You are on page 1of 15

CSE 535: INFORMATION RETRIEVAL

Apache Lucene and Project 2

Agenda
! Walk through for Project 2 ! Lucene ! Introduction and package overview ! Hands on: Simple indexing and querying ! Cook book: Some recipes ! Applications to Project 2 ! Recap of concepts

PROJECT TWO
IR models, query processing and evaluation

The two minute overview


! Whats given ! A big collection with queries and relevance judgments ! A program that evaluates results tells you how good you did ! Your task ! Section one: Compare different methods
! Do models behave differently for different queries query length, OOV

terms, etc. ! Can you tweak parameters to perform better? ! Can you simulate Boolean queries?

! Section two: Write a query processing module


! Read up Lucene query syntax ! Easy: Give different weights to different zones. Trial & error may work as first

stab ! Harder: Syntactical parsing, NLP processing


! Section three: Discuss QP results

! Did some models behave better to QP than others? ! Any interesting patterns: Technique wise? Query wise?

THE THEORY
What concepts would you need?

Relevance models
! VSM or tf-idf ! Boolean ! BM25 ! LM ! DFR

Evaluation metrics
! Precision, Recall & F-measure single valued ! tp, fp, tn, fn ! Based on rank ! Average precision ! Mean Average Precision: MAP, GM MAP ! Precision @ k ! R-precision ! Reciprocal rank ! Bpref Preference based ! Incremental value? ! DCG ! nDCG

LUCENE
A 100% Java Text IR engine

Introduction
! Originally written in 1999 by Doug Cutting ! Part of the Apache Software Foundation family ! Full text indexing and searching ! Flexible, extendable ! Spawned several related projects: ! Nutch: Web crawling + HTML parsing ! Solr, ElasticSearch: Enterprise search servers ! Used by: ! AOL, Apple, CiteSeerX, Eclipse, IBM, JIRA, LinkedIn, Twitter, etc.

Package overview
Package analysis Usage Converts text from a Reader into TokenStream. Analyzer combines Tokenizer and TokenFilters to create TokenStream Primarily composed of IndexWriter and IndexReader Data structures to represent queries, an IndexSearcher to search over docs and QueryParsers to convert strings to queries Abstractions for storing persistent data Just a bunch of handy classes

document Simple Document class that is simply a collection of Fields index search store util

Getting started
! Pre-shipped files: IndexFiles and SearchFiles ! IndexFiles: Used to create index ! java -cp lucene-core.jar:lucene-demo.jar:lucene-analyzerscommon.jar org.apache.lucene.demo.IndexFiles -index <directory> -docs <directory> ! SearchFiles: Use to query index ! java -cp lucene-core.jar:lucene-demo.jar:lucenequeryparser.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles

IndexFiles
! Setup Directory, Analyzer, IndexWriter ! Directory dir = FSDirectory.open(new File(indexPath)); // where are you reading files from! ! Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); //standard out of box: different analyzers can do different thing! ! IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer);! ! IndexWriter writer = new IndexWriter(dir, iwc); ! ! Construct Document, add Fields and add to writer ! Document doc = new Document();! ! Field pathField = new StringField("path", file.getPath(), Field.Store.YES); //properties for each field ! ! doc.add(pathField);! ! doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));! ! writer.addDocument(doc); //add document to writer!

SearchFiles
! Setup IndexReader and QueryParser ! IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); //Read the physical index! ! IndexSearcher searcher = new IndexSearcher(reader);! ! Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);! ! QueryParser parser = new QueryParser(Version.LUCENE_40, field, analyzer); // setup the parser: needs a searcher and analyzer! ! Parse input text into Query ! Query query = parser.parse(line); ! ! Run the search and get results ! TopDocs results = searcher.search(query, 100);! ! ScoreDoc[] hits = results.scoreDocs;!

Some recipes
! Indexing: ! Use different analyzers ! Play around with document and fields ! Searching ! Different query parsers ! Powerful query syntax:
! Boolean queries, phrase queries, proximity queries, field boosting etc.

! Highlighter ! Highlight passages that correspond to matches

! Similarity classes: ! Default (TF-IDF) plus other variants: BM25, LM, DFR ! Defines how relevance ranking is done. ! New in Lucene 4: Per field similarity

Applications to project 2
! The wrapper code makes the corresponding calls to

different similarity classes. ! Customizing scoring


! Subclass the corresponding similarity class ! Overwrite methods as needed ! http://www.lucenetutorial.com/advanced-topics/scoring.html

! Query parser: ! Simplest is boost fields: Rank one higher than others ! Machine learned weights? Can treat as a classic regression problem ! Can write more complicated queries:
! POS tagging - some tags more important than others ! SVO analysis: Whats the subject? Interchangeable verbs etc.

You might also like