Lucene

CSE 535: INFORMATION RETRIEVAL
Apache Lucene and Project 2
Agenda
! Walk through for Project 2 ! Lucene ! Introduction and package overview ! Hands on: Simple indexing and querying ! Cook book: Some recipes ! Applications to Project 2 ! Recap of concepts
PROJECT TWO
IR models, query processing and evaluation
The two minute overview

! Whats given ! A big collection with queries and relevance judgments ! A program that evaluates results tells you how good you did ! Your task ! Section one: Compare different methods
! Do models behave differently for different queries query length, OOV
terms, etc. ! Can you tweak parameters to perform better? ! Can you simulate Boolean queries?
! Section two: Write a query processing module

! Read up Lucene query syntax ! Easy: Give different weights to different zones. Trial & error may work as first
stab ! Harder: Syntactical parsing, NLP processing

! Section three: Discuss QP results
! Did some models behave better to QP than others? ! Any interesting patterns: Technique wise? Query wise?
THE THEORY
What concepts would you need?
Relevance models
! VSM or tf-idf ! Boolean ! BM25 ! LM ! DFR
Evaluation metrics
! Precision, Recall & F-measure single valued ! tp, fp, tn, fn ! Based on rank ! Average precision ! Mean Average Precision: MAP, GM MAP ! Precision @ k ! R-precision ! Reciprocal rank ! Bpref Preference based ! Incremental value? ! DCG ! nDCG
LUCENE
A 100% Java Text IR engine
Introduction
! Originally written in 1999 by Doug Cutting ! Part of the Apache Software Foundation family ! Full text indexing and searching ! Flexible, extendable ! Spawned several related projects: ! Nutch: Web crawling + HTML parsing ! Solr, ElasticSearch: Enterprise search servers ! Used by: ! AOL, Apple, CiteSeerX, Eclipse, IBM, JIRA, LinkedIn, Twitter, etc.
Package overview
Package analysis Usage Converts text from a Reader into TokenStream. Analyzer combines Tokenizer and TokenFilters to create TokenStream Primarily composed of IndexWriter and IndexReader Data structures to represent queries, an IndexSearcher to search over docs and QueryParsers to convert strings to queries Abstractions for storing persistent data Just a bunch of handy classes
document Simple Document class that is simply a collection of Fields index search store util
Getting started
! Pre-shipped files: IndexFiles and SearchFiles ! IndexFiles: Used to create index ! java -cp lucene-core.jar:lucene-demo.jar:lucene-analyzerscommon.jar org.apache.lucene.demo.IndexFiles -index <directory> -docs <directory> ! SearchFiles: Use to query index ! java -cp lucene-core.jar:lucene-demo.jar:lucenequeryparser.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles
IndexFiles
! Setup Directory, Analyzer, IndexWriter ! Directory dir = FSDirectory.open(new File(indexPath)); // where are you reading files from! ! Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); //standard out of box: different analyzers can do different thing! ! IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer);! ! IndexWriter writer = new IndexWriter(dir, iwc); ! ! Construct Document, add Fields and add to writer ! Document doc = new Document();! ! Field pathField = new StringField("path", file.getPath(), Field.Store.YES); //properties for each field ! ! doc.add(pathField);! ! doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));! ! writer.addDocument(doc); //add document to writer!
SearchFiles
! Setup IndexReader and QueryParser ! IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); //Read the physical index! ! IndexSearcher searcher = new IndexSearcher(reader);! ! Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);! ! QueryParser parser = new QueryParser(Version.LUCENE_40, field, analyzer); // setup the parser: needs a searcher and analyzer! ! Parse input text into Query ! Query query = parser.parse(line); ! ! Run the search and get results ! TopDocs results = searcher.search(query, 100);! ! ScoreDoc[] hits = results.scoreDocs;!
Some recipes
! Indexing: ! Use different analyzers ! Play around with document and fields ! Searching ! Different query parsers ! Powerful query syntax:
! Boolean queries, phrase queries, proximity queries, field boosting etc.
! Highlighter ! Highlight passages that correspond to matches
! Similarity classes: ! Default (TF-IDF) plus other variants: BM25, LM, DFR ! Defines how relevance ranking is done. ! New in Lucene 4: Per field similarity
Applications to project 2
! The wrapper code makes the corresponding calls to
different similarity classes. ! Customizing scoring

! Subclass the corresponding similarity class ! Overwrite methods as needed ! http://www.lucenetutorial.com/advanced-topics/scoring.html
! Query parser: ! Simplest is boost fields: Rank one higher than others ! Machine learned weights? Can treat as a classic regression problem ! Can write more complicated queries:
! POS tagging - some tags more important than others ! SVO analysis: Whats the subject? Interchangeable verbs etc.

Lucene

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lucene

Uploaded by

Copyright:

Available Formats

CSE 535: INFORMATION RETRIEVAL

Apache Lucene and Project 2

The two minute overview

! Section two: Write a query processing module

stab ! Harder: Syntactical parsing, NLP processing

! Highlighter ! Highlight passages that correspond to matches

different similarity classes. ! Customizing scoring

You might also like