Java Doc for MoreLikeThis.java in  » Net » lucene-connector » org » apache » lucene » search » similar » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Net » lucene connector » org.apache.lucene.search.similar 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   org.apache.lucene.search.similar.MoreLikeThis

MoreLikeThis
final public class MoreLikeThis (Code)
Generate "more like this" similarity queries. Based on this mail:
 Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
 Term frequencies can be computed by re-tokenizing the text, which, for a single document,
 is usually fast enough.  But looking up the docFreq() of every term in the document is
 probably too slow.
 You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
 or at all.  Since you're trying to maximize a tf*idf score, you're probably most interested
 in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
 reduce the number of terms under consideration.  Another heuristic is that terms with a
 high idf (i.e., a low df) tend to be longer.  So you could threshold the terms by the
 number of characters, not selecting anything less than, e.g., six or seven characters.
 With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
 that do a pretty good job of characterizing a document.
 It all depends on what you're trying to do.  If you're trying to eek out that last percent
 of precision and recall regardless of computational difficulty so that you can win a TREC
 competition, then the techniques I mention above are useless.  But if you're trying to
 provide a "more like this" button on a search results page that does a decent job and has
 good performance, such techniques might be useful.
 An efficient, effective "more-like-this" query generator would be a great contribution, if
 anyone's interested.  I'd imagine that it would take a Reader or a String (the document's
 text), analyzer Analyzer, and return a set of representative terms using heuristics like those
 above.  The frequency and length thresholds could be parameters, etc.
 Doug
 

Initial Usage

This class has lots of options to try to make it efficient and flexible. See the body of MoreLikeThis.main main() below in the source for real code, or if you want pseudo code, the simpliest possible usage is as follows. The bold fragment is specific to this class.
 IndexReader ir = ...
 IndexSearcher is = ...
 
 MoreLikeThis mlt = new MoreLikeThis(ir);
 Reader target = ... // orig source of doc you want to find similarities to
 Query query = mlt.like( target);
 
 Hits hits = is.search(query);
 // now the usual iteration thru 'hits' - the only thing to watch for is to make sure
 you ignore the doc if it matches your 'target' document, as it should be similar to itself 
 
Thus you:
  1. do your normal, Lucene setup for searching,
  2. create a MoreLikeThis,
  3. get the text of the doc you want to find similaries to
  4. then call one of the like() calls to generate a similarity query
  5. call the searcher to find the similar docs

More Advanced Usage

You may want to use MoreLikeThis.setFieldNames setFieldNames(...) so you can examine multiple fields (e.g. body and title) for similarity.

Depending on the size of your index and the size and makeup of your documents you may want to call the other set methods to control how the similarity queries are generated:


 Changes: Mark Harwood 29/02/04
 Some bugfixing, some refactoring, some optimisation.
 - bugfix: retrieveTerms(int docNum) was not working for indexes without a termvector -added missing code
 - bugfix: No significant terms being created for fields with a termvector - because 
 was only counting one occurence per term/field pair in calculations(ie not including frequency info from TermVector) 
 - refactor: moved common code into isNoiseWord()
 - optimise: when no termvector support available - used maxNumTermsParsed to limit amount of tokenization
 

author:
   David Spencer
author:
   Bruce Ritchie
author:
   Mark Harwood


Field Summary
final public static  AnalyzerDEFAULT_ANALYZER
     Default analyzer to parse source doc with.
final public static  booleanDEFAULT_BOOST
     Boost terms in query based on score.
final public static  String[]DEFAULT_FIELD_NAMES
     Default field names.
final public static  intDEFAULT_MAX_NUM_TOKENS_PARSED
     Default maximum number of tokens to parse in each example doc field that is not stored with TermVector support.
final public static  intDEFAULT_MAX_QUERY_TERMS
     Return a Query with no more than this many terms.
final public static  intDEFAULT_MAX_WORD_LENGTH
     Ignore words greater than this length or if 0 then this has no effect.
final public static  intDEFAULT_MIN_DOC_FREQ
     Ignore words which do not occur in at least this many docs.
final public static  intDEFAULT_MIN_TERM_FREQ
     Ignore terms with less than this frequency in the source doc.
final public static  intDEFAULT_MIN_WORD_LENGTH
     Ignore words less than this length or if 0 then this has no effect.
final public static  SetDEFAULT_STOP_WORDS
     Default set of stopwords.

Constructor Summary
public  MoreLikeThis(IndexReader ir)
     Constructor requiring an IndexReader.

Method Summary
public  StringdescribeParams()
     Describe the parameters that control how the "more like this" query is formed.
public  AnalyzergetAnalyzer()
     Returns an analyzer that will be used to parse source doc with.
public  String[]getFieldNames()
     Returns the field names that will be used when generating the 'More Like This' query.
public  intgetMaxNumTokensParsed()
    
public  intgetMaxQueryTerms()
     Returns the maximum number of query terms that will be included in any generated query.
public  intgetMaxWordLen()
     Returns the maximum word length above which words will be ignored.
public  intgetMinDocFreq()
     Returns the frequency at which words will be ignored which do not occur in at least this many docs.
public  intgetMinTermFreq()
     Returns the frequency below which terms will be ignored in the source doc.
public  intgetMinWordLen()
     Returns the minimum word length below which words will be ignored.
public  SetgetStopWords()
     Get the current stop words being used.
public  booleanisBoost()
     Returns whether to boost terms in query based on "score" or not.
public  Querylike(int docNum)
     Return a query that will return docs like the passed lucene document ID.
Parameters:
  docNum - the documentID of the lucene doc to generate the 'More Like This" query for.
public  Querylike(File f)
     Return a query that will return docs like the passed file.
public  Querylike(URL u)
     Return a query that will return docs like the passed URL.
public  Querylike(java.io.InputStream is)
     Return a query that will return docs like the passed stream.
public  Querylike(Reader r)
     Return a query that will return docs like the passed Reader.
public static  voidmain(String[] a)
     Test driver.
public  String[]retrieveInterestingTerms(Reader r)
     Convenience routine to make it easy to return the most interesting words in a document.
public  PriorityQueueretrieveTerms(Reader r)
     Find words for a more-like-this query former.
public  voidsetAnalyzer(Analyzer analyzer)
     Sets the analyzer to use.
public  voidsetBoost(boolean boost)
     Sets whether to boost terms in query based on "score" or not.
public  voidsetFieldNames(String[] fieldNames)
     Sets the field names that will be used when generating the 'More Like This' query.
public  voidsetMaxNumTokensParsed(int i)
    
public  voidsetMaxQueryTerms(int maxQueryTerms)
     Sets the maximum number of query terms that will be included in any generated query.
public  voidsetMaxWordLen(int maxWordLen)
     Sets the maximum word length above which words will be ignored.
public  voidsetMinDocFreq(int minDocFreq)
     Sets the frequency at which words will be ignored which do not occur in at least this many docs.
public  voidsetMinTermFreq(int minTermFreq)
     Sets the frequency below which terms will be ignored in the source doc.
public  voidsetMinWordLen(int minWordLen)
     Sets the minimum word length below which words will be ignored.
public  voidsetStopWords(Set stopWords)
     Set the set of stopwords.

Field Detail
DEFAULT_ANALYZER
final public static Analyzer DEFAULT_ANALYZER(Code)
Default analyzer to parse source doc with.
See Also:   MoreLikeThis.getAnalyzer



DEFAULT_BOOST
final public static boolean DEFAULT_BOOST(Code)
Boost terms in query based on score.
See Also:   MoreLikeThis.isBoost
See Also:   MoreLikeThis.setBoost
See Also:   



DEFAULT_FIELD_NAMES
final public static String[] DEFAULT_FIELD_NAMES(Code)
Default field names. Null is used to specify that the field names should be looked up at runtime from the provided reader.



DEFAULT_MAX_NUM_TOKENS_PARSED
final public static int DEFAULT_MAX_NUM_TOKENS_PARSED(Code)
Default maximum number of tokens to parse in each example doc field that is not stored with TermVector support.
See Also:   MoreLikeThis.getMaxNumTokensParsed



DEFAULT_MAX_QUERY_TERMS
final public static int DEFAULT_MAX_QUERY_TERMS(Code)
Return a Query with no more than this many terms.
See Also:   BooleanQuery.getMaxClauseCount
See Also:   MoreLikeThis.getMaxQueryTerms
See Also:   MoreLikeThis.setMaxQueryTerms
See Also:   



DEFAULT_MAX_WORD_LENGTH
final public static int DEFAULT_MAX_WORD_LENGTH(Code)
Ignore words greater than this length or if 0 then this has no effect.
See Also:   MoreLikeThis.getMaxWordLen
See Also:   MoreLikeThis.setMaxWordLen
See Also:   



DEFAULT_MIN_DOC_FREQ
final public static int DEFAULT_MIN_DOC_FREQ(Code)
Ignore words which do not occur in at least this many docs.
See Also:   MoreLikeThis.getMinDocFreq
See Also:   MoreLikeThis.setMinDocFreq
See Also:   



DEFAULT_MIN_TERM_FREQ
final public static int DEFAULT_MIN_TERM_FREQ(Code)
Ignore terms with less than this frequency in the source doc.
See Also:   MoreLikeThis.getMinTermFreq
See Also:   MoreLikeThis.setMinTermFreq
See Also:   



DEFAULT_MIN_WORD_LENGTH
final public static int DEFAULT_MIN_WORD_LENGTH(Code)
Ignore words less than this length or if 0 then this has no effect.
See Also:   MoreLikeThis.getMinWordLen
See Also:   MoreLikeThis.setMinWordLen
See Also:   



DEFAULT_STOP_WORDS
final public static Set DEFAULT_STOP_WORDS(Code)
Default set of stopwords. If null means to allow stop words.
See Also:   MoreLikeThis.setStopWords
See Also:   MoreLikeThis.getStopWords




Constructor Detail
MoreLikeThis
public MoreLikeThis(IndexReader ir)(Code)
Constructor requiring an IndexReader.




Method Detail
describeParams
public String describeParams()(Code)
Describe the parameters that control how the "more like this" query is formed.



getAnalyzer
public Analyzer getAnalyzer()(Code)
Returns an analyzer that will be used to parse source doc with. The default analyzer is the MoreLikeThis.DEFAULT_ANALYZER . the analyzer that will be used to parse source doc with.
See Also:   MoreLikeThis.DEFAULT_ANALYZER



getFieldNames
public String[] getFieldNames()(Code)
Returns the field names that will be used when generating the 'More Like This' query. The default field names that will be used is MoreLikeThis.DEFAULT_FIELD_NAMES . the field names that will be used when generating the 'More Like This' query.



getMaxNumTokensParsed
public int getMaxNumTokensParsed()(Code)
The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
See Also:   MoreLikeThis.DEFAULT_MAX_NUM_TOKENS_PARSED



getMaxQueryTerms
public int getMaxQueryTerms()(Code)
Returns the maximum number of query terms that will be included in any generated query. The default is MoreLikeThis.DEFAULT_MAX_QUERY_TERMS . the maximum number of query terms that will be included in any generated query.



getMaxWordLen
public int getMaxWordLen()(Code)
Returns the maximum word length above which words will be ignored. Set this to 0 for no maximum word length. The default is MoreLikeThis.DEFAULT_MAX_WORD_LENGTH . the maximum word length above which words will be ignored.



getMinDocFreq
public int getMinDocFreq()(Code)
Returns the frequency at which words will be ignored which do not occur in at least this many docs. The default frequency is MoreLikeThis.DEFAULT_MIN_DOC_FREQ . the frequency at which words will be ignored which do not occur in at least thismany docs.



getMinTermFreq
public int getMinTermFreq()(Code)
Returns the frequency below which terms will be ignored in the source doc. The default frequency is the MoreLikeThis.DEFAULT_MIN_TERM_FREQ . the frequency below which terms will be ignored in the source doc.



getMinWordLen
public int getMinWordLen()(Code)
Returns the minimum word length below which words will be ignored. Set this to 0 for no minimum word length. The default is MoreLikeThis.DEFAULT_MIN_WORD_LENGTH . the minimum word length below which words will be ignored.



getStopWords
public Set getStopWords()(Code)
Get the current stop words being used.
See Also:   MoreLikeThis.setStopWords



isBoost
public boolean isBoost()(Code)
Returns whether to boost terms in query based on "score" or not. The default is MoreLikeThis.DEFAULT_BOOST . whether to boost terms in query based on "score" or not.
See Also:   MoreLikeThis.setBoost



like
public Query like(int docNum) throws IOException(Code)
Return a query that will return docs like the passed lucene document ID.
Parameters:
  docNum - the documentID of the lucene doc to generate the 'More Like This" query for. a query that will return docs like the passed lucene document ID.



like
public Query like(File f) throws IOException(Code)
Return a query that will return docs like the passed file. a query that will return docs like the passed file.



like
public Query like(URL u) throws IOException(Code)
Return a query that will return docs like the passed URL. a query that will return docs like the passed URL.



like
public Query like(java.io.InputStream is) throws IOException(Code)
Return a query that will return docs like the passed stream. a query that will return docs like the passed stream.



like
public Query like(Reader r) throws IOException(Code)
Return a query that will return docs like the passed Reader. a query that will return docs like the passed Reader.



main
public static void main(String[] a) throws Throwable(Code)
Test driver. Pass in "-i INDEX" and then either "-fn FILE" or "-url URL".



retrieveInterestingTerms
public String[] retrieveInterestingTerms(Reader r) throws IOException(Code)
Convenience routine to make it easy to return the most interesting words in a document. More advanced users will call MoreLikeThis.retrieveTerms(java.io.Reader) retrieveTerms() directly.
Parameters:
  r - the source document the most interesting words in the document
See Also:   MoreLikeThis.retrieveTerms(java.io.Reader)
See Also:   MoreLikeThis.setMaxQueryTerms



retrieveTerms
public PriorityQueue retrieveTerms(Reader r) throws IOException(Code)
Find words for a more-like-this query former. The result is a priority queue of arrays with one entry for every word in the document. Each array has 6 elements. The elements are:
  1. The word (String)
  2. The top field that this word comes from (String)
  3. The score for this word (Float)
  4. The IDF value (Float)
  5. The frequency of this word in the index (Integer)
  6. The frequency of this word in the source document (Integer)
This is a somewhat "advanced" routine, and in general only the 1st entry in the array is of interest. This method is exposed so that you can identify the "interesting words" in a document. For an easier method to call see MoreLikeThis.retrieveInterestingTerms retrieveInterestingTerms() .
Parameters:
  r - the reader that has the content of the document the most intresting words in the document ordered by score, with the highest scoring, or best entry, first
See Also:   MoreLikeThis.retrieveInterestingTerms



setAnalyzer
public void setAnalyzer(Analyzer analyzer)(Code)
Sets the analyzer to use. An analyzer is not required for generating a query with the MoreLikeThis.like(int) method, all other 'like' methods require an analyzer.
Parameters:
  analyzer - the analyzer to use to tokenize text.



setBoost
public void setBoost(boolean boost)(Code)
Sets whether to boost terms in query based on "score" or not.
Parameters:
  boost - true to boost terms in query based on "score", false otherwise.
See Also:   MoreLikeThis.isBoost



setFieldNames
public void setFieldNames(String[] fieldNames)(Code)
Sets the field names that will be used when generating the 'More Like This' query. Set this to null for the field names to be determined at runtime from the IndexReader provided in the constructor.
Parameters:
  fieldNames - the field names that will be used when generating the 'More Like This'query.



setMaxNumTokensParsed
public void setMaxNumTokensParsed(int i)(Code)

Parameters:
  i - The maximum number of tokens to parse in each example doc field that is not stored with TermVector support



setMaxQueryTerms
public void setMaxQueryTerms(int maxQueryTerms)(Code)
Sets the maximum number of query terms that will be included in any generated query.
Parameters:
  maxQueryTerms - the maximum number of query terms that will be included in anygenerated query.



setMaxWordLen
public void setMaxWordLen(int maxWordLen)(Code)
Sets the maximum word length above which words will be ignored.
Parameters:
  maxWordLen - the maximum word length above which words will be ignored.



setMinDocFreq
public void setMinDocFreq(int minDocFreq)(Code)
Sets the frequency at which words will be ignored which do not occur in at least this many docs.
Parameters:
  minDocFreq - the frequency at which words will be ignored which do not occur in atleast this many docs.



setMinTermFreq
public void setMinTermFreq(int minTermFreq)(Code)
Sets the frequency below which terms will be ignored in the source doc.
Parameters:
  minTermFreq - the frequency below which terms will be ignored in the source doc.



setMinWordLen
public void setMinWordLen(int minWordLen)(Code)
Sets the minimum word length below which words will be ignored.
Parameters:
  minWordLen - the minimum word length below which words will be ignored.



setStopWords
public void setStopWords(Set stopWords)(Code)
Set the set of stopwords. Any word in this set is considered "uninteresting" and ignored. Even if your Analyzer allows stopwords, you might want to tell the MoreLikeThis code to ignore them, as for the purposes of document similarity it seems reasonable to assume that "a stop word is never interesting".
Parameters:
  stopWords - set of stopwords, if null it means to allow stop words
See Also:   org.apache.lucene.analysis.StopFilter.makeStopSet
See Also:    StopFilter.makeStopSet()
See Also:   MoreLikeThis.getStopWords
See Also:   



Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.