Java Doc for MoreLikeThis.java in » Net » lucene-connector » org » apache » lucene » search » similar » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » Net » lucene connector » org.apache.lucene.search.similar

Source Cross Reference Class Diagram Java Document (Java Doc)

java.lang .Object

org.apache.lucene.search.similar .MoreLikeThis

MoreLikeThis

final public class MoreLikeThis (Code)

Generate "more like this" similarity queries. Based on this mail:

 Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
 Term frequencies can be computed by re-tokenizing the text, which, for a single document,
 is usually fast enough.  But looking up the docFreq() of every term in the document is
 probably too slow.
 You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
 or at all.  Since you're trying to maximize a tf*idf score, you're probably most interested
 in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
 reduce the number of terms under consideration.  Another heuristic is that terms with a
 high idf (i.e., a low df) tend to be longer.  So you could threshold the terms by the
 number of characters, not selecting anything less than, e.g., six or seven characters.
 With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
 that do a pretty good job of characterizing a document.
 It all depends on what you're trying to do.  If you're trying to eek out that last percent
 of precision and recall regardless of computational difficulty so that you can win a TREC
 competition, then the techniques I mention above are useless.  But if you're trying to
 provide a "more like this" button on a search results page that does a decent job and has
 good performance, such techniques might be useful.
 An efficient, effective "more-like-this" query generator would be a great contribution, if
 anyone's interested.  I'd imagine that it would take a Reader or a String (the document's
 text), analyzer Analyzer, and return a set of representative terms using heuristics like those
 above.  The frequency and length thresholds could be parameters, etc.
 Doug

Initial Usage

This class has lots of options to try to make it efficient and flexible. See the body of MoreLikeThis.main main() below in the source for real code, or if you want pseudo code, the simpliest possible usage is as follows. The bold fragment is specific to this class.

 IndexReader ir = ...
 IndexSearcher is = ...
 
 MoreLikeThis mlt = new MoreLikeThis(ir);
 Reader target = ... // orig source of doc you want to find similarities to
 Query query = mlt.like( target);
 
 Hits hits = is.search(query);
 // now the usual iteration thru 'hits' - the only thing to watch for is to make sure
 you ignore the doc if it matches your 'target' document, as it should be similar to itself

Thus you:

do your normal, Lucene setup for searching,
create a MoreLikeThis,
get the text of the doc you want to find similaries to
then call one of the like() calls to generate a similarity query
call the searcher to find the similar docs

More Advanced Usage

You may want to use MoreLikeThis.setFieldNames setFieldNames(...) so you can examine multiple fields (e.g. body and title) for similarity.

Depending on the size of your index and the size and makeup of your documents you may want to call the other set methods to control how the similarity queries are generated:

MoreLikeThis.setMinTermFreq setMinTermFreq(...)
MoreLikeThis.setMinDocFreq setMinDocFreq(...)
MoreLikeThis.setMinWordLen setMinWordLen(...)
MoreLikeThis.setMaxWordLen setMaxWordLen(...)
MoreLikeThis.setMaxQueryTerms setMaxQueryTerms(...)
MoreLikeThis.setMaxNumTokensParsed setMaxNumTokensParsed(...)
MoreLikeThis.setStopWords setStopWord(...)

 Changes: Mark Harwood 29/02/04
 Some bugfixing, some refactoring, some optimisation.
 - bugfix: retrieveTerms(int docNum) was not working for indexes without a termvector -added missing code
 - bugfix: No significant terms being created for fields with a termvector - because 
 was only counting one occurence per term/field pair in calculations(ie not including frequency info from TermVector) 
 - refactor: moved common code into isNoiseWord()
 - optimise: when no termvector support available - used maxNumTermsParsed to limit amount of tokenization

author:
   David Spencer
author:
   Bruce Ritchie
author:
   Mark Harwood

Field Summary
final public static Analyzer	DEFAULT_ANALYZER Default analyzer to parse source doc with.
final public static boolean	DEFAULT_BOOST Boost terms in query based on score.
final public static String[]	DEFAULT_FIELD_NAMES Default field names.
final public static int	DEFAULT_MAX_NUM_TOKENS_PARSED Default maximum number of tokens to parse in each example doc field that is not stored with TermVector support.
final public static int	DEFAULT_MAX_QUERY_TERMS Return a Query with no more than this many terms.
final public static int	DEFAULT_MAX_WORD_LENGTH Ignore words greater than this length or if 0 then this has no effect.
final public static int	DEFAULT_MIN_DOC_FREQ Ignore words which do not occur in at least this many docs.
final public static int	DEFAULT_MIN_TERM_FREQ Ignore terms with less than this frequency in the source doc.
final public static int	DEFAULT_MIN_WORD_LENGTH Ignore words less than this length or if 0 then this has no effect.
final public static Set	DEFAULT_STOP_WORDS Default set of stopwords.

Constructor Summary
public	MoreLikeThis(IndexReader ir) Constructor requiring an IndexReader.

Method Summary
public String	describeParams() Describe the parameters that control how the "more like this" query is formed.
public Analyzer	getAnalyzer() Returns an analyzer that will be used to parse source doc with.
public String[]	getFieldNames() Returns the field names that will be used when generating the 'More Like This' query.
public int	getMaxNumTokensParsed()
public int	getMaxQueryTerms() Returns the maximum number of query terms that will be included in any generated query.
public int	getMaxWordLen() Returns the maximum word length above which words will be ignored.
public int	getMinDocFreq() Returns the frequency at which words will be ignored which do not occur in at least this many docs.
public int	getMinTermFreq() Returns the frequency below which terms will be ignored in the source doc.
public int	getMinWordLen() Returns the minimum word length below which words will be ignored.
public Set	getStopWords() Get the current stop words being used.
public boolean	isBoost() Returns whether to boost terms in query based on "score" or not.
public Query	like(int docNum) Return a query that will return docs like the passed lucene document ID. Parameters: docNum - the documentID of the lucene doc to generate the 'More Like This" query for.
public Query	like(File f) Return a query that will return docs like the passed file.
public Query	like(URL u) Return a query that will return docs like the passed URL.
public Query	like(java.io.InputStream is) Return a query that will return docs like the passed stream.
public Query	like(Reader r) Return a query that will return docs like the passed Reader.
public static void	main(String[] a) Test driver.
public String[]	retrieveInterestingTerms(Reader r) Convenience routine to make it easy to return the most interesting words in a document.
public PriorityQueue	retrieveTerms(Reader r) Find words for a more-like-this query former.
public void	setAnalyzer(Analyzer analyzer) Sets the analyzer to use.
public void	setBoost(boolean boost) Sets whether to boost terms in query based on "score" or not.
public void	setFieldNames(String[] fieldNames) Sets the field names that will be used when generating the 'More Like This' query.
public void	setMaxNumTokensParsed(int i)
public void	setMaxQueryTerms(int maxQueryTerms) Sets the maximum number of query terms that will be included in any generated query.
public void	setMaxWordLen(int maxWordLen) Sets the maximum word length above which words will be ignored.
public void	setMinDocFreq(int minDocFreq) Sets the frequency at which words will be ignored which do not occur in at least this many docs.
public void	setMinTermFreq(int minTermFreq) Sets the frequency below which terms will be ignored in the source doc.
public void	setMinWordLen(int minWordLen) Sets the minimum word length below which words will be ignored.
public void	setStopWords(Set stopWords) Set the set of stopwords.

Field Detail

DEFAULT_ANALYZER
final public static Analyzer DEFAULT_ANALYZER(Code)
	Default analyzer to parse source doc with. See Also: MoreLikeThis.getAnalyzer

DEFAULT_BOOST
final public static boolean DEFAULT_BOOST(Code)
	Boost terms in query based on score. See Also: MoreLikeThis.isBoost See Also: MoreLikeThis.setBoost See Also:

DEFAULT_FIELD_NAMES
final public static String[] DEFAULT_FIELD_NAMES(Code)
	Default field names. Null is used to specify that the field names should be looked up at runtime from the provided reader.

DEFAULT_MAX_NUM_TOKENS_PARSED
final public static int DEFAULT_MAX_NUM_TOKENS_PARSED(Code)
	Default maximum number of tokens to parse in each example doc field that is not stored with TermVector support. See Also: MoreLikeThis.getMaxNumTokensParsed

DEFAULT_MAX_QUERY_TERMS
final public static int DEFAULT_MAX_QUERY_TERMS(Code)
	Return a Query with no more than this many terms. See Also: BooleanQuery.getMaxClauseCount See Also: MoreLikeThis.getMaxQueryTerms See Also: MoreLikeThis.setMaxQueryTerms See Also:

DEFAULT_MAX_WORD_LENGTH
final public static int DEFAULT_MAX_WORD_LENGTH(Code)
	Ignore words greater than this length or if 0 then this has no effect. See Also: MoreLikeThis.getMaxWordLen See Also: MoreLikeThis.setMaxWordLen See Also:

DEFAULT_MIN_DOC_FREQ
final public static int DEFAULT_MIN_DOC_FREQ(Code)
	Ignore words which do not occur in at least this many docs. See Also: MoreLikeThis.getMinDocFreq See Also: MoreLikeThis.setMinDocFreq See Also:

DEFAULT_MIN_TERM_FREQ
final public static int DEFAULT_MIN_TERM_FREQ(Code)
	Ignore terms with less than this frequency in the source doc. See Also: MoreLikeThis.getMinTermFreq See Also: MoreLikeThis.setMinTermFreq See Also:

DEFAULT_MIN_WORD_LENGTH
final public static int DEFAULT_MIN_WORD_LENGTH(Code)
	Ignore words less than this length or if 0 then this has no effect. See Also: MoreLikeThis.getMinWordLen See Also: MoreLikeThis.setMinWordLen See Also:

DEFAULT_STOP_WORDS
final public static Set DEFAULT_STOP_WORDS(Code)
	Default set of stopwords. If null means to allow stop words. See Also: MoreLikeThis.setStopWords See Also: MoreLikeThis.getStopWords

Constructor Detail

MoreLikeThis
public MoreLikeThis(IndexReader ir)(Code)
	Constructor requiring an IndexReader.

Method Detail

describeParams
public String describeParams()(Code)
	Describe the parameters that control how the "more like this" query is formed.

getAnalyzer
public Analyzer getAnalyzer()(Code)
	Returns an analyzer that will be used to parse source doc with. The default analyzer is the MoreLikeThis.DEFAULT_ANALYZER . the analyzer that will be used to parse source doc with. See Also: MoreLikeThis.DEFAULT_ANALYZER

getFieldNames
public String[] getFieldNames()(Code)
	Returns the field names that will be used when generating the 'More Like This' query. The default field names that will be used is MoreLikeThis.DEFAULT_FIELD_NAMES . the field names that will be used when generating the 'More Like This' query.

getMaxNumTokensParsed
public int getMaxNumTokensParsed()(Code)
	The maximum number of tokens to parse in each example doc field that is not stored with TermVector support See Also: MoreLikeThis.DEFAULT_MAX_NUM_TOKENS_PARSED

getMaxQueryTerms
public int getMaxQueryTerms()(Code)
	Returns the maximum number of query terms that will be included in any generated query. The default is MoreLikeThis.DEFAULT_MAX_QUERY_TERMS . the maximum number of query terms that will be included in any generated query.

getMaxWordLen
public int getMaxWordLen()(Code)
	Returns the maximum word length above which words will be ignored. Set this to 0 for no maximum word length. The default is MoreLikeThis.DEFAULT_MAX_WORD_LENGTH . the maximum word length above which words will be ignored.

getMinDocFreq
public int getMinDocFreq()(Code)
	Returns the frequency at which words will be ignored which do not occur in at least this many docs. The default frequency is MoreLikeThis.DEFAULT_MIN_DOC_FREQ . the frequency at which words will be ignored which do not occur in at least thismany docs.

getMinTermFreq
public int getMinTermFreq()(Code)
	Returns the frequency below which terms will be ignored in the source doc. The default frequency is the MoreLikeThis.DEFAULT_MIN_TERM_FREQ . the frequency below which terms will be ignored in the source doc.

getMinWordLen
public int getMinWordLen()(Code)
	Returns the minimum word length below which words will be ignored. Set this to 0 for no minimum word length. The default is MoreLikeThis.DEFAULT_MIN_WORD_LENGTH . the minimum word length below which words will be ignored.

getStopWords
public Set getStopWords()(Code)
	Get the current stop words being used. See Also: MoreLikeThis.setStopWords

isBoost
public boolean isBoost()(Code)
	Returns whether to boost terms in query based on "score" or not. The default is MoreLikeThis.DEFAULT_BOOST . whether to boost terms in query based on "score" or not. See Also: MoreLikeThis.setBoost

like
public Query like(int docNum) throws IOException(Code)
	Return a query that will return docs like the passed lucene document ID. Parameters: docNum - the documentID of the lucene doc to generate the 'More Like This" query for. a query that will return docs like the passed lucene document ID.

like
public Query like(File f) throws IOException(Code)
	Return a query that will return docs like the passed file. a query that will return docs like the passed file.

like
public Query like(URL u) throws IOException(Code)
	Return a query that will return docs like the passed URL. a query that will return docs like the passed URL.

like
public Query like(java.io.InputStream is) throws IOException(Code)
	Return a query that will return docs like the passed stream. a query that will return docs like the passed stream.

like
public Query like(Reader r) throws IOException(Code)
	Return a query that will return docs like the passed Reader. a query that will return docs like the passed Reader.

main
public static void main(String[] a) throws Throwable(Code)
	Test driver. Pass in "-i INDEX" and then either "-fn FILE" or "-url URL".

retrieveInterestingTerms
public String[] retrieveInterestingTerms(Reader r) throws IOException(Code)
	Convenience routine to make it easy to return the most interesting words in a document. More advanced users will call MoreLikeThis.retrieveTerms(java.io.Reader) retrieveTerms() directly. Parameters: r - the source document the most interesting words in the document See Also: MoreLikeThis.retrieveTerms(java.io.Reader) See Also: MoreLikeThis.setMaxQueryTerms

retrieveTerms
public PriorityQueue retrieveTerms(Reader r) throws IOException(Code)
	Find words for a more-like-this query former. The result is a priority queue of arrays with one entry for every word in the document. Each array has 6 elements. The elements are: The word (String) The top field that this word comes from (String) The score for this word (Float) The IDF value (Float) The frequency of this word in the index (Integer) The frequency of this word in the source document (Integer) This is a somewhat "advanced" routine, and in general only the 1st entry in the array is of interest. This method is exposed so that you can identify the "interesting words" in a document. For an easier method to call see MoreLikeThis.retrieveInterestingTerms retrieveInterestingTerms() . Parameters: r - the reader that has the content of the document the most intresting words in the document ordered by score, with the highest scoring, or best entry, first See Also: MoreLikeThis.retrieveInterestingTerms

setAnalyzer
public void setAnalyzer(Analyzer analyzer)(Code)
	Sets the analyzer to use. An analyzer is not required for generating a query with the MoreLikeThis.like(int) method, all other 'like' methods require an analyzer. Parameters: analyzer - the analyzer to use to tokenize text.

setBoost
public void setBoost(boolean boost)(Code)
	Sets whether to boost terms in query based on "score" or not. Parameters: boost - true to boost terms in query based on "score", false otherwise. See Also: MoreLikeThis.isBoost

setFieldNames
public void setFieldNames(String[] fieldNames)(Code)
	Sets the field names that will be used when generating the 'More Like This' query. Set this to null for the field names to be determined at runtime from the IndexReader provided in the constructor. Parameters: fieldNames - the field names that will be used when generating the 'More Like This'query.

setMaxNumTokensParsed
public void setMaxNumTokensParsed(int i)(Code)
	Parameters: i - The maximum number of tokens to parse in each example doc field that is not stored with TermVector support

setMaxQueryTerms
public void setMaxQueryTerms(int maxQueryTerms)(Code)
	Sets the maximum number of query terms that will be included in any generated query. Parameters: maxQueryTerms - the maximum number of query terms that will be included in anygenerated query.

setMaxWordLen
public void setMaxWordLen(int maxWordLen)(Code)
	Sets the maximum word length above which words will be ignored. Parameters: maxWordLen - the maximum word length above which words will be ignored.

setMinDocFreq
public void setMinDocFreq(int minDocFreq)(Code)
	Sets the frequency at which words will be ignored which do not occur in at least this many docs. Parameters: minDocFreq - the frequency at which words will be ignored which do not occur in atleast this many docs.

setMinTermFreq
public void setMinTermFreq(int minTermFreq)(Code)
	Sets the frequency below which terms will be ignored in the source doc. Parameters: minTermFreq - the frequency below which terms will be ignored in the source doc.

setMinWordLen
public void setMinWordLen(int minWordLen)(Code)
	Sets the minimum word length below which words will be ignored. Parameters: minWordLen - the minimum word length below which words will be ignored.

setStopWords
public void setStopWords(Set stopWords)(Code)
	Set the set of stopwords. Any word in this set is considered "uninteresting" and ignored. Even if your Analyzer allows stopwords, you might want to tell the MoreLikeThis code to ignore them, as for the purposes of document similarity it seems reasonable to assume that "a stop word is never interesting". Parameters: stopWords - set of stopwords, if null it means to allow stop words See Also: org.apache.lucene.analysis.StopFilter.makeStopSet See Also: StopFilter.makeStopSet() See Also: MoreLikeThis.getStopWords See Also:

Methods inherited from java.lang.Object

native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.