Java Doc for MemoryIndex.java in  » Net » lucene-connector » org » apache » lucene » index » memory » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Net » lucene connector » org.apache.lucene.index.memory 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   org.apache.lucene.index.memory.MemoryIndex

MemoryIndex
public class MemoryIndex (Code)
High-performance single-document main memory Apache Lucene fulltext search index.

Overview

This class is a replacement/substitute for a large subset of org.apache.lucene.store.RAMDirectory functionality. It is designed to enable maximum efficiency for on-the-fly matchmaking combining structured and fuzzy fulltext search in realtime streaming applications such as Nux XQuery based XML message queues, publish-subscribe systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. Rather than targetting fulltext search of infrequent queries over huge persistent data archives (historic search), this class targets fulltext search of huge numbers of queries over comparatively small transient realtime data (prospective search). For example as in
 float score = search(String text, Query query)
 

Each instance can hold at most one Lucene "document", with a document containing zero or more "fields", each field having a name and a fulltext value. The fulltext value is tokenized (split and transformed) into zero or more index terms (aka words) on addField(), according to the policy implemented by an Analyzer. For example, Lucene analyzers can split on whitespace, normalize to lower case for case insensitivity, ignore common terms with little discriminatory value such as "he", "in", "and" (stop words), reduce the terms to their natural linguistic root form such as "fishing" being reduced to "fish" (stemming), resolve synonyms/inflexions/thesauri (upon indexing and/or querying), etc. For details, see Lucene Analyzer Intro.

Arbitrary Lucene queries can be run against this class - see Lucene Query Syntax as well as Query Parser Rules. Note that a Lucene query selects on the field names and associated (indexed) tokenized terms, not on the original fulltext(s) - the latter are not stored but rather thrown away immediately after tokenization.

For some interesting background information on search technology, see Bob Wyman's Prospective Search, Jim Gray's A Call to Arms - Custom subscriptions, and Tim Bray's On Search, the Series.

Example Usage

 Analyzer analyzer = PatternAnalyzer.DEFAULT_ANALYZER;
 //Analyzer analyzer = new SimpleAnalyzer();
 MemoryIndex index = new MemoryIndex();
 index.addField("content", "Readings about Salmons and other select Alaska fishing Manuals", analyzer);
 index.addField("author", "Tales of James", analyzer);
 QueryParser parser = new QueryParser("content", analyzer);
 float score = index.search(parser.parse("+author:james +salmon~ +fish* manual~"));
 if (score > 0.0f) {
 System.out.println("it's a match");
 } else {
 System.out.println("no match found");
 }
 System.out.println("indexData=" + index.toString());
 

Example XQuery Usage

 (: An XQuery that finds all books authored by James that have something to do with "salmon fishing manuals", sorted by relevance :)
 declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
 declare variable $query := "+salmon~ +fish* manual~"; (: any arbitrary Lucene query can go here :)
 for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0]
 let $score := lucene:match($book/abstract, $query)
 order by $score descending
 return $book
 

No thread safety guarantees

An instance can be queried multiple times with the same or different queries, but an instance is not thread-safe. If desired use idioms such as:
 MemoryIndex index = ...
 synchronized (index) {
 // read and/or write index (i.e. add fields and/or query)
 } 
 

Performance Notes

Internally there's a new data structure geared towards efficient indexing and searching, plus the necessary support code to seamlessly plug into the Lucene framework.

This class performs very well for very small texts (e.g. 10 chars) as well as for large texts (e.g. 10 MB) and everything in between. Typically, it is about 10-100 times faster than RAMDirectory. Note that RAMDirectory has particularly large efficiency overheads for small to medium sized texts, both in time and space. Indexing a field with N tokens takes O(N) in the best case, and O(N logN) in the worst case. Memory consumption is probably larger than for RAMDirectory.

Example throughput of many simple term queries over a single MemoryIndex: ~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary.

If you're curious about the whereabouts of bottlenecks, run java 1.5 with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing ).
author:
   whoschek.AT.lbl.DOT.gov




Constructor Summary
public  MemoryIndex()
     Constructs an empty instance.

Method Summary
public  voidaddField(String fieldName, String text, Analyzer analyzer)
    
public  voidaddField(String fieldName, TokenStream stream)
     Equivalent to addField(fieldName, stream, 1.0f).
public  voidaddField(String fieldName, TokenStream stream, float boost)
     Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, Lucene org.apache.lucene.document.Field . Finally closes the token stream.
public  IndexSearchercreateSearcher()
     Creates and returns a searcher that can be used to execute arbitrary Lucene queries and to collect the resulting query results as hits.
public  intgetMemorySize()
     Returns a reasonable approximation of the main memory [bytes] consumed by this instance.
public  TokenStreamkeywordTokenStream(Collection keywords)
     Convenience method; Creates and returns a token stream that generates a token for each keyword in the given collection, "as is", without any transforming text analysis.
public  floatsearch(Query query)
     Convenience method that efficiently returns the relevance score by matching this index against the given Lucene query expression.
Parameters:
  query - an arbitrary Lucene query to run against this index the relevance score of the matchmaking; A number in the range[0.0 ..
public  StringtoString()
     Returns a String representation of the index data for debugging purposes.


Constructor Detail
MemoryIndex
public MemoryIndex()(Code)
Constructs an empty instance.




Method Detail
addField
public void addField(String fieldName, String text, Analyzer analyzer)(Code)
Convenience method; Tokenizes the given field text and adds the resulting terms to the index; Equivalent to adding an indexed non-keyword Lucene org.apache.lucene.document.Field that is org.apache.lucene.document.Field.Index.TOKENIZED tokenized , org.apache.lucene.document.Field.Store.NO not stored , org.apache.lucene.document.Field.TermVector.WITH_POSITIONS termVectorStored with positions (or org.apache.lucene.document.Field.TermVector.WITH_POSITIONS termVectorStored with positions and offsets ),
Parameters:
  fieldName - a name to be associated with the text
Parameters:
  text - the text to tokenize and index.
Parameters:
  analyzer - the analyzer to use for tokenization



addField
public void addField(String fieldName, TokenStream stream)(Code)
Equivalent to addField(fieldName, stream, 1.0f).
Parameters:
  fieldName - a name to be associated with the text
Parameters:
  stream - the token stream to retrieve tokens from



addField
public void addField(String fieldName, TokenStream stream, float boost)(Code)
Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, Lucene org.apache.lucene.document.Field . Finally closes the token stream. Note that untokenized keywords can be added with this method via MemoryIndex.keywordTokenStream(Collection) , the Lucene contrib KeywordTokenizer or similar utilities.
Parameters:
  fieldName - a name to be associated with the text
Parameters:
  stream - the token stream to retrieve tokens from.
Parameters:
  boost - the boost factor for hits for this field
See Also:   Field.setBoost(float)



createSearcher
public IndexSearcher createSearcher()(Code)
Creates and returns a searcher that can be used to execute arbitrary Lucene queries and to collect the resulting query results as hits. a searcher



getMemorySize
public int getMemorySize()(Code)
Returns a reasonable approximation of the main memory [bytes] consumed by this instance. Useful for smart memory sensititive caches/pools. Assumes fieldNames are interned, whereas tokenized terms are memory-overlaid. the main memory consumption



keywordTokenStream
public TokenStream keywordTokenStream(Collection keywords)(Code)
Convenience method; Creates and returns a token stream that generates a token for each keyword in the given collection, "as is", without any transforming text analysis. The resulting token stream can be fed into MemoryIndex.addField(String,TokenStream) , perhaps wrapped into another org.apache.lucene.analysis.TokenFilter , as desired.
Parameters:
  keywords - the keywords to generate tokens for the corresponding token stream



search
public float search(Query query)(Code)
Convenience method that efficiently returns the relevance score by matching this index against the given Lucene query expression.
Parameters:
  query - an arbitrary Lucene query to run against this index the relevance score of the matchmaking; A number in the range[0.0 .. 1.0], with 0.0 indicating no match. The higher the numberthe better the match.
See Also:   org.apache.lucene.queryParser.QueryParser.parse(String)



toString
public String toString()(Code)
Returns a String representation of the index data for debugging purposes. the string representation



Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.