org.apache.lucene.benchmark.byTask

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Search Engine » lucene » org.apache.lucene.benchmark.byTask 
org.apache.lucene.benchmark.byTask
Benchmarking Lucene By Tasks
Benchmarking Lucene By Tasks.

This package provides "task based" performance benchmarking of Lucene. One can use the predefined benchmarks, or create new ones.

Contained packages:

Package Description
stats Statistics maintained when running benchmark tasks.
tasks Benchmark tasks.
feeds Sources for benchmark inputs: documents and queries.
utils Utilities used for the benchmark, and for the reports.
programmatic Sample performance test written programatically.

Table Of Contents

  1. Benchmarking By Tasks
  2. How to use
  3. Benchmark "algorithm"
  4. Supported tasks/commands
  5. Benchmark properties
  6. Example input algorithm and the result benchmark report.
  7. Results record counting clarified

Benchmarking By Tasks

Benchmark Lucene using task primitives.

A benchmark is composed of some predefined tasks, allowing for creating an index, adding documents, optimizing, searching, generating reports, and more. A benchmark run takes an "algorithm" file that contains a description of the sequence of tasks making up the run, and some properties defining a few additional characteristics of the benchmark run.

How to use

Easiest way to run a benchmarks is using the predefined ant task:

  • ant run-task
    - would run the micro-standard.alg "algorithm".
  • ant run-task -Dtask.alg=conf/compound-penalty.alg
    - would run the compound-penalty.alg "algorithm".
  • ant run-task -Dtask.alg=[full-path-to-your-alg-file]
    - would run your perf test "algorithm".
  • java org.apache.lucene.benchmark.byTask.programmatic.Sample
    - would run a performance test programmatically - without using an alg file. This is less readable, and less convinient, but possible.

You may find existing tasks sufficient for defining the benchmark you need, otherwise, you can extend the framework to meet your needs, as explained herein.

Each benchmark run has a DocMaker and a QueryMaker. These two should usually match, so that "meaningful" queries are used for a certain collection. Properties set at the header of the alg file define which "makers" should be used. You can also specify your own makers, implementing the DocMaker and QureyMaker interfaces.

Benchmark .alg file contains the benchmark "algorithm". The syntax is described below. Within the algorithm, you can specify groups of commands, assign them names, specify commands that should be repeated, do commands in serial or in parallel, and also control the speed of "firing" the commands.

This allows, for instance, to specify that an index should be opened for update, documents should be added to it one by one but not faster than 20 docs a minute, and, in parallel with this, some N queries should be searched against that index, again, no more than 2 queries a second. You can have the searches all share an index reader, or have them each open its own reader and close it afterwords.

If the commands available for use in the algorithm do not meet your needs, you can add commands by adding a new task under org.apache.lucene.benchmark.byTask.tasks - you should extend the PerfTask abstract class. Make sure that your new task class name is suffixed by Task. Assume you added the class "WonderfulTask" - doing so also enables the command "Wonderful" to be used in the algorithm.

External classes: It is sometimes useful to invoke the benchmark package with your external alg file that configures the use of your own doc/query maker and or html parser. You can work this out without modifying the benchmark package code, by passing your class path with the benchmark.ext.classpath property:

  • ant run-task -Dtask.alg=[full-path-to-your-alg-file] -Dbenchmark.ext.classpath=/mydir/classes -Dtask.mem=512M

Benchmark "algorithm"

The following is an informal description of the supported syntax.

  1. Measuring: When a command is executed, statistics for the elapsed execution time and memory consumption are collected. At any time, those statistics can be printed, using one of the available ReportTasks.
  2. Comments start with '#'.
  3. Serial sequences are enclosed within '{ }'.
  4. Parallel sequences are enclosed within '[ ]'
  5. Sequence naming: To name a sequence, put '"name"' just after '{' or '['.
    Example - { "ManyAdds" AddDoc } : 1000000 - would name the sequence of 1M add docs "ManyAdds", and this name would later appear in statistic reports. If you don't specify a name for a sequence, it is given one: you can see it as the algorithm is printed just before benchmark execution starts.
  6. Repeating: To repeat sequence tasks N times, add ': N' just after the sequence closing tag - '}' or ']' or '>'.
    Example - [ AddDoc ] : 4 - would do 4 addDoc in parallel, spawning 4 threads at once.
    Example - [ AddDoc AddDoc ] : 4 - would do 8 addDoc in parallel, spawning 8 threads at once.
    Example - { AddDoc } : 30 - would do addDoc 30 times in a row.
    Example - { AddDoc AddDoc } : 30 - would do addDoc 60 times in a row.
    Exhaustive repeating: use * instead of a number to repeat exhaustively. This is sometimes useful, for adding as many files as a doc maker can create, without iterating over the same file again, especially when the exact number of documents is not known in advance. For insance, TREC files extracted from a zip file. Note: when using this, you must also set doc.maker.forever to false.
    Example - { AddDoc } : * - would add docs until the doc maker is "exhausted".
  7. Command parameter: a command can optionally take a single parameter. If the certain command does not support a parameter, or if the parameter is of the wrong type, reading the algorithm will fail with an exception and the test would not start. Currently the following tasks take optional parameters:
    • AddDoc takes a numeric parameter, indicating the required size of added document. Note: if the DocMaker implementation used in the test does not support makeDoc(size), an exception would be thrown and the test would fail.
    • DeleteDoc takes numeric parameter, indicating the docid to be deleted. The latter is not very useful for loops, since the docid is fixed, so for deletion in loops it is better to use the doc.delete.step property.
    • SetProp takes a name,value mandatory param, ',' used as a separator.
    • SearchTravRetTask and SearchTravTask take a numeric parameter, indicating the required traversal size.
    • SearchTravRetLoadFieldSelectorTask takes a string parameter: a comma separated list of Fields to load.

    Example - AddDoc(2000) - would add a document of size 2000 (~bytes).
    See conf/task-sample.alg for how this can be used, for instance, to check which is faster, adding many smaller documents, or few larger documents. Next candidates for supporting a parameter may be the Search tasks, for controlling the qurey size.
  8. Statistic recording elimination: - a sequence can also end with '>', in which case child tasks would not store their statistics. This can be useful to avoid exploding stats data, for adding say 1M docs.
    Example - { "ManyAdds" AddDoc > : 1000000 - would add million docs, measure that total, but not save stats for each addDoc.
    Notice that the granularity of System.currentTimeMillis() (which is used here) is system dependant, and in some systems an operation that takes 5 ms to complete may show 0 ms latency time in performance measurements. Therefore it is sometimes more accurate to look at the elapsed time of a larger sequence, as demonstrated here.
  9. Rate: To set a rate (ops/sec or ops/min) for a sequence, add ': N : R' just after sequence closing tag. This would specify repetition of N with rate of R operations/sec. Use 'R/sec' or 'R/min' to explicitely specify that the rate is per second or per minute. The default is per second,
    Example - [ AddDoc ] : 400 : 3 - would do 400 addDoc in parallel, starting up to 3 threads per second.
    Example - { AddDoc } : 100 : 200/min - would do 100 addDoc serially, waiting before starting next add, if otherwise rate would exceed 200 adds/min.
  10. Command names: Each class "AnyNameTask" in the package org.apache.lucene.benchmark.byTask.tasks, that extends PerfTask, is supported as command "AnyName" that can be used in the benchmark "algorithm" description. This allows to add new commands by just adding such classes.

Supported tasks/commands

Existing tasks can be divided into a few groups: regular index/search work tasks, report tasks, and control tasks.

  1. Report tasks: There are a few Report commands for generating reports. Only task runs that were completed are reported. (The 'Report tasks' themselves are not measured and not reported.)
    • RepAll - all (completed) task runs.
    • RepSumByName - all statistics, aggregated by name. So, if AddDoc was executed 2000 times, only 1 report line would be created for it, aggregating all those 2000 statistic records.
    • RepSelectByPref   prefixWord - all records for tasks whose name start with prefixWord.
    • RepSumByPref   prefixWord - all records for tasks whose name start with prefixWord, aggregated by their full task name.
    • RepSumByNameRound - all statistics, aggregated by name and by Round. So, if AddDoc was executed 2000 times in each of 3 rounds, 3 report lines would be created for it, aggregating all those 2000 statistic records in each round. See more about rounds in the NewRound command description below.
    • RepSumByPrefRound   prefixWord - similar to RepSumByNameRound, just that only tasks whose name starts with prefixWord are included.
    If needed, additional reports can be added by extending the abstract class ReportTask, and by manipulating the statistics data in Points and TaskStats.
  2. Control tasks: Few of the tasks control the benchmark algorithm all over:
    • ClearStats - clears the entire statistics. Further reports would only include task runs that would start after this call.
    • NewRound - virtually start a new round of performance test. Although this command can be placed anywhere, it mostly makes sense at the end of an outermost sequence.
      This increments a global "round counter". All task runs that would start now would record the new, updated round counter as their round number. This would appear in reports. In particular, see RepSumByNameRound above.
      An additional effect of NewRound, is that numeric and boolean properties defined (at the head of the .alg file) as a sequence of values, e.g. merge.factor=mrg:10:100:10:100 would increment (cyclic) to the next value. Note: this would also be reflected in the reports, in this case under a column that would be named "mrg".
    • ResetInputs - DocMaker and the various QueryMakers would reset their counters to start. The way these Maker interfaces work, each call for makeDocument() or makeQuery() creates the next document or query that it "knows" to create. If that pool is "exhausted", the "maker" start over again. The resetInpus command therefore allows to make the rounds comparable. It is therefore useful to invoke ResetInputs together with NewRound.
    • ResetSystemErase - reset all index and input data and call gc. Does NOT reset statistics. This contains ResetInputs. All writers/readers are nullified, deleted, closed. Index is erased. Directory is erased. You would have to call CreateIndex once this was called...
    • ResetSystemSoft - reset all index and input data and call gc. Does NOT reset statistics. This contains ResetInputs. All writers/readers are nullified, closed. Index is NOT erased. Directory is NOT erased. This is useful for testing performance on an existing index, for instance if the construction of a large index took a very long time and now you would to test its search or update performance.
  3. Other existing tasks are quite straightforward and would just be briefly described here.
    • CreateIndex and OpenIndex both leave the index open for later update operations. CloseIndex would close it.
    • OpenReader, similarly, would leave an index reader open for later search operations. But this have further semantics. If a Read operation is performed, and an open reader exists, it would be used. Otherwise, the read operation would open its own reader and close it when the read operation is done. This allows testing various scenarios - sharing a reader, searching with "cold" reader, with "warmed" reader, etc. The read operations affected by this are: Warm, Search, SearchTrav (search and traverse), and SearchTravRet (search and traverse and retrieve). Notice that each of the 3 search task types maintains its own queryMaker instance.

Benchmark properties

Properties are read from the header of the .alg file, and define several parameters of the performance test. As mentioned above for the NewRound task, numeric and boolean properties that are defined as a sequence of values, e.g. merge.factor=mrg:10:100:10:100 would increment (cyclic) to the next value, when NewRound is called, and would also appear as a named column in the reports (column name would be "mrg" in this example).

Some of the currently defined properties are:

  1. analyzer - full class name for the analyzer to use. Same analyzer would be used in the entire test.
  2. directory - valid values are This tells which directory to use for the performance test.
  3. Index work parameters: Multi int/boolean values would be iterated with calls to NewRound. There would be also added as columns in the reports, first string in the sequence is the column name. (Make sure it is no shorter than any value in the sequence).
    • max.buffered
      Example: max.buffered=buf:10:10:100:100 - this would define using maxBufferedDocs of 10 in iterations 0 and 1, and 100 in iterations 2 and 3.
    • merge.factor - which merge factor to use.
    • compound - whether the index is using the compound format or not. Valid values are "true" and "false".

Here is a list of currently defined properties:

  1. Root directory for data and indexes:
    • work.dir (default is System property "benchmark.work.dir" or "work".)
  2. Docs and queries creation:
    • analyzer
    • doc.maker
    • doc.maker.forever
    • html.parser
    • doc.stored
    • doc.tokenized
    • doc.term.vector
    • doc.term.vector.positions
    • doc.term.vector.offsets
    • doc.store.body.bytes
    • docs.dir
    • query.maker
    • file.query.maker.file
    • file.query.maker.default.field
  3. Logging:
    • doc.add.log.step
    • doc.delete.log.step
    • log.queries
    • task.max.depth.log
    • doc.tokenize.log.step
  4. Index writing:
    • compound
    • merge.factor
    • max.buffered
    • directory
    • ram.flush.mb
    • autocommit
  5. Doc deletion:
    • doc.delete.step

For sample use of these properties see the *.alg files under conf.

Example input algorithm and the result benchmark report

The following example is in conf/sample.alg:

# --------------------------------------------------------
#
# Sample: what is the effect of doc size on indexing time?
#
# There are two parts in this test:
# - PopulateShort adds 2N documents of length  L
# - PopulateLong  adds  N documents of length 2L
# Which one would be faster?
# The comparison is done twice.
#
# --------------------------------------------------------

# -------------------------------------------------------------------------------------
# multi val params are iterated by NewRound's, added to reports, start with column name.
merge.factor=mrg:10:20
max.buffered=buf:100:1000
compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=500

docs.dir=reuters-out

doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker

query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=false
# -------------------------------------------------------------------------------------
{

    { "PopulateShort"
        CreateIndex
        { AddDoc(4000) > : 20000
        Optimize
        CloseIndex
    >

    ResetSystemErase

    { "PopulateLong"
        CreateIndex
        { AddDoc(8000) > : 10000
        Optimize
        CloseIndex
    >

    ResetSystemErase

    NewRound

} : 2

RepSumByName
RepSelectByPref Populate

The command line for running this sample:
ant run-task -Dtask.alg=conf/sample.alg

The output report from running this test contains the following:

Operation     round mrg  buf   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
PopulateShort     0  10  100        1        20003        119.6      167.26    12,959,120     14,241,792
PopulateLong -  - 0  10  100 -  -   1 -  -   10003 -  -  - 74.3 -  - 134.57 -  17,085,208 -   20,635,648
PopulateShort     1  20 1000        1        20003        143.5      139.39    63,982,040     94,756,864
PopulateLong -  - 1  20 1000 -  -   1 -  -   10003 -  -  - 77.0 -  - 129.92 -  87,309,608 -  100,831,232

Results record counting clarified

Two columns in the results table indicate records counts: records-per-run and records-per-second. What does it mean?

Almost every task gets 1 in this count just for being executed. Task sequences aggregate the counts of their child tasks, plus their own count of 1. So, a task sequence containing 5 other task sequences, each running a single other task 10 times, would have a count of 1 + 5 * (1 + 10) = 56.

The traverse and retrieve tasks "count" more: a traverse task would add 1 for each traversed result (hit), and a retrieve task would additionally add 1 for each retrieved doc. So, regular Search would count 1, SearchTrav that traverses 10 hits would count 11, and a SearchTravRet task that retrieves (and traverses) 10, would count 21.

Confusing? this might help: always examine the elapsedSec column, and always compare "apples to apples", .i.e. it is interesting to check how the rec/s changed for the same task (or sequence) between two different runs, but it is not very useful to know how the rec/s differs between Search and SearchTrav tasks. For the latter, elapsedSec would bring more insight.

 
Java Source File NameTypeComment
Benchmark.javaClass Run the benchmark algorithm.

Usage: java Benchmark algorithm-file

  1. Read algorithm.
  2. Run the algorithm.
Things to be added/fixed in "Benchmarking by tasks":
  1. TODO - report into Excel and/or graphed view.
  2. TODO - perf comparison between Lucene releases over the years.
  3. TODO - perf report adequate to include in Lucene nightly build site? (so we can easily track performance changes.)
  4. TODO - add overall time control for repeated execution (vs.
PerfRunData.javaClass Data maintained by a performance test run.
TestPerfTasksLogic.javaClass Test very simply that perf tasks - simple algorithms - are doing what they should.
TestPerfTasksParse.javaClass Test very simply that perf tasks are parses as expected.
www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.