Source Code Cross Referenced for Frontier.java in » Web-Crawler » heritrix » org » archive » crawler » framework » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI
Java
Java Tutorial
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.crawler.framework
Source Cross Referenced Class Diagram Java Document (Java Doc)
001:        /* Frontier
002:         *
003:         * $Id: Frontier.java 5045 2007-04-10 01:37:23Z gojomo $
004:         *
005:         * Created on Mar 29, 2004
006:         *
007:         * Copyright (C) 2004 Internet Archive.
008:         *
009:         * This file is part of the Heritrix web crawler (crawler.archive.org).
010:         *
011:         * Heritrix is free software; you can redistribute it and/or modify
012:         * it under the terms of the GNU Lesser Public License as published by
013:         * the Free Software Foundation; either version 2.1 of the License, or
014:         * any later version.
015:         *
016:         * Heritrix is distributed in the hope that it will be useful,
017:         * but WITHOUT ANY WARRANTY; without even the implied warranty of
018:         * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
019:         * GNU Lesser Public License for more details.
020:         *
021:         * You should have received a copy of the GNU Lesser Public License
022:         * along with Heritrix; if not, write to the Free Software
023:         * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
024:         */
025:        package org.archive.crawler.framework;
026:
027:        import java.io.IOException;
028:        import java.util.ArrayList;
029:
030:        import org.archive.crawler.datamodel.CandidateURI;
031:        import org.archive.crawler.datamodel.CrawlSubstats;
032:        import org.archive.crawler.datamodel.CrawlURI;
033:        import org.archive.crawler.framework.exceptions.EndedException;
034:        import org.archive.crawler.framework.exceptions.FatalConfigurationException;
035:        import org.archive.crawler.framework.exceptions.InvalidFrontierMarkerException;
036:        import org.archive.crawler.frontier.FrontierJournal;
037:        import org.archive.net.UURI;
038:        import org.archive.util.Reporter;
039:
040:        /**
041:         * An interface for URI Frontiers.
042:         *
043:         * <p>A URI Frontier is a pluggable module in Heritrix that maintains the
044:         * internal state of the crawl. This includes (but is not limited to):
045:         * <ul>
046:         *     <li>What URIs have been discovered
047:         *     <li>What URIs are being processed (fetched)
048:         *     <li>What URIs have been processed
049:         *     <li>In what order unprocessed URIs will be processed
050:         * </ul>
051:         *
052:         * <p>The Frontier is also responsible for enforcing any politeness restrictions
053:         * that may have been applied to the crawl. Such as limiting simultaneous
054:         * connection to the same host, server or IP number to 1 (or any other fixed
055:         * amount), delays between connections etc.
056:         *
057:         * <p>A URIFrontier is created by the
058:         * {@link org.archive.crawler.framework.CrawlController CrawlController} which
059:         * is in turn responsible for providing access to it. Most significant among
060:         * those modules interested in the Frontier are the
061:         * {@link org.archive.crawler.framework.ToeThread ToeThreads} who perform the
062:         * actual work of processing a URI.
063:         *
064:         * <p>The methods defined in this interface are those required to get URIs for
065:         * processing, report the results of processing back (ToeThreads) and to get
066:         * access to various statistical data along the way. The statistical data is
067:         * of interest to {@link org.archive.crawler.framework.StatisticsTracking
068:         * Statistics Tracking} modules. A couple of additional methods are provided
069:         * to be able to inspect and manipulate the Frontier at runtime.
070:         *
071:         * <p>The statistical data exposed by this interface is:
072:         * <ul>
073:         *     <li> {@link #discoveredUriCount() Discovered URIs}
074:         *     <li> {@link #queuedUriCount() Queued URIs}
075:         *     <li> {@link #finishedUriCount() Finished URIs}
076:         *     <li> {@link #succeededFetchCount() Successfully processed URIs}
077:         *     <li> {@link #failedFetchCount() Failed to process URIs}
078:         *     <li> {@link #disregardedUriCount() Disregarded URIs}
079:         *     <li> {@link #totalBytesWritten() Total bytes written}
080:         * </ul>
081:         *
082:         * <p>In addition the frontier may optionally implement an interface that
083:         * exposes information about hosts.
084:         *
085:         * <p>Furthermore any implementation of the URI Frontier should trigger
086:         * {@link org.archive.crawler.event.CrawlURIDispositionListener
087:         * CrawlURIDispostionEvents} by invoking the proper methods on the
088:         * {@link org.archive.crawler.framework.CrawlController CrawlController}.
089:         * Doing this allows a custom built
090:         * {@link org.archive.crawler.framework.StatisticsTracking
091:         * Statistics Tracking} module to gather any other additional data it might be
092:         * interested in by examining the completed URIs.
093:         *
094:         * <p>All URI Frontiers inherit from
095:         * {@link org.archive.crawler.settings.ModuleType ModuleType}
096:         * and therefore creating settings follows the usual pattern of pluggable modules
097:         * in Heritrix.
098:         *
099:         * @author Gordon Mohr
100:         * @author Kristinn Sigurdsson
101:         *
102:         * @see org.archive.crawler.framework.CrawlController
103:         * @see org.archive.crawler.framework.CrawlController#fireCrawledURIDisregardEvent(CrawlURI)
104:         * @see org.archive.crawler.framework.CrawlController#fireCrawledURIFailureEvent(CrawlURI)
105:         * @see org.archive.crawler.framework.CrawlController#fireCrawledURINeedRetryEvent(CrawlURI)
106:         * @see org.archive.crawler.framework.CrawlController#fireCrawledURISuccessfulEvent(CrawlURI)
107:         * @see org.archive.crawler.framework.StatisticsTracking
108:         * @see org.archive.crawler.framework.ToeThread
109:         * @see org.archive.crawler.framework.FrontierHostStatistics
110:         * @see org.archive.crawler.settings.ModuleType
111:         */
112:        public interface Frontier extends Reporter {
113:            /**
114:             * All URI Frontiers should have the same 'name' attribute. This constant
115:             * defines that name. This is a name used to reference the Frontier being
116:             * used in a given crawl order and since there can only be one Frontier
117:             * per crawl order a fixed, unique name for Frontiers is optimal.
118:             *
119:             * @see org.archive.crawler.settings.ModuleType#ModuleType(String)
120:             */
121:            public static final String ATTR_NAME = "frontier";
122:
123:            /**
124:             * Initialize the Frontier.
125:             *
126:             * <p> This method is invoked by the CrawlController once it has
127:             * created the Frontier. The constructor of the Frontier should
128:             * only contain code for setting up it's settings framework. This
129:             * method should contain all other 'startup' code.
130:             *
131:             * @param c The CrawlController that created the Frontier.
132:             *
133:             * @throws FatalConfigurationException If provided settings are illegal or
134:             *            otherwise unusable.
135:             * @throws IOException If there is a problem reading settings or seeds file
136:             *            from disk.
137:             */
138:            public void initialize(CrawlController c)
139:                    throws FatalConfigurationException, IOException;
140:
141:            /**
142:             * Get the next URI that should be processed. If no URI becomes availible
143:             * during the time specified null will be returned.
144:             *
145:             * @return the next URI that should be processed.
146:             * @throws InterruptedException
147:             * @throws EndedException 
148:             */
149:            CrawlURI next() throws InterruptedException, EndedException;
150:
151:            /**
152:             * Returns true if the frontier contains no more URIs to crawl.
153:             *
154:             * <p>That is to say that there are no more URIs either currently availible
155:             * (ready to be emitted), URIs belonging to deferred hosts or pending URIs
156:             * in the Frontier. Thus this method may return false even if there is no
157:             * currently availible URI.
158:             *
159:             * @return true if the frontier contains no more URIs to crawl.
160:             */
161:            boolean isEmpty();
162:
163:            /**
164:             * Schedules a CandidateURI.
165:             *
166:             * <p>This method accepts one URI and schedules it immediately. This has
167:             * nothing to do with the priority of the URI being scheduled. Only that
168:             * it will be placed in it's respective queue at once. For priority
169:             * scheduling see {@link CandidateURI#setSchedulingDirective(int)}
170:             *
171:             * <p>This method should be synchronized in all implementing classes.
172:             *
173:             * @param caURI The URI to schedule.
174:             *
175:             * @see CandidateURI#setSchedulingDirective(int)
176:             */
177:            public void schedule(CandidateURI caURI);
178:
179:            /**
180:             * Report a URI being processed as having finished processing.
181:             *
182:             * <p>ToeThreads will invoke this method once they have completed work on
183:             * their assigned URI.
184:             *
185:             * <p>This method is synchronized.
186:             *
187:             * @param cURI The URI that has finished processing.
188:             */
189:            public void finished(CrawlURI cURI);
190:
191:            /**
192:             * Number of <i>discovered</i> URIs.
193:             *
194:             * <p>That is any URI that has been confirmed be within 'scope'
195:             * (i.e. the Frontier decides that it should be processed). This
196:             * includes those that have been processed, are being processed
197:             * and have finished processing. Does not include URIs that have
198:             * been 'forgotten' (deemed out of scope when trying to fetch,
199:             * most likely due to operator changing scope definition).
200:             *
201:             * <p><b>Note:</b> This only counts discovered URIs. Since the same
202:             * URI can (at least in most frontiers) be fetched multiple times, this
203:             * number may be somewhat lower then the combined <i>queued</i>,
204:             * <i>in process</i> and <i>finished</i> items combined due to duplicate
205:             * URIs being queued and processed. This variance is likely to be especially
206:             * high in Frontiers implementing 'revist' strategies.
207:             *
208:             * @return Number of discovered URIs.
209:             */
210:            public long discoveredUriCount();
211:
212:            /**
213:             * Number of URIs <i>queued</i> up and waiting for processing.
214:             *
215:             * <p>This includes any URIs that failed but will be retried. Basically this
216:             * is any <i>discovered</i> URI that has not either been processed or is
217:             * being processed. The same discovered URI can be queued multiple times.
218:             *
219:             * @return Number of queued URIs.
220:             */
221:            public long queuedUriCount();
222:
223:            public long deepestUri(); // aka longest queue
224:
225:            public long averageDepth(); // aka average queue length
226:
227:            public float congestionRatio(); // multiple of threads needed for max progress
228:
229:            /**
230:             * Number of URIs that have <i>finished</i> processing.
231:             *
232:             * <p>Includes both those that were processed successfully and failed to be
233:             * processed (excluding those that failed but will be retried). Does not
234:             * include those URIs that have been 'forgotten' (deemed out of scope when
235:             * trying to fetch, most likely due to operator changing scope definition).
236:             *
237:             * @return Number of finished URIs.
238:             */
239:            public long finishedUriCount();
240:
241:            /**
242:             * Number of <i>successfully</i> processed URIs.
243:             *
244:             * <p>Any URI that was processed successfully. This includes URIs that
245:             * returned 404s and other error codes that do not originate within the
246:             * crawler.
247:             *
248:             * @return Number of <i>successfully</i> processed URIs.
249:             */
250:            public long succeededFetchCount();
251:
252:            /**
253:             * Number of URIs that <i>failed</i> to process.
254:             *
255:             * <p>URIs that could not be processed because of some error or failure in
256:             * the processing chain. Can include failure to acquire prerequisites, to
257:             * establish a connection with the host and any number of other problems.
258:             * Does not count those that will be retried, only those that have
259:             * permenantly failed.
260:             *
261:             * @return Number of URIs that failed to process.
262:             */
263:            public long failedFetchCount();
264:
265:            /**
266:             * Number of URIs that were scheduled at one point but have been
267:             * <i>disregarded</i>.
268:             *
269:             * <p>Counts any URI that is scheduled only to be disregarded
270:             * because it is determined to lie outside the scope of the crawl. Most
271:             * commonly this will be due to robots.txt exclusions.
272:             *
273:             * @return The number of URIs that have been disregarded.
274:             */
275:            public long disregardedUriCount();
276:
277:            /**
278:             * Total number of bytes contained in all URIs that have been processed.
279:             *
280:             * @return The total amounts of bytes in all processed URIs.
281:             * @deprecated misnomer; consult StatisticsTracker instead
282:             */
283:            public long totalBytesWritten();
284:
285:            /**
286:             * Recover earlier state by reading a recovery log.
287:             *
288:             * <p>Some Frontiers are able to write detailed logs that can be loaded
289:             * after a system crash to recover the state of the Frontier prior to the
290:             * crash. This method is the one used to achive this.
291:             *
292:             * @param pathToLog The name (with full path) of the recover log.
293:             * @param retainFailures If true, failures in log should count as 
294:             * having been included. (If false, failures will be ignored, meaning
295:             * the corresponding URIs will be retried in the recovered crawl.)
296:             * @throws IOException If problems occur reading the recover log.
297:             */
298:            public void importRecoverLog(String pathToLog,
299:                    boolean retainFailures) throws IOException;
300:
301:            /**
302:             * Get a <code>URIFrontierMarker</code> initialized with the given
303:             * regular expression at the 'start' of the Frontier.
304:             * @param regexpr The regular expression that URIs within the frontier must
305:             *                match to be considered within the scope of this marker
306:             * @param inCacheOnly If set to true, only those URIs within the frontier
307:             *                that are stored in cache (usually this means in memory
308:             *                rather then on disk, but that is an implementation
309:             *                detail) will be considered. Others will be entierly
310:             *                ignored, as if they dont exist. This is usefull for quick
311:             *                peeks at the top of the URI list.
312:             * @return A URIFrontierMarker that is set for the 'start' of the frontier's
313:             *                URI list.
314:             */
315:            public FrontierMarker getInitialMarker(String regexpr,
316:                    boolean inCacheOnly);
317:
318:            /**
319:             * Returns a list of all uncrawled URIs starting from a specified marker
320:             * until <code>numberOfMatches</code> is reached.
321:             *
322:             * <p>Any encountered URI that has not been successfully crawled, terminally
323:             * failed, disregarded or is currently being processed is included. As
324:             * there may be duplicates in the frontier, there may also be duplicates
325:             * in the report. Thus this includes both discovered and pending URIs.
326:             *
327:             * <p>The list is a set of strings containing the URI strings. If verbose is
328:             * true the string will include some additional information (path to URI
329:             * and parent).
330:             *
331:             * <p>The <code>URIFrontierMarker</code> will be advanced to the position at
332:             * which it's maximum number of matches found is reached. Reusing it for
333:             * subsequent calls will thus effectively get the 'next' batch. Making
334:             * any changes to the frontier can invalidate the marker.
335:             *
336:             * <p>While the order returned is consistent, it does <i>not</i> have any
337:             * explicit relation to the likely order in which they may be processed.
338:             *
339:             * <p><b>Warning:</b> It is unsafe to make changes to the frontier while
340:             * this method is executing. The crawler should be in a paused state before
341:             * invoking it.
342:             *
343:             * @param marker
344:             *            A marker specifing from what position in the Frontier the
345:             *            list should begin.
346:             * @param numberOfMatches
347:             *            how many URIs to add at most to the list before returning it
348:             * @param verbose
349:             *            if set to true the strings returned will contain additional
350:             *            information about each URI beyond their names.
351:             * @return a list of all pending URIs falling within the specification
352:             *            of the marker
353:             * @throws InvalidFrontierMarkerException when the
354:             *            <code>URIFronterMarker</code> does not match the internal
355:             *            state of the frontier. Tolerance for this can vary
356:             *            considerably from one URIFrontier implementation to the next.
357:             * @see FrontierMarker
358:             * @see #getInitialMarker(String, boolean)
359:             */
360:            public ArrayList getURIsList(FrontierMarker marker,
361:                    int numberOfMatches, boolean verbose)
362:                    throws InvalidFrontierMarkerException;
363:
364:            /**
365:             * Delete any URI that matches the given regular expression from the list
366:             * of discovered and pending URIs. This does not prevent them from being
367:             * rediscovered.
368:             *
369:             * <p>Any encountered URI that has not been successfully crawled, terminally
370:             * failed, disregarded or is currently being processed is considered to be
371:             * a pending URI.
372:             *
373:             * <p><b>Warning:</b> It is unsafe to make changes to the frontier while
374:             * this method is executing. The crawler should be in a paused state before
375:             * invoking it.
376:             *
377:             * @param match A regular expression, any URIs that matches it will be
378:             *              deleted.
379:             * @return The number of URIs deleted
380:             */
381:            public long deleteURIs(String match);
382:
383:            /**
384:             * Notify Frontier that a CrawlURI has been deleted outside of the
385:             * normal next()/finished() lifecycle. 
386:             * 
387:             * @param curi Deleted CrawlURI.
388:             */
389:            public void deleted(CrawlURI curi);
390:
391:            /**
392:             * Notify Frontier that it should consider the given UURI as if
393:             * already scheduled.
394:             * 
395:             * @param u UURI instance to add to the Already Included set.
396:             */
397:            public void considerIncluded(UURI u);
398:
399:            /**
400:             * Notify Frontier that it should consider updating configuration
401:             * info that may have changed in external files.
402:             */
403:            public void kickUpdate();
404:
405:            /**
406:             * Notify Frontier that it should not release any URIs, instead
407:             * holding all threads, until instructed otherwise. 
408:             */
409:            public void pause();
410:
411:            /**
412:             * Resumes the release of URIs to crawl, allowing worker
413:             * ToeThreads to proceed. 
414:             */
415:            public void unpause();
416:
417:            /**
418:             * Notify Frontier that it should end the crawl, giving
419:             * any worker ToeThread that askss for a next() an 
420:             * EndedException. 
421:             */
422:            public void terminate();
423:
424:            /**
425:             * @return Return the instance of {@link FrontierJournal} that
426:             * this Frontier is using.  May be null if no journaling.
427:             */
428:            public FrontierJournal getFrontierJournal();
429:
430:            /**
431:             * @param cauri CandidateURI for which we're to calculate and
432:             * set class key.
433:             * @return Classkey for <code>cauri</code>.
434:             */
435:            public String getClassKey(CandidateURI cauri);
436:
437:            /**
438:             * Request that the Frontier load (or reload) crawl seeds, 
439:             * typically by contacting the Scope. 
440:             */
441:            public void loadSeeds();
442:
443:            /**
444:             * Request that Frontier allow crawling to begin. Usually
445:             * just unpauses Frontier, if paused. 
446:             */
447:            public void start();
448:
449:            /**
450:             * Get the 'frontier group' (usually queue) for the given 
451:             * CrawlURI. 
452:             * @param curi CrawlURI to find matching group
453:             * @return FrontierGroup for the CrawlURI
454:             */
455:            public FrontierGroup getGroup(CrawlURI curi);
456:
457:            /**
458:             * Generic interface representing the internal groupings 
459:             * of a Frontier's URIs -- usually queues. Currently only 
460:             * offers the HasCrawlSubstats interface. 
461:             */
462:            public interface FrontierGroup extends
463:                    CrawlSubstats.HasCrawlSubstats {
464:
465:            }
466:        }
www.java2java.com | Contact Us
All other trademarks are property of their respective owners.