| org.apache.cocoon.components.crawler.SimpleCocoonCrawlerImpl
SimpleCocoonCrawlerImpl | public class SimpleCocoonCrawlerImpl extends AbstractLogEnabled implements CocoonCrawler,Configurable,Disposable,Recyclable(Code) | | A simple cocoon crawler.
author: Bernhard Huber version: CVS $Id: SimpleCocoonCrawlerImpl.java 433543 2006-08-22 06:22:54Z crossley $ |
Inner Class :public static class CocoonCrawlerIterator implements Iterator | |
Method Summary | |
public void | configure(Configuration configuration) Configure the crawler component.
Configure can specify which URI to include, and which URI to exclude
from crawling. | public void | crawl(URL url) | public void | crawl(URL url, int maxDepth) Start crawling a URL.
Use this method to start crawling.
Get the this url, and all its children by using iterator() .
The Iterator object will return URL objects.
You may use the crawl(), and iterator() methods the following way:
SimpleCocoonCrawlerImpl scci = ....;
scci.crawl( "http://foo/bar" );
Iterator i = scci.iterator();
while (i.hasNext()) {
URL url = (URL)i.next();
...
}
The i.next() method returns a URL, and calculates the links of the
URL before return it.
Parameters: url - Crawl this URL, getting all links from this URL. Parameters: maxDepth - maximum depth to crawl to. | public void | dispose() dispose at end of life cycle, releasing all resources. | public Iterator | iterator() Return iterator, iterating over all links of the currently crawled URL. | public void | recycle() |
ACCEPT_CONFIG | final public static String ACCEPT_CONFIG(Code) | | Config element name specifying http header value for accept.
Its value is accept .
|
ACCEPT_DEFAULT | final public static String ACCEPT_DEFAULT(Code) | | Default value of accept configuration option.
Its value is * / *
|
EXCLUDE_CONFIG | final public static String EXCLUDE_CONFIG(Code) | | Config element name specifying excluding regular expression pattern.
Its value is exclude .
|
INCLUDE_CONFIG | final public static String INCLUDE_CONFIG(Code) | | Config element name specifying including regular expression pattern.
Its value is include .
|
LINK_CONTENT_TYPE_CONFIG | final public static String LINK_CONTENT_TYPE_CONFIG(Code) | | Config element name specifying expected link content-typ.
Its value is link-content-type .
|
LINK_CONTENT_TYPE_DEFAULT | final public String LINK_CONTENT_TYPE_DEFAULT(Code) | | Default value of link-content-type configuration value.
Its value is application/x-cocoon-links .
|
LINK_VIEW_QUERY_CONFIG | final public static String LINK_VIEW_QUERY_CONFIG(Code) | | Config element name specifying query-string appendend for requesting links
of an URL.
Its value is link-view-query .
|
LINK_VIEW_QUERY_DEFAULT | final public static String LINK_VIEW_QUERY_DEFAULT(Code) | | Default value of link-view-query configuration option.
Its value is ?cocoon-view=links .
|
USER_AGENT_CONFIG | final public static String USER_AGENT_CONFIG(Code) | | Config element name specifying http header value for user-Agent.
Its value is user-agent .
|
depth | protected int depth(Code) | | |
SimpleCocoonCrawlerImpl | public SimpleCocoonCrawlerImpl()(Code) | | Constructor for the SimpleCocoonCrawlerImpl object
|
configure | public void configure(Configuration configuration) throws ConfigurationException(Code) | | Configure the crawler component.
Configure can specify which URI to include, and which URI to exclude
from crawling. You specify the patterns as regular expressions.
Morover you can configure
the required content-type of crawling request, and the
query-string appended to each crawling request.
<include>.*\.html?</exclude> or <exclude>.*\.html?, .*\.xsp</exclude>
<exclude>.*\.gif</exclude> or <exclude>.*\.gif, .*\.jpe?g</exclude>
<link-content-type> application/x-cocoon-links </link-content-type>
<link-view-query> ?cocoon-view=links </link-view-query>
Parameters: configuration - XML configuration of this avalon component. exception: ConfigurationException - is throwing if configuration is invalid. |
crawl | public void crawl(URL url)(Code) | | The same as calling crawl(url,-1);
Parameters: url - Crawl this URL, getting all links from this URL. |
crawl | public void crawl(URL url, int maxDepth)(Code) | | Start crawling a URL.
Use this method to start crawling.
Get the this url, and all its children by using iterator() .
The Iterator object will return URL objects.
You may use the crawl(), and iterator() methods the following way:
SimpleCocoonCrawlerImpl scci = ....;
scci.crawl( "http://foo/bar" );
Iterator i = scci.iterator();
while (i.hasNext()) {
URL url = (URL)i.next();
...
}
The i.next() method returns a URL, and calculates the links of the
URL before return it.
Parameters: url - Crawl this URL, getting all links from this URL. Parameters: maxDepth - maximum depth to crawl to. -1 for no maximum. |
dispose | public void dispose()(Code) | | dispose at end of life cycle, releasing all resources.
|
iterator | public Iterator iterator()(Code) | | Return iterator, iterating over all links of the currently crawled URL.
The Iterator object will return URL objects at its next()
method.
Iterator iterator of all links from the crawl URL. |
recycle | public void recycle()(Code) | | recylcle this object, relasing resources
|
|
|