Web Robot Crawler Spider « Net Web Mobile « Java Products

Java Products

1. Application

2. Business

3. Byte Source Code

4. Component

5. Data File

6. Database

7. Development

8. Graph Image Diagram Movie

9. GUI Tools

10. J2EE Web Development

11. Misc

12. Net Web Mobile

13. Programming

14. Science

15. Server Side JSP Servlet

16. Swing

17. Testing

18. Utilities

19. XML

Microsoft Office Word 2007 Tutorial

Java

Java Tutorial

Java Source Code / Java Documentation

Java Open Source

Jar File Download

Java Articles

Java by API

C# / C Sharp

C# / CSharp Tutorial

ASP.Net

JavaScript DHTML

JavaScript Tutorial

JavaScript Reference

HTML / CSS

HTML CSS Reference

C / ANSI-C

C Tutorial

C++

C++ Tutorial

PHP

Python

SQL Server / T-SQL

Oracle PL / SQL

Oracle PL/SQL Tutorial

PostgreSQL

SQL / MySQL

MySQL Tutorial

VB.Net

VB.Net Tutorial

Java Products » Net Web Mobile » Web Robot Crawler Spider

1. Heritrix
By:
License: GNU Library or Lesser General Public License (LGPL)
URL: http://crawler.archive.org/
Description: Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

2. WebSPHINX
By:
License: Apache Software License
URL: http://www-2.cs.cmu.edu/~rcm/websphinx/
Description: WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically. WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.

3. JSpider
By:
License: GNU Library or Lesser General Public License (LGPL)
URL: http://j-spider.sourceforge.net/
Description: JSpider is: * A highly configurable and customizable Web Spider engine. * Developed under the LGPL Open Source license * In 100% pure Java

4. webeater
By:
License: GNU General Public License (GPL)
URL: http://sourceforge.net/projects/webeater
Description: A 100% pure Java program for web site retrieval and offline viewing.

5. Writing a Web Crawler in the Java Programming Language
By:
License: Open source
URL: http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/
Description: Writing a Web Crawler in the Java Programming Language

6. WebLech
By:
License: MIT License
URL: http://weblech.sourceforge.net/
Description: WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and will feature a GUI console. Similar in some aspects to tools such as wget (in recursive retrieval mode), WebSuck or Teleport Pro, WebLech allows you to "spider" a website and to recursively download all the pages on it. You can then browse the site offline for your convenience, or even "mirror" the website and re-publish it yourself. Note that WebLech is not suited to downloading single URLs -- use wget for this kind of thing.

7. Arachnid
By:
License: GNU General Public License (GPL)
URL: http://arachnid.sourceforge.net/
Description: Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed. Two example spider applications are included to illustrate how to use the framework.

8. JoBo
By:
License: GNU Library or Lesser General Public License (LGPL)
URL: http://www.matuschek.net/software/jobo/index.html
Description: JoBo is a simple program to download complete websites to your local computer. Internally it is basically a web spider. The main advantage to other download tools is that it can automatically fill out forms (e.g. for automated login) and also use cookies for session handling. Compared to other products the GUI seems to be very simple, but the internal features matters ! Do you know any download tool that allows it to login to a web server and download content if that server uses a web forms for login and cookies for session handling ? It also features very flexible rules to limit downloads by URL, size and/or MIME type.

www.java2java.com | Contact Us

Copyright 2010 - 2030 Java Source and Support. All rights reserved.

All other trademarks are property of their respective owners.