WebSPHINX: A Personal, Customizable Web Crawler
WebSPHINX:
A Personal, Customizable Web CrawlerContentsAbout WebSPHINX
- About WebSPHINX
- Download
- Examples
- FAQ
- Source Code (latest release v0.5, July
8, 2002; see change history)- Documentation
- Publications
- Related Links
- Acknowledgements
WebSPHINX ( Website-Specific Processors for HTML
INformation eXtraction) is a Java class library and interactive
development environment for web crawlers. A web crawler (also called a
robot or spider) is a program that browses and processes Web pages automatically.WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX
Crawler Workbench
class library.The Crawler Workbench is a graphical user interface that lets you configure
and control a customizable web crawler. Using the Crawler Workbench, you
can:WebSPHINX class library
- Visualize a collection of web pages as a graph
- Save pages to your local disk for offline browsing
- Concatenate pages together for viewing or printing
them as a single document- Extract all text matching a certain pattern from
a collection of pages.- Develop a custom crawler in Java or Javascript that
processes pages however you want.The WebSPHINX class library provides support for writing web crawlers
in Java. The class library offers a number of features:Download
- Multithreaded Web page retrieval in a simple
application framework- An object model that explicitly represents pages
and links- Support for reusable page content classifiers
- Tolerant HTML parsing
- Support for the robot exclusion standard
- Pattern matching, including regular expressions,
Unix shell wildcards, and HTML tag expressions. Regular
expressions are provided by the Apache jakarta-regexp regular expression
library.- Common HTML transformations , such as concatenating
pages , saving pages to disk, and renaming linksFirst, you need Java 1.2 or later installed on your computer. If you're
not sure, try running java -version. If you need to install Java
on Windows, Linux, or Solaris, go directly to Sun; for other platforms, consult
the list of Java
ports.If your computer has AFS access, run java -jar /afs/cs.cmu.edu/user/rcm/www/websphinx/websphinx.jar
If you don't have AFS, you'll need to download this JAR file:
websphinx.jar
and then run java -jar websphinx.jar
The Crawler Workbench will appear in a new window.
ExamplesHere are some things to try in the Workbench.
- Visualize part of the Web as a graph
- This crawler retrieves the WebSPHINX pages
you've been reading and renders them as a graph of pages and links.欢迎分享,转载请注明来源:内存溢出
评论列表(0条)