WebSPHINX: A Personal, Customizable Web Crawler_随笔

WebSPHINX: A Personal, Customizable Web Crawler

WebSPHINX:

A Personal, Customizable Web CrawlerContents
About WebSPHINX
Download
Examples

FAQ
Source Code (latest release v0.5, July
8, 2002; see change history)
Documentation
Publications
Related Links
Acknowledgements
About WebSPHINX
WebSPHINX ( Website-Specific Processors for HTML
INformation eXtraction) is a Java class library and interactive
development environment for web crawlers. A web crawler (also called a
robot or spider) is a program that browses and processes Web pages automatically.
WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX
class library.
Crawler Workbench
The Crawler Workbench is a graphical user interface that lets you configure
and control a customizable web crawler. Using the Crawler Workbench, you
can:
Visualize a collection of web pages as a graph
Save pages to your local disk for offline browsing
Concatenate pages together for viewing or printing
them as a single document
Extract all text matching a certain pattern from
a collection of pages.
Develop a custom crawler in Java or Javascript that
processes pages however you want.
WebSPHINX class library
The WebSPHINX class library provides support for writing web crawlers
in Java. The class library offers a number of features:
Multithreaded Web page retrieval in a simple
application framework
An object model that explicitly represents pages
and links
Support for reusable page content classifiers
Tolerant HTML parsing
Support for the robot exclusion standard
Pattern matching, including regular expressions,
Unix shell wildcards, and HTML tag expressions. Regular
expressions are provided by the Apache jakarta-regexp regular expression
library.
Common HTML transformations , such as concatenating
pages , saving pages to disk, and renaming links
Download
First, you need Java 1.2 or later installed on your computer. If you're
not sure, try running java -version. If you need to install Java
on Windows, Linux, or Solaris, go directly to Sun; for other platforms, consult
the list of Java
ports.
If your computer has AFS access, run java -jar /afs/cs.cmu.edu/user/rcm/www/websphinx/websphinx.jar
If you don't have AFS, you'll need to download this JAR file:
websphinx.jar
and then run java -jar websphinx.jar
The Crawler Workbench will appear in a new window.
Examples

Here are some things to try in the Workbench.

Visualize part of the Web as a graph

This crawler retrieves the WebSPHINX pages
you've been reading and renders them as a graph of pages and links.

欢迎分享，转载请注明来源：内存溢出
原文地址: http://outofmemory.cn/zaji/2091578.html

WebSPHINX: A Personal, Customizable Web Crawler

发表评论

评论列表（0条）