WebSPHINX: A Personal, Customizable Web Crawler

WebSPHINX: A Personal, Customizable Web Crawler,第1张

WebSPHINX: A Personal, Customizable Web Crawler





WebSPHINX:

A Personal, Customizable Web CrawlerContents
  • About WebSPHINX
  • Download
  • Examples

  • FAQ
  • Source Code (latest release v0.5, July
    8, 2002; see change history)
  • Documentation
  • Publications
  • Related Links
  • Acknowledgements
About WebSPHINX

WebSPHINX ( Website-Specific Processors for HTML
INformation eXtraction) is a Java class library and interactive
development environment for web crawlers. A web crawler (also called a
robot or spider) is a program that browses and processes Web pages automatically.

WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX
class library.

Crawler Workbench

The Crawler Workbench is a graphical user interface that lets you configure
and control a customizable web crawler. Using the Crawler Workbench, you
can:

  • Visualize a collection of web pages as a graph
  • Save pages to your local disk for offline browsing
  • Concatenate pages together for viewing or printing
    them as a single document
  • Extract all text matching a certain pattern from
    a collection of pages.
  • Develop a custom crawler in Java or Javascript that
    processes pages however you want.
WebSPHINX class library

The WebSPHINX class library provides support for writing web crawlers
in Java. The class library offers a number of features:

  • Multithreaded Web page retrieval in a simple
    application framework
  • An object model that explicitly represents pages
    and links
  • Support for reusable page content classifiers
  • Tolerant HTML parsing
  • Support for the robot exclusion standard
  • Pattern matching, including regular expressions,
    Unix shell wildcards, and HTML tag expressions. Regular
    expressions are provided by the Apache jakarta-regexp regular expression
    library.
  • Common HTML transformations , such as concatenating
    pages , saving pages to disk, and renaming links
Download

First, you need Java 1.2 or later installed on your computer. If you're
not sure, try running java -version. If you need to install Java
on Windows, Linux, or Solaris, go directly to Sun; for other platforms, consult
the list of Java
ports.

If your computer has AFS access, run java -jar /afs/cs.cmu.edu/user/rcm/www/websphinx/websphinx.jar

If you don't have AFS, you'll need to download this JAR file:

websphinx.jar

and then run java -jar websphinx.jar

The Crawler Workbench will appear in a new window.

Examples

Here are some things to try in the Workbench.

Visualize part of the Web as a graph
This crawler retrieves the WebSPHINX pages
you've been reading and renders them as a graph of pages and links.

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/2091578.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-07-22
下一篇 2022-07-22

发表评论

登录后才能评论

评论列表(0条)

保存