-
Notifications
You must be signed in to change notification settings - Fork 0
crawld
crawld is a web crawler built on libcrawl. It operates using a relational database to track the status of resources, although other queue implementations could be added later.
crawld supports a configurable number of distinct crawl threads, and can operate as part of a parallel-processing cluster (via libcluster).
crawld implements basic configuration-driven policy processing: it can be configured to white- or black-list certain URI schemes and MIME types.
crawld provides processors, which act on resources once they have been successfully retrieved.
The core of crawld is compiled into libcrawld, around which crawld implements a thin wrapper. A command-line utility to add resources to the configured crawl queue, crawler-add, also links against libcrawld.
crawld does not currently support being extended through loadable modules, but is intended to support this in the future, via an interface to libcrawld.