crawld

crawld is a web crawler built on libcrawl. It operates using a relational database to track the status of resources, although other queue implementations could be added later.

Parallel processing

crawld supports a configurable number of distinct crawl threads, and can operate as part of a parallel-processing cluster (via libcluster).

Policy handling

crawld implements basic configuration-driven policy processing: it can be configured to white- or black-list certain URI schemes and MIME types.

Processors

crawld provides processors, which act on resources once they have been successfully retrieved.

`libcrawld`

The core of crawld is compiled into libcrawld, around which crawld implements a thin wrapper. A command-line utility to add resources to the configured crawl queue, crawler-add, also links against libcrawld.

crawld does not currently support being extended through loadable modules, but is intended to support this in the future, via an interface to libcrawld.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

crawld

Parallel processing

Policy handling

Processors

`libcrawld`

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally