WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content
Mo McRoberts edited this page Jan 30, 2017 · 2 revisions

crawld is a web crawler built on libcrawl. It operates using a relational database to track the status of resources, although other queue implementations could be added later.

Parallel processing

crawld supports a configurable number of distinct crawl threads, and can operate as part of a parallel-processing cluster (via libcluster).

Policy handling

crawld implements basic configuration-driven policy processing: it can be configured to white- or black-list certain URI schemes and MIME types.

Processors

crawld provides processors, which act on resources once they have been successfully retrieved.

libcrawld

The core of crawld is compiled into libcrawld, around which crawld implements a thin wrapper. A command-line utility to add resources to the configured crawl queue, crawler-add, also links against libcrawld.

crawld does not currently support being extended through loadable modules, but is intended to support this in the future, via an interface to libcrawld.

Clone this wiki locally