reddit hackernews mail facebook facebook linkedin
crawley

crawley

The unix-way web crawler.

Crawls web pages and prints any link it can find. Features:
- fast html SAX-parser
- small (below 1500 SLOC), idiomatic, 100% test covered codebase
- grabs most of useful resources urls
- found urls are streamed to stdout and guranteed to be unique
- scan depth can be configured
- can crawl rules and sitemaps from robots.txt
- brute mode - scan html comments for urls
- make use of HTTP_PROXY / HTTPS_PROXY environment values + handles proxy auth
- directory-only scan mode
- user-defined cookies, in curl-compatible format
- user-defined headers
- tag filter
- url ignore
- js parser