You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
9 years ago | |
---|---|---|
.gitignore | 9 years ago | |
LICENSE | 9 years ago | |
README.md | 9 years ago | |
malcrawler.py | 9 years ago |
README.md
malcrawler
Crawler that checks links on websites for malware activity (google safebrowsing v2, surbl, spamhaus)
Prerequisites
- Google Safe Browsing: https://code.google.com/p/google-safe-browsing/
- Spam Blocklist: https://pypi.python.org/pypi/spam-blocklists/0.9.3 (pip fetchable)
- TLD extract: https://pypi.python.org/pypi/tldextract (pip fetchable)
- BeautifulSoup
TODO:
- Maybe split out safebrowse logic in a server part so it can stay running and doesn't need to reinitialize every time.
- Robuster crawling
- Unicode niggles.
- Proper handling of Ctrl-C