TorCrawl.py is a Python script that enables anonymous web scraping of regular and onion webpages through the Tor network.
Crawl and extract (regular or onion) webpages through TOR network
This tool is designed for users who need to perform secure and untraceable data collection from websites, including those on the Tor network. It is ideal for privacy-conscious programmers and researchers who want to crawl and extract webpage content anonymously while respecting web crawling ethics.
Users must have the Tor service installed and running for the tool to function properly. While the crawler can bypass robots.txt restrictions, it is recommended to respect website terms of service and copyright laws. The tool is suitable for responsible data gathering and privacy-focused automation.
git clone https://github.com/MikeMeliz/TorCrawl.py.git
pip install -r requirements.txt
Install and start TOR service depending on OS:
For Debian/Ubuntu: apt-get install tor
Start TOR service: service tor start
For Windows: Download tor.exe
Install TOR service: tor.exe --service install
Start TOR service: tor.exe --service start
For MacOS: brew install tor
Start TOR service: brew services start tor
torcrawl -u http://www.github.com/ | grep 'google-analytics'
Scrape the specified webpage anonymously through Tor and filter output for 'google-analytics' entries.
torcrawl -v -u http://www.github.com/ -c -d 2 -p 2
Verbose crawl of the webpage with link crawling enabled, 2 levels deep, and 2 seconds delay between requests.