Basic Web Crawler
Abstract
Create a web crawler that automatically navigates through websites, extracts content, and saves structured data. This project demonstrates advanced web scraping, URL management, and systematic data collection techniques.
Prerequisites
- Solid understanding of Python syntax
- Knowledge of web scraping with BeautifulSoup
- Familiarity with HTTP requests and web protocols
- Understanding of data structures (queues, sets)
- Basic knowledge of CSV file operations
Getting Started
-
Install Required Dependencies
pip install requests beautifulsoup4
pip install requests beautifulsoup4
-
Run the Web Crawler
python basicwebcrawler.py
python basicwebcrawler.py
-
Configure Crawling
- Enter the starting URL
- Set maximum pages to crawl
- Choose whether to save results to CSV
Code Explanation
Crawler Architecture
class WebCrawler:
def __init__(self, start_url, max_pages=10, delay=1):
self.visited_urls = set()
self.to_visit = deque([start_url])
self.crawled_data = []
class WebCrawler:
def __init__(self, start_url, max_pages=10, delay=1):
self.visited_urls = set()
self.to_visit = deque([start_url])
self.crawled_data = []
Uses queue-based architecture to systematically process URLs while avoiding duplicates.
URL Validation and Management
def is_valid_url(self, url):
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def extract_links(self, html, base_url):
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=True):
full_url = urljoin(base_url, href)
def is_valid_url(self, url):
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def extract_links(self, html, base_url):
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=True):
full_url = urljoin(base_url, href)
Implements robust URL handling with validation and proper absolute URL construction.
Data Extraction Pipeline
def extract_page_data(self, html, url):
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('title').get_text().strip()
meta_desc = soup.find('meta', attrs={'name': 'description'})
headings = [h.get_text().strip() for h in soup.find_all(['h1', 'h2', 'h3'])]
def extract_page_data(self, html, url):
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('title').get_text().strip()
meta_desc = soup.find('meta', attrs={'name': 'description'})
headings = [h.get_text().strip() for h in soup.find_all(['h1', 'h2', 'h3'])]
Extracts structured data including titles, descriptions, headings, and content previews.
Respectful Crawling
def crawl(self):
# Be respectful - add delay
time.sleep(self.delay)
# Only crawl within the same domain
if urlparse(link).netloc == urlparse(self.start_url).netloc:
self.to_visit.append(link)
def crawl(self):
# Be respectful - add delay
time.sleep(self.delay)
# Only crawl within the same domain
if urlparse(link).netloc == urlparse(self.start_url).netloc:
self.to_visit.append(link)
Implements ethical crawling practices with delays and domain restrictions.
Data Export
def save_to_csv(self, filename="crawl_results.csv"):
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow(data_copy)
def save_to_csv(self, filename="crawl_results.csv"):
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow(data_copy)
Exports crawled data to structured CSV format for analysis and reporting.
Features
- Systematic Crawling: Queue-based URL processing with duplicate detection
- Data Extraction: Captures titles, descriptions, headings, and content
- Domain Restriction: Stays within the starting domain for focused crawling
- Rate Limiting: Respectful delays between requests
- Error Handling: Robust handling of network errors and invalid URLs
- CSV Export: Structured data export for further analysis
- Progress Tracking: Real-time crawling progress and statistics
- Configurable Limits: Set maximum pages and crawling parameters
Next Steps
Enhancements
- Add support for robots.txt compliance
- Implement depth-first vs breadth-first crawling options
- Create advanced filtering based on content type
- Add database storage for large-scale crawling
- Implement parallel/concurrent crawling
- Create web interface for crawler management
- Add image and file download capabilities
- Implement crawling analytics and reporting
Learning Extensions
- Study advanced web scraping techniques and anti-bot measures
- Explore distributed crawling systems
- Learn about search engine indexing principles
- Practice with database integration for large datasets
- Understand legal and ethical considerations in web crawling
- Explore machine learning for content classification
Educational Value
This project teaches:
- Web Crawling Architecture: Designing systematic data collection systems
- Queue Management: Using data structures for efficient URL processing
- HTTP Programming: Advanced request handling and error management
- Data Extraction: Parsing and structuring web content systematically
- File Operations: Writing structured data to various file formats
- Ethical Programming: Implementing respectful web interaction practices
- URL Management: Handling relative/absolute URLs and link resolution
- System Design: Building scalable and maintainable data collection tools
Perfect for understanding large-scale data collection, web technologies, and building tools for automated information gathering.
Was this page helpful?
Let us know how we did