Skip to content

Web Page Scraper with Notifications

Abstract

Build an automated web page monitoring system that scrapes websites for content changes and sends notifications through multiple channels including email and desktop alerts. This project demonstrates web scraping, automation, content comparison, and notification systems.

Prerequisites

Basic understanding of Python syntax
Knowledge of web scraping concepts
Familiarity with HTTP requests and HTML parsing
Understanding of email protocols and SMTP
Basic knowledge of scheduling and automation

Getting Started

Install Required Dependencies

pip install requests beautifulsoup4 schedule plyer

pip install requests beautifulsoup4 schedule plyer

Configure Email Settings (Optional)
- Set up Gmail app password for notifications
- Update email configuration in the script
- Test email functionality

Run the Web Page Monitor

python webpagescrapernotifications.py

python webpagescrapernotifications.py

Customize Monitoring
- Change the URL to monitor your desired website
- Adjust checking intervals (default: 30 minutes)
- Configure notification preferences

Code Explanation

Web Page Content Extraction

webpagescrapernotifications.py

def get_page_content(self):
    response = requests.get(self.url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')
    content = soup.get_text().strip()
    return content

webpagescrapernotifications.py

def get_page_content(self):
    response = requests.get(self.url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')
    content = soup.get_text().strip()
    return content

Uses BeautifulSoup to parse HTML and extract text content for comparison detection.

Content Change Detection

webpagescrapernotifications.py

def get_content_hash(self, content):
    return hashlib.md5(content.encode()).hexdigest()
 
def check_for_changes(self):
    current_content = self.get_page_content()
    current_hash = self.get_content_hash(current_content)
    
    if current_hash != self.previous_content:
        print("Content change detected!")

webpagescrapernotifications.py

def get_content_hash(self, content):
    return hashlib.md5(content.encode()).hexdigest()
 
def check_for_changes(self):
    current_content = self.get_page_content()
    current_hash = self.get_content_hash(current_content)
    
    if current_hash != self.previous_content:
        print("Content change detected!")

Generates MD5 hashes of page content to efficiently detect any changes without storing full content.

Email Notification System

webpagescrapernotifications.py

def send_notification_email(self, subject, message):
    msg = MIMEMultipart()
    msg['From'] = self.email_config['from_email']
    msg['To'] = self.email_config['to_email']
    msg['Subject'] = subject
    
    server = smtplib.SMTP(self.email_config['smtp_server'], self.email_config['smtp_port'])
    server.starttls()
    server.login(self.email_config['from_email'], self.email_config['password'])
    server.sendmail(self.email_config['from_email'], self.email_config['to_email'], text)

webpagescrapernotifications.py

def send_notification_email(self, subject, message):
    msg = MIMEMultipart()
    msg['From'] = self.email_config['from_email']
    msg['To'] = self.email_config['to_email']
    msg['Subject'] = subject
    
    server = smtplib.SMTP(self.email_config['smtp_server'], self.email_config['smtp_port'])
    server.starttls()
    server.login(self.email_config['from_email'], self.email_config['password'])
    server.sendmail(self.email_config['from_email'], self.email_config['to_email'], text)

Implements SMTP email sending with proper authentication and formatting for change notifications.

Desktop Notification Integration

webpagescrapernotifications.py

def send_desktop_notification(self, title, message):
    import plyer
    plyer.notification.notify(
        title=title,
        message=message,
        timeout=10
    )

webpagescrapernotifications.py

def send_desktop_notification(self, title, message):
    import plyer
    plyer.notification.notify(
        title=title,
        message=message,
        timeout=10
    )

Provides cross-platform desktop notifications using the plyer library for immediate alerts.

Automated Scheduling

webpagescrapernotifications.py

def start_monitoring(self, interval_minutes=30):
    schedule.every(interval_minutes).minutes.do(self.check_for_changes)
    
    while True:
        schedule.run_pending()
        time.sleep(1)

webpagescrapernotifications.py

def start_monitoring(self, interval_minutes=30):
    schedule.every(interval_minutes).minutes.do(self.check_for_changes)
    
    while True:
        schedule.run_pending()
        time.sleep(1)

Implements continuous monitoring with configurable intervals using the schedule library.

Object-Oriented Architecture

webpagescrapernotifications.py

class WebPageMonitor:
    def __init__(self, url, email_config=None):
        self.url = url
        self.email_config = email_config
        self.previous_content = None

webpagescrapernotifications.py

class WebPageMonitor:
    def __init__(self, url, email_config=None):
        self.url = url
        self.email_config = email_config
        self.previous_content = None

Uses class-based design for better organization and state management across monitoring sessions.

Features

Automated Web Scraping: Continuously monitors specified web pages
Change Detection: Uses content hashing for efficient change identification
Multiple Notification Channels: Supports email and desktop notifications
Configurable Intervals: Customizable monitoring frequency
Error Handling: Robust handling of network and parsing errors
Cross-Platform Support: Works on Windows, macOS, and Linux
Email Integration: SMTP support for Gmail and other providers
Content Hashing: Efficient comparison without storing full content

Next Steps

Enhancements

Add support for monitoring specific page elements (CSS selectors)
Implement webhook notifications for integration with other services
Create a web dashboard for managing multiple monitored sites
Add database storage for change history and analytics
Implement different notification types (SMS, Slack, Discord)
Create configuration file support for managing multiple sites
Add image/screenshot comparison capabilities
Implement rate limiting and respectful scraping practices

Learning Extensions

Study advanced web scraping techniques and anti-bot measures
Explore headless browsers (Selenium, Playwright) for dynamic content
Learn about web scraping ethics and legal considerations
Practice with database integration for historical data
Understand caching strategies for efficient monitoring
Explore real-time monitoring with WebSocket connections

Educational Value

This project teaches:

Web Scraping: Extracting and processing data from websites
Content Comparison: Efficiently detecting changes in large datasets
Automation: Building systems that run continuously without human intervention
Notification Systems: Implementing multiple communication channels
Email Programming: Working with SMTP protocols and email formatting
Scheduling: Creating time-based automation with proper resource management
Error Handling: Managing network failures and parsing errors gracefully
Object-Oriented Design: Structuring applications for maintainability and reusability

Perfect for understanding automation, web data monitoring, and building practical tools for content tracking and alerting systems.