Skip to content

Web Page Scraper with Notifications

Abstract

Build an automated web page monitoring system that scrapes websites for content changes and sends notifications through multiple channels including email and desktop alerts. This project demonstrates web scraping, automation, content comparison, and notification systems.

Prerequisites

  • Basic understanding of Python syntax
  • Knowledge of web scraping concepts
  • Familiarity with HTTP requests and HTML parsing
  • Understanding of email protocols and SMTP
  • Basic knowledge of scheduling and automation

Getting Started

  1. Install Required Dependencies

    pip install requests beautifulsoup4 schedule plyer
    pip install requests beautifulsoup4 schedule plyer
  2. Configure Email Settings (Optional)

    • Set up Gmail app password for notifications
    • Update email configuration in the script
    • Test email functionality
  3. Run the Web Page Monitor

    python webpagescrapernotifications.py
    python webpagescrapernotifications.py
  4. Customize Monitoring

    • Change the URL to monitor your desired website
    • Adjust checking intervals (default: 30 minutes)
    • Configure notification preferences

Code Explanation

Web Page Content Extraction

webpagescrapernotifications.py
def get_page_content(self):
    response = requests.get(self.url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')
    content = soup.get_text().strip()
    return content
webpagescrapernotifications.py
def get_page_content(self):
    response = requests.get(self.url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')
    content = soup.get_text().strip()
    return content

Uses BeautifulSoup to parse HTML and extract text content for comparison detection.

Content Change Detection

webpagescrapernotifications.py
def get_content_hash(self, content):
    return hashlib.md5(content.encode()).hexdigest()
 
def check_for_changes(self):
    current_content = self.get_page_content()
    current_hash = self.get_content_hash(current_content)
    
    if current_hash != self.previous_content:
        print("Content change detected!")
webpagescrapernotifications.py
def get_content_hash(self, content):
    return hashlib.md5(content.encode()).hexdigest()
 
def check_for_changes(self):
    current_content = self.get_page_content()
    current_hash = self.get_content_hash(current_content)
    
    if current_hash != self.previous_content:
        print("Content change detected!")

Generates MD5 hashes of page content to efficiently detect any changes without storing full content.

Email Notification System

webpagescrapernotifications.py
def send_notification_email(self, subject, message):
    msg = MIMEMultipart()
    msg['From'] = self.email_config['from_email']
    msg['To'] = self.email_config['to_email']
    msg['Subject'] = subject
    
    server = smtplib.SMTP(self.email_config['smtp_server'], self.email_config['smtp_port'])
    server.starttls()
    server.login(self.email_config['from_email'], self.email_config['password'])
    server.sendmail(self.email_config['from_email'], self.email_config['to_email'], text)
webpagescrapernotifications.py
def send_notification_email(self, subject, message):
    msg = MIMEMultipart()
    msg['From'] = self.email_config['from_email']
    msg['To'] = self.email_config['to_email']
    msg['Subject'] = subject
    
    server = smtplib.SMTP(self.email_config['smtp_server'], self.email_config['smtp_port'])
    server.starttls()
    server.login(self.email_config['from_email'], self.email_config['password'])
    server.sendmail(self.email_config['from_email'], self.email_config['to_email'], text)

Implements SMTP email sending with proper authentication and formatting for change notifications.

Desktop Notification Integration

webpagescrapernotifications.py
def send_desktop_notification(self, title, message):
    import plyer
    plyer.notification.notify(
        title=title,
        message=message,
        timeout=10
    )
webpagescrapernotifications.py
def send_desktop_notification(self, title, message):
    import plyer
    plyer.notification.notify(
        title=title,
        message=message,
        timeout=10
    )

Provides cross-platform desktop notifications using the plyer library for immediate alerts.

Automated Scheduling

webpagescrapernotifications.py
def start_monitoring(self, interval_minutes=30):
    schedule.every(interval_minutes).minutes.do(self.check_for_changes)
    
    while True:
        schedule.run_pending()
        time.sleep(1)
webpagescrapernotifications.py
def start_monitoring(self, interval_minutes=30):
    schedule.every(interval_minutes).minutes.do(self.check_for_changes)
    
    while True:
        schedule.run_pending()
        time.sleep(1)

Implements continuous monitoring with configurable intervals using the schedule library.

Object-Oriented Architecture

webpagescrapernotifications.py
class WebPageMonitor:
    def __init__(self, url, email_config=None):
        self.url = url
        self.email_config = email_config
        self.previous_content = None
webpagescrapernotifications.py
class WebPageMonitor:
    def __init__(self, url, email_config=None):
        self.url = url
        self.email_config = email_config
        self.previous_content = None

Uses class-based design for better organization and state management across monitoring sessions.

Features

  • Automated Web Scraping: Continuously monitors specified web pages
  • Change Detection: Uses content hashing for efficient change identification
  • Multiple Notification Channels: Supports email and desktop notifications
  • Configurable Intervals: Customizable monitoring frequency
  • Error Handling: Robust handling of network and parsing errors
  • Cross-Platform Support: Works on Windows, macOS, and Linux
  • Email Integration: SMTP support for Gmail and other providers
  • Content Hashing: Efficient comparison without storing full content

Next Steps

Enhancements

  • Add support for monitoring specific page elements (CSS selectors)
  • Implement webhook notifications for integration with other services
  • Create a web dashboard for managing multiple monitored sites
  • Add database storage for change history and analytics
  • Implement different notification types (SMS, Slack, Discord)
  • Create configuration file support for managing multiple sites
  • Add image/screenshot comparison capabilities
  • Implement rate limiting and respectful scraping practices

Learning Extensions

  • Study advanced web scraping techniques and anti-bot measures
  • Explore headless browsers (Selenium, Playwright) for dynamic content
  • Learn about web scraping ethics and legal considerations
  • Practice with database integration for historical data
  • Understand caching strategies for efficient monitoring
  • Explore real-time monitoring with WebSocket connections

Educational Value

This project teaches:

  • Web Scraping: Extracting and processing data from websites
  • Content Comparison: Efficiently detecting changes in large datasets
  • Automation: Building systems that run continuously without human intervention
  • Notification Systems: Implementing multiple communication channels
  • Email Programming: Working with SMTP protocols and email formatting
  • Scheduling: Creating time-based automation with proper resource management
  • Error Handling: Managing network failures and parsing errors gracefully
  • Object-Oriented Design: Structuring applications for maintainability and reusability

Perfect for understanding automation, web data monitoring, and building practical tools for content tracking and alerting systems.

Was this page helpful?

Let us know how we did