Web Page Scraper with Notifications
Abstract
Build an automated web page monitoring system that scrapes websites for content changes and sends notifications through multiple channels including email and desktop alerts. This project demonstrates web scraping, automation, content comparison, and notification systems.
Prerequisites
- Basic understanding of Python syntax
- Knowledge of web scraping concepts
- Familiarity with HTTP requests and HTML parsing
- Understanding of email protocols and SMTP
- Basic knowledge of scheduling and automation
Getting Started
-
Install Required Dependencies
pip install requests beautifulsoup4 schedule plyer
pip install requests beautifulsoup4 schedule plyer
-
Configure Email Settings (Optional)
- Set up Gmail app password for notifications
- Update email configuration in the script
- Test email functionality
-
Run the Web Page Monitor
python webpagescrapernotifications.py
python webpagescrapernotifications.py
-
Customize Monitoring
- Change the URL to monitor your desired website
- Adjust checking intervals (default: 30 minutes)
- Configure notification preferences
Code Explanation
Web Page Content Extraction
def get_page_content(self):
response = requests.get(self.url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
content = soup.get_text().strip()
return content
def get_page_content(self):
response = requests.get(self.url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
content = soup.get_text().strip()
return content
Uses BeautifulSoup to parse HTML and extract text content for comparison detection.
Content Change Detection
def get_content_hash(self, content):
return hashlib.md5(content.encode()).hexdigest()
def check_for_changes(self):
current_content = self.get_page_content()
current_hash = self.get_content_hash(current_content)
if current_hash != self.previous_content:
print("Content change detected!")
def get_content_hash(self, content):
return hashlib.md5(content.encode()).hexdigest()
def check_for_changes(self):
current_content = self.get_page_content()
current_hash = self.get_content_hash(current_content)
if current_hash != self.previous_content:
print("Content change detected!")
Generates MD5 hashes of page content to efficiently detect any changes without storing full content.
Email Notification System
def send_notification_email(self, subject, message):
msg = MIMEMultipart()
msg['From'] = self.email_config['from_email']
msg['To'] = self.email_config['to_email']
msg['Subject'] = subject
server = smtplib.SMTP(self.email_config['smtp_server'], self.email_config['smtp_port'])
server.starttls()
server.login(self.email_config['from_email'], self.email_config['password'])
server.sendmail(self.email_config['from_email'], self.email_config['to_email'], text)
def send_notification_email(self, subject, message):
msg = MIMEMultipart()
msg['From'] = self.email_config['from_email']
msg['To'] = self.email_config['to_email']
msg['Subject'] = subject
server = smtplib.SMTP(self.email_config['smtp_server'], self.email_config['smtp_port'])
server.starttls()
server.login(self.email_config['from_email'], self.email_config['password'])
server.sendmail(self.email_config['from_email'], self.email_config['to_email'], text)
Implements SMTP email sending with proper authentication and formatting for change notifications.
Desktop Notification Integration
def send_desktop_notification(self, title, message):
import plyer
plyer.notification.notify(
title=title,
message=message,
timeout=10
)
def send_desktop_notification(self, title, message):
import plyer
plyer.notification.notify(
title=title,
message=message,
timeout=10
)
Provides cross-platform desktop notifications using the plyer library for immediate alerts.
Automated Scheduling
def start_monitoring(self, interval_minutes=30):
schedule.every(interval_minutes).minutes.do(self.check_for_changes)
while True:
schedule.run_pending()
time.sleep(1)
def start_monitoring(self, interval_minutes=30):
schedule.every(interval_minutes).minutes.do(self.check_for_changes)
while True:
schedule.run_pending()
time.sleep(1)
Implements continuous monitoring with configurable intervals using the schedule library.
Object-Oriented Architecture
class WebPageMonitor:
def __init__(self, url, email_config=None):
self.url = url
self.email_config = email_config
self.previous_content = None
class WebPageMonitor:
def __init__(self, url, email_config=None):
self.url = url
self.email_config = email_config
self.previous_content = None
Uses class-based design for better organization and state management across monitoring sessions.
Features
- Automated Web Scraping: Continuously monitors specified web pages
- Change Detection: Uses content hashing for efficient change identification
- Multiple Notification Channels: Supports email and desktop notifications
- Configurable Intervals: Customizable monitoring frequency
- Error Handling: Robust handling of network and parsing errors
- Cross-Platform Support: Works on Windows, macOS, and Linux
- Email Integration: SMTP support for Gmail and other providers
- Content Hashing: Efficient comparison without storing full content
Next Steps
Enhancements
- Add support for monitoring specific page elements (CSS selectors)
- Implement webhook notifications for integration with other services
- Create a web dashboard for managing multiple monitored sites
- Add database storage for change history and analytics
- Implement different notification types (SMS, Slack, Discord)
- Create configuration file support for managing multiple sites
- Add image/screenshot comparison capabilities
- Implement rate limiting and respectful scraping practices
Learning Extensions
- Study advanced web scraping techniques and anti-bot measures
- Explore headless browsers (Selenium, Playwright) for dynamic content
- Learn about web scraping ethics and legal considerations
- Practice with database integration for historical data
- Understand caching strategies for efficient monitoring
- Explore real-time monitoring with WebSocket connections
Educational Value
This project teaches:
- Web Scraping: Extracting and processing data from websites
- Content Comparison: Efficiently detecting changes in large datasets
- Automation: Building systems that run continuously without human intervention
- Notification Systems: Implementing multiple communication channels
- Email Programming: Working with SMTP protocols and email formatting
- Scheduling: Creating time-based automation with proper resource management
- Error Handling: Managing network failures and parsing errors gracefully
- Object-Oriented Design: Structuring applications for maintainability and reusability
Perfect for understanding automation, web data monitoring, and building practical tools for content tracking and alerting systems.
Was this page helpful?
Let us know how we did