A Python-based web scraper for extracting dental insurance guidelines from major carriers' provider portals.
- Automated extraction of dental insurance guidelines
- Support for major carriers (Aetna, Cigna, MetLife, UHC)
- PDF processing and text extraction
- Structured data output in JSON format
- Rate limiting and proxy support
- Comprehensive error handling
- Python 3.11+
- Docker (optional, for containerized deployment)
- Access to provider portals
-
Clone the repository:
git clone https://github.com/yourusername/insurance-web-scraper.git cd insurance-web-scraper -
Install dependencies:
pip install -r requirements.txt
-
Configure your credentials:
cp .env.example .env # Edit .env with your credentials -
Run the scraper:
python -m scraper run --carrier aetna
Detailed documentation is available in the docs directory:
The scraper uses YAML configuration files for easy setup and maintenance. See the Configuration Guide for details.
Example configuration:
scraper:
rate_limit: 5 # requests per second
output_dir: "./data"
proxy:
enabled: true
rotate_every: 60 # seconds
carriers:
aetna:
enabled: true
base_url: "https://www.aetna.com/health-care-professionals"
cigna:
enabled: true
base_url: "https://www.cigna.com/healthcare-providers"For issues and feature requests, please use the GitHub issue tracker.
This project is licensed under the MIT License - see the LICENSE file for details.
- Python 3.11+
- Docker (optional for containerized deployment)
- Docker Compose (optional for containerized deployment)
- At least 4GB RAM available
- At least 10GB disk space
.
├── dental_scraper/ # Main Python package
│ ├── spiders/ # Spider implementations for each carrier
│ ├── scrapers/ # Base scraping functionality
│ ├── models/ # Data models and validation
│ ├── middlewares/ # Request middleware (rate limiting, etc.)
│ ├── utils/ # Utility functions and helpers
│ ├── app.py # Application entry point
│ ├── exceptions.py # Custom exceptions
│ └── pdf/ # PDF processing functionality
├── tests/ # Test suite
│ └── spiders/ # Spider-specific tests
├── data/ # Data storage
│ ├── raw/ # Raw downloaded PDFs
│ └── processed/ # Processed data files
├── logs/ # Application logs
└── cache/ # Cache storage
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r config/requirements.txt
The system uses the following services when deployed with Docker:
- Scraper Service: Python 3.11-based web scraper
- MongoDB: Document storage for structured data
- Qdrant: Vector database for semantic search
- Rotating Proxy: TOR-based proxy rotation system
MONGO_URI: MongoDB connection stringQDRANT_HOST: Qdrant service hostnameQDRANT_PORT: Qdrant service portTOR_PROXY: TOR proxy connection string
-
Start the services:
docker-compose up -d
-
Monitor the logs:
docker-compose logs -f scraper
python src/main.pyRun tests with pytest:
pytest tests/- Scraper Service: 2GB RAM, 2 CPU cores
- MongoDB: Uses default resource allocation
- Qdrant: Uses default resource allocation
- TOR Proxy: Uses default resource allocation
- Custom DNS servers: 8.8.8.8, 8.8.4.4
- Bridge network mode for service isolation
- TOR proxy with 5 instances and 60-second rotation
- Logs are stored in the
logsdirectory - PDFs are stored in the
data/rawdirectory - Extracted data is stored in the
data/processeddirectory - Cache is stored in the
cachedirectory
-
If the scraper service fails to start:
- Check if all required directories exist
- Verify MongoDB and Qdrant services are running (if using Docker)
- Check the logs for specific error messages
-
If proxy rotation isn't working:
- Verify TOR service is running
- Check TOR logs for connection issues
- Ensure proxy settings are correctly configured
-
For memory issues:
- Monitor container resource usage
- Adjust memory limits in docker-compose.yml
- Clear cache directory if needed
- Create a new branch for your feature
- Write tests for new functionality
- Submit a pull request
MIT License
Copyright (c) 2024 AojdevStudio
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
The authentication configuration system provides a secure and flexible way to manage authentication credentials and handle different authentication methods for web scraping tasks.
- Secure credential storage using system keyring and encrypted files
- Support for multiple authentication methods:
- Basic Authentication
- Form-based Authentication
- Token-based Authentication
- Configuration validation using Pydantic
- Utility functions for handling CSRF tokens, form fields, and response validation
- Comprehensive test coverage
The authentication system requires additional dependencies. Install them using:
pip install -r requirements.txtfrom auth import CredentialManager, AuthConfigManager, create_auth_handler
# Initialize managers
credential_manager = CredentialManager("my_service")
config_manager = AuthConfigManager(credential_manager)
# Add configuration
config = config_manager.add_config("my_service", {
"auth_type": "basic",
"service_name": "my_service",
"base_url": "https://example.com",
"auth_url": "https://example.com/auth"
})
# Create auth handler
auth_handler = create_auth_handler("basic", credential_manager)
# Store credentials
credential_manager.store_credential("username", "my_username")
credential_manager.store_credential("password", "my_password")
# Authenticate
success = auth_handler.authenticate(
username=credential_manager.get_credential("username"),
password=credential_manager.get_credential("password"),
url=config.auth_url
)# Add form auth configuration
config = config_manager.add_config("my_service", {
"auth_type": "form",
"service_name": "my_service",
"base_url": "https://example.com",
"form_url": "https://example.com/login",
"form_data": {
"remember_me": "true"
}
})
# Create form auth handler
auth_handler = create_auth_handler("form", credential_manager)
# Authenticate with form
success = auth_handler.authenticate(
username=credential_manager.get_credential("username"),
password=credential_manager.get_credential("password"),
form_url=config.form_url,
form_data=config.form_data
)# Add token auth configuration
config = config_manager.add_config("my_service", {
"auth_type": "token",
"service_name": "my_service",
"base_url": "https://example.com",
"token_url": "https://example.com/token",
"token_field": "access_token",
"expiry_field": "expires_in"
})
# Create token auth handler
auth_handler = create_auth_handler("token", credential_manager)
# Authenticate and get token
success = auth_handler.authenticate(
username=credential_manager.get_credential("username"),
password=credential_manager.get_credential("password"),
token_url=config.token_url
)- Credentials are stored securely using the system keyring when available
- Fallback to encrypted file storage when keyring is not available
- Sensitive data is never logged or stored in plain text
- SSL certificate verification is enabled by default
- CSRF tokens are automatically handled for form authentication
- Token expiration is tracked and handled appropriately
Run the test suite:
pytest tests/auth/