CENACE web scraper for obtaining CSV files (Mexico). These files are processed and uploaded to a SQL server where data could be analyzed.
Site: CENACE
- Files are downloaded by the
PythonWebScrapper.py
script. - Files are read by the
DailyScirpts
on theCSVdir
, then they are INSERTED on the SQL DB.
Pandas_PML_Daily.py
and Pandas_PND_Daily.py
: Parse information downloaded from CENACE to the SQL Database every day.
Pandas_PML_Monthly.py
and Pandas_PND_Monthly.py
: INSERT to the SQL database historical data from CENACE.
To use the tool, it is necessary to download and install:
Note:
- To install Geckodriver in windows it is necessary to add geckodriver.exe to the systems path
- Enter CENACE and download files to the specified directory
- Validate downloads
- Use Pandas lib to parse CSV files on specified directory
- Create a daily condensed CSV files for PML and PND
- Validate data integrity as dataframe
- Local DB Connection
- Local DB INSERT and SELECT
- PML monthly script
- PND monthly script
- PML daily script
- PND daily script
- Azure DB connection
- Azure data upload
- Initialize DB with past information using monthly scripts
- Run every 24 hrs
- Performance
current code performance with large amounts of data is slow. About 165,000 inserts per min on a system with:
-
AMD A8-5550M APU 2.10 Ghz
- CPU load when INSERT historical data ~(40%-55%)
Working laptop with common programs and bloat ware running (Lotus, IBM, CITRIX, McAfee, Atom Editor, Spiyder, SQL Server Management Studio, etc.)
-
8 GB RAM when INSERT historical data
- Memory ~(35%-45%)
Running SQL Server Windows NT 64bit, python and Spyder as main processes, the python process tend to consume different amounts of memory due to the file size variability.
-
64bit Windows 8 OS
Site: Tests
According to code logic, performance bottle neck is due to
to_sql
function on pandas lib.
PML
MDA
MTR
PND
MDA
MTR
Don't forget to change the download paths!!
Example on file Webdriver_Downloader:
profile.set_preference("browser.download.dir", "C:\\Users\e-jlfloresg\Desktop\Python-Requests-CENACE\SELENIUM\test downloads\PML\MTR")