A gentle instruction about how to use the WorldCup98 Dataset.
Note: It takes a long time to rebuild nginx log and transfer it into csv format. I provide proceeding results after step Group Request by Minute in the repo. Feel free to use them.
The original dataset description is available at https://ita.ee.lbl.gov/html/contrib/WorldCup.html. The dataset is archived on an FTP server, which may not be supported for direct downloading using modern browsers like Chrome and Firefox. A practical way is to use dedicated download tools like wget or FileZilla.
wget -r ftp://ita.ee.lbl.gov/traces/WorldCup/
You should reorganize the download directory by following these steps:
- Remove the WorldCup.html
- Unzip the WorldCup_tools.tar.gz to an independent directory
- Place all the wc_day*.gz data files in an independent directory
Now the directory tree should look like this:
- ita_public_tools
- bin
- ...
- WorldCup
- wc_day1_1.gz
- wc_day2_1.gz
- ...
- wc_day92_1.gz
cd ita_public_tools && make
Run 01_rebuild.py
, the .py
file will call the built tool to rebuld Nginx log from .gz
format.
Check Tools
section in the description page for more details.
Default output directory is RecreatedLog
.
Run 02_to_csv.py
, the .py
file will transfer the rebuilt Nginx log in to CSV format.
Each row consists of data like 1998-04-30 21:30:17,24736
, representing timestamp, transferred data size.
Every record in Niginx log whose request size is '-' is replaced by 1.
Default input directory is RecreatedLog
.
Default output directory is CSVLog
.
Run 03_group_request_by_min.py
, the .py
file will further group the Nginx log into pre-minute record.
Each row consists of data like 1998-05-04 22:00:00,543,4234845
, representing per-minute timestamp, request number,transferred data size.
Default input directory is CSVLog
.
Default output directory is GroupedLog
.
Run 04_merge.py
, the .py
file will merge the grouped requests.
Default input directory is GroupedLog
.
Default output directory is MergedLog
.
Note: The original data processing is done. Further steps will remove anomalies and smooth the data. You can stick to merged data in
MergedLog
if you prefer to shuffle data by your self.
Note: This step use data ranging from
1998-06-09T00:00:00Z
to1998-07-13T23:59:59Z
when the WorldCup98 event was held.
Run 05_remove_anomaly.ipynb
, the .ipynb
notebook will smooth the request number.
The notebook find anomalies and smooth values as following:
- Calculate the absolute value of the first order difference of the request number.
- Sort the absolute values in descending order and plot the values.
- Large, anomalous absolute values take a small portion of the whole absolute value set. The figure will have an obvious "elbow".
- Take the elbow value as criteria, remove the sudden peaks and valleys in the figure of the original request number.
Default input directory is
MergedLog
. Default output directory isAnomalyRemovedLog
.
There are still some anomalies after smoothing, requiring further fine tune, especially at timestamp 22:00 each day.
Manually tagged anomaly index are in FineTunedLog/anomaly_index.json
. You can edit it freely if you find more anomalies.
Run 06_fine_tune.ipynb
, the .ipynb
notebook will further fine-tune the smoothed data.
Default input directory is AnomalyRemovedLog
.
Default output directory is FineTunedLog
.