This repo is based on @tomaztk great work in Benchmarking file formats for cloud Storage.
This repo provides the format_benchmark_tool
python module.
It is used to compare various pandas file formats.
.csv
.json
.xml
.xlsx
(Excel).pkl
(Pickle).h5
(HDF5).feather
.parquet
.orc
.dta
(Stata)
For ease of use we provide a simple Jupyter Notebook benchmarking all supported file formats and generating pretty graphs.
Results are based on experiments with multiple datasets from RoboCup 2D Simulation league recordings (≅20MB csv data).
Format | Read time Rank | Write time Rank | File Size Rank | Type | Language Support | Notes |
---|---|---|---|---|---|---|
Pickle | 1 | 1 | 7 | binary | Python | |
Feather | 2 | 2 | 2 | binary | Python, R, Julia, JS | May not be stable |
Parquet | 3 | 4 | 1 | binary | Python, Java, C++, PHP, JS, ... | |
HDF5 | 4 | 3 | 8 | binary | Python, C, C++, Java, ... | |
Orc | 5 | 5 | 3 | binary | Python, Java, C++ | |
Csv | 6 | 7 | 4 | text | UNIVERSAL | |
Stata | 7 | 8 | 6 | binary | Stata, Python (Pandas) | Limited data type support |
Json | 8 | 6 | 9 | text | UNIVERSAL | |
Xml | 9 | 9 | 10 | text | UNIVERSAL | |
Excel | 10 | 10 | 5 | text | UNIVERSAL |