A benchmarking project to compare the performance and usability of various data processing tools on a single machine:
- Pandas
- DuckDB
- SQLite
- Dask
- Modin
- Vaex
- Ray
- Spark
- cuDF
- Polars
As benchmarking is fillied with many pitfalls, we will try to mitigate them by following the recommendations from this paper:
After creating a venv for the project run:
pip install -r requirements.txt
Run the bash script that first generates the tpch data tables in csv format approx 1Gb in size (scale factor 1), then it converts it to parquet files
./populate_data.sh 1
This will creat a folder named tables_scale_1 at the root folder of the project.