biggish-data-benchmarking

A benchmarking project to compare the performance and usability of various data processing tools on a single machine:

Benchmarking standard

As benchmarking is fillied with many pitfalls, we will try to mitigate them by following the recommendations from this paper:

After creating a venv for the project run:

pip install -r requirements.txt

Run the bash script that first generates the tpch data tables in csv format approx 1Gb in size (scale factor 1), then it converts it to parquet files

./populate_data.sh 1

This will creat a folder named tables_scale_1 at the root folder of the project.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dbgen		dbgen
pandas_analysis		pandas_analysis
README.md		README.md
convert_to_parquet.py		convert_to_parquet.py
populate_data.sh		populate_data.sh
requirements.txt		requirements.txt