This is a portable, reproducible, and completely automated application designed to benchmark various database platforms and big data technologies with a goal of comparing how each performs when working with large datasets. By ensuring the same dataset is inserted into each database platform and executing syntactically equivalent queries, we can achieve an accurate comparison and understand where one outperforms the other. An interactive HTTP dashboard containing various charting and graphing elements are used to display the benchmarking results for visual analysis.
When it comes to Big Data and the platforms that house the data, the most common question being asked is:
"Which big data platform performs best?"
The answer really depends on a few key factors:
- Raw data size (small data vs. big data)
- Query complexity (joins, unions, aggregate and analytical queries)
- Number of concurrent users
- Memory, CPU, and network configurations
And since each environment is different, the more appropriate question to ask is:
"Which performs best under each scenario?"
If your organization is running multiple Big Data platforms, this benchmarking application is designed to compare them side by side.
- Supports the following big data technologies:
- Oracle Database
- Oracle Database In-Memory
- Microsoft SQL Server
- SAP HANA
- coming soon:
- Apache Hive
- Apache Spark SQL
- SQLite (use pysqlite)
- MySQL (use mysqldb or mysql.connector or pymysql)
- MariaDB (use mysqldb or mysql.connector or pymysql)
- PostgreSQL (use psycopg2)
- Microsoft Azure
- Compatible on Linux, macOS, and Windows in a reproducible, portable development environment
- Able to accept multiple source datasets in CSV file format (containing headers)
- Create tables and insert into database the datasets specified
- Run benchmarks on these new datasets or on existing tables in the database
- Able to specify the name of the tables to benchmark
- Able to specify the number of benchmarking iterations to perform
- Able to specify the number of concurrent users to simultaneously perform the same benchmarking queries
- Predefined query templates used to formulate a dynamically generated query which is transferrable between each database technology using a syntactically equivalent query
- Limit query results to a specified row count
- Drop tables (only on the newly created tables) immediately after the benchmark
- Output benchmarking results to CSV
- Plot the results in an interactive HTML dashboard containing data tables, charts and graphs using a Bokeh server web app
- Vagrant - A tool for building complete development environments
- VirtualBox - A cross-platform virtualization application
- Read/Write permissions to one or more of the following database technologies:
- Oracle Database
- Oracle Database In-Memory
- Microsoft SQL Server
- SAP HANA
- more coming soon...
- One or more CSV datasets (containing headers). Recommended sites to find open datasets:
- data.gov - The home of the U.S. Government's open data
- /r/datasets - subreddit with hundreds of interesting datasets
- Awesome Public Datasets - list of datasets, hosted on GitHub.
- Kaggle - Kaggle datasets
Download and install the latest versions of:
Install the Vagrant plugin which automatically installs the host's VirtualBox Guest Additions on the guest system
vagrant plugin install vagrant-vbguest
Clone the repo and navigate into the project directory
git clone https://github.com/justinnaldzin/big-data-benchmarking.git
cd big-data-benchmarking
The folder structure looks like the following:
$ tree -L 1
.
├── LICENSE.md
├── README.md
├── app
├── benchmark.py
├── big_data_benchmarking.py
├── config.json.example
├── create_tables.py
├── csv
├── data
├── drop_tables.py
├── log
├── queries
└── vagrant
6 directories, 7 files
Move all source CSV datasets into the big-data-benchmarking/data/
path
mv /path/to/datasets/*.csv data/
Using the example configuration file, create a JSON file named config.json
and add database credentials to the connection_string
parameter
cp config.json.example config.json
vi config.json
{
"Oracle Database":{
"connection_string":"oracle+cx_oracle://user:password@host"
...
},
"Oracle Database In-Memory":{
"connection_string":"oracle+cx_oracle://user:password@host"
...
},
"SQL Server":{
"connection_string":"mssql+pymssql://user:password@host"
...
},
"HANA":{
"connection_string":"hana://user:password@host:30015"
...
},
"Hive":{
"connection_string":"hive://user:password@host:10000/database"
...
}
}
Build the environment in a virtual machine
cd vagrant
vagrant up
After provisioning, SSH into the VM
vagrant ssh
Navigate to the project directory on the VM
cd /big-data-benchmarking
Print the command line help menu to view all the options for running the application
$ ./big_data_benchmarking.py --help
usage: big_data_benchmarking.py [-h] [-t TABLE_LIKE] [-r ROWS] [-i ITERATIONS]
[-u CONCURRENT_USERS] [-p DATA_PATH] [-c] [-d]
[database_list [database_list ...]]
Big Data Benchmarking
positional arguments:
database_list Specify the list of Databases to benchmark. These
names must match the names pre-configured in the
'config.json' file.
optional arguments:
-h, --help show this help message and exit
-t TABLE_LIKE, --table-like TABLE_LIKE
Specify the name of the tables to benchmark. This uses
the SQL 'LIKE' operator to search a specified pattern
so use the '%' sign to define wildcards. Default is
'%' which will find all existing tables in the
database.
-r ROWS, --rows ROWS The maximum number of rows to return from each query
execution. Default is 10000
-i ITERATIONS, --iterations ITERATIONS
The number of benchmark iterations to perform on the
database. Default is 1
-u CONCURRENT_USERS, --users CONCURRENT_USERS
The number of concurrent users to connect to the
database. Default is 1
-p DATA_PATH, --path DATA_PATH
Full directory path to where the data files are
stored. These will be used to create the tables and
insert into database. Default path is: /data
-c, --create-tables Create tables and insert into database the data files
that exist within in the folder '--path' argument. Not
specifying this option will run benchmarks on all
existing tables in the database.
-d, --drop-tables The '--create-tables' argument must be specified. Only
those tables created will be dropped.
Note! If you simply want to view the interactive dashboard using sample benchmarking results, skip this section and jump down to the Benchmarking Results section.
The big-data-benchmarking.py
Python script is the main entrypoint to the application. Here are some example benchmarking scenarios, and how to execute the script:
- Run against Oracle Database and SQL Server with default options
./big-data-benchmarking.py "Oracle Database" "SQL Server"
- Run on Oracle Database In-Memory against all tables whose name starts with "DEV"
./big-data-benchmarking.py "Oracle Database In-Memory" -t "DEV%"
- Run against HANA limiting the number of rows to return from each query to 5000
./big-data-benchmarking.py "HANA" -r 5000
- Run with 50 concurrent users
./big-data-benchmarking.py "Oracle Database" -u 50
- Run 10 iterations
./big-data-benchmarking.py "Oracle Database" -i 10
- Create tables on the database using all CSV datasets in the default
/big-data-benchmarking/data/
path
./big-data-benchmarking.py "Oracle Database" -c
- Create tables specifying a different directory than the default path where the CSV datasets reside
./big-data-benchmarking.py "Oracle Database" -c -p /some/other/path/
- Create tables and drop only those tables after the benchmark completes
./big-data-benchmarking.py "Oracle Database" -c -d
The /big-data-benchmarking/csv/
is the location where the actual benchmarking results are written to. Within this folder you will also find an example benchmarking results CSV file. This example allows full functionality of the Bokeh server web application even without running any benchmarks.
Here is a sample of what the CSV data looks like:
category | concurrency_factor | database | name | query_executed | query_id | query_template | rows | table_name | table_row_count | table_size_category | time |
---|---|---|---|---|---|---|---|---|---|---|---|
Aggregate | 1 | Oracle Database | COUNT * | SELECT COUNT(*) FROM "On_Time_Performance_ALL" | 1 | SELECT COUNT(*) FROM {table} | 1 | On_Time_Performance_ALL | 160690563 | X-Large | 149.38998920100005 |
Aggregate | 1 | Oracle Database | COUNT | SELECT COUNT("LATE_AIRCRAFT_DELAY") FROM "On_Time_Performance_ALL" | 2 | SELECT COUNT({numeric_column}) FROM {table} | 1 | On_Time_Performance_ALL | 160690563 | X-Large | 144.099790143 |
Aggregate | 1 | Oracle Database | COUNT DISTINCT | SELECT COUNT(DISTINCT "DEST_CITY_MARKET_ID") FROM "On_Time_Performance_ALL" | 3 | SELECT COUNT(DISTINCT {numeric_column}) FROM {table} | 1 | On_Time_Performance_ALL | 160690563 | X-Large | 156.50899973100002 |
Aggregate | 1 | Oracle Database | SUM | SELECT SUM("WHEELS_OFF") FROM "On_Time_Performance_ALL" | 4 | SELECT SUM({numeric_column}) FROM {table} | 1 | On_Time_Performance_ALL | 160690563 | X-Large | 156.93902191200004 |
Aggregate | 1 | Oracle Database | SUM DISTINCT | SELECT SUM(DISTINCT "LATE_AIRCRAFT_DELAY") FROM "On_Time_Performance_ALL" | 5 | SELECT SUM(DISTINCT {numeric_column}) FROM {table} | 1 | On_Time_Performance_ALL | 160690563 | X-Large | 149.11920515500003 |
Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. The Bokeh server web app uses the data from the CSV and generates an interactive HTML dashboard consisting of data tables, charts and graphs. This allows us to visually analyze the benchmarking results and compare each database platform.
Instruct Bokeh server to launch the web app
bokeh serve app
Then open your browser and navigate to the dashboard: http://localhost:5006
Finally, when you are finished running the benchmarks and no longer need to view the dashboard results, terminate the VM
vagrant destroy
Note! The big-data-benchmarking
project folder remains on your local system, along with the CSV benchmarking results. You will no longer be able to run the benchmarks or view the dashboard once the VM is terminated. You can easily reprovision the VM and start over
vagrant up
- Option to specify the minimum amount of rows to return from each query. Currently you are only permitted to limit the maximum amount of rows
- Benchmark different python drivers against the same database type (MySQL example:
mysqldb
vsmysql.connector
vspymysql
) - Embed Bokeh into Flask app with Jinja templates
Justin Naldzin - GitHub