The Pivotal Greenplum Database (GPDB) is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely geared toward big data analytics, Greenplum Database is powered by the world’s most advanced cost-based query optimizer delivering high analytical query performance on large data volumes. https://pivotal.io/pivotal-greenplum
StreamSets software delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data continually arrives on-time and with quality, empowering business-critical analysis and decision-making. https://streamsets.com/
- Loading data from StreamSets data generator into Greenplum
- Streaming data from Kafka into Greenplum
- Loading data from Hadoop into Greenplum
This example uses Streamsets data generator to generate random data and uses JDBC Producer that concurrently writes data into Greenplum.
The purpose of this use case is to demonstrate how to use StreamSets ETL solution to load large data sets into Greenplum database. For more details, see this README.MD
The example below shows records that are processed , number of records inserted per second while using Dev Data Generator to generate data that will be inserted into Greenplum via JDBC Producer
This example uses Streamsets data generator to generate random data, store data into Kafka. Later, this example loads data from Kafka into Greenplum
The purpose of this use case is to demonstrate how to use StreamSets ETL solution to load large data sets from Kafka into Greenplum database. For more details, see this README.MD
The example below shows records that are processed , number of records inserted per second while using Kafka consumer to read data and insert data into GPDB via JDBC
To be added later
Alternative solution: You can use Spark ETL solution to load data from multiple sources including Kafka, S3 and others. Using Greenplum-Spark connector, you can parallelize data transfer from Spark cluster to Greenplum cluster.
Enhancement to Streamsets to use Greenplum native loaders Greenplum - ETL Greenplum-github