I recently updated the code to use my fork of Arrow and this required considerable rework so I am currently working on restoring functionality that was previously working.
- Upgrade to Apache Arrow 0.12.0
- Allow query to be executed against Arrow CSV reader
- Allow query to be executed against Arrow Parquet reader
- Implement project push-down so that only necessary columns are loaded into memory
- Logical query plan definition
- SQL Parser
- Query planner
- Projection
- Selection
- Simple aggregate queries with optional GROUP BY
- Support for MIN/MAX
- Support for SUM
- Support for COUNT
- Support for COUNT(DISTINCT)
- Support
SQL to register data sources - SQL console and Docker image for standalone use / easy testing and benchmarking
The goal of this release is to support a larger percentage of real world queries and to focus on improved unit testing to ensure correctness of query execution.
- Scalar UDFs
- Array UDFs
- Support nested objects with dot notation
- JOIN support (hash join and sort merge join)
- Better unit tests / smoke test / performance tests
- Parallel execution using threads (async/await)
- Partitioning
- Query optimizer improvements
- Serializable logical query plan (in protobuf format)
- Worker node that can receive and execute plan against local files
- Write query output to local files or return results in protobuf and/or IPC format
- Consider supporting Hive protocol to allow JDBC/ODBC clients to submit queries to a single node
This release will allow queries to be executed against a cluster, supporting interactive queries that return results in Arrow format or write results to disk.
- Distributed query planner
- Worker can delegate portions of query plan to other workers
- Data source meta-data
- Kubernetes support for spinning up worker nodes
- Web user interface / better tools / monitoring etc
- Support for S3
- Support for HDFS
- Authentication/authorization
- Encryption at rest and in transit