Creating a data analytics project for Uber using modern data engineering on Google Cloud Platform (GCP) involves several steps and components.
Below is a high-level overview of how you could structure such a project:
Objective:
Enhance decision-making and optimize operations for Uber by analyzing and visualizing relevant data.
Data Sources: Uber ride data, driver information, user feedback, geographic data, etc. Tools and Technologies:
GCP services such as BigQuery, Cloud Storage, Dataflow, Pub/Sub, and Data Studio. Python for data processing and analysis.
Data Ingestion:
Set up data ingestion pipelines to collect and process data from various sources. Use Cloud Pub/Sub for real-time data streaming and Cloud Storage for batch data uploads.
Data Storage: Store raw and processed data in BigQuery for easy querying and analysis. Organize data into structured tables for efficient retrieval.
Data Processing: Utilize Cloud Dataflow for data processing tasks, ensuring scalability and efficiency. Implement data transformation and cleaning processes to handle missing or erroneous data.
- Programming Language - Python
Google Cloud Platform
- Google Storage
- Compute Instance
- BigQuery
- Looker Studio
Modern Data Pipeine Tool - https://www.mage.ai/
Contibute to this open source project - https://github.com/mage-ai/mage-ai
TLC Trip Record Data Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
Here is the dataset used in the video - https://github.com/darshilparmar/uber-etl-pipeline-data-engineering-project/blob/main/data/uber_data.csv
More info about dataset can be found here: