Flight Data Analysis Project

Project Overview

The project involves exploring and analyzing U.S. flight data, which is stored in Parquet files on a Spark cluster hosted within the Databricks environment.

Dataset Description

The dataset contains detailed information about flights, including various attributes such as flight date, carrier code, departure and arrival times, delays, and more. Here's a brief description of the dataset columns:

Variable	Meaning
FL_DATE	Date of flight (YYYY-mm-dd)
OP_CARRIER	Airline code assigned by the International Air Transport Association
OP_CARRIER_FL_NUM	Flight number assigned by the airline
ORIGIN	IATA code for the airport of departure
DEST	IATA code for the destination airport
...	(Other columns omitted for brevity)

Milestones

Part 1: Basic Data Exploration

Tasks for this part of the project:

Prepare data: Clean, format, and consolidate into a Spark DataFrame.
Analyze data: Calculate aggregates including flight counts, delay percentages, and statistics by carrier.

Part 2: Feature Engineering

Method: Engineer features such as day of the week, airport weather delay rates, and departures relative to daily averages to enhance predictive accuracy.

Part 3: Delay Prediction

Modeling: Apply Spark ML to predict departure delays.
Validation: Evaluate the model’s performance on the held-out validation set, and report accuracy.

Getting Started

To get started with the project, follow these steps:

Clone this repository to your local machine.

git clone <repository_url>

Set up your Databricks workspace and connect it to your Azure Data Lake Storage account.

# Install the Databricks CLI
pip install databricks-cli

# Configure the Databricks CLI with your access token
databricks configure --token

# Connect to your Databricks workspace
databricks workspace configure --url <workspace_url> --token

# Mount your Azure Data Lake Storage account
databricks fs mount --source adl://<storage_account_name>.azuredatalakestore.net/ --mount-point /mnt/<mount_name>

Use the provided notebooks to perform data exploration and analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Part 1 - Basic Data Exploration.ipynb		Part 1 - Basic Data Exploration.ipynb
Part 2 - Feature Engineering.ipynb		Part 2 - Feature Engineering.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flight Data Analysis Project

Project Overview

Dataset Description

Milestones

Part 1: Basic Data Exploration

Part 2: Feature Engineering

Part 3: Delay Prediction

Getting Started

About

Releases

Packages

Languages

talzab/Distributed-Data-Analysis

Folders and files

Latest commit

History

Repository files navigation

Flight Data Analysis Project

Project Overview

Dataset Description

Milestones

Part 1: Basic Data Exploration

Part 2: Feature Engineering

Part 3: Delay Prediction

Getting Started

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages