-
Notifications
You must be signed in to change notification settings - Fork 1
/
prepare_flightdata.Rmd
88 lines (71 loc) · 3.68 KB
/
prepare_flightdata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
title: "prepare_flightdata"
author: "Brigitte"
date: "April 25, 2016"
output: html_document
---
PURPOSE: Clean the data downloaded from http://1.usa.gov/1KEd08B (unzip), by removing small airports, changing format of rows, removing rows without arrival_delay information and splitting the data into train and test data.
Loading data
```{r}
setwd("~/R-scripts/Predict_Flights")
origData <- read.csv2('963742499_T_ONTIME.csv',sep=',', header=TRUE, stringsAsFactors=FALSE)
```
Checking a few things. This will show that we have many rows, so we select large airports only. Where to change this? Find the names with names() and subset() the array.
```{r}
nrow(origData)
head(origData,3)
names(origData)
largeairports <- c('ATL','LAX','ORD','DFW','JFK','SFO','CLT','LAS','PHX')
origData <- subset(origData,DEST %in% largeairports & ORIGIN %in% largeairports)
```
Try to locate duplicate fields with correlation cor() for numerical fields and != for string fields, and remove them by setting them equal to NULL.
```{r}
head(origData)
origData$X <- NULL #this was probably introduced at readin of data
cor(origData[c("ORIGIN_AIRPORT_SEQ_ID","ORIGIN_AIRPORT_ID")])
head(origData[c("ORIGIN_AIRPORT_SEQ_ID","ORIGIN_AIRPORT_ID")],3)
cor(origData[c("DEST_AIRPORT_SEQ_ID","DEST_AIRPORT_ID")])
sum(origData$CARRIER != origData$UNIQUE_CARRIER)
origData$ORIGIN_AIRPORT_SEQ_ID <- NULL
origData$DEST_AIRPORT_SEQ_ID <- NULL
origData$UNIQUE_CARRIER <- NULL
```
Now make sure we have all values to be predicted, i.e. ARR_Del15 and DEP_DEL15 is either 0 or 1 in each row.
```{r}
onTimeData <- origData[!is.na(origData$ARR_DEL15) & origData$ARR_DEL15!="" & !is.na(origData$DEP_DEL15) & origData$DEP_DEL15!="", ]
```
Are there any variables that are in a 'wrong' format?
```{r}
onTimeData$DISTANCE <- as.integer(onTimeData$DISTANCE)
onTimeData$CANCELLED <- as.integer(onTimeData$CANCELLED)
onTimeData$ARR_DEL15 <- as.factor(onTimeData$ARR_DEL15)
onTimeData$DIVERTED <- as.integer(onTimeData$DIVERTED)
onTimeData$DEP_DEL15 <- as.factor(onTimeData$DEP_DEL15)
onTimeData$DEST_AIRPORT_ID <- as.factor(onTimeData$DEST_AIRPORT_ID)
onTimeData$ORIGIN_AIRPORT_ID <- as.factor(onTimeData$ORIGIN_AIRPORT_ID)
onTimeData$DAY_OF_WEEK <- as.factor(onTimeData$DAY_OF_WEEK)
onTimeData$DEST <- as.factor(onTimeData$DEST)
onTimeData$ORIGIN <- as.factor(onTimeData$ORIGIN)
onTimeData$DEP_TIME_BLK <- as.factor(onTimeData$DEP_TIME_BLK)
onTimeData$CARRIER <- as.factor(onTimeData$CARRIER)
tapply(onTimeData$ARR_DEL15,onTimeData$ARR_DEL15,length)
```
The next step in predicting flight delays for many models is to split into Training (70%) and Testing (30%) data.
We select variables: Origin and Destination, Day of Week, Carrier, Departure Time Block (late departure often means late arrival, gropued into 1 hour blocks). Plus the predictor Arrival Delay 15.
Load in the caret package (Classification and Regression Training)
install.packages('caret') if you haven't installed it before.
```{r}
library(caret)
set.seed(100)
featureCols <- c('ARR_DEL15','DAY_OF_WEEK','CARRIER','DEST','ORIGIN','DEP_TIME_BLK')
onTimeDataFiltered <- onTimeData[,featureCols]
# Percentage of Delayed and not should be same in training and testing, when you create partition, tell caret which variable should have an equal split.
inTrainRows <- createDataPartition(onTimeDataFiltered$ARR_DEL15,p=0.7,list=FALSE)
trainDataFiltered <- onTimeDataFiltered[inTrainRows,]
testDataFiltered <- onTimeDataFiltered[-inTrainRows,]
nrow(trainDataFiltered)/(nrow(testDataFiltered)+nrow(trainDataFiltered))
```
save the trainingDataFiltered and the testDataFiltered into csv files and read them in, for further work with it.
```{r}
write.csv(trainDataFiltered, file = "train.csv")
write.csv(testDataFiltered, file = "test.csv")