layout:true
Data Analysis with R
--
class: center,middle
Follow along at: http://bit.ly/data-analysis-r
See the code at: http://bit.ly/data-analysis-r-code
Data Analysis with R by Richard Dunks and Julia Marden is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
class:center,middle
exclude:true
???
- Facilitators will cover the following skills: muting themselves, stopping their video, typing in chat box, raising their hand, sharing their screen
- Mute and Unmute your microphone
- Start and Stop your video
- Post a message in the Chat window with your name and computer operating system (Windows or MacOS)
- Click the Participants window and Raise your hand
???
- Facilitators establish the intention we have for the culture of the classroom
--
-
Step up, step back --
-
One mic --
-
Be curious and ask questions in the chat box --
-
Assume noble regard and positive intent --
-
Respect multiple perspectives --
-
Be present (phone, email, social media, etc.)
--
--
-
Who you are --
-
Where you work --
-
What you hoping to learn today --
-
What you've done with code (any code)
--
-
Introduction to R --
-
Using R in Data Analysis --
-
Getting Familiar: R Syntax + R Studio --
-
311 Data Analysis --
-
Presentations!
--
-
R syntax and commands --
-
RStudio --
-
Load data --
-
Explore data --
-
Wrangle data --
-
Visualize data
???
- Students will review progress and give feedback on key takeaways
name:housekeeping
--
-
We’ll have one 15 minute break in the morning --
-
We’ll have an hour for lunch --
-
We’ll have a 15 minute break in the afternoon --
-
Class will start promptly after breaks --
-
Feel free to use the bathroom if you need during class --
-
Please take any phone conversations into the hall to not disrupt the class
--
“Analysis is simply the pursuit of understanding, usually through detailed inspection or comparison”
???
- Orient students to key concept in analysis
- Use R to uncover meaning in data
???
- Establish frame for the analytics process to be followed in class
- Familiarize students with terminology (esp "data wrangling/data cleaning")
- Demystify the process
- Empower students to do analysis
.caption[Image Credit: Astroval1, CC BY-SA 4.0 via Wikimedia Commmons]
???
- Facilitator provides context for the exercise by describing Old Faithful
- Students will download script with prepared code snippets to run
- Students will learn the steps of running summary statistics in R
--
-
What's the minimum amount of time I should plan to spend at Old Faithful? --
-
Is there a relationship between the amount of time I wait and the length of time it erupts? --
???
- Students will understand the problem we're seeking to solve in class
- Students will learn by example the value of problem setting.
- This will be done by writing out explicit problem statement for 311 Noise, possibly vision 0 db after we have exercise.
???
- Students will open and load a simple dataset.
- They will inspect the data in the viewer and confirm it loaded properly.
- This will be done by live demo of code
- Students will be writing code themselves
- Introduce basic commands and tab completion
- Describe comments and their purpose
- Emphasize cooperation between participants
???
- Introduce students to Console, Environment, and Help
- Students will be familiar with the key features of the console for the exercises to come
- This will be done by live demo and verbal discussion
- Ctrl+L clear console
--
.caption[Image Credit: AnonMoos, Public Domain via Wikipedia]
???
- Students will get vocabulary for accomplishing tasks in code
- This will be done with an overview discussion
# basic command
command(dataset)
View(faithful)
???
- Facilitator guides students through basic syntax in R for simple tasks
- Instructor reinforces syntax idea and relation to regular sentence structure to convey meaning where appropriate --
# select a column
command(dataset$column)
mean(faithful$waiting)
--
# get help
?help
?faithful
--
-
Look through the code we just wrote --
-
Make a change to one thing on the chart --
-
If necessary, check out the help documentation --
-
Be ready to describe what you did
--
-
Statistical programming language --
-
Open-source --
-
Made for and by people who work with data --
-
Used for data analysis --
-
For the history of R, see this video
???
- Familiarize students with basics of R and set context
- "Created for and by the people" - Julia Marden
???
-
Facilitator compares R directly to Excel for context (assuming most participants are well-acquainted with Excel) --
-
R is a programming language while Excel is an application --
-
R can work with much larger datasets than Excel --
-
R can perform more complex operations than Excel --
-
R commands can be easily saved, re-run, and automated --
-
R doesn't have the icons, animations, and wizards of Excel
name:nola
.caption[Image Credit: Michael Barnett CC BY-SA 2.5, via Wikimedia Commons]
???
- Students will be inspired to use their knowledge in practical applications
.caption[Image Credit: City of New Orleans, via nola.gov]
???
- Students will be inspired to use their knowledge in practical applications
.caption[Image Credit: City of New Orleans, via nola.gov]
???
- Students will be inspired to use their knowledge in practical applications
class:center,middle
class:center, middle
Source: https://xkcd.com/378/
--
-
Sorting --
-
Filtering --
-
Aggregating (PivotTable) --
-
Transforming --
-
Visualizing
--
-
Reorganize rows in a dataset based on the values in a column --
-
Can sort on multiple columns
--
-
Use
order()
-- -
Specify the column you want to sort by
(in our caseeruptions
orwaiting
) --
df[order(df$column_to_sort_by),]
--
- Sort the Old Faithful data to find the shortest waiting time
- Sort the Old Faithful data to find the longest waiting time
???
- Why the comma?
- The syntax is
df[row specifier, column specifier]
. - If a specifier is absent, R returns all.
--
-
Only show rows that contain some value --
-
Can filter by multiple values --
-
Can filter by values in multiple columns
--
-
Provide some logical test (
<
,>
,==
, etc.) -- -
The format is --
df[df$column_to_filter_by <logical test>,]
--
- Filter the Old Faithful data for all eruptions longer than 4 minutes
--
-
Trends only become clear in aggregate --
-
Often where you discover the "so what" --
-
Aggregating data meaningfully can be tricky --
-
We'll be showing how to do this with R later
--
-
Sometimes available categories don't make sense --
-
Values may not be in the format you need (or have mistakes) --
-
You always want to have a clean copy of the data to go back to --
-
Best to keep track of what you've done --
-
We'll be showing how to do this with R later
--
-
Quickly communicate information --
-
Tell a clearer story --
-
A picture is worth a thousands words --
-
We've already seen this with the Old Faithful data
hist(faithful$waiting)
hist(faithful$eruptions)
plot(faithful, main="Eruptions of Old Faithful", xlab="Eruption Time in Minutes", ylab="Waiting Time to Next Eruption in Min")
abline(lm(faithful$waiting~faithful$eruptions), col="red")
- Sorting
- Filtering
- Aggregating (PivotTable)
- Transforming
- Visualizing
.center[Derelict Vehicles Across NYC]
--
-
How many people complain about derelict vehicles? --
-
Do people complain more at a particular time of day? --
-
Do people complain more in a particular neighborhood or borough? --
???
- Students will understand the problem we're seeking to solve in class
- Students will learn by example the value of problem setting.
- This will be done by writing out explicit problem statement for 311 Noise, possibly vision 0 db after we have exercise.
--
-
Open the
311_a.R
script (already loaded in RStudio) -- -
Follow along the code as we load the dataset --
-
You can download the code here --
-
The data dictionary explains each column
???
- Students will conduct the same commands from Faithful with 311 exercise
- Students will hit the roadblocks
- Can't run summary statistics
- Exercise will be run through script showing comments (not on slide)
- Script will mirror the Faithful with intention of not working
???
- Students will understand a few of the different data types in R
- They will use the
str
andsummary
command - This will be done with a live demo of code
--
-
Lists and data frames (mixed data types) --
--
- You often need to restructure your data to make it usable
???
- Students will review work done in simple data load
- They will learn key elements of data structures based on Faithful data
- This will be done with live demo and discussion
- They will use the
str
andsummary
command
class:center, middle
???
- Facilitator reviews the learning in the morning with participants
- Facilitator answers any questions
- If there is time, facilitator has participants switch and review someone else's code, then has them reflect on what they learned looking at someone else's code
class:center, middle
Source: https://xkcd.com/1319/
class:center, middle
--
-
Get data into right type or structure --
-
Create subsets --
-
Add packages to work with the data we have
???
- start of section discussing manipulating data
- picking up pieces from exercise where script failed
- start of exercise 3
--
-
Add-ons: extra functions, data viz, special features --
-
Can help you load data, work with timestamps, create charts --
-
If you need to do something, there's probably a package for it
--
- To use:
install.packages()
???
- Students will understand the purpose and value of packages
- This will be done with a discussion
???
- An example question of the 311 dataset
- students will be walked through the exercise with a script
- Prompts in the script with a more specific question
- incidents per borough -> distribution of complaints
--
-
Switch out derelict vehicles for another complaint type --
-
Look at a different borough, ZIP, or community board --
-
Look at day of the week instead of hour --
-
Challenge yourself --
-
We'll be around to help
class:center,middle
???
- Students will understand better the purpose of using code for analysis
- Remind them we all have hypothesis -> need to be acknowledged
--
-
How many? --
-
Where? --
-
When? --
???
- Prompts for starting your investigation of the data
- Students will have a way to start exploring data
- Discussion leading into guided exercise
--
-
Working in pairs or alone, start working on a question that interests you --
-
Start with a new script and give it a name --
-
Use the skills we've covered --
-
Challenge yourself to do something new --
-
Don't be afraid of not knowing --
-
Use the documentation --
-
Help each other out --
-
We'll be around to help
class:center, middle
Source: https://xkcd.com/1831/
--
-
Everyone gets errors all the time --
-
It's just a matter of how complex they are
-- And fixing them -- -
Syntax errors -> using the wrong instructions --
-
Semantic errors -> doing the wrong things --
-
When in doubt, take a breath, try breaking things apart into smaller pieces, review the documentation, and search for help
???
- Students will be introduced to key concepts in identifying and resolving errors
- This will be done with a lecture/discussion leading into an exercise
- Class exercise finding errors in code -> slide with code snippets in Markdown with errors
- deal with issue of correctness
- Debug your neighbor's R Script and verify results
???
- Students will examine another student's code, run the code, and fix any errors
- Students will have a better understanding of how to think in code
- Goal is to get students talking to each other about their code
- have documentation at end of slides
exclude:true class:center,middle
class:middle,center
???
- Students will review select code examples
- Goal is to model a collaborative process for data analysis
- Time buffer for end of class
class:center,middle
--
-
R syntax and commands --
-
RStudio --
-
Load data --
-
Explore data --
-
Wrangle data --
-
Visualize data --
-
Anything else?
???
- Students will review progress and give feedback on key takeaways
???
-
Facilitators reinforce key learning points with participants for integrating into their workflow --
-
R is a powerful tool for cleaning, analyzing, and visualizing data --
-
Integrating it into your workflow takes practice and a commitment to not giving up (Google is your friend) --
-
RStudio makes it easy to get started --
-
You should be able to download R and RStudio on your work computer (Use the zip/tarball option)
name:resources
--
--
-
Hands-On Programming with R - Free online book with code examples meant for non-programmers --
-
R for Data Science- Free online book covering basic topics in data science with R --
-
R Cookbook - Free online walkthrough of the basics --
-
R Programming Coursera Course - Free course in R that runs regularly --
-
Swirl - Interactive learning inside of R
install.packages(“swirl”)
--
-
Tidyverse - R packages for Data Science --
-
Stat Methods - Great documentation for doing data analysis in R --
-
UCLA Stats - Many examples of statistical analysis with comparisons between R, Stata, SPSS, etc. --
-
Stack Overflow - One of the best Q&A sites for technology
???
- Also NYC Open Statistical Programming Meetup - Monthly talks about R and sponsor of the NYC R Conference
- Students will have key resources for moving forward in their learning
- Class handout
- Datapolitan training classes - The online home of our training materials
- Email me
- Check out my website
- Connect on Twitter
- Connect on LinkedIn
- Follow us on Instagram
class:center, middle
View()
# show dataset as spreadsheet in Viewer
str()
# identify data type and structure
nrow()
# identify the number of rows
ncol()
# identify the number of columns
colnames()
# list the name of every column
sort()
# sort the values in a column
data.frame()
# structure data into a matrix
subset()
# extract data from a dataframe
min()
# identify minimum value
max()
# identify maximum value
median()
# calculate median value
mean()
# calculate mean value
hist()
# make a chart with numeric data
plot()
# plot two numeric variables along an x-y axis
abline()
# add a trendline to a plot
table()
# make a table with factor data
prop.table()
# make a table with percentages
barplot()
# make a chart with factor data
install.packages("dplyr")
require(dplyr)
tbl_df()
# create a dataframe
filter()
select()
# create a subset; filter for rows, select for columns
mutate()
# add a column
arrange()
# sort rows by category
install.packages("lubridate")
require(lubridate)
mdy_hms()
# format timestamp into month, day, year, hour, min and second
# other commands: mdy_hm, mdy, dmy, etc.
hour()
# extract hour from timestamp
# other commands: day, minute, second, etc.
ggplot()
# plot a dataframe
geom_bar()
# make a proportional bar chart
# alternative is geom_col()
# used for factor data
ggtitle()
# add a title to a plot