Skip to content

MUSA-620-Spring-2017/Course-Materials

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 

Repository files navigation

#UPenn: MUSA 620 - Data Wrangling and Data Visualization

SCHEDULING

Class: Wednesdays from 9am to 12pm in the Levin Building, room 111.

Office hours: Mondays 6pm-8pm and Tuesdays 1pm-3pm. Email galkamaxd at gmail to schedule a time.

OBJECTIVE

The purpose of this course is to familiarize students with the “pipeline” approach to data science. This involves the process of gathering data, storing the data, analyzing the data, and visualizing the data such that non-technical decision makers can make sense of it. The course is broken down accordingly into four sections.

  1. Data collection: Students will learn how to gather data by way of web scraping, APIs, and other unstructured sources.
  2. Databases: This part of the course teaches students how to store this data for efficient retrieval and analysis.
  3. Analytics: Students will learn a range of machine-driven techniques for analyzing structured and unstructured data.
  4. Data visualization: The last part of the course teaches students how to present the results of their analysis visually using R and the web application framework Shiny.

FORMAT

The course will be conducted in weekly sessions devoted to lectures, demonstrations and discussions.

ASSIGNMENTS

There is one required final project at the end of the semester. Homework will be assigned before the close of each class and will be due at the end of the following week’s class. Four of the homework assignments will be explicitly required. The remainder are optional, but will count toward the participation component of your final grade.

For the final project, students will replicate the pipeline approach on a dataset (or datasets) of their choosing. The final deliverable will be a web-based data visualization and accompanying description including a summary of the results and the methods used in each step of the process (collection, storage, analysis and visualization).

Final Project Description

GRADING

The grading breakdown is as follows: 50% for homework; 40% for final project, 10% for participation

SOFTWARE

This course relies on use of the R Statistical Package in conjunction with Shiny and other associated extensions.

SCHEDULE

Class # Date Topic Notes
Week 1 Jan 18 Introduction / Data visualization concepts Slides
Week 2 Jan 25 Working with Census data Slides
Week 3 Feb 1 Web scraping with R Slides
Week 4 Feb 8 Unstructured data: Twitter API Slides
Week 5 Feb 15 Large datasets: NYC Taxi trip data with Google BigQuery Slides
Week 6 Feb 22 Spatial databases: PostGIS Slides
Week 7 Mar 1 Data frames and data manipulation with R: dplyr Slides
Spring Break
Week 8 Mar 15 Natural language processing Slides
Week 9 Mar 22 Data visualization with R: ggplot2 Slides
Week 10 Mar 29 Interactive maps with R Leaflet Slides
Week 11 Apr 5 Shiny 1
Week 12 Apr 12 Shiny 2
Week 13 Apr 19 Shiny 3
Week 14 Apr 26 In-class work on final projects

Releases

No releases published

Packages

No packages published