For the overall course, we recommend the following books as potentially being useful:
-
Data Science from Scratch: First Principles with Python, 2nd ed, by Joel Grus, published by O'Reilly.
-
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd ed, by Wes McKinney, published by O'Reilly.
Additionally, we recommend Towards Data Science as a useful resource for this space.
- The University of Pennsylvania's CIS 545, Big Data Analytics, www.cis.upenn.edu/~cis545
Students may find the following resources to be useful as background:
-
Google's Python class (free): https://developers.google.com/edu/python
-
Harvard Online learning course on probability and statistics, https://online-learning.harvard.edu/course/introduction-probability-edx
The OpenDS4All modules can be "mixed and matched" at the discretion of the instructor, according to preferences, time constraints, and the target audience. However, certain elements do have dependencies. We suggest a "core" outline as follows:
-
Overview, 1.5 lecture hours (basic)
- Optional recitation: review of Python basics, including data structures
-
Acquiring, wrangling, integrating, and cleaning data, 3-4 lecture hours (basic-intermediate)
- Optional recitation: basics of HTML and the Document Object Model
- Optional recitation: basics of regular expressions (often used for pattern matching) and XPath (which builds on some ideas from regular expressions and traverses XML trees)
-
Modeling data: types, graphs, schemas, 2-4 lecture hours
- Optional recitation: encoding tree- or graph-structured data in relations, and traversing the data
-
Performance:
-
Foundations: Computer architecture basics, 1 hour (basic, provides an overview of CPU and memory)
-
Efficient data processing, 3-7 lecture hours (intermediate, appropriate for a more computational and big data audience)
-
Optional recitation: Use
merge
andmerge_map
algorithms from Lecture Notebook to study performance of alternative strategies. Use%%time
and SQLite to study performance of database indices.
-
-
Building machine learning models
- Overview and Unsupervised Models, 1 lecture hour, basic.
- Supervised Models, Decision Trees, Random Forests, 1-1.5 lecture hours, basic.
- Linear and Logistic Regression, 1-1.5 lecture hours, basic.
- Neural Networks, builds upon linear and logistic regression, 2-4 lecture hours, intermediate [requires understanding of calculus].
-
Validating and tuning models, 1.5-3 hours, basic
Additional and advanced topics:
-
Data ethics, 1-2 hours, basic, most appropriately covered after a discussion of machine learning models.
-
Data exploration and visualization, 1-2 hours, basic.
-
Big data and the cloud, 3-5 hours, intermediate. Most appropriate after a discussion of performance.