Instructor Notes

Readings, Texts, and References

For the overall course, we recommend the following books as potentially being useful:

Data Science from Scratch: First Principles with Python, 2nd ed, by Joel Grus, published by O'Reilly.
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd ed, by Wes McKinney, published by O'Reilly.

Additionally, we recommend Towards Data Science as a useful resource for this space.

Courses Using OpenDS4All Materials

The University of Pennsylvania's CIS 545, Big Data Analytics, www.cis.upenn.edu/~cis545

Background Material

Students may find the following resources to be useful as background:

Google's Python class (free): https://developers.google.com/edu/python
Harvard Online learning course on probability and statistics, https://online-learning.harvard.edu/course/introduction-probability-edx

Suggested Configuration of Modules

The OpenDS4All modules can be "mixed and matched" at the discretion of the instructor, according to preferences, time constraints, and the target audience. However, certain elements do have dependencies. We suggest a "core" outline as follows:

Overview, 1.5 lecture hours (basic)
- Optional recitation: review of Python basics, including data structures
Acquiring, wrangling, integrating, and cleaning data, 3-4 lecture hours (basic-intermediate)
- Optional recitation: basics of HTML and the Document Object Model
- Optional recitation: basics of regular expressions (often used for pattern matching) and XPath (which builds on some ideas from regular expressions and traverses XML trees)
Modeling data: types, graphs, schemas, 2-4 lecture hours
- Optional recitation: encoding tree- or graph-structured data in relations, and traversing the data
Performance:
- Foundations: Computer architecture basics, 1 hour (basic, provides an overview of CPU and memory)
- Efficient data processing, 3-7 lecture hours (intermediate, appropriate for a more computational and big data audience)
- Optional recitation: Use merge and merge_map algorithms from Lecture Notebook to study performance of alternative strategies. Use %%time and SQLite to study performance of database indices.
Building machine learning models
- Overview and Unsupervised Models, 1 lecture hour, basic.
- Supervised Models, Decision Trees, Random Forests, 1-1.5 lecture hours, basic.
- Linear and Logistic Regression, 1-1.5 lecture hours, basic.
- Neural Networks, builds upon linear and logistic regression, 2-4 lecture hours, intermediate [requires understanding of calculus].
Validating and tuning models, 1.5-3 hours, basic

Additional and advanced topics:

Data ethics, 1-2 hours, basic, most appropriately covered after a discussion of machine learning models.
Data exploration and visualization, 1-2 hours, basic.
Big data and the cloud, 3-5 hours, intermediate. Most appropriate after a discussion of performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instructor_Notes.md

Instructor_Notes.md

Instructor Notes

Readings, Texts, and References

Courses Using OpenDS4All Materials

Background Material

Suggested Configuration of Modules

Files

Instructor_Notes.md

Latest commit

History

Instructor_Notes.md

File metadata and controls

Instructor Notes

Readings, Texts, and References

Courses Using OpenDS4All Materials

Background Material

Suggested Configuration of Modules