CS329e Elements of Data Analytics

Instructor: Dr. Philip E. Cannata, cannata@cs.utexas.edu
Office hours: After class in the classroom or by appointment

TA: TBD

Class Times: MW 4:00 - 5:30 p.m.
Class Location: PHR 2.108

Class Website: http://www.cs.utexas.edu/~cannata/analytics/

Course Description:

This course teaches essential decision making skills for scientists and business people using rapidly-growing technologies known as predictive analytics, data mining and machine learning. These technologies focus on finding patterns in large data sets that might otherwise go undetected and then using these patterns for modeling and prediction. Predictive analytics, data mining and machine learning are especially important today because of advances in computational power and an explosion in the quantity of available data. Students will become familiar with data.world, statistical learning packages in R and similar capabilities in the python Scikit-Learn package. Time permitting, students will also be introduced to "Deep Learning" using the python Tensorflow package.

Working with data.world gives students the opportunity to collaborate around their specific projects, take advantage of the wealth of data readily available on the data.world platform, and learn how to store their data in a space available to them long after their time with the University.

R is the most popular free software environment for statistical computing. R supports all of the statistical learning methods taught in this course, plus it is the environment of choice for research statisticians. This means that it is at the cutting edge with respect to new statistical learning methods.

The python scikit-learn package is also a popular free software environment for machine learning, it provides a single, clean and consistent interface to many different statistical methods. It provides many options for each method, but also trys to choose sensible defaults. It also tries to help users understand the models as well as how to use them properly. And, like R, it is being actively developed by the python machine learning community.

Deep learning libraries like Google Tensorflow are also now gaining in popularity. "Deep Learning" is mainly concerned with training and using large neural networks. Neural networks are loosely patterned after the operation of neurons in the human brain.

Prerequisites:

Required Texts:

Recommended Texts:

Grading:

Plus and minus grades will not be used for final course grades.

Grades will be calculated as follows:

  • Attendance and Active Class Participation 60 points - 30 points for Attendance, and 30 points for Active Class Participation.
  • Top Hat will be used during class to take attendance and to facilitate and grade Active Class Participation.
  • The extenuating Circumstances section below does not apply to your absences and class participation grade unless you have a well documented medical or family emergency, or religious obligation.
  • I have noticed over many semesters that class attendance is strongly correlated with getting a good grade in my classes. To encourage class attendance, the 30 points for Attendance will be assigned as follows:
Absences Attendance points
0 30
1 25
2 15
3 or more 0
  • There will be 8 quizzes over the course of the semester, each worth 10 points for a total of 80 points.
  • There will be 4 lab projects over the course of the semester, each worth 20 points for a total of 80 points. Students will work in groups of 3 or 4 on these labs. Groups will be expected to complete the lab projects on your own using the provided code when given. Class examples and examples from the web can be used, but these must be documented. To be clear, each group must do a significant amount of the work on the lab projects by themselves.
  • Each group will have to complete a separate final project which will be worth 55 points.

Grading will be on a straight scale as follows:

            A =      100 - 90%    (248-275 pts)
            B =      89 - 80%      (220-247 pts)
            C =      79 - 70%      (193-219 pts)
            D =      69 - 60%      (165-192 pts)
            F =      < 60%         (below 165 pts)

Attendance and Active Class Participation:

If you miss something in class, you need to ask questions right then. You should practice what I teach in class as soon as possible after class and then if you have problems, come to office hours, stay after class and/or post on pizza. If you miss a class, it is your responsibility to catch up as quickly as possible. Procrastination is a killer in this class. 

The way to improve your listening skills is to practice "active listening." This is where you make a conscious effort to hear not only the words that another person is saying but, more importantly, try to understand the complete message being sent.

In order to do this you must pay attention to the other person very carefully.

You cannot allow yourself to become distracted by whatever else may be going on around you, or by forming counter arguments that you'll make when the other person stops speaking. Nor can you allow yourself to get bored, and lose focus on what the other person is saying. All of these contribute to a lack of listening and understanding.

Extenuating Circumstances:

If you encounter an unexpected medical or family emergency, a random act of Nature and/or have difficulty meeting the requirements of this course, fail to complete a project, and/or miss a quiz because of extenuating circumstances, please advise your Dr. Cannata in writing (not email) during the week of Final Project presentations so that special consideration may be given. A file of all written correspondence will be kept by the Dr. Cannata in a and decisions regarding them will be made at the end of the semester after the initial final grades have been calculated.

Please note: the University does not consider a job interview as a valid reason for missing class.

Students with disabilities:

Students with disabilities may request appropriate academic accommodations from the Division of Diversity and Community Engagement, Services for Students with Disabilities, 512-471-6259, http://diversity.utexas.edu/disability/

Course Topics:

Week Subject Readings (best done prior to class) Projects and Quizzes
1 Class introduction and getting started with Statistical Learning Statistical Learning - Chapter 1; Chapter 1 Videos
2 Introduction to Statistical Learning Statistical Learning - Chapter 2; Chapter 2 Videos  
3 Linear Regression

Statistical Learning - Chapter 3; Chapter 3 Videos

R for Data Science - Chapter 18 (Model Basics with modelr), pages 345 (Introduction) - page 358 (end of Residuals).

 
4 Residual Analysis and Dichotomous Classification using Logistic Regression

R for Data Science - Chapter 19 (Model Building), pages 375 (Introduction) - page 384 (end of "A More Complicated Model").

Statistical Learning - Chapter 4 (pages 127 - 137), and the first 3 Chapter 4 Videos and the Lab: Logistic Regression video.

Project 1
5 Classification using Linear Discriminant Analysis Statistical Learning - - Chapter 4; Chapter 4 Videos Quiz 1
6

The tidyverse dplyr package, ROC Curves, Other forms of Discriminant Analysis, Model Comparison

Communicate with R Markdown Interactive Documents

"Data Transformation Cheat Sheet at RStudio Cheat Sheets

ROC Curve, Model comparison - Statistical Learning Chapter 4 pages 147 - 154; Statistical Learning - - Chapter 4, and Chapter 4 Videos

R Markdown, R Markdown Cheat Sheet at this link

Interactive Document

Quiz 2
7 Resampling Methods Statistical Learning - Chapter 5; Chapter 5 Videos Project 2
8 Linear Model Selection and Regularization Statistical Learning - Chapter 6 (Subset Selection and Shrinkage Methods); Chapter 6 Videos (Subset Selection and Shrinkage Methods) Videos Quiz 3
9 SQL and Joining Data Tables with Census Data Tables; Principal Components Analysis (PCA) Statistical Learning - Chapters 6 (PCA) and 10 (PCA); Chapters 6 (PCA) and 10 (PCA) ; Scikit-Learn Chapters 8 Project 3
10 Tree-Based Methods, and Moving Beyond Linearity Statistical Learning - Chapters 8 and 7; Chapters 8 and 7 Videos; Sci-kit Learn Chapters 6, and 7 Quiz 4
11 Support Vector Machines and Unsupervised Learning Statistical Learning - Chapters 9 and 10; Chapters 9 and 10 Videos; Sci-kit Learn Chapter 5 Quiz 5
12 Introduction to Neural Networks, Deep Learning, and TensorFlow Sci-Kit Part II - Chapters 9 and 10 Project 4
13 Statistical Learning Recap; Training DNNs; Sci-Kit Part II - Chapter 11 Quiz 6
14 Convolutional Neaural Networks (CNNs) Sci-Kit Part II - Chapter 13 Quiz 7
15 Final Project Reviews    
  Selected Project Presentations and Wrap-up   Quiz 8