Class Times: Mondays 6:00pm - 9:00pm
Class Location: GDC 1.304
Instructor: Shirley Cohen
Email: scohen at cs dot utexas dot edu
Office hours: After class or by appointment in GDC 1.304
TA: Nishtha Aggarwal
Email: nishthaaggarwal41197 at gmail dot com
Office hours: Tuesday and Thursday from 10:00am - 11:00am at TA Stations, GDC basement
TA: Mao Wang
Email: maohua dot ethan dot wang at utexas dot edu
Office hours: Wednesday from 5:00pm - 6:00pm and Friday from 1:00pm - 2:00pm at TA Station 4, GDC basement
This course is designed to give students a practical understanding of databases and data systems. The goal is to learn modern data management and data processing techniques through a mix of best practices, experimentation, and problem solving.
The content of the course is organized into three broad areas: 1) SQL with an emphasis on the standard features of the language; 2) data models from relational to hierarchical; and 3) dataflow programming and workflow orchestration for processing data at-scale.
We will first construct a analytics database from single dataset of structured data. We will then expand the scope of the database with additional datasets from multiple sources. We will gradually enrich, cleanse, and transform the data to create a unified view of the data across multiple datasets. This work will be implemented on Google Cloud Platform using BigQuery, Apache Beam/Dataflow, MongoDB, Apache Airflow/Composer, and Data Studio.
Below are some of the topics we will be covering during the term:
- inserts, updates, deletes
- referential integrity
- repeated fields
- data integration
- workflow orchestration
The course assumes a programming background and in particular, a solid working knowledge of Python scripting. As such, the prerequisites for this course are CS 303E, CS 307 or the equivalent. Familiarity with SQL is also helpful, but not required.
There are two required texts for this course:
- Wilfried Lemahieu et. al, Principles of Database Management, First Edition, 2018. Available as an ebook from Amazon.
- Clare Churcher, Beginning SQL Queries: From Notice to Professional, Second Edition, 2016. Available as an ebook from the UT Library.
The course requires consulting the GCP product documentation on BigQuery, Apache Beam/Dataflow, Apache Airflow/Composer, and Data Studio. The documentation is updated frequently and will be reviewed on a regular basis throughout the course.
There will be weekly quizzes based on assigned readings. The quizzes will have 5 multiple-choice questions and they will be taken at the start of class with your iClicker. iClickers are available as a remote and mobile app. Choose one of these two options (not both) and register your device with this course.
The most important component of this course is the term project. The project will be to construct a data warehouse that brings together previously siloed datasets with the goal of deriving some interesting insights from the combined data. It will tie together many of the concepts you will learn throughout this course, such as data modeling, transformations, visualization, and orchestration. It is an ambigious project that will take up the entire term. The project will be carried out in teams of two students. You will form groups at the start of the term and work with the same person on all the milestones. There are 12 milestones overall. More details on the miletones are provided in the Week-by-Week section below.
There is only one exam for this course which is the midterm. This exam is comprehensive and will cover all the material to-date, including SQL, data modeling, and dataflow programming. It is a closed-book exam and will be done entirely in class. Unfortunately, no make-up exam can be offered due to limited resources.
This class will be taught using "active learning" techniques. My goal is to spend the majority of class time clarifying difficult concepts and actively working through problems rather than delivering a traditional lecture. You will need to bring a laptop to class and sit with your partner to able to work together on practice problems. Participation questions will be answered with your iClicker and they will be based on these exercises.
The basic grading rubric is comprised of the four components listed below: