Syllabus for CS 327E Elements of Databases - Fall 2019

Class Times: Mondays 6:00pm - 9:00pm
Class Location: GDC 1.304

Instructor: Shirley Cohen
Email: scohen at cs dot utexas dot edu
Office hours: After class or by appointment in GDC 1.304

TA: Mao Wang
Email: maohua dot ethan dot wang at utexas dot edu
Office hours: Wednesday from 5:00pm - 6:00pm and Friday from 1:00pm - 2:00pm at TA Station 4, GDC basement

TA: Levi Villarreal
Email: villarreallevi at gmail dot com
Office hours: Tuesday from 3:00pm - 4:00pm and Thursday from 2:00pm - 3:00pm at 3rd floor GDC atrium

Course Description:
This course is designed to give students a practical understanding of databases and data systems. The goal is to learn modern data management and data processing techniques through a mix of best practices, experimentation, and problem solving.

The content of the course is organized into three broad areas: 1) SQL with an emphasis on the standard features of the language; 2) data models from relational to hierarchical; and 3) dataflow programming and workflow orchestration for processing data at-scale.

We will first construct a analytics database from single dataset of structured data. We will then expand the scope of the database with additional datasets from multiple sources. We will gradually enrich, cleanse, and transform the data to create a unified view of the data across multiple datasets. This work will be implemented on Google Cloud Platform using a variety of data science tools: BigQuery, Apache Beam and Dataflow, Apache Airflow and Composer, Jupyter Notebooks, and Data Studio.

Below are some of the topics we will be covering during the term:

SQL:
- select-from-where
- order-bys
- joins
- inserts, updates, deletes
- aggregates
- group-bys
- subqueries
- functions

Data Modeling:
- referential integrity
- normalization
- hierarchies
- repeated fields
- views
- partitioning

Dataflow Programming:
- ingestion
- cleansing
- transformation
- data integration
- workflow orchestration

Prerequisites:
The course assumes a programming background and in particular, a solid working knowledge of Python scripting. As such, the prerequisites for this course are CS 303E, CS 307 or the equivalent. Familiarity with SQL is also helpful, but not required.

Textbooks:
There are two required texts for this course:
- Wilfried Lemahieu et. al, Principles of Database Management, First Edition, 2018. Available as an ebook from Amazon.
- Clare Churcher, Beginning SQL Queries: From Notice to Professional, Second Edition, 2016. Available as an ebook from the UT Library.

Supplemental Readings:
The course requires consulting the GCP product documentation on BigQuery, Apache Beam/Dataflow, Apache Airflow/Composer, and Data Studio. The documentation is updated frequently and will be reviewed on a regular basis throughout the course.

Quizzes:
There will be weekly quizzes based on assigned readings. The quizzes are closed notes. They will have 5 multiple-choice questions and they will be taken at the start of class with your iClicker. iClickers are available as a remote and mobile app. Choose one of these two options (not both) and register your device with this course.

Project:
The most important component of this course is the term project. The project will be to construct a data warehouse that brings together previously siloed datasets with the goal of deriving some interesting insights from the combined data. It will tie together many of the concepts you will learn throughout this course, such as data modeling, transformations, visualization, and orchestration. It is an ambigious project that will take up the entire term. The project will be carried out in teams of two students. You will form groups at the start of the term and work with the same person on all the milestones. There are 12 milestones overall. More details on the miletones are provided in the Week-by-Week section below.

Exam:
There is only one exam for this course which is the midterm. This exam is comprehensive and will cover all the material to-date, including SQL, data modeling, and dataflow programming. It is a closed-book exam and will be done entirely in class. Unfortunately, no make-up exam can be offered due to limited resources.

Participation:
This class will be taught using "active learning" techniques. My goal is to spend the majority of class time clarifying difficult concepts and actively working through problems rather than delivering a traditional lecture. You will need to bring a laptop to class and sit with your partner to able to work together on practice problems. Participation questions will be answered with your iClicker and they will be based on these exercises.

Grading Rubric:
The basic grading rubric is comprised of the four components listed below:

Note: The final grade will use the plus/minus grading system.

Late Submission Policy:
There is a 10% reduction in the grade per day. This applies to all project submissions throughout the term.

Tooling:
- Google Cloud Platform for practice problems and project work.
- GitHub for code repository, version control, and how-to guides.
- Lucidchart for diagramming.
- Piazza for asynchronous communication (announcements, questions, discussions).
- Canvas for grade reporting.

Academic Integrity:
This course will abide by UTCS' code of academic integrity.

Students with Disabilities:
Students with disabilities may request appropriate academic accommodations.

Week-by-Week Schedule:
Below is a week-by-week schedule that includes the important milestones and assigned readings:
Acknowledgments:
This course is generously supported by Google by giving us access to their Cloud Platform.