Syllabus for CS 327E Elements of Databases - Spring 2018

Class Times: Mondays 6:00pm - 9:00pm
Class Location: CLA 1.106

Instructor: Shirley Cohen
Email: scohen at cs dot utexas dot edu
Office hours: After class or by appointment in CLA 1.106

TA: Prithvi Chowhan
Email: chowhan at utexas dot edu
Office hours: Tuesdays 3:00pm - 4:00pm and Fridays 1:00pm - 2:00pm in GDC basement TA stations

TA: William Chia
Email: william dot chia at utexas dot edu
Office hours: Wednesdays 1:00pm - 2:00pm and Thursdays 4:00pm - 5:00pm in GDC basement TA stations

Course Description:
This course is designed to give students a practical understanding of database and big data technologies. The focus is on designing and building databases and data processing pipelines, rather than on the theory behind them.

The content of the course is organized into three broad topics: 1) SQL with an emphasis on the standard features of the language; 2) database design with a focus on ER modeling; and 3) ETL programming and building data processing pipelines. The course will also cover such topics as master data management, data integration, and cloud computing.

In the first half of the term, we will use Postgres as our database system. I have chosen Postgres because it is both a popular open-source database system and is easy to learn. In the second half, we will transition to Dremel and construct our data warehouse on top of BigQuery.

There will be 10 labs, worth 50% altogether. The labs will build on each other and lead up to the project demo and report, worth 10% combined. Through the labs, students will develop a data pipeline that contains the following steps: collect structured data from multiple sources including files and Web APIs, load the data into a data warehouse, explore the data using SQL queries, normalize and clean up the data, design and populate a dimensional model, create derived tables for reporting, and analyze and visualize the transformed data. The system will be built on Google Cloud using the following services: Cloud SQL for Postgres, Data Studio, BigQuery, and Cloud Dataflow.

Students will work in pairs on the labs. They will choose a project partner at the start of the term and work with the same partner throughout. At the end of the term, each group will demo their project and write up a report that covers the technical details of their system and some lessons learned in the process.

The course will be taught using active learning techniques as much as possible. The goal is to spend the majority of class time clarifying difficult concepts and actively working through challenge problems and lab assignments rather than listening to a traditional lecture. In order for this to happen, students must learn the basic concepts before coming to class by doing the assigned readings and exercises. As an extra incentive for keeping up with the homework, there will be a short quiz at the start of class based on the assigned readings for that day. In addition, students will earn points for participating in class exercises.

Below are some of the topics we will be covering during the term:

SQL:
- select-from-where
- order-bys
- joins
- inserts, updates, deletes
- aggregates and group-bys
- subqueries
- views
- stored procedures and functions

Database Design:
- ER modeling
- hierarchies
- scalar and complex types
- normalization and denormalization
- materialized views
- indexing

Data Engineering:
- dataflow programming
- data extraction
- data cleansing
- data transformations
- data ingestion
- data integration
- data visualization

Prerequisites:
The course assumes prior knowledge and expertize in Python. As such, the prerequisite for the course is CS 303E, CS 307 or the equivalent. Familiarity with SQL is helpful, but not required.

Textbook:
There is one required text for this class:
- Ramez Elmasri and Shamkant Navathe's Fundamentals of Database Systems, 7th edition, 2015.

Extra Readings:
There are some extra readings based on papers, blog posts, and technical manuals. Those are all available online.
- Jeff Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004.
- Sergey Melnik et. al, Dremel: Interactive Analysis of Web-Scale Datasets. VLDB, 2010.
- Tyler Akidau et. al, The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, VLDB, 2015.
- Google Cloud Platform, Cloud SQL for Postgres, product documentation.
- Google Cloud Platform, Data Studio Tutorials, product documentation.
- Google Cloud Platform, BigQuery Concepts, product documentation.
- Google Cloud Platform, Dataflow Concepts, product documentation.

Clickers:
We will be using clickers for reading quizzes and class participation. Students will need to have their own clickers and bring them to class each time. Either the newer i>clicker 2 or the older i>clicker model can be used. The University Co-op carries the newer i>clicker 2 model. Remember to register your clicker for this class through Canvas (click on the i>Clicker link from the left navigation menu and follow the instructions).

Grading Rubric:
The basic grading rubric is comprised of the six components listed below:

Note: There is no final exam and the final grade will use a plus/minus grading system.

Late Submission Policy:
10% reduction in the grade per day.

Resources:
- Google Cloud Platform for exercises, practice problems, and lab projects.
- GitHub for code repository, version control, and how-to guides.
- Lucidchart for diagramming.
- Piazza for asynchronous communication (announcements, questions, discussions).
- Canvas for grade reporting.

Academic Integrity:
This course will abide by UTCS' code of academic integrity.

Students with Disabilities:
Students with disabilities may request appropriate academic accommodations.

Week-by-Week Schedule:
Below is a week-by-week schedule that includes the important milestones and assigned readings:
Acknowledgments:
This course is generously supported by Google by giving us access to their Cloud Platform.