CS 327E Elements of Databases

Syllabus for CS 327E Elements of Databases - Fall 2017

Class Times: Mondays and Wednesdays 6:30pm - 8:00pm
Class Location: CLA 1.104

Instructor: Shirley Cohen
Email: scohen at cs dot utexas dot edu
Office hours: After class or by appointment in CLA 1.104

TA: Kartik Sathyanarayanan
Email: kartik at cs dot utexas dot edu
Office hours: Thursdays 3:30pm - 5:30pm in GDC basement TA station 1.302

TA: Sean Wang
Email: seanyusa at utexas dot edu
Office hours: Wednesdays and Fridays 9:00am - 10:30am in GDC basement TA stations

Course Description:
This course is designed to give students a practical understanding of database and big data technologies. The focus is on designing and building databases and data pipelines, rather than on the theory behind them.

The content of the course is organized into 3 broad topics: 1) SQL with an emphasis on the standard features of the language; 2) schema design with a focus on ER modeling; and 3) data engineering, namely ETL programming and building an end-to-end data processing pipeline. The course will also cover such topics as master data management, data integration, and cloud computing.

During the term we will use Postgres for in-class demos, lab projects and the final project. I have chosen Postgres because it is a popular and scalable open-source relational database system and is also easy to learn.

There will be 3 lab projects, each worth 10%. The projects will build on each other as well as lead up to the final project, worth 30%. Through the lab projects and final project, students will develop a complex data pipeline that contains the following steps: extracts structured and semi-structured data from multiple sources including files and Web APIs, explores the data using a query engine, transforms the data for consistency and analysis, loads the data into a normalized relational database, analyzes the data with SQL, and visualizes the data with simple charts, such as pie charts and histograms. The pipeline will be built on AWS with the following software stack: SQL, Presto, Spark, Python, Postgres, QuickSight.

Students will work in pairs on the lab projects and final project. The code for all deliverables will be kept under version control in GitHub. At the end of the term, each group will submit a technical report that describes their project and lessons learned from the various milestones. Students will choose their lab partner at the start of the term and work with the same partner throughout.

The course will be taught using active learning techniques as much as possible. The goal is to spend the majority of class time clarifying difficult concepts and actively working through challenge problems, lab projects, and the final project rather than listening to a traditional lecture. In order for this to happen, students must learn the basic concepts before coming to class by doing the assigned readings and exercises. As an extra incentive for keeping up with the homework, there will be a short quiz at the start of class based on the assigned readings for that day. In addition, students will earn points for participating in class exercises.

Below is a list of some of the topics we will be covering during the term:

SQL:
- select-from-where
- order-bys
- joins
- inserts, updates, deletes
- aggregates and group-bys
- subqueries
- views
- stored procedures and functions

Database Design:
- ER modeling
- hierarchies
- scalar and complex types
- normalization and denormalization
- materialized views
- indexing

Data Engineering:
- data extraction
- data cleansing
- data transformation
- data ingestion
- data integration
- data visualization

Prerequisites:
The course assumes prior knowledge in Python programming. As such, the prerequisite for the course is CS 303E, CS 307 or the equivalent. Familiarity with SQL is helpful, but not required.

Textbooks:
There are three required texts for this class, of which one is available as an ebook through the UT Library:
- Jan Harrington's Relational Database Design and Implementation, 4th edition, 2016. Available as an ebook through the UT Library.
- Regina Obe and Leo Hsu's PostgreSQL: Up and Running: A Practical Guide to the Advanced Open Source Database, 2nd edition, 2014.
- Frank Kane's Taming Big Data with Apache Spark and Python, 1st edition, 2017.

Extra Readings:
There are some readings based on papers, blog posts, and manuals. Those are all available online.
- Jeff Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004.
- Michael Armbrust et. al, Spark SQL: Relational Data Processing in Spark. SIGMOD, 2015.
- Amazon, Athena User Guide, August 4, 2017.
- Amazon, QuickSight User Guide, 2017.
- Amazon, EMR Developer Guide, 2017.

Clickers:
We will be using clickers for reading quizzes and class participation. Students will need to have their own clickers and bring them to class each time. Either the newer i>clicker 2 or the older i>clicker model can be used. The University Co-op carries the newer i>clicker 2 model. Remember to register your clicker with this class through Canvas (click on the i>Clicker link from the left navigation menu and follow the instructions).

Grading Rubric:
The basic grading rubric is comprised of the five components listed below:

Quizzes 10%
Class Participation 10%
Labs 30%
Midterm Exam 20%
Final Project 30%

Note: There is no final exam and the final grade will use a plus/minus grading system.

Late Submission Policy:
10% reduction in the grade per day.

Resources:
- Amazon Web Services for exercises, lab projects, and final project.
- GitHub for code repository and version control.
- Lucidchart for diagramming.
- Piazza for class discussions.
- Canvas for grade reporting.

Academic Integrity:
This course will abide by UTCS' code of academic integrity.

Students with Disabilities:
Students with disabilities may request appropriate academic accommodations.

Week-by-Week Schedule:
Below is a week-by-week schedule that includes the important milestones and assigned readings:

08/30: Course Overview.
09/04: Labor Day. No class.
09/06: Relational Data Model. Readings before class: Chapter 5 from Relational Database text and Chapter 1 from Postgres text. Slides Notes
09/11: Simple SQL. Readings before class: Chapter 16 from Relational Database text and Chapter 5 from Postgres text. Slides Notes
09/13: Joins Readings before class: Chapter 17 from Relational Database text. Slides Notes
09/18: Lab 1. Data Exploration. Readings before class: Using Presto at Netflix and Getting Started with Athena. Lab 1 Assignment Lab 1 Rubric Slides Notes
09/20: Lab 1. Data Exploration. Lab 1 due Friday 09/22.
09/25: Normalization. Readings before class: Chapters 3 and 7 (pp. 141-152) from Relational Database text. Slides Notes
09/27: Denormalization and Table Creation. Readings before class: Chapter 11 from Relational Database and Chapter 6 from Postgres text. Slides Notes
10/02: Lab 2. Data Load. Readings before class: Chapter 13 from Relational Database text. Lab 2 Assignment Lab 2 Rubric Slides Notes
10/04: Lab 2. Data Load. Lab 2 due Friday 10/06.
10/09: Aggregations and Groupings. Readings before class: Chapters 18 and 19 from Relational Database text. Slides Notes
10/11: Aggregate Queries, Views, and Dimensional Design. Readings before class: Chapters 20 and 21 from Relational Database text and Chapter 7 from Postgres text. Slides Notes
10/16: Lab 3. Aggregate Queries & Visualizations. Readings before class: Quick Start Guide (pp. 5-34) from QuickSight User Guide. Lab 3 Assignment Lab 3 Rubric Slides Notes
10/18: Lab 3. Aggregate Queries & Visualizations. Lab 3 due Friday 10/20.
10/23: Indexes. Reading before class: Chapter 8 from Relational Database text. Slides Notes
10/25: Midterm Exam.
10/30: Spark Tutorial and Final Project Overview. Readings before class: MapReduce paper, Chapter 2 from PySpark text and How Does Amazon EMR Work? (pp. 7-10) from EMR Developer Guide. Milestone 1 Assignment Milestone 1 Rubric Slides Notes
11/01: Milestone 1. Milestone 1 due Friday 11/03.
11/06: Milestone 2. Reading before class: Chapter 3 from PySpark text. Milestone 2 Assignment Milestone 2 Rubric Slides Notes
11/08: Milestone 2. Milestone 2 due Friday 11/10.
11/13: Milestone 3. Readings before class: Chapter 5 from PySpark text and SparkSQL paper. Milestone 3 Assignment Milestone 3 Rubric Slides Notes
11/15: Milestone 3. Milestone 3 due Friday 11/17.
11/20: Milestone 4. No Readings. Milestone 4 Assignment Milestone 4 Rubric Slides Notes
11/22: Thanksgiving Break. No class.
11/27: Milestone 4. No Readings.
11/29: Milestone 4. Milestone 4 due Friday 12/01.
12/04: Milestones 5 and 6. Group Presentations. Milestone 5 Assignment Milestone 6 Assignment
12/05: Milestones 5 and 6. Group Presentations.
12/06: Milestones 5 and 6. Group Presentations.
12/07: Milestones 5 and 6. Group Presentations.
12/11: Milestones 5 and 6. Group Presentations. Milestone 6 due Monday 12/11.