Syllabus for 378 Foundations of Data Warehousing - Fall 2025

Class Meetings: Friday 2:00pm - 5:00pm
Class Location: JGB 2.218 and Zoom*

Instructor: Shirley Cohen
Email: scohen at cs dot utexas dot edu
Office Hours: Monday 5:00pm - 6:00pm on Zoom

TA: Sai Surya Duvvuri
Email: saisurya at cs dot utexas dot edu
Office Hours: Tuesday, Wednesday, Thursday 3:00pm - 4:00pm on Zoom

*This class will have meetings both in-person and on Zoom. For all Zoom links (class meetings and office hours), please see Canvas.


Course Description
This course will introduce students to designing and implementing a modern data lakehouse by way of three pillars. The first one is data modeling. We will study techniques to model independently produced datasets that are logically related into a cohesive semantic data model that is suitable for business intelligence, data science, and machine learning applications. The source data will come in a variety of forms and modalities, ranging from the traditional structured tabular data to unstructured data such as images and videos.

The second pillar is ETL. We will learn how to extract attributes from unstructured data and design robust pipelines that gradually transform the data from the raw layer to the mart layer of the lakehouse. We will examine the operational aspects of pipeline development, namely incremental updates, quality testing, and orchestration.

The third pillar is AI engineering, which will be interleaved throughout the curriculum. We will look at new approaches to data engineering whose solutions are not yet well understood: how to power data modeling and pipeline development with LLMs and agentic flows. We will experiment with SOTA models and agent development to evaluate the different approaches and understand their tradeoffs.

This is a project-based course with each week's assignment building on the previous one to create a final substantial deliverable. The projects will be coded in SQL and Python with the frameworks Data Build Tool and Agent Development Kit. We will be using a variety of Google Cloud technologies throughout the term, including Cloud Storage, BigQuery, Colab, and Vertex AI.


Prerequisites
The course assumes prior knowledge in SQL, databases, and Python. As such, CS 347 Data Management or equivalent is a recommended prerequisite. Familiarity with machine learning and Google Cloud technologies would also be helpful, but not required.


Required Readings
There will be assigned readings on most weeks. They will come from 4 texts: For convenience, all required readings will be available through the Longhorn Textbook Access program. Go to Canvas and click on the My Textbooks tab.


Supplemental Readings
In addition to the text readings, assignments will include reading official product documentation, especially when we get to the dbt and ADK portions of the course. All product documentation is available online for free.


Projects
The most important component of this course are the projects. With the exception of the first one, all projects will require substantial design and coding work. They will build on each other throughout the term, up to and including the final project.

There will be 9 weekly projects and 1 final project in total. Each weekly project will be worth 7.7% for a total of 70% of your final grade. The final project will be worth 10% of your final grade.

All projects will be done in groups of two students. You will choose your project partner on the first day of class and will work with them throughout the term. It is therefore very important that you choose your partner wisely. Both partners will receive the same grade and are expected to collaborate and contribute equally to each project. If you run into any issues, you should first try to work them out with your partner directly and then reach out to me or the TA if the issue persists.


Presentations
There will be two types of presentations. One will be in-class and the other will be recorded.

The in-class presentations will happen at the end of every unit and are marked as such on the course schedule. If your group has done great work on a unit, you may be called upon to share your approach with the whole class. All live presentations will be chosen by the Professor. If you are selected, your group will receive an email Friday morning, a few hours before class, so that you have a bit of time to prep.

There will be two recorded presentations which are required for everyone. One will be a Midterm around week 7 and the other a Final at the end of the term. Both presentations will be done together with your partner. The recorded presentations will be worth 10% of your grade or 5% per presentation.


Quizzes
There will be in-class reading quizzes on most weeks. The quizzes will be drawn from the assigned chapter for that class, not from the project work. They are meant to keep you on track with the readings and check your basic understanding. The quizzes will consist of T/F, MC and short answer questions. There will be about 13 quizzes total throughout the term and they will collectively be worth 10% of your grade.

Students are expected to take the quizzes on their own without collaboration. If you are struggling with any of the content on the quizzes, please come to office hours and ask for help.


Class Structure
We will meet in our physical classroom 5 times during the term, on the first day of class and at the end of each unit, about every 2-3 weeks. Those dates are highlighted on the course schedule. The rest of the time we will meet online on Zoom so that we can accelerate our project work. Typically, there will be a short lecture at the start of class, followed by a work period where we will break up into groups. The TA and I will be visiting with each group to check-in on progress, answer questions, and resolve issues.

Attendance for each class is mandatory and expected. If you have to miss a class, you should notify your partner and the teaching staff ahead of time. This will enable us to record your absence and give your partner extra support on that week's project. If you miss more than two classes, you will be asked to work solo for the remainder of the term and your partner will be reassigned to a different group.


Academic Integrity
This course will abide by UTCS' code of academic integrity.


Late Submissions, Extensions, and Resubmissions
You will lose 10% of your score for each late day of your project submission unless you have obtained an extension from the Professor prior to the assignment deadline. In addition, if you submit late, your group will not be able to take advantage of our resubmission opportunity, which allows you to gain lost points back if you have submitted your weekly project on time. To be clear, even if you have received an extension, you are not eligible for the resubmission if you submit after the deadline. This policy applies to all weekly projects throughout the term. For the Final Project, you will not have the option to resubmit or submit after the deadline due to the university's compressed grade reporting timeline.

For deadline extension requests, alternate quiz requests, SSD accommodations, or special accommodations (for emergencies or personal issues), please make a private post on Ed and email the teaching staff directly if you don't receive a response within 24 hours. Please include your reason for requesting an extension and any relevant documentation if applicable.


Students with Disabilities
Students with disabilities may request appropriate academic accommodations.


Grading Rubric
The final grade will be made up of the following components:
Tools
We will be using the following tools throughout the term:
Week-by-week Schedule
This schedule is tentative and subject-to-change based on the needs of the class. The highlighted dates are those when we will meet in the classroom, otherwise we will meet on Zoom.
Acknowledgments
This course is generously supported by Google by giving us access to their Cloud Platform.