Syllabus for 333E Elements of Data Integration - Spring 2026
Class Meetings: Friday 2:00pm - 5:00pm
Class Location: GDC 1.304
Instructor: Shirley Cohen
Email: scohen at cs dot utexas dot edu
Office Hours:
 • In Person: Monday 4:00pm - 5:00pm in GDC 6.510
 • Online: Tuesday 7:00pm - 8:00pm on Zoom
TA: Sai Surya Duvvuri
Email: saisurya at cs dot utexas dot edu
Office Hours: Wednesday and Thursday 3:00pm - 4:30pm on Zoom
Course Description
This course provides an introduction to the design and implementation of modern data lakehouses, structured around three core pillars. The first pillar focuses on advanced data modeling. Students will master techniques for synthesizing unified models from disparate, independently produced datasets with incompatible formats. These models will be engineered to provide a robust foundation for business intelligence, data science, and machine learning. To reflect real-world complexity, the course will utilizes data across a spectrum of modalities, from traditional structured tables to unstructured formats such as image and video.
The second pillar of this course is ETL and ELT. We will learn how to create robust and scalable data pipelines that gradually clean, normalize, and integrate the data sources from the raw layer to the marts. We will examine the operational aspects of pipeline development, namely orchestration, incremental updates, and quality checks.
The third pillar is AI engineering, which will be interwoven throughout the curriculum. We will look at new approaches to data engineering whose solutions are not yet well understood: how to use the LLM to extract attributes from unstructured data for the purpose of data enrichment, how to use the LLM to augment data that is missing in the database, how to use the LLM to map schemas from source to target as part of the ELT pipeline. We will experiment with an agent development framework and create semi-autonomous flows to speed up time-consuming data engineering tasks.
This is a project-based course with each week's assignment building on the previous one to create a final substantial deliverable. The projects will be coded in SQL and Python with the frameworks Data Build Tool and Agent Development Kit. We will be using a variety of Google Cloud technologies throughout the term, including Cloud Storage, BigQuery, Colab, and Vertex AI.
Prerequisites
CS 327E or equivalent is a required prerequisite. In addition, familiarity with AI/ML and Cloud technologies is helpful, but not required.
Required Readings
There will be assigned readings on most weeks. They will come from three texts:
- Bartosz Konieczny, Data Engineering Design Patterns, First Edition, O'Reilly, 2025.
- Chip Huyen, AI Engineering, First Edition, O'Reilly, 2024.
- Michael J. Hernandez, Database Design for Mere Mortals, Fourth Edition, Pearson, 2021.
For convenience, all required readings will be available through the
Longhorn Textbook Access program. Go to Canvas and click on the
My Textbooks tab to access the readings.
Supplemental Readings
In addition to the text readings, assignments will include reading official product documentation, especially when we get to the dbt and ADK portions of the course. All product documentation is available online for free.
Projects
The most important component of this course are the projects. With the exception of the first one, all projects will require substantial design and coding work. They will build on each other throughout the term, up to and including the Final Project.
There will be 10 Weekly Projects and 1 Final Project in total. Each weekly project will be worth 7% for a total of 70% of your final grade. The Final Project will be worth 10% of your final grade.
All projects will be done in groups of two students. You will choose your project partner on the first day of class and will work with them throughout the term. It is therefore very important that you choose your partner wisely. Both partners will receive the same grade and are expected to collaborate and contribute equally to each project. If you run into any issues, you should first try to work them out with your partner directly and then reach out to me or the TA if the issue persists.
Presentations
There will be two types of presentations. One will be in-class and the other will be recorded.
The live presentations will happen at the end of every unit and are marked as such on the course schedule. If your group has done solid work on a unit, you may be called upon to present your work to the class at this session. All live presentations will be chosen by the Professor. If you are selected, your group will receive an email the day before the session to give you a bit of time to prep. In exchange for presenting live, group members will receive extra credit.
Aside from the live presentations, there will be two recorded presentations which are required for everyone. One will be a midterm around week 7 and the other a final at the end of the term. Both presentations will be done in groups. The recorded presentations will be worth 10% of your grade or 5% per presentation.
Quizzes
There will be reading quizzes on most weeks which will be done in class. The quizzes will be drawn from the assigned chapter for that class, not from the project work. They are meant to keep you on track with the readings and check your basic understanding. The quizzes will consist of T/F, MC and short answer questions. There will be about 12 quizzes total throughout the term and they will collectively be worth 10% of your grade.
Students are expected to take the quizzes on their own without collaboration. If you are struggling with any of the content on the quizzes, please come to office hours and ask for help.
Class Structure
Our class sessions follow a workshop-intensive model, typically starting with a brief lecture followed by dedicated group work. To best support this collaborative format, we will primarily meet in person; however, we may occasionally shift to Zoom when a project is better suited for virtual breakout rooms. In either setting, the TA and I will circulate among the groups to provide real-time guidance, answer questions, and help resolve technical hurdles.
Attendance to each class is mandatory and expected. If you have to miss a meeting, you should notify your partner and the teaching staff ahead of time. This will enable us to record your absence and give your partner extra support for that week's project. If you miss more than two class meetings, you will be asked to work solo for the remainder of the term and your partner will be reassigned to a different group.
Academic Integrity
This course will abide by UTCS' code of
academic integrity.
Late Submissions, Extensions, and Resubmissions
You will lose 10% of your score for each late day of your project submission unless you have obtained an extension from the Professor prior to the assignment deadline. In addition, if you submit late, your group will not be able to take advantage of our resubmission option, which allows you to gain lost points back if you have submitted your weekly project on time. To be clear, even if you have received an extension, you are not eligible for the resubmission if you submit after the deadline. This policy applies to all weekly projects throughout the term. For the Final Project, you will not have the option to resubmit or submit after the deadline due to the university's compressed grade reporting timeline.
For deadline extension requests, alternate quiz requests, SSD accommodations, or special accommodations (for emergencies or personal issues), please make a private post on Ed and email the teaching staff directly if you don't receive a response within 24 hours. Please include your reason for requesting an extension and any relevant documentation if applicable.
Students with Disabilities
Students with disabilities may request appropriate academic accommodations.
Grading Rubric
The final grade will be made up of the following components:
- Weekly Projects: 70%
- Final Project: 10%
- Presentations: 10%
- Weekly Quizzes: 10%
Tools
We will be using the following tools throughout the term:
Week-by-week Schedule
This schedule is tentative and subject-to-change based on the needs of the class.
- 01/16: Course overview & data collection. Read before class: Chapter 1 from Data Eng text.
Slides.
Project 1.
- 01/23: Data ingestion. Read before class: Chapter 2 from Data Eng text. Slides. Project 2.
- 01/30: Attribute extraction. Read before class: Chapter 5 from AI Eng text. Project 3.
- 02/06: Data enrichment. Read before class: Chapter 8 from AI Eng text. Project 4.
- 02/13: Presentations & data modeling primer. Read before class: Chapter 6 from AI Eng text.
- 02/20: Modeling the staging layer. Read before class: Chapter 7 from Data Modeling text. Project
5.
- 02/27: Modeling the intermediate layer. Read before class: Chapter 8 from Data Modeling text. Project 6.
- 03/06: Modeling the intermediate layer. Read before class: Chapter 10 from Data Modeling text.
Project 6.
- 03/13: Modeling the mart layer. Read before class: Chapter 12 from Data Modeling text.
Project 7.
- 03/20: Spring Break!
- 03/27: Presentations & data pipelines primer. Read before class: Chapter 4 from Data Eng text.
Midterm Presentation.
- 04/03: Orchestrating the staging layer. Read before class: Chapter 6 from Data Eng text.
Project 8.
- 04/10: Orchestrating the intermediate layer. Read before class: Chapter 9 from Data Eng text.
Project 9.
- 04/17: Orchestrating the mart layer. Read before class: Chapter 10 from Data Eng text.
Project 10.
- 04/24: Presentations & AI agent development primer. Read before class: Chapter 6 from AI Eng text. Final Project and Presentation.
Acknowledgments
This course is generously supported by Google by giving us access to their
Cloud Platform.