CS380L: Advanced Operating Systems

Lab #1

The goal of this assignment is to understand the differences between the native host, a container and a VM by measuring the performance of certain programs in these different environments and trying to understand what influences the end-to-end performance.

Before you start

We are interested in doing experimental computer science. We will follow the scientific method, which Wikipedia tells me has been around for a long time. But knowing about science in the abstract is relatively easy; actually doing good science is difficult both to learn and to execute.

Let's start with reproducibility. You will write a report for this lab, and in your report you will include details about your system. Think about what it would take to recreate your results. I won't spell out exactly what information you should include, but please include everything relevant while not bogging down your reader. You should report things like the kernel version for your host and guest system. If you used CloudLab, include details about the hardware of the machine type you used.

Your report should answer every question in this lab and should do so in a way that is clearly labeled.

I have a major pet peeve with excessive digits of precision. Your measurements are usually counts. If you average three counts, don't give me six decimal places of precision even if six digits is the default output format for floats in the language you are using. Decide how many digits are meaningful and then report that many. Also, make your decimal points line up when that makes sense. For example, if you report a mean and a standard deviation, make the decimal places always align so you can see easily if the standard deviation is less than a tenth of the mean (which is a good sign for reproducibility).

I would use C or C++, but you can use whatever programming tools you want. One thing I want you to do both for this class and for real life is always check the return code of every single sytem call you ever make. I know it sounds a bit pedantic, but start the habit now and you will have a happier programming life. For almost every system call all that means is checking if the return code less than zero and if so call perror. When system calls don't work, you really want to know about it early, trust me on this point.

Getting a container running using Docker

In Lab #0, you learned how to get a VM running using QEMU/KVM. Now you will run a container using Docker.

Install Docker on your machine. Here is a good guide for installing Docker on an Ubuntu machine.
Add your user to the docker group so that you don't need sudo to launch a container.
Launch a container running shell based on an Ubuntu 22.04 image.

Once you've been able to get your container running, try installing a package using aptitude to verify your container has network access.

Tools for measuring programs

Before heading to the main part of your lab, I would like to introduce some tools that you will use to measure your programs. You are encouraged to use any other tools that you think are valuable.

time
iostat
perf
getrusage. To use it, call getrusage at the end of your program and print out the fields. Pay particular attention to utime, stime, maxrss, minflt, majflt, inblock, oublock, voluntary and involuntary context switches. Note that the sum of utime and stime may not be equal to the elapsed time reported by time. If you see that, you will want to measure user/system time a different way.
The tracing infrastructure in QEMU

Expriment setup

Your VMs should have 2 CPUs and 2GB of memory. It should be running the same version of Linux kernel as the host. You should enable KVM for acceleration.
Your containers should have 2 CPUs and 2GB of memory. You can use the following command to do this:
```
  docker run -it --cpus="2" --memory="2g" ubuntu:22.04 /bin/bash
```

Measuring mmap

Your first task will be to write a program that mmaps a 1GB region (either file-backed or anonymous) and writes the first byte of each page (chosen in a random order) exactly once. To access each page of a region exactly once in a random order, you might want to generate a random permutation. Here is an example that takes an array and shuffles it based on Fisher-Yates shuffle:

void shuffle(uint64_t *array, size_t n)
{
  if (n > 1) {
    size_t i;
    for (i = 0; i < n - 1; i++) {
      size_t j = i + rand() / (RAND_MAX / (n - i) + 1);
      uint64_t t = array[j];
      array[j] = array[i];
      array[i] = t;
    }
  }
}

We want to produce deterministic results. You should bind your program to a specific core. Also, for this expriment, we want you to make sure that the entire file is cached in the system's page cache before each time you run the expriment. You can write a simple program that sequentially reads the entire file for several times to load the file into the page cache. Before each expriment, you should use fincore to make sure that the entire file is cached in the page cache. The point here is to make sure that the standard deviation of your results is small. Your results are not deterministic if they vary dramatically from experiment to experiment.

First, let's do the expriment on your host machine:

Create a 1GB file on an ext3 or ext4 filesystem, which is backed by a SSD. If your machine does not have a SSD, just use any storage device that is available. Your filesystem must be either ext3 nor ext4. You can use df -hT to check the types of filesystems on your host machine. If you cannot find an existing one, create one using fdisk and mkfs. Whatever filesystem you're using here, you have to use it in all the following expriments.
Write a program that opens and mmaps the file and randomly writes the first byte of each page exactly once. Also, you should add a call to msync with the MS_SYNC flag at the end of your program to make sure that all the file changes are written to the disk.
Try different mmap flags, including MAP_PRIVATE and MAP_SHARED. Record the elapsed time for your program to finish.
Configure your program to use MAP_ANONYMOUS. Also, try different mmap flags for it, including MAP_PRIVATE and MAP_SHARED. Record the elapsed time for your program to finish.

Next, let's do the same thing in a container under two settings:

(Setting #1) Configure your container to use the host filesystem directly. Use the --mount option or the -v option to expose the same file used in your previous expriments to your container and make sure that your program is accessing that file (for reproducability).
(Setting #2) Configure your container to use overlayfs, which is the default filesystem used by Docker containers. In order to get a better understanding of how overlayfs works, we want you to manually mount one. Below are the instructions:
```
  cd path/to/your/filesystem
  mkdir lower upper work merged
  truncate -s 1g lower/file-1g
  sudo mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merged
```
Now what does the lower, upper, work and merged directories look like, respectively? Again, use the --mount option or the -v option to expose the merged directory to your container. Note that if you are using the -v option, you must specify a target for it. Otherwise, the overlayfs won't work. For file-backed cases, including file-backed private and file-backed shared, we want you to report the amount of time your program consumes on its first and second run. Before starting measuring your program, do make sure that lower/file-1g is cached in the page cache and the upper directory is empty (I refer to this state as the initial state of your overlayfs). Then launch your container and do the same expriment using merged/file-1g for two times and record the elapsed time for each run. Now what does the lower, upper, work and merged directory look like, respectively? To restore the initial state of your overlayfs (to record the elapsed time for your first run), umount the merged directory, delete all the folders and redo the above instructions to mount a new overlayfs.

Now, let's do the same thing in a VM under two different settings:

Create a 1GB file on an ext3 or ext4 filesystem in your VM, which is backed by a virtual disk virtualized using virtio. The virtual disk should be backed by the same filesystem and the same SSD used in your previous expriments on the host. You should run your expriments using that 1GB file you just created. If you used an ext3 filesystem in the previous expriments, please also use an ext3 filesystem in your VM. (And if you used an ext4 filesystem previously, continue with an ext4).
(Setting #1) Let's run the expriment in a VM with extended page tables (EPT) support. By default, the KVM will use EPT as long as your machine supports it. You can use lscpu to check if your machine supports EPT and use cat /sys/module/kvm_intel/parameters/ept to make sure that EPT is enabled for your VM.
(Setting #2) Now let's run the expriments in a VM with EPT disabled. You can execute the following instructions on the host to disable EPT:

  # make sure that all your VMs are killed
  sudo rmmod kvm_intel # use kvm_amd for AMD machines
  sudo modprobe kvm_intel ept=0 # use npt=0 for AMD machines
  # relaunch your VMs

ept=1

npt=1

Summarize your results in the table below. Please also include the standard deviation of your results.

	file-backed private	file-backed shared
Native host
Container using host FS
Container using overlayfs	first run: second run:	first run: second run:
VM with EPT
VM without EPT

Are there any numbers you find interesting? You may use the tools we introduced above to measure your programs and help you understand the end-to-end performance. Please answer the following questions on your report:

Explain any performance differences between file-backed and anonymous mmap.
Is there any difference between MAP_PRIVATE and MAP_SHARED? Explain the differences.
If you find that the file-backed private case in a VM is slow, can you explain why? How to improve the performance?
Are there any difference in performance among the native host, the container and a VM using EPT? Whether there is a difference or not, please explain why.
Explain any performance differences between VM with EPT and VM without EPT.
Is there any performance difference for workloads on a container using the host file system and one using overlayfs? Whether there is a difference or not, please explain why.
In the case of container using overlayfs with file-backed mmap, is there any difference between your first and second run? Explain the differences.

Measuring direct file I/O

The second part of your lab is to fill out the table below. Please also include the standard deviation of your results.

	sequential read	sequential write	random read	random write
Native host
Container using host FS
Container using overlayfs	first run: second run:	first run: second run:	first run: second run:	first run: second run:
VM with EPT

We want you to measure the performance of direct file I/O, including random read/write and sequential read/write. Write a program that opens the same file used in the previous sections using O_DIRECT. Construct an offset_array and pass it to the function below. Here, IO_SIZE is a macro, which defines the size of each I/O request. We use 4096 bytes as the I/O size. offset_array stores the offset of each I/O request. For sequential read/write, offset_array should look like {0, 4096, 8192, 12288, ..., FILE_SIZE - 4096}. n is the length of the offset_array. For random read/write, generate a random permutation of the sequential offset_array and pass it to the function below. If the opt_read flag is true, we read from the file, if it is false, we write to the file.

In the case of container using overlayfs, just like what we did in the last experiment, we want you to report the amount of time your program consumes on its first and second run. Again, before starting measuring your program, do make sure that lower/file-1g is cached and the upper directory is empty. You can use the same instructions in the last experiment to mount an overlayfs and restore its initial state.

#define	IO_SIZE 4096
void do_file_io(int fd, char *buf, 
      uint64_t *offset_array, size_t n, int opt_read)
{
  int ret = 0;
  for (int i = 0; i < n; i++) {
    ret = lseek(fd, offset_array[i], SEEK_SET);
    if (ret == -1) {
      perror("lseek");
      exit(-1);
    }
    if (opt_read)
      ret = read(fd, buf, IO_SIZE);
    else
      ret = write(fd, buf, IO_SIZE);
    if (ret == -1) {
      perror("read/write");
      exit(-1);
    }
  }
}

Once again we want you to compare the performance measurements and explain the differences. The tools linked above might be helpful to you to better understand the measured performance. Please answer the following questions on your report:

What does the flag O_DIRECT do?
Please explain any differences between the performance of sequential I/O and random I/O.
Please explain differences between read and write benchmarks.
Are there any differences in performance among the native host, a container using the host filesystem and a VM with EPT? Explain these differences.
Explain any performance differences for workloads on a container using the host file system and one using overlayfs.
In the case of container using overlayfs, is there any difference between your first and second run? Explain the differences.

Swap

This is the final part of your lab. In this expriment we want you to understand the functionality of swap. You will use the same program written in the first expriment "Measuring mmap" for this expriment. More specifically, we want you to consider the anonymous private case of your program. Please answer the following questions:

For containers:

Restrict the memory size of your containers to 500MB by specifying --memory="500m" in the Docker command. Then Run your program. Does it finish successfully? Explain why.
Add --memory-swap="1.5g" to your docker command (Don't remove the --memory="500m" flag). Also, please make sure that swap is enabled on your host machine. You can use free -m to check if it is enabled. Run your program again. Does it finish successfully? Explain why.

For VMs:

What happens to your program if you only give your VM 500MB memory? I am assuming that swap is disabled on your VM.
Enable swap on your VM. Configure a 1GB swap region. Run your program again. Does it finish successfully? Explain why.

Report

Your report should be a PDF file submitted to canvas. Here is a description of its contents.

The first section should include everything the reader needs to reproduce all your results. As always, report your experimental platform. Describe the software you are using, like the version of your kernel, VM images and docker images.

The second section should include the results of your first expriment and your answers and explanations to the corresponding questions. Use the table we specified above to report your results. Please specify the units of your measurements. Your report should answer every question and should do so in a way that is clearly labeled. Your explanation should include how you used the tools to help you understand the differences in performance. Don't just include your hypothesis! Use tools to measure your programs and support your hypothesis.

The third section should include the results of your second expriment and your answers and explanations to the corresponding questions. The requirements are the same as the second section.

The final section should include your answers and explanations to the questions in the third expriment. Please also include the output of your program.

Please report how much time you spent on the lab.

Notes

Configure your VM with at least two virtual CPUs, but first confirm that your host system has at least two CPUs.

Check for perf availability in your host system before checking/installing in the guest.

If you run perf list on the command line, it will tell you what counters are supported by your combination of hardware, OS and perf tools.

I'm not sure if it is necessary, but if you get a lot of variation in your results for the experiments that follow, you might want to disable CPU frequency scaling on your system. I would do this in the BIOS, but you can also try user-level tools like this one that allow you to set the frequency directly (or perhaps the "-g performance" option would work, I'm not sure). Here is a tool. https://manpages.ubuntu.com/manpages/hardy/man1/cpufreq-selector.1.html

Your report should be a PDF file submitted to canvas.

Please include how much time you spent on the lab.

Your code will have to run with many different configurations. Consider using getopt, or maybe you would prefer a configuration file, but I find command line options superior for this sort of task as they are more explicit and more easily scripted.

You must give us access to a code repository that contains the history of your lab. This will allow us to see your partial progress.

Your code is subject to visual inspection and should be clean and readable. Points will be deducted if your code is too hard to follow. Your code should follow best practices, for example avoiding arbitrary constants and checking error codes. Seriously, check the return value of every system call you ever make.

Please check in your code as you write it. We will look at your revision history as a way to ensure that you are doing the work. We suggest using github and we request that you share access to your repository with us.