## Lab #1

### Program measurement and mmap

The goal of this assignment is to learn about how to measure your program's behavior, and a bit about mmap.

We are interested in doing experimental computer science. We will follow the scientific method, which wikipedia tells me has been around for a long time. But knowing about science in the abstract is relatively easy; actually doing good science is difficult both to learn and to execute.

Your report should answer every question in this lab and should do so in a way that is clearly labeled.

This lab requires you to report data. But how you report it is up to you. Please think carefully. I do not want a huge table of numbers for every experiment. You can provide a table, discuss what is and is not relevant, then give me a graph, or a small table, or bold the entries that I should pay attention to (and that you explain in your text).

I have a major pet peeve with excessive digits of precision. Your measurements are usually counts. If you average three counts, don't give me six decimal places of precision even if six digits is the default output format for flots in the language you are using. Decide how many digits are meaningful and then report that many. Also, make your decimal points line up when that makes sense. For example, if you report a mean and a standard deviation, make the decimal places always align so you can see easily if the standard deviation is less than a tenth of the mean (which is a good sign for reproducibility).

I would use C, but you can use whatever programming tools you want. One thing I want you to do both for this class and for real life is always check the return code of every single sytem call you ever make. I know it sounds a bit pedantic, but start the habit now and you will have a happier programming life. For almost every system call all that means is checking if the return code less than zero and if so call `perror`. When system calls don't work, you really want to know about it early, trust me on this point.

Please include text about each experiment called for in this section, and include any graphs, tables or other data to make your point clearly and concisely. As always, report your experimental platform. You should find out what kind of CPU you have, its level 1 data cache capacity and associativity, and its data TLB capacity and associativity. Does your TLB have a second level? If so, describe it. Put all this information about your hardware in your report.

### Memory map

Your first task will be to write a program that opens, reads, and prints the /proc/self/maps file. Put the result of your program in your report and write a few sentences about what it means. I want you to look at the output, read the documentation to understand its fields, then think about what is the most interesting thing you found in your output and write a few sentences about what that is. Also answer this question: Report the base address of your executable (the start of the text section) and the start address of `libc`. Why are these numbers so different?

### Getrusage

Next, call `getrusage` at the end of your program and print out the fields. Pay particular attention to `utime`, `stime`, `maxrss`, `minflt`, `majflt`, `inblock`, `oublock`, voluntary and involuntary context switches.

### perf_event_open

You will use the `perf_event_open` interface to measure your code. You can do this in a VM or not, but please specify in your report what you have done. Note that you have to configure your kernel to support perf and you might need to build the perf tools yourself if you are booting the latest kernel (as I requested in Lab 0, but which is not 100% necessary for this lab). Read the documentation. It is pretty wild how you call it, right? Does using the `syscall` routine mean there is a syscall opcode in your program? Check using `objdump -d` and explain what you find in your report.

We want you to monitor the following events: level 1 data cache accesses and misses, and data TLB misses. Figuring out how to encode those events into perf_event_open is not trivial, so use the force (and the Internet). Note the division of events into read, write and prefetch. According to my experiments, the level 1 data cache experiences many misses due to prefetch, but they do not count prefetch accesses or prefetch data TLB misses (the latter likely because a prefetch that would cause a TLB miss is dropped).

Allocate a 1GB buffer, and pass the pointer to this routine (I'm assuming your host and VM have at least 2GB of memory). This routine assumes a global variable called `opt_random_access`. It also assumes your code defines the constant `CACHE_LINE_SIZE`, which is 64 on x86 platforms (and you are allowed to hard code this constant). This code represents the behavior of a program that we wish to study. Briefly summarize its memory access behavior in your report.

```// p points to a region that is 1GB (ideally)
void do_mem_access(char* p, int size) {
int i, j, count, outer, locality;
int ws_base = 0;
int max_base = ((size / CACHE_LINE_SIZE) - 512);
for(outer = 0; outer < (1<<20); ++outer) {
long r = simplerand() % max_base;
// Pick a starting offset
if( opt_random_access ) {
ws_base = r;
} else {
ws_base += 512;
if( ws_base >= max_base ) {
ws_base = 0;
}
}
for(locality = 0; locality < 16; locality++) {
volatile char *a;
char c;
for(i = 0; i < 512; i++) {
// Working set of 512 cache lines, 32KB
a = p + ws_base + i * CACHE_LINE_SIZE;
if((i%8) == 0) {
*a = 1;
} else {
c = *a;
}
}
}
}
}
```

This simple random number generator comes from wikipedia. We give it to you so you can observe it using a simple profiler. Note that our code calls this function whether or not it is doing a random traversal. Explain why in your report.

```///////////////////////////////////////////////////////////////
// Simple, fast random number generator, here so we can observe it using profiler
long x = 1, y = 4, z = 7, w = 13;

long simplerand(void) {
long t = x;
t ^= t << 11;
t ^= t >> 8;
x = y;
y = z;
z = w;
w ^= w >> 19;
w ^= t;
return w;
}
```

Enable your tracing events right before calling do_mem_access, then disable the events, read the results and report them along with the `getrusage` results. You should compute and report the cache miss rate and the data TLB miss rate as a percentage (100.0 * cache misses / cache_accesses and 100.0 * tlb misses / cache_accesses). Also look at the counts themselves.

To be even more careful about generating repeatable results you should flush the level 1 data cache before enabling the performance counters. You can do this by reading (or writing, if your cache does write allocate) a memory buffer that is larger than your cache size. Also, lock your process onto a single processor and describe the system calls you need to do this in your report.

### Measuring memory access behavior

This is the main part of your lab. Perform the following experiments. Allocate your buffer using `mmap`, both mapping anonymous memory and file-backed memory. Consider sequential and random access (these are properties of the `do_mem_access` code we supply). For file-based `mmap` consider `MAP_PRIVATE` and `MAP_SHARED`. Also consider `MAP_POPULATE`. Consider the case when you memset the entire buffer after allocating it. Call `msync` after the memset. What does that do? Remember to examine overall execution time and CPU utilization (often reported by the shell).

Perform all of the above experiments and study the data generated by `getrusage` and `perf_event_open`. Describe the results in your report using tables, graphs or whatever you think is appropriate. Explain the differences between experiments and why they happen. Determine if your results are stable by running your experiment several times and measuring the standard deviation of your results. Include the results of your analysis in your report.

Here is an example of the sort of question you should be asking yourself. Why can your data TLB miss count be lower than the total number of data pages your application accesses? Answer this question in your report.

Run your program under `strace`. Enter the output for `arch_prctl` in your report and explain what this system call does. Put the output of the system call that involves `/etc/ld.so.preload` in your report and explain what is going on.

You should do your measurements on an otherwise quiet system. However, for extra credit, you can study the effect of background activity. Explain clearly how you generated the background activity, what you expected, and what you found.

For the final section of the lab, we will add code that `forks` or `clones` a process that will compete for memory with our process under test. When your main program ends, be sure this competing process also dies. This can be accomplished in many different ways, several of which require less than or equal to three lines of code. For this part, please use the VM you build in Lab 0, configured with at least two virtual CPUs (and check that you have at least two physical CPUs in your host).

Write some code that forks or clones and then calls `compete_for_memory`. You should fork/clone before locking your program under test to a single core. Note that `compete_for_memory` calls `get_mem_size()` which is a function you must provide that returns the size of physical memory in bytes. Because the purpose of this function is to compete for memory with the foreground process, it is ok if this number is not completely equal to the exact amount of RAM in your system, but it should be close. It is NOT acceptable to hard code this number, you must write code to measure this number at runtime.

```int compete_for_memory(void* unused) {
long mem_size = get_mem_size();
int page_sz = sysconf(_SC_PAGE_SIZE);
printf("Total memsize is %3.2f GBs\n", (double)mem_size/(1024*1024*1024));
fflush(stdout);
char* p = mmap(NULL, mem_size, PROT_READ | PROT_WRITE,
MAP_NORESERVE|MAP_PRIVATE|MAP_ANONYMOUS, -1, (off_t) 0);
if (p == MAP_FAILED)
perror("Failed anon MMAP competition");

int i = 0;
while(1) {
volatile char *a;
long r = simplerand() % (mem_size/page_sz);
char c;
if( i >= mem_size/page_sz ) {
i = 0;
}
// One read and write per page
//a = p + i * page_sz; // sequential access
a = p + r * page_sz;
c += *a;
if((i%8) == 0) {
*a = 1;
}
i++;
}
return 0;
}
```

For your report: Why do we call `fflush` after the `printf`? Would this fflush be necessary if we fprintf'ed to stderr?

Now we will mess with the kernel's head (or more precisely its LRU page replacement algorithm). These instructions are approximate since your version of the kernel might be different, but look in mm/vmscan.c in a function likely called shrink_page_list. In it, you will see a switch statement with a PAGEREF_ACTIVATE case, which is the case where the kernel sees the page has been recently accessed. In this case the kernel gotos activate_locked, but you will change the to fall through, which should be the PAGEREF_RECLAIM case. Then evaluate what this change does to your test program when it has competition for memory. You can do this by having two identical VMs with different kernels or one kernel that you dynamically configure with its default behavior and your modified behavior, perhaps controlled via the `/proc` file system.

### Notes

Configure your VM with at least two virtual CPUs, but first confirm that your host system has at least two CPUs.

To allocate space in a file, look at the abysmally named `ftruncate` and `fallocate`. No matter what you use, call `msync` after your memset and report your findings.

The perf subsystem uses the performance management unit (PMU) which is a set of hardware counters supported directly by your processor. If you look at `dmesg` output, it should show the PMU being initialized.

Check for `perf` availability in your host system before checking the guest. Most of the lab can be done on the host.

If you run `perf list` on the command line, it will tell you what counters are supported by your combination of hardware, OS and perf tools. You need at least level 1 data cache accesses and misses and data TLB misses. Some of the prefetch varieties may be unsupported.

Historically, VirtualBox has done a poor job of virtualizing the performance counters.

Only so many counters can be active at once (the exactly limit depends on the hardware/OS/perf tools). If you try to add too many you might get errors.

I'm not sure it is necessary, but if you get a lot of variation in your results for the experiments that follow, you might want to disable CPU frequency scaling on your system. I would do this in the BIOS, but you can also try user-level tools like this one that allow you to set the frequency directly (or perhaps the "-g performance" option would work, I'm not sure). Here is a tool. http://manpages.ubuntu.com/manpages/hardy/man1/cpufreq-selector.1.html

Your code will have to run with many different configurations. Consider using `getopt`, or maybe you would prefer a configuration file, but I find command line options superior for this sort of task as they are more explicit and more easily scripted.

Please report how much time you spent on the lab.