This code provides a user-level tracing method for unix
(Solaris, actually) processes. It is based on Ian Goldberg et al's
Janus (See their "Wily Hacker" paper at the usenix security symposium,
modified by Manish Gupta to do tracing, and modifed further by Mike
Dahlin to simplify it to make it more suited for tracing purposes.)

The idea is to launch a program and have this program
write a log of information about what system calls that
program makes and how long they take.

e.g.
	janus -l out/log netscape

(2 notes on compiling: the Makefile must be compiled with "gmake"
not just make; the programs themselves must be compiled with CC not
gcc (the Makefile is set up this way))


Writes a log about netscapes activities into the file out/log-XXXX
(where XXXX is the pid of the process).

In fact, if you do the above, you will see several out/log-XXXX
files. Netscape spawns off several helper programs as children, and
janus correctly handles that case (by spawning off its own children to
watch these children.) The idea is that we could track everything a
user does by having them start their X-session
	
	janus -l out/log xinit

We would then track the X process and all processes spawned by the X
process. (A caveat, we cannot trace setuid programs or programs whose
executables we cannot read; this is a security restriction in
solaris.)



This code is just the bare bones. You will need to enhance it
significantly to make the overall study work. Here are several issues
off the top of my head

1) tracing to a log v. on-the-fly analysis

The current code prints out rather verbose log messages every time a
system call is entered or exited. This results in rather large log
files. We would like to track a user for a day or a few users for a
week, and the logs may be too large for that. (Solutions: I am willing
to buy a few 10's of GB of disk for this study if that will help;
also, it should be easy to reduce the log size by 10x with better
encoding; it may be possible to reduce it by more by omitting info
that we don't need)

An alternative strategy is to do the analysis on the fly and only
print out results (e.g., in addition to logging all events, the
current code tracks the total time spent in each type of system call
and prints a report on those at the end).

Some tradeoffs between the two approaches: logging is simpler and it
lets you process the data in new ways if you decide there is some new
study you want to do that you didn't think to do when you did the
on-line analysis. On-line analysis may make it easier to do a
larger-scale study spanning multiple users or days by reducing the
storage requirements for the info. 

It may be difficult to do on-line analysis for inter-process analysis
(e.g., when one process is waiting for another to do something).

Your decision will depend on the studies you eventually want to do. My
initial feeling is that logging is the way to go -- disk space is
pretty cheap, so if we can figure out a way to store a reasonable
amount of data to do the analysis off line, that would be great.




2) deeper analsysis of the important sys calls

By default, when a syscall begins, we print out the system call number
and the time of the start. Similar when it ends.

For some calls, we will want more information -- what file is being
opened, what machine is being connected to, ... To do that, we
superclass Hook with a class that understands what information we care
about for a particular system call (e.g., see RWFileHook, OpenHook,
CloseHook for specialized code for read(), write(), etc.)

You will want to get more information about more off the calls. Of
course, you probably won't want to make detailed handlers for all
calls. A reasonable strategy is to focus on the calls that account for
the most cumulative time. (One that you will probably have to tackle
near the beginning is mmap and page faults; these are often used in
place of read/write).



3) Per-process and inter-process trace post-processors
Programs that read the trace files and report interesting
results. Start simply -- which system calls account for the most time,
which calls are most expensive on a per-call basis, etc. Be able to
answer these questions for an individual program and for a collection
of programs (e.g., "when you run netscape, you spend X% waiting for
local FS, X% waiting for DNS, X% waiting for ..." and "during a day of
work you spend X%...)


4) Known bugs include:

	stdin, stdout, stderr (fd's 0, 1, 2) show up as "closedfs"
	since we never see them get opened. If we initialize
	the FDClass table from the root process to mark these
	as "terminal" the right thing should happen (they
	should get propagated to children porcesses)


	for some reason a lot of /usr/lib... files are 	
	showing up as unknownfs (the fstatvfs calls
	are failing and I don't know why). 


5) Other info: README.documentation has pointers to other info;
"Summary" and "problems" have some of manish's notes from last
year; I'm not sure I agree with all of his suggestions.

6)...I'm sure I've forgotten something. Come talk to me if you get
confused (or if you get a really good idea!)

7) Don't forget. Before you get too far, write your project proposal
so you know where you are going and so you don't waste time doing
things that don't matter (and so that you do complete the things that
do matter.)




