Peter Reiher, Gerald Popek, Richard Guy, Geoff Kuenning, David Ratner, John Saldanha
UCLA
Many operating system components rely heavily on locality to achieve acceptable performance, including virtual memory managers, various forms of cachers, physical disk managers, and network file systems. The principle of locality is that the system can make use of its knowledge of the interrelationships between items to improve performance, from determining the proper contents of a cache, to pre-fetching information likely to be needed soon, to concealing delays caused by relatively slow hardware components.
Today's operating systems largely take advantage of spatial and temporal locality. Spatial locality asserts that items that are close together in space are related. For example, blocks of a file or files in a single directory are often considered highly related because they are spatially close together. Temporal locality assumes that things that happened close together in time are related. Examples include the log structured file system's assumption that file blocks written at proximate times are related [Rosenblum 91], and least-recently-used algorithms for controlling caches.
Both spatial and temporal locality are substitutes for the more fundamental form of locality, semantic locality. Semantic locality captures the locality that occurs between elements of the system as they are actually used in practice. Files that are always opened one after the other display semantic locality. Linked spreadsheets display semantic locality. When a typical user likes to cut and paste information from his mail reader to his calendar program, those programs display semantic locality. When users at several sites spread across the world run cooperative programs, the programs exhibit semantic locality.
Semantic locality is not identical with either spatial or temporal locality. Semantically local things may not be spatially next to each other, including in their disk layout, their physical location in the world, and their place in the file system naming scheme. Semantically local things are often, but not always, temporally co-located. The relationship may persist over long periods of time filled with other intervening, unrelated events. At best, spatial and temporal locality approximate semantic locality.
For many purposes, that approximation is sufficient. LRU, sequential file read-ahead, and other such methods have provided years of useful service. However, most of the benefits available from spatial and temporal locality have already been achieved, so as the relative speeds of various hardware components diverge (CPU, memory, disk, and network speed ratios, for example), spatial and temporal locality may no longer conceal the differences. The increasing trend towards building programs from components (such as CORBA or DCOM objects) makes understanding of locality between disjoint objects even more important. Further, as users become more diverse, their interactions with the system will become less easily predictable, so no simple, static use of a locality measure will properly capture the needs of all users. To continue improving operating system performance, we must attack the problem of understanding and exploiting the true relationships between the data and software using the system.
Since temporal and spatial locality already capture the simpler relationships between operating system entities, the unexploited remainder is mostly complex. Further, it may be data-dependent, changing from one run of a program to another, or changing depending on user, date or time, or other factors. Extracting true semantic locality in the difficult cases is likely to be challenging, requiring deep understanding of the system, yet it must be done at the speed of programs and messages, not at the speed of human analysts. Clearly, such extraction may prove to be a computationally and data-storage intensive effort.
Fortunately, the same hardware trends that make it vital to use semantic locality also provide the resources that make its extraction possible. Most systems have many idle CPU cycles, and the introduction of parallel CPUs on commodity computers will create even more idle cycles. Configuring machines with many megabytes of main memory and large disks is economically feasible. If the resources spent extracting semantic locality are less than those saved by exploiting it, or if most of the costs can be paid while the machine is otherwise idle, our increasingly abundant computing resources can help us solve this problem.
However, the problem cannot be solved by brute force, because the operating system community knows very little today about this form of locality. Before we can effectively exploit semantic locality, we must have much more understanding of what semantic locality is and how we may extract it. Several researchers have realized that they are hindered by insufficient understanding of the semantic relationships between entities controlled by their operating systems. For example, the Seer system uses semantic knowledge about the relationship between different files to hoard data onto a portable computer before the machine is disconnected [Kuenning 97a]. Seer tries to correctly predict what data should be stored on a portable computer user's disk prior to disconnection, to ensure that the portable computer user always has access to the data he needs, despite lack of network connectivity. Seer's basic strategy is to group related files into project clusters, hoarding the most recently used clusters prior to disconnection. The Millennium project is examining the principle of introspection to monitor the system and reason about its configuration and performance [Bolosky 97]. The designers of the log structured file system have proposed a garbage collection and disk layout rearrangement strategy based on principles very much like semantic locality [Neefe 96]. The HOSS system at Texas A&M is centered around the hypermedia paradigm, using various forms of semantic locality extractable from hypermedia links to improve performance [Nurnberg 96].
Preliminary results suggest that semantic locality has tremendous promise for increasing operating system performance. The Seer system works extremely well, allowing its disconnected users to operate practically without any file misses, using nearly the minimum disk space required to satisfy all user requests [Kuenning 97b]. Other preliminary experiments show savings of a factor of two or three in important metrics using rather simple methods. Deeper, more complex methods might allow greater savings. Semantic locality could prove vastly important in obtaining the performance required for ever more demanding applications.
We are building on the Seer system to examine more general use of semantic locality in improving operating systems. This work is at an early stage, so we outline here our existing evidence that semantic locality has promise beyond the Seer system, and our plans for learning more about semantic locality and exploiting that knowledge.
Pre-fetching Web pages for a portable computer is similar to Seer's file hoarding. As a simple test to determine if Seer's methods extended to Web pages, we ran traces of Web traffic from our local Web server through Seer's file hoarding algorithms. We then determined how many accesses to our Web server would be satisfiable by hoarding a reasonable number of the Web page clusters Seer built. The results were quite promising, especially considering that we did nothing whatsoever to optimize Seer for Web pages hoarding.
Typical program startup requires opening several configuration and library files. If the system understood the semantic relationship between the program and those files, it could prefetch them before the program requested them. Since these relationships are predictable, the system can watch previous runs of the program to extract the required semantic locality information. As a simple test of the potential benefit, we measured the startup time of various applications with none of their related files cached in memory, versus the startup time when all related files that would fit were cached. For the Netscape Navigator, startup time dropped from 16 seconds to 4 seconds. Other applications showed lesser improvements.
Exploiting semantic locality will typically require knowing that the locality exists, then taking appropriate actions. The operating system can use various methods to learn of the existence of semantic locality. In places where explicit links of various sorts are present (such as include statements in source files, hypertext links, OLE links, etc.), the system can examine those links and extract the locality. In other cases, the system must watch the behavior of the applications and use information gained to deduce semantic locality. This is Seer's basic method, and the method we intend to use in most of our work.
Typically, any information the operating system can observe is rather noisy, as modern systems tend to have multiple activities occurring simultaneously. Determining semantic locality from the raw data that the system can observe is challenging. For Seer, we developed metrics and algorithms to build project clusters. Seer's methods and metrics are not universally applicable, particularly because Seer only needed its clusters at the moment of making a hoarding decision, a moment that came only once a day or so. Thus, Seer formed new clusters every time it needed to hoard files. For many uses of semantic locality, the operating system needs continuously available semantic locality information, or at least quick, cheap information on demand. Much research will be required to develop more general metrics and clustering algorithms.
In Seer's case, the use of the semantic locality information was fairly simple. Files in a cluster were hoarded together. Other applications of semantic locality may require much more sophisticated uses of the information, which we must research. Once clusters were formed, Seer used LRU methods to determine which clusters to hoard. LRU is not necessarily the best predictive method for all purposes. We will investigate methods that use past behavior and models of future behavior and conditions to determine how to make use of the semantic locality information we have extracted.
Much effort will be required to determine the relative costs of gathering, extracting, and usefully applying semantic locality information. The system must trade off quality of semantic locality information, effective application of that information, and available resources. Little is known today of how to make those tradeoffs properly.
Our approach to exploiting semantic locality is only one of several promising research directions. Clearly, the system can gain a great deal of advantage by utilizing the semantic information built into the file system name structure, process trees, and other existing operating systems structures. Expert systems might have an important role to play in exploiting semantic locality, as might simulation. Programmers and users might be able to provide some semantic hint information that the operating system could use. Other promising approaches are likely to be discovered in the near future.
The use of semantic locality information in the most general way is a huge, largely untapped area for research in systems performance improvement. We plan to make major contributions from our research, but clearly there are other related approaches and other interesting applications that we will not address. Exploiting semantic locality is a rich research area that is likely to provide major, vital performance improvements for systems of the future.
Bibliography
[Bolosky 97] William Bolosky, Richard Draves, Robert Fitzgerald, Christopher Fraser, Michael Jones, Todd Knoblock, and Rick Rashid, Operating Systems Directions for the Next Millennium, Proceedings of the HotOS Workshop, 1997.
[Kuenning 97a] Geoff Kuenning and Gerald Popek, Automated Hoarding for Mobile Computers, to appear in the Proceedings of the Symposium on Operating Systems Principles, October 1997.
[Kuenning 97b] Geoff Kuenning, Peter Reiher, Gerald Popek, Experience With an Automated Hoarding System, to appear in IEEE Personal Technologies, 1997.
[Neefe 96] J. Neefe, D. Rosselli, R. Wang, T. Anderson, and M. Dahlin, Improving the Performance of the Log Structured File System, position paper for work in progress at OSDI, 1996.
[Nurnberg 96] Peter Nurnberg, J. Leggett, E. Schneider, J. Schnase, Hypermedia Operating Systems: A New Paradigm for Computing, Proceedings of the Seventh ACM Conference on Hypertext, 1996.
[Rosenblum 91] Mendel Rosenblum and John Ousterhout, The Design and Implementation of a Log-Structured File System, Proceedings of the Symposium on Operating Systems Principles, 1991.