Saul
Edwards
Jianping
Song
In recent history, computer hardware has provided predictable rates of growth in performance improvements. However, the rates of progress for different computer components are quite dissimilar. Processing power nearly doubles every 18 months, while the rotational speed and data density of hard disk drives has allowed much slower overall developments. This huge disparity has isolated permanent storage as the main bottleneck in most systems. Even in a common desktop PC, millions or even billions of processor instructions may be wasted in the event of an unnecessary disk operation.
This observation was first made at least a decade ago, [Rose91] so researchers have already been hard at work on the problem. Controller cache size, controller bus width, disk arm speed, page buffer size, read-ahead buffering, and file system organization policies are just a few of the targets of performance improvements in the set of hardware and software that compose the modern I/O system. In particular, the design of the operating system’s file system component has been the focus of impressive research efforts. Today, almost any letter of the alphabet can be placed before the letters “FS” to construct the nickname of a common file system.
Many new systems emphasize optimization of writing to the file system. These systems assume that cache hits will increase as the average size of memory in a system continues to grow; reads will decrease in cost relative to writes. Unfortunately, the size of the average data on a hard disk is growing at the same speed as main memory, if not faster. [Dahl96] The traditional Fast File System policy for file and metadata locality is a good general solution to the problem, but performance can easily be improved for “read-intensive” situations. These file systems include web servers and databases where the amount of data far exceeds available main memory cache and system deals with only small changes in content as a function of time.
No existing system satisfied all of our requirements for such a “read-intensive” workload, although we are probably not aware of all attempts. We wanted to build a system that reduced overall seek time for reads but required nominal amounts of main memory and processor cycles to achieve it. In addition, we wanted the system to counter the effects of file system aging. This would require reorganization every so often: there is no way to predict the future of disk content, so there is no way to place a file in an optimal position on a hard disk only once. Basic reorganization of existing file systems has already shown to be effective using gray-box methods. [Arpa2001]
Requiring the system to perform reorganization dynamically is another important design decision that we felt strongly about. Users and administrators rarely have free time with which to interrupt file system usage or otherwise disturb normal system availability. By building the reorganization into the operating system kernel and using the access of a file to trigger its movement to a more desirable area of disk, we make the process invisible to all users of the system. This invisibility and the system’s independence from tuning and administration make it easy to use, which is often key to any project’s success beyond the realm of academia.
Finally, we tried to keep the modifications simple, modular, and completely independent from user applications. Although the best organization policies might be defined by the applications that use the data, the interface needed to implement such control is complex at best. Furthermore, application programmers are forced to create the best possible system policies under these schemes. It is even possible that applications would make inferior decisions on file and metadata placement because of insufficient knowledge of disk or controller characteristics. Most of all, a superior technology will fail to replace legacy systems if replacement is not completely painless. Even the well-tested ext3 file system, which is completely backwards and forwards compatible with the older ext2 file system and only a slightly modified kernel, faces slow espousal by the users of the Linux operating system. We wanted to show that an accepted and proven commodity system could be improved with only slight modifications.
The UC-Berkeley Fast File System [McKu84] has defined the standard locality policies for file systems across many Unix flavors for nearly two decades. Today, the ext2 and ext3 file systems of Linux still use identical policies to allocate space for inodes, directories, and data blocks. Changing the standard block size from 512 bytes to 4 kilobytes was a major breakthrough, and is also still in use today. The traditional Unix file system is still divided into large groups of blocks. Also, copies of the superblock and free inode bitmap are maintained in each group. All of these features contribute to performance and reliability.
When an application calls the system to create a new file, FFS identifies a block group with fewer allocated inodes than the average block group. When writing data to the file, the blocks are obtained from block groups with a large number of free blocks to avoid putting large files in a single block group. These allocation policies are justified because
“they cannot attempt to localize all data references, but must also try to spread unrelated data among different cylinder groups. Taken to an extreme, total localization can result in a single huge cluster of data resembling the old file system.”
In a static file system like FFS, this is true. Unfortunately, in practice this causes a spread of both data and metadata across block groups. Figures 1 and 2 show the default spread of approximately 1.6 GB of data across a 5 GB ext2 partition. All of the block groups are utilized even though only 30 percent are required. This will cause the disk to seek across its entire width, slowing both reads and writes. Although these policies guarantee no locality, files within the same directory and files created at the same time are often successfully grouped as a result.
File systems that are reorganized online can avoid these locality constraints when the number of reads is much higher than the number of writes. In fact, the data and metadata can be thought of as a “single huge cluster of data”, with the most frequently accessed blocks closest to the center of this cluster. After gathering statistics on file system usage, data is moved to its preferred location on disk depending on the policy of reorganization. For example, maybe the files with the highest number of accesses during the information-gathering period should be placed in the middle of the cluster. Another policy associates data that is often accessed together in common sequences. Yet another policy clusters files by the last time of modification because this may be a simple way to tell how frequently each file is accessed. An even simpler policy would be to just “compress” the partition’s current state into a smaller area, maintaining the orders of metadata and file blocks. We implemented two of these policies and provided a mechanism to easily construct the others. The best policy may be different from workload to workload, so the user can set the policy with a very simple tuning interface. Alternatively, the system can try several policies, measure its own performance, and select the best one so as to avoid maintenance requirements.
When the disk is nearly full, the file system will observe a reduced average seek time from reorganization because frequently accessed data is closer to the head. Being able to move your workplace, the grocery store, and the city park nearer to your home means you have smaller distances to travel. When the disk is not nearly full, the average seek time will be further reduced because the range of the average seek will also be smaller. Track-to-track seeks are especially cheap.[Ruem94] Living in a smaller city means everything is proportionally closer.
Another advantage of reorganization is the avoidance of aging commonly found in traditional file systems. [Smith97] In the next epoch of reorganization, any large distances between blocks of the same file will be reduced. Deletions of large files that create holes in the file system that waste seek time will also be resolved quickly. Basing the system restructuring on file usage statistics ensures that unused and old files will not be moved unnecessarily.
Many reorganization systems have appeared in recent research. One of the simplest methods is employed by a gray-box system. [Arpa01] In a gray-box system, a user application extracts details on the state of the file system from operating system calls and uses other simple calls (create, move, unlink, etc.) to coax the file system into a more desirable organization. Although gray-box methods can be performed without risky kernel modification, can be completely customized by the user, and often require few resources, they have a number of important drawbacks. For example, the kernel has access to much more information on the state of the system, so it will make more informed decisions. Kernel modifications are capable of intercepting system calls, so access information can be gathered and reorganization can be done seamlessly. In general, the kernel is more efficient than user applications because it can avoid the conversions between user and system modes and has access to internal methods that may be more appropriate for a given operation. Also, gray-box systems are not usually any more portable kernel systems because they must be recompiled and altered for every new platform also. Finally, gray-box systems will almost always require more administration and tuning because they must be controlled by a user and do not have access to all of the information that they use as input parameters.
Other offline reorganization systems are described by [McDonld89] and [Stae91]. These systems are similar to ours in that they use various heuristics to place “hot” files in the center of the disk and “cold” files on the edges. [Akyű95] suggests that these techniques be employed by the disk controller or device driver rather than the file system or a user application. Finally, [Grif94] proposes a prefetching system that logs sequences of file accesses and reads the files most commonly opened next. All of these systems use techniques that resemble our system, but they did not implement their algorithms in a commodity file system or test them with a familiar web server benchmark.
Other research proposes modifying the interface between applications and the file system. With the appropriate interface extensions, applications can use system calls to associate files so that access of files within the same group will be optimized. As discussed earlier, a major downfall of these methods is that they place responsibility on the applications programmer to decide what data should be placed together. More importantly, changing the system call interface is nearly impossible because existing systems would be very slow to convert to the new methods. The kernel may even make more accurate decisions on which files should be placed together based on its own statistics. In general, keeping operating system modifications invisible to the system’s users is vital to its success in the commercial world.
Our modifications to the ext2 file system include an additional source file containing the statistics-gathering structures and algorithms, the file’s associated header, and minor changes to six different existing files in the latest version of the code. In total, our changes add little more than 500 lines of code and are completely modular so that reorganization can easily be turned on and off.
The first phase of restructuring is the collection of file system statistics. This logging is performed at the granularity of the inode because finer statistics (i.e., at the level of the block) would require an excessive amount of main memory. Each time a file is opened, the inode number of the previous open call is recorded in the file’s list of access sequences, along with the time difference between the two accesses. The time of the last access of the file can also be recorded, so all of the aforementioned policies available.
Either a memory limit or a time limit (or both) can be set on phase 1. When the logging information reaches a certain memory size or the system has been logging for a certain amount of time, logging file accesses stops and phase 2 begins. This is a very brief phase in which the file system determines the future locations of files based on its current policy. After optimizing our data structures for this phase, we found that 200,000 inodes could be ordered in much less than a second. This processing will grow linearly with the size of the disk, but since processor speed is growing faster than disk size, this small requirement should never be significant.
Immediately after determining the proposed file order, phase 3 begins. This phase “moves” files to their new desired location on disk. The movement of a particular file occurs when the file is first accessed during phase 3. First, it creates a new file with a slightly modified name. The allocation of the inode is controlled by the determined location for the file. Then, the file’s contents are copied using the basic ext2 read and write system calls. During the copying, the data block allocations are also controlled by the organization algorithm. Finally, the old file is unlinked (deleted) while the new file is renamed to the file’s original name. Clearly, this process could be optimized in a number of ways. It also has a few reliability problems in its current state of maturity.
It is possible that a crash could occur during the deleting of the old file and the renaming of the new file. This problem is easily solved by existing journaling or transaction techniques. The new file’s alternate name is recorded before the old file is deleted. Another problem is that for very large files, it may be infeasible for the file system to hold two copies of the file at once. In such cases, it is often desirable to avoid restructuring the file at all, and since the file system obviously has access to this information, it can decide what the length cutoff should be. If the file must be moved, file system primitives can be used to slowly move the blocks of the file over time, meanwhile updating the necessary inode and indirect block structures. Finally, our restructuring algorithm is not compatible with symmetric multiprocessors; but neither is the ext2 implementation in Linux so we were not concerned with this problem. Again, this is easily solved with conventional synchronization techniques such as monitors and condition variables.
As in phase one, a time limit to phase three can be set. The process then begins again, this time disregarding files completely contained in a single block group and files that are close to their desired locations. Various statistics on the status of restructuring are available via a simple /proc interface. These statistics include the total number of accesses logged, the number of files reorganized, the amount of data to be moved, and the actual desired block group of each inode and its associated data blocks, among others.
In order to judge the performance improvement of our system, we used the popular WebBench web server benchmark from ZD Net labs. The file system of the web server was modified so that our restructuring method was applied only to the partition that contained nothing but the web server’s content. Both client and server were x86-compatible machines with processors of gigahertz-range clock speeds, 512 MB of memory, and about 80 GB of disk space divided unevenly between two disks. After warmup, the server was utilizing nearly 400 MB of its main memory for disk caching. The two machines were connected via 100 Mbps Ethernet with no external interruptions. All disks used the IDE bus and had 2 MB of on-disk cache.
In order to reduce the total testing time for our modifications, the web server’s partition was a mere 5 GB on one 18 GB drive. There was only about 1.7 GB of data in the partition, translating to usage of about 30 percent. The document root of the Apache web server was set to the root directory of this partition.
WebBench is a simple client that sends requests to the desired web server and performs a basic test on the returned data to guarantee limited accuracy. Performance is measured in both requests per second and throughput in bytes. A small (75 MB) workload accompanies WebBench, which we duplicated over 20 times to arrive at our total workload. WebBench’s customizable workload files were modified to include all of the copies of the workload. All of the original qualities modeled by the workload were maintained except size, of course.
The 75 MB workload is based on a large survey performed by ZD Labs. It consists of over 6,000 files ranging from 6 bytes to nearly 1 MB. The distribution of file sizes matches 90 percent of the sites surveyed. In addition, the files are classified into usage groups, where each group is assigned a percentage of the accesses. Multiple directory depths are used to model directory name resolution. The directory names even model the average and distribution of directory name lengths. Average file size as a function of directory depth is modeled also. Finally, the file content is real.
WebBench allows the user to also control the number of threads making requests at once. We tried a large range of thread numbers for the client and found that throughput and the number of client threads is roughly related logarithmically. We set this number to the maximum allowed because we wanted to ensure that the disk was the biggest bottleneck in the system.
Figure 3 shows the standard ext2 file system as found in the Linux 2.4 kernel. Each epoch is about five minutes long, and it took the file system about 4 epochs, or 20 minutes, to stabilize after filling its main memory with cached data. Figure 4 shows the restructuring file system in effect. The limit set on reorganization data was 5 MB, and the locality policy was set to group the most accessed files together. It took the file system about 30 minutes to complete the logging of data, and after another 90 minutes, the throughput of the system stabilized. At this point, the throughput was demonstrating an improvement of about 18 to 26 percent. We performed several runs of this test and averaged the results, restoring the initial state of the partition from a backup each time.
No system is without its disadvantages, and ours was no different. Kernel modifications are always dangerous because kernel code must be trustworthy in every aspect. Furthermore, kernel modifications are not portable between platforms, even though the modifications may be easy to apply to other file systems because of the standard VFS interface. Building a working version on a commodity was an important step in proving the viability of our solution to the locality problem, but the code would have to pass many inspections and far more testing before actual integration into a commercial kernel.
Depending on the policy, logging information and location information for all of the inodes on a disk can require a large amount of main memory. In the worst case, our 1.7 GB of data required about 10 MB of main memory. However, only 3 MB of additional memory is required for 5 GB of data. This will probably not scale to terabytes of storage, but neither will the restructuring itself if months are required to stabilize the file system. In extreme cases, the simplest policies can be reduced to only a few kilobytes of vital memory with no cost in processing time.
This system relies on some administration, even if the requirements are extremely small. Partitions must be labeled by the user as “read-intensive” because partitions with a large ratio of writes to reads or a large file turnover (i.e., swap partitions) would probably not be improved by dynamic restructuring. The system can use simple heuristics to decide how much logging to do in phase 1 or how much memory to occupy. It can use system calls to determine the disk’s characteristics and partition parameters. We were largely successful in avoiding “magic numbers”.
Our tests were performed with a workload several times the size of the system’s main memory. Obviously, if all of the workload fits in main memory, restructuring is a waste of time. However, there are plenty of workloads that represent excellent applications of our system. Databases, code repositories, and web servers are just a few of them.
We successively combined the features of several other proposals into a system with moderate performance gains. We feel that demonstrating the efficiency of a solution within the bounds of a tested operating system and with an accessible benchmark is almost as important as the idea of dynamic restructuring itself. It is only through integration with already successful mechanisms that we can collectively solve the performance problems of modern operating systems.
References
[Akyű95] S. Akyűrek and K. Salem, Adaptive Block Rearragement, ACM Transactions on Computer Systems, 13(2):89-121, May 1995
[Arpa01] Arpaci-Dusseau, Andrea C., and Arpaci-Dusseau, Remzi H., Information and Control in Gray Box Systems, Symposium on Operating Systems Principles, 2001.
[deJo93] de Jonge, Wiebren, et al. The Logical Disk: A New Approach to Improving File Systems. Proceedings of the 13th ACM Symposium on Operating Systems Principles, July 1991.
[Dahl96] Dahlin, Mike. Technology Trends. http://www.cs.utexas.edu/users/dahlin/techTrends.
[Grif94] J. Griffioen and R. Appleton, Reducing File System Latency Using A Predictive Approach. Proc. 1994 Summer USENIX Conference, PP. 197-207, Jun. 1994
[McDonld89] M. Shane McDonald and Richard B. Bunt, Improving File System Performance by Dynamically Restructuring Disk Space, Proc. Phoenix Conference on Computers and Communication, pp.264-269, Mar. 1989.
[McKu84] McKusick, Marshall K., et al. A Fast File System for UNIX. ACM Transactions on Computer Systems, 2(3):181-197, August 1984.
[Rose91] Rosenblum, Mendel, and Ousterhout, John K., The Design and Implementation of a Log-Structured File System. Proceedings of the 13th ACM Symposium on Operating Systems Principles, July 1991.
[Ruem94] Ruemmler, Chris, and Wilkes, John. An Introduction to Disk Drive Modeling. IEEE Computer, 27(3):17-29, Mar. 1994.
[Smith96] Smith, Keith A., and Seltzer, Margo, A Comparison of FFS Disk Allocation Policies. Proc. 1996 USENIX Conf., Jan. 1996.
[Smith97] Smith, Keith A., et al. File System Aging – Increasing the Relevance of File System Benchmarks. Proceedings of the 1997 ACM SIGMETRICS Conference, July 1997.
[Stae91] Carl Staelin and Hector Garcia-Molina, Smart Filesystems. Proc. 1991 Winter USENIX Conference, pp.45-51, Jan. 1991.