Abstract

We collect and analyze a snapshot of data from 10,568 file systems of 4801 Windows personal computers in a commercial environment. The tile systems contain 140 million files totaling 10.5 TB of data. We develop analytical approximations for distributions of file size, file age, file functional lifetime, directory size, and directory depth, and we compare them to previously derived distributions. We find that tile and directory sizes are fairly consistent across file systems, but file lifetimes vary widely and are significantly affected by the job function of the user. Larger tiles tend to be composed of blocks sized in powers of two, which noticeably affects their size distribution. File-name extensions are strongly correlated with file sizes, and extension popularity varies with user job function. On average, file systems are only half full. sizes are fairly consistent across file systems, but tile lifetimes and file-name extensions vary with the job function of the user. We also found that tile-name extension is a good predictor of file size but a poor predictor of file age or lifetime, that most large files are composed of records sized in powers of two, and that file systems are only half full on average. File-system designers require usage data to test hypotheses 8, lo, to drive simulations 6, 15, 17, 291, to validate benchmarks 33, and to stimulate insights that inspire new features 22. Filesystem access requirements have been quantified by a number of empirical studies of dynamic trace data e.g. 1, 3, 7, 8, 10, 14, 23, 24, 261. However, the details of applications’ and users’ storage requirements have received comparatively little attention. File systems are not used directly by computer users; they are used through application programs and system utilities, which are in turn constrained by the operating system. Currently, 75\% of all client computers 161 run an operating system in the Microsoft Windows family S, 341. However, we know of no published studies of file-system usage on Windows computers. The next section reviews previous work in the collection of empirical data on file system contents. Section 3 describes the methodology by which we collected the file-system data, and it describes some of our general presentation and analysis techniques. Our results are presented in Sections 4 through 8: Section 4 discusses file sizes, Section 5 discusses directory hierarchies, Section 6 discusses file ages and lifetimes, Section 7 discusses the correlation of various file properties with file-name extensions, and Section 8 discusses overall characteristics of file systems. Section 9 concludes by summarizing our work and highlighting our main contributions.

Links and resources

Tags

community

  • @bes
  • @ragibhasan
  • @dblp
  • @dhruvbansal
  • @derek-jones
@dhruvbansal's tags highlighted