
In essence we can get a good statistical overview of the “state of our data”.Īll of this capability is just great and goes far beyond anything that is available today. We can also determine how our data is “aging” – that is how old is the average file, the median file, and we can do this for the entire file system tree or certain parts of it. We can compute how quickly our data is changing (how many files have been modified, created, deleted in a certain period of time). With this information we can monitor the basic state of our data. Fortunately, POSIX gives us some standard metadata for our files such as the following: One of the keys to data management is being able to monitor the state of your data which usually means monitoring the metadata. What does this mean for us? One thing that it means to me is that we need to pay much more attention to managing our data. That is, the number of files is getting larger while we are adding some very large files and a large number of small files. But these observations are another good data point that tell us something about our data. The combination of the observations previously mentioned mean that we have many more files on our desktops and we are adding some really large files and about the same number of small files. So, with these working definitions, the three observations previously mentioned indicate that perhaps desktops have a few really large files that drive up the average file size but at the same time there are a number of small files that makes the median file size about the same despite the increase in the number of files and the increase in large files.

The median file size is the one in the middle of the ordered list. But the median file size is found by ordering the list from the smallest to largest of the file size of every file. The average file size is computed by summing the size of every file and dividing by the number of files. To fully understand the difference between the first point and the third point you need to remember some basic statistics. The average file system capacity has tripled from 2000 to 2010.Some of the highlights from the paper are: While the paper didn’t really cover Linux (it covered Windows) and it was more focused on desktops, and it was focused on deduplication, it did present some very enlightening insights on file systems from 2000 to 2010. One question that affects storage design and performance is if these files are large or small and how many of them are there?Īt this year’s FAST (USENIX Conference on File System and Storage Technologies) the best paper went to “A Study of Practical Deduplication” by William Bolosky from Microsoft Research, and Dutch Meyer from the University of British Columbia.

We now have lots of multimedia on our desktops, and lots of files on our servers at work, and we’re starting to put lots of data into the cloud (e.g.

I think it’s a given that the amount of data is increasing at a fairly fast rate.
