Coding Sanity

.NET development and technology thoughts
posts - 51, comments - 30, trackbacks - 0

Storing Millions of files in a File Structure

A recent question on HighScalability.com was “How Do I Organize Millions Of Images?”. The asker had found that storing files in a database was inefficient, and wanted to know what scheme he should use to structure the files. I started writing this as a comment, but decided to do it as a blog post instead. The questioner is moving in the right direction; databases are a very poor place to put large amounts of files from a performance perspective – although don’t discount the convenience of this.

So, the question to ask is this:

How many file system entries can a folder efficiently store?

I did tests on this a couple years back, and on Windows at that time the answer was “about a thousand”. Okay, so that implies that we must go for a tree structure, and each node should have no more than about a thousand child nodes. This implies that we want to keep the tree nice and balanced. Having a huge tree structure with 90% of the files nodes distributed into 5% of the nodes is not going to be hugely helpful.

So, for a million files, with a perfectly even distribution, a single folder level is sufficient. So, that’s one “root folder” containing 1000 child folders, each containing 1000 files. For simplicities sake, I’m going to assume that only “leaf” folders will store files. Okay, so that will efficiently store about 1,000,000 files. Except, he wants to store millions of files. Okay, so that implies that either we accept more entries per node or we increase the tree depth. I’d suggest the more entries as the starting point to consider, my “1000 entry” testing is a bit out-dated.

So, 2-level structure; a “root folder”, with 1000 folders, each containing 1000 folders, each containing 1000 files gives us a nice even billion, 10003, assuming an even distribution. That last part is the tricky part. How do assure even distribution? Well, the simplest method would be to generate the folder names randomly using a pseudo random number generator with even distribution, so probably a cryptographically secure one. Some of the schemes suggested in the comments ranged from generating GUIDs to generating SHA-1 hashes of the files. Some of them may work well; I’ve personally used the GUID one myself to good effect. But a GUID does not guarantee good distribution, and it might bite you, badly.

Using a hash function is cute, it limits you to a folder size of 256 nodes though; which implies a deeper folder structure – additionally it means you must hash the file as part of the file location. But, um, if you’re looking for the file, how do you hash it? I assume you store the hash somewhere; this is good for detecting tampering  and if you are doing this or plan on doing this – then this seems like a good approach. Unfortunately it is inefficient compared to our “ideal” 1000 node per folder size. As the commenter points out; one other benefit is that if the same image is uploaded multiple times, the same file path will be generated. The problem with this approach is that the commenter is incorrect when he says that SHA-1 does not have collisions; there is in fact a theoretical approach to generate collisions for SHA-1, and NIST suggests that it’s use for name collision avoidance should be stopped by Federal agencies. So, maybe SHA-2? Well, it is off a similar base to SHA-1, so it’s possible a collision attack could be found – although one hasn’t been found yet. Oh, and why we should worry about a collision attack? Because person A uploads a photo of her wedding and person B uploads some porn – and person B overwrites person A’s photo.

The technique I’ve used many times is the GUID one, and it works well in most cases. The random number generator approach I’ve used for larger systems, using random numbers for folders, and a GUID for the file name. The hashing approach is very interesting. I think I might have to give it a try in a year or two when I have some spare time. I’d want to modify it to have a few thousand nodes per level, rather than just 256; and I’d want to handle collisions – but it has some really nice emergent features; and it makes good use of the hash I always store for file verification.

I haven’t touched on the approaches of segregating based on user ID and similar; since in my case where I need to store millions of files for a single company, this doesn’t apply. It may well apply quite nicely to your needs however.

Here are some simple rules to live by:

  • DO Compress stored documents – this will save you huge amounts of storage space
  • DO Compress transmissions – this will save you huge amounts of bandwidth, and speed up downloads
  • DO Support HTTP Resume – this will save you huge amounts of bandwidth
  • DO NOT Store large amounts of BLOBs in a database. If anyone tells you to do this; then they haven’t handled large number of binary documents. This always seems like a good idea at the time, and never is. Seriously. NEVER.
  • DO Separate your path generation logic from your path lookup. In other words, don’t replicate your path generation on lookups. Rather store the generated path, and just read it. This allows you to move files around if you need to, rebase them, change your algorithm – a whole bunch of things.
  • DO NOT use MDS for anything. Ever. No, not even path generation.

Print | posted on Friday, June 22, 2012 7:16 AM | Filed Under [ General Performance ]

Feedback

Gravatar

# re: Storing Millions of files in a File Structure

>Using a hash function is cute, it limits you to a folder size of 256 nodes though
I cannot understand how you could have made this mistake.
Using hash function does not limit ANYTHING. You want 1000? Ok, hash the file and use hash%1000 for the first level and (hash /1000)%1000 for the second level. Hard isn't it? If you replace 1000 with 1024 it would even be perfectly distributed for int-based hash functions
10/12/2012 1:30 AM | Ark-kun
Gravatar

# re: Storing Millions of files in a File Structure

Hashing the file cannot work, because then if you needed to find the file again your would need to have the file to hash so you could find it. Oops. You COULD store the hash in a database somewhere to work around this, but now it's starting to look ugly.

Now, Randall's suggestion was to take the first two bytes of the hash output (off the name) and convert them to hexadecimal, providing the node structure. Hexadecimal has 16 possibilities, and 16 ^ 2 is 256. Since filesystems do not provide the ability to use ANY value as a folder name, we must use some encoding of the hash output into the folder name.

Thus, there is not infinite possibilities, you are limited in folder width by your encoding of the hash bytes (16 for hex), and in folder depth by the output size of the hash (256 bytes for SHA-2).

Now, you could just encode 2 bytes, or 3, or 4, sacrificing folder depth (you've got more than enough) to get a more folder width.

I do like the hash solution to this problem, and I suspect I may use a variant on it in future, but only if I can have a folder width of about 1000, which is where I've found the sweet spot to be.
10/12/2012 5:38 AM | sean
Gravatar

# re: Storing Millions of files in a File Structure

>Hashing the file cannot work
Yes, it can. For some scenarios. I never said anything about hashing the file though. It's strange to see you disagreeing with me and saying that hashing cannot work and then talking about using hashes.
And it looks like you haven't understood what I wrote. I wrote about the easy universal solution that doesn't care where you got the hash.

>because then if you needed to find the file again your would need to have the file to hash so you could find it.
This depends on the nature of the KEY that you want to use to locate the file. If the key is the name of the file, you need to hash the name. If the key is the file contents (e.g. file database that supports duplicates search), then you need to hash the file contents.
But this all doesn't matter.

>Now, Randall's suggestion was to take the first two bytes of the hash output (off the name) and convert them to hexadecimal, providing the node structure.
And this is a bit silly and limiting as I've pointed out. Hash is (usually) an integer number (representation doesn't matter). Yes, you can convert it into hexadecimal number and use the last 2 pairs of hex digits to construct the file path, but nobody is forcing you to do that and limit yourself.
The path that you get with that procedure is this: "hex(hash % 256) / hex((hash / 256) % 256) / file" which gives you 256 folders per level.
Why do you want to limit yourself? If you want 1000 folders per level, use the very similar path scheme: "(hash % 1000) / ((hash / 1000) % 1000) / file" which gives you 1000 folders per level.

P.S.
Being afraid of numbers and want a solution based only on hex strings? Just use 2.5 hex digits (5 octal digits) per level getting 1024 folders. It's very easy.
Take 5 hex digits from the hash and convert the 3rd digit to octals:
0 => 0/0
1 => 0/1
2 => 0/2
3 => 0/3
4 => 1/0
5 => 1/1
6 => 1/2
7 => 1/3
8 => 2/0
9 => 2/1
A => 2/2
B => 2/3
C => 3/0
D => 3/1
E => 3/2
F => 3/3

But the modulus solution is more flexible since you can get any number of folders you want.
10/17/2012 3:01 AM | Ark-kun
Gravatar

# re: Storing Millions of files in a File Structure

You're quite right that I didn't understand. I still don't. Your comments don't make sense. It's like you're googling things you don't fully understand and then trying to string them together coherently. It ALWAYS matters where you get the hash. Hashes aren't some magical hand waving thing that gets you out of jail free. It DOES matter if the key is the file contents, because if the key is the thing you're looking for, then you're not going to have it. You don't have the key until you have the file contents, and you don't have the file contents until you get the key.

Circular dependency, see dependency, circular.

As for the encoding scheme you suggest, you don't seem to be aware that the output of a hash is an array of bytes. There is no need to "% 256" them, they are already an array of numbers from 0 to 255. Making them "% 1000" like you suggest is idiotic. Taking any number below 1000 and taking its modulus with 1000 gives you the exact same number back, so there's no point in the modulus operation.

Randall's scheme was hex(hash[0]) / hex(hash[1]).

Your first "solution" solves to the EXACT same thing, oh, except with three levels. So 3 levels of 256 permutations ech. Doesn't solve anything. Your idea about using octal instead of hex? Okay, another encoding scheme, and now you're dropping the radix, but increasing the number of digits. Cool, except 5 octal digits (77777L) is 4095 NOT 1024, and that's possibly too many nodes in the folder. Your little example at the end is converting to radix 4, not 8. So, it seems you're a bit vague on what octal is too.

So, given all this, I'm wondering why your tone is so angry; its not a topic that seems that important, and also, you clearly don't have a background in it, because you're making some fairly fundamental mistakes. I mean you don't even seem to grasp what a modulus does, nor how octal works, and you seem to have some serious misunderstandings about the fundamentals of the proceess of hashing. So given that it's clearly not a topic close to your heart, why are you so worked up about it, and so aggressive?

Okay, since you're determined to find an encoding scheme where hashes will work perfectly, let me put you out of your misery. Work backwards from your problem. The problem is that you need an encoding that will take one or more bytes and output a value, representable as a folder name, with about 1000 permutations. So, take 1000, and find its square root. It's about 31.6. Okay, so we use a base-32 encoding, giving us 1024 permutations per digit. In fact we could probably use up to a base 45 encoding, giving us 2025 permutations per digit.
10/17/2012 6:43 AM | sean
Gravatar

# re: Storing Millions of files in a File Structure

If you reall want, you could again drop the radix and increase the number of digits. I just don't feel like calculating this out now.
10/17/2012 9:09 AM | sean

Post Comment

Title  
Name  
Email
Url
Comment   
Please add 1 and 1 and type the answer here:

Powered by:
Powered By Subtext Powered By ASP.NET