A recent question on HighScalability.com was “How Do I Organize Millions Of Images?”. The asker had found that storing files in a database was inefficient, and wanted to know what scheme he should use to structure the files. I started writing this as a comment, but decided to do it as a blog post instead. The questioner is moving in the right direction; databases are a very poor place to put large amounts of files from a performance perspective – although don’t discount the convenience of this.
So, the question to ask is this:
How many file system entries can a folder efficiently store?
I did tests on this a couple years back, and on Windows at that time the answer was “about a thousand”. Okay, so that implies that we must go for a tree structure, and each node should have no more than about a thousand child nodes. This implies that we want to keep the tree nice and balanced. Having a huge tree structure with 90% of the files nodes distributed into 5% of the nodes is not going to be hugely helpful.
So, for a million files, with a perfectly even distribution, a single folder level is sufficient. So, that’s one “root folder” containing 1000 child folders, each containing 1000 files. For simplicities sake, I’m going to assume that only “leaf” folders will store files. Okay, so that will efficiently store about 1,000,000 files. Except, he wants to store millions of files. Okay, so that implies that either we accept more entries per node or we increase the tree depth. I’d suggest the more entries as the starting point to consider, my “1000 entry” testing is a bit out-dated.
So, 2-level structure; a “root folder”, with 1000 folders, each containing 1000 folders, each containing 1000 files gives us a nice even billion, 10003, assuming an even distribution. That last part is the tricky part. How do assure even distribution? Well, the simplest method would be to generate the folder names randomly using a pseudo random number generator with even distribution, so probably a cryptographically secure one. Some of the schemes suggested in the comments ranged from generating GUIDs to generating SHA-1 hashes of the files. Some of them may work well; I’ve personally used the GUID one myself to good effect. But a GUID does not guarantee good distribution, and it might bite you, badly.
Using a hash function is cute, it limits you to a folder size of 256 nodes though; which implies a deeper folder structure – additionally it means you must hash the file as part of the file location. But, um, if you’re looking for the file, how do you hash it? I assume you store the hash somewhere; this is good for detecting tampering and if you are doing this or plan on doing this – then this seems like a good approach. Unfortunately it is inefficient compared to our “ideal” 1000 node per folder size. As the commenter points out; one other benefit is that if the same image is uploaded multiple times, the same file path will be generated. The problem with this approach is that the commenter is incorrect when he says that SHA-1 does not have collisions; there is in fact a theoretical approach to generate collisions for SHA-1, and NIST suggests that it’s use for name collision avoidance should be stopped by Federal agencies. So, maybe SHA-2? Well, it is off a similar base to SHA-1, so it’s possible a collision attack could be found – although one hasn’t been found yet. Oh, and why we should worry about a collision attack? Because person A uploads a photo of her wedding and person B uploads some porn – and person B overwrites person A’s photo.
The technique I’ve used many times is the GUID one, and it works well in most cases. The random number generator approach I’ve used for larger systems, using random numbers for folders, and a GUID for the file name. The hashing approach is very interesting. I think I might have to give it a try in a year or two when I have some spare time. I’d want to modify it to have a few thousand nodes per level, rather than just 256; and I’d want to handle collisions – but it has some really nice emergent features; and it makes good use of the hash I always store for file verification.
I haven’t touched on the approaches of segregating based on user ID and similar; since in my case where I need to store millions of files for a single company, this doesn’t apply. It may well apply quite nicely to your needs however.
Here are some simple rules to live by:
- DO Compress stored documents – this will save you huge amounts of storage space
- DO Compress transmissions – this will save you huge amounts of bandwidth, and speed up downloads
- DO Support HTTP Resume – this will save you huge amounts of bandwidth
- DO NOT Store large amounts of BLOBs in a database. If anyone tells you to do this; then they haven’t handled large number of binary documents. This always seems like a good idea at the time, and never is. Seriously. NEVER.
- DO Separate your path generation logic from your path lookup. In other words, don’t replicate your path generation on lookups. Rather store the generated path, and just read it. This allows you to move files around if you need to, rebase them, change your algorithm – a whole bunch of things.
- DO NOT use MDS for anything. Ever. No, not even path generation.