Archive: Storing Millions of files in a File Structure

A recent question on HighScalability.com was “How Do I Organize Millions Of Images?”. The asker had found that storing files in a database was inefficient, and wanted to know what scheme he should use to structure the files. I started writing this as a comment, but decided to do it as a blog post instead. The questioner is moving in the right direction; databases are a very poor place to put large amounts of files from a performance perspective – although don’t discount the convenience of this.
Lots of files
So, the question to ask is this:

How many file system entries can a folder efficiently store?

I did tests on this a couple years back, and on Windows at that time the answer was “about a thousand”. Okay, so that implies that we must go for a tree structure, and each node should have no more than about a thousand child nodes. This implies that we want to keep the tree nice and balanced. Having a huge tree structure with 90% of the files nodes distributed into 5% of the nodes is not going to be hugely helpful.

So, for a million files, with a perfectly even distribution, a single folder level is sufficient. So, that’s one “root folder” containing 1000 child folders, each containing 1000 files. For simplicities sake, I’m going to assume that only “leaf” folders will store files. Okay, so that will efficiently store about 1,000,000 files. Except, he wants to store millions of files. Okay, so that implies that either we accept more entries per node or we increase the tree depth. I’d suggest the more entries as the starting point to consider, my “1000 entry” testing is a bit out-dated.

So, 2-level structure; a “root folder”, with 1000 folders, each containing 1000 folders, each containing 1000 files gives us a nice even billion, 10003, assuming an even distribution. That last part is the tricky part. How do assure even distribution? Well, the simplest method would be to generate the folder names randomly using a pseudo random number generator with even distribution, so probably a cryptographically secure one. Some of the schemes suggested in the comments ranged from generating GUIDs to generating SHA-1 hashes of the files. Some of them may work well; I’ve personally used the GUID one myself to good effect. But a GUID does not guarantee good distribution, and it might bite you, badly.

Using a hash function is cute, it limits you to a folder size of 256 nodes though; which implies a deeper folder structure – additionally it means you must hash the file as part of the file location. But, um, if you’re looking for the file, how do you hash it? I assume you store the hash somewhere; this is good for detecting tampering  and if you are doing this or plan on doing this – then this seems like a good approach. Unfortunately it is inefficient compared to our “ideal” 1000 node per folder size. As the commenter points out; one other benefit is that if the same image is uploaded multiple times, the same file path will be generated. The problem with this approach is that the commenter is incorrect when he says that SHA-1 does not have collisions; there is in fact a theoretical approach to generate collisions for SHA-1, and NIST suggests that it’s use for name collision avoidance should be stopped by Federal agencies. So, maybe SHA-2? Well, it is off a similar base to SHA-1, so it’s possible a collision attack could be found – although one hasn’t been found yet. Oh, and why we should worry about a collision attack? Because person A uploads a photo of her wedding and person B uploads some porn – and person B overwrites person A’s photo.

The technique I’ve used many times is the GUID one, and it works well in most cases. The random number generator approach I’ve used for larger systems, using random numbers for folders, and a GUID for the file name. The hashing approach is very interesting. I think I might have to give it a try in a year or two when I have some spare time. I’d want to modify it to have a few thousand nodes per level, rather than just 256; and I’d want to handle collisions – but it has some really nice emergent features; and it makes good use of the hash I always store for file verification.

I haven’t touched on the approaches of segregating based on user ID and similar; since in my case where I need to store millions of files for a single company, this doesn’t apply. It may well apply quite nicely to your needs however.

Here are some simple rules to live by:

  • DO Compress stored documents – this will save you huge amounts of storage space
  • DO Compress transmissions – this will save you huge amounts of bandwidth, and speed up downloads
  • DO Support HTTP Resume – this will save you huge amounts of bandwidth
  • DO NOT Store large amounts of BLOBs in a database. If anyone tells you to do this; then they haven’t handled large number of binary documents. This always seems like a good idea at the time, and never is. Seriously. NEVER.
  • DO Separate your path generation logic from your path lookup. In other words, don’t replicate your path generation on lookups. Rather store the generated path, and just read it. This allows you to move files around if you need to, rebase them, change your algorithm – a whole bunch of things.
  • DO NOT use MDS for anything. Ever. No, not even path generation.

This article has been recovered from an archive from my old blog site. Slight changes have been made.

Stuff that’s getting me excited

What makes me excited

I'm too excitedI get very excited about technology, but I also get very frustrated about where it falls down, its bugs”. This is especially true in software, I find I have incredibly high standards, and when some part of a piece of technology disappoints me it tends to color my entire experience of it. So in the past I’ve focused on the negative about such things as Windows Vista or Entity Framework, and I do think my various criticisms were justified. But it’s too easy to criticise so I’ve come to the decision that I need to focus more on things that are positive.

So I’m going to do a quick round-up of things that are exciting me about technology and software. I’m sure I’ll go into more detail about some of these in later posts.

I’m excited about Roslyn

I seriously doubt any readers of this blog don’t know what Roslyn is. The fact that it’s now open source just makes it even more important. Roslyn and it’s associated projects .NET Native, and RyuJIT are going to fundamentally shift the way we develop. I’m sure most of it will be hidden by the tooling, but our code will become more predictable, more expressive; our tools will become more insightful; and our software will become more reliable and will deliver more bang for our buck.

I imagine we’ll see an explosion of .NET compilers based on the Roslyn platform. Think about the deluge of innovative languages in the Java space, and I’m really hopeful that some of that kind of innovation could cross the chasm. Scala certainly seems to be getting copied into C# piece by piece, and maybe we’ll see that kind of thing accelerate. I also hope we’ll see things like better treatments of code contracts or aspect orientation or non nullable types or, hell, a million other features I’d like to see in C#. I’m sure most of them will be edge cases, but it will be exciting to see the various implementations.

The barrier to entry for a Resharper competitor will be lowered significantly, and maybe projects like StyleCop will get a new lease on life.

I’m excited about ASP.NET 5

ASP.NET 5 has a set of features that I think is heading it in the right direction. OWIN is taking us in the direction of Docker, and will result in us being able to happily have multiple web sites, with wildly different Framework versions, all running side by side without impacting each other, and the ability to easily move them from one host to another. Expect to see the number of virtual machines you need to manage drop precipitously.

Loving being able to easily test my MVC 6 website on OSX, but still not thrilled with the development tool options there. Vim is okay I guess, but I never learned Emacs, and Xamarin seem to only care about mobile devices, not web sites, so their Studio is just awful at ASP.NET. Which is a real pity because otherwise it’s a really great IDE.

The new client-side asset management pipeline is really great too. Sure that’s more Visual Studio 2015, but I really like it. Plus I love having the project file moved away from XML. Hopefully it will make merges easier. But guys, please, when can we have a project file that’s code? Like a Gulpfile? C’mon, that’d be awesome!

I’m excited about Azure

I’m really hoping to get some quality time on Azure this year. It seems really awesome, and I’ve been blown away by the bits I’ve made use of so far. Azure BizTalk Services look absolutely brilliant, and I’m not a huge BizTalk fan. Azure Active Directory is really, really clever too. Once they get the tooling so that it’s easier to use, it will be a real game changer, especially for companies wanting to sell multi tenant software to corporate clients, who want to use their own Active Directory.

All I really, really want is an Azure data centre down here in sunny South Africa. Because right now, the latency is pretty vile.

I’m excited about the Falcon 9

The Falcon 9 rocketA completely reusable rocket could reduce the cost of space flight by an order of magnitude. Or two orders of magnitude if Elon Musk is correct. Such a decrease would make space technology easily affordable to a vastly wider range of clients. That’s bringing the cost of spaceflight into the same general cost as a space elevator. I doubt they’d be able to get that efficient, but no matter, consider the effects of a 10 times reduction in spaceflight costs.

It might very well make asteroid mining profitable. If so, that’s a game changer in and of itself. Imagine a world where the dangers and environmental impact of mining happen in space. That’s a much better world in my opinion.

Pity about their last crash, but at least they crashed on the barge they were supposed to land on. That’s threading a mighty small needle and is a hugely impressive achievement in and of itself.

Resurrected from the dead

Guess who’s back, back again
Shady’s back, tell a friend

Resurrected… hopefully for good

Imhotep, the resurrected deity of all those who like holding their hands in funny poses
Well, it’s official, I’ve finally dusted myself off and resurrected my blogging again. Hopefully it will be a longer streak than last time. I’ve moved to a new blog engine, and haven’t yet pulled in my old posts, and in fact I don’t think I will.

My role has changed quite a lot since I last did any blogging. I’m currently a Director of Palantir, and I’m also heading up software engineering at STANLIB. So my posts will probably focus a bit more on business and leadership issues. That said, I’m still incredibly passionate about software engineering as a discipline, and .NET as a platform, so there will still be a good bit of techie talk.

And yes, I’m still unimpressed by Entity Framework. Happily we have RavenDB, so I don’t need to waste my time with that stuff.

High performance .NET

I’m also incredibly upset that Ben Watson wrote the excellent “Writing High-Performance .NET Code” book. I had bought Scrivener, and started research on the book I wanted to do about high performance .NET code. I haven’t yet finished Ben’s book, but from what I’ve read so far, he’s done an excellent job, and his book is far better than mine would have ever been, assuming I had ever managed to finish it.

Huh?

And, assuming you’re the 99% who have never heard of me or this blog, and are confused about why I’m talking like you should know me, well I’m glad to say I won’t waste any of your time by blowing my trumpet. “Past performance is not indicator of future success” and all that. Welcome though.

I’ve gone a little crazy with plugins and integration and all that kind of thing, so you really should be able to share and comment easily. If not, please drop me a line and I’ll try and add your favourite feature.

Books Mentioned