Serving lots of small files?
I am building a website which depends on serving lots of little mp3 files (approx 10-15KB each) quite quickly. Each file contains a word pronunciation, and 20-30 per user will be downloaded every minute they are using the site. Each user might download 200 a day, and I anticipate 50 simultaneous users. There will be approx. 15,000 separate file eventually.
What would be the best way to store, manage, call and play these files as required? Will I need specialist hosting to deal with all the little files, or will they behave happily in one big folder (using a standard host)? Any delays will ruin the feel.
Update
Having done a bit more searching, I think the problem could be solved with either:
- A service like Photobucket bu开发者_StackOverflow中文版t for audio instead, with its own API
- Some other sort of 'bucket hosting' solution where you can upload thousands of files at a reasonable cost, and call for them easily
Does anyone know of such a product?
15k Files in one directory should not be a problem for any modern file system. It surely isn't for NTFS. What you don't want to do is open up a folder that contains 100k+ files in explorer or something similar, because populating the list-box (GUI) is a killer. Also you wouldn't want to iterate over the contents of such a folder repeatedly. However just accessing a file if you know the filename (path) is still very fast, and a server usually does just that.
The frequency doesn't sound too scary too. 50 users * 30 requests/minute/user is 25 requests per second. That's not something you can ignore completely, but any decent web-server should be able to serve files at that rate. Also I see no need for a specialized in-memory server/database/data-store. Every OS has a file-cache, and that should take care of having frequently accessed files in memory.
If you must guarantee low (worst-case) latency, you might still need an in-memory data-store. But then again if you must guarantee latency, things become complicated anyway.
One last thing: think about reverse proxies. I find it very convenient to be able to primarily store/update data in just one place (of my choosing), and have reverse proxies take care of the rest. If your files never change (i.e. same URL means same data) this is a very easy way to provide really good scalability. If the files indeed can chance, just make it so that they can not :) e.g. by encoding the change-date into the filename (and deleting the "old versions").
If you want (or need) to store the files on disk instead of as BLOBs in a database, there are a couple of things you need to keep in mind.
Many (but not necessarily all) file systems don't work too well with folders containing many files, so you probably don't want to store everything in one big folder - but that doesn't mean you need specialist hosting.
The key is to distribute the files into a folder hierarchy, based on some hash function. As an example, we'll use the MD5 of the filename here, but it's not particularly important which algorithm you use or what data you are hashing, as long as you're consistent and have the data available when you need to locate a file.
In general, the output of a hash function is formatted as a hexadecimal string: for example, the MD5 of "foo.mp3" is 10ebb1120767e9de166e0f5905077cb1.
You can create 16 folders, one for each of possible hexadecimal characters - so you have a directory 0, one named 1, and so on up to f.
In each of those 16 folders, repeat this structure, so you have two levels. (0/0/, 0/1/,... , f/f/)
What you then do is simply to place the file in the folder dictated by its hash. You can use the first character to determine the first folder, and the second character to determine the subfolder. Using that scheme, foo.mp3 would go in 1/0/, bar.mp3 goes in b/6/, and baz.mp3 goes in 1/b/.
Since these hash functions are intended to distribute their values evenly, your files will be distributed fairly evenly across these 256 folders, which reduces the number of files in any single folder; statistically, 15000 files would result in an average of nearly 60 per folder which should be no problem.
If you're unlucky and the hash function you chose ends up clumping too many of your files in one folder anyway, you can extend the hierarchy to more than 2 levels, or you can simply use a different hash function. In both cases, you need to redistribute the files, but you only need to do that once, and it shouldn't be too much trouble to write a script to do it for you.
For managing your files, you will likely want a small database indexing what files you currently have, but this does not necessarily need to be used for anything other than managing them - if you know the name of the file, and you use the filename as input to your hash function, you can just calculate the hash again and find its location that way.
I would serve these from an in memory database 15ksize * 15000 = 225Mb of raw data, even with significant overhead it will easily fit in a medium hosting plan. The disk based caches might be elegant here, e.g. memcachedb, ehcache or similar then you only have one API and some configuration.
You should warm up the cache though on startup.
The metadata can be in a mysql or similar. You might keep a mastercopy there too for easier management and as a backend for the cache.
精彩评论