Serving lots of small files?

2023-01-18 22:49 问答作者：

I am building a website which depends on serving lots of little mp3 files (approx 10-15KB each) quite quickly. Each file contains a word pronunciation, and 20-30 per user will be downloaded every minute they are using the site. Each user might download 200 a day, and I anticipate 50 simultaneous users. There will be approx. 15,000 separate file eventually.

What would be the best way to store, manage, call and play these files as required? Will I need specialist hosting to deal with all the little files, or will they behave happily in one big folder (using a standard host)? Any delays will ruin the feel.

Update

Having done a bit more searching, I think the problem could be solved with either:

A service like Photobucket bu开发者_StackOverflow中文版t for audio instead, with its own API
Some other sort of 'bucket hosting' solution where you can upload thousands of files at a reasonable cost, and call for them easily

Does anyone know of such a product?

15k Files in one directory should not be a problem for any modern file system. It surely isn't for NTFS. What you don't want to do is open up a folder that contains 100k+ files in explorer or something similar, because populating the list-box (GUI) is a killer. Also you wouldn't want to iterate over the contents of such a folder repeatedly. However just accessing a file if you know the filename (path) is still very fast, and a server usually does just that.

The frequency doesn't sound too scary too. 50 users * 30 requests/minute/user is 25 requests per second. That's not something you can ignore completely, but any decent web-server should be able to serve files at that rate. Also I see no need for a specialized in-memory server/database/data-store. Every OS has a file-cache, and that should take care of having frequently accessed files in memory.

If you must guarantee low (worst-case) latency, you might still need an in-memory data-store. But then again if you must guarantee latency, things become complicated anyway.

One last thing: think about reverse proxies. I find it very convenient to be able to primarily store/update data in just one place (of my choosing), and have reverse proxies take care of the rest. If your files never change (i.e. same URL means same data) this is a very easy way to provide really good scalability. If the files indeed can chance, just make it so that they can not :) e.g. by encoding the change-date into the filename (and deleting the "old versions").

If you want (or need) to store the files on disk instead of as BLOBs in a database, there are a couple of things you need to keep in mind.

Many (but not necessarily all) file systems don't work too well with folders containing many files, so you probably don't want to store everything in one big folder - but that doesn't mean you need specialist hosting.

The key is to distribute the files into a folder hierarchy, based on some hash function. As an example, we'll use the MD5 of the filename here, but it's not particularly important which algorithm you use or what data you are hashing, as long as you're consistent and have the data available when you need to locate a file.

In general, the output of a hash function is formatted as a hexadecimal string: for example, the MD5 of "foo.mp3" is 10ebb1120767e9de166e0f5905077cb1.

You can create 16 folders, one for each of possible hexadecimal characters - so you have a directory 0, one named 1, and so on up to f.

In each of those 16 folders, repeat this structure, so you have two levels. (0/0/, 0/1/,... , f/f/)

What you then do is simply to place the file in the folder dictated by its hash. You can use the first character to determine the first folder, and the second character to determine the subfolder. Using that scheme, foo.mp3 would go in 1/0/, bar.mp3 goes in b/6/, and baz.mp3 goes in 1/b/.

Since these hash functions are intended to distribute their values evenly, your files will be distributed fairly evenly across these 256 folders, which reduces the number of files in any single folder; statistically, 15000 files would result in an average of nearly 60 per folder which should be no problem.

If you're unlucky and the hash function you chose ends up clumping too many of your files in one folder anyway, you can extend the hierarchy to more than 2 levels, or you can simply use a different hash function. In both cases, you need to redistribute the files, but you only need to do that once, and it shouldn't be too much trouble to write a script to do it for you.

For managing your files, you will likely want a small database indexing what files you currently have, but this does not necessarily need to be used for anything other than managing them - if you know the name of the file, and you use the filename as input to your hash function, you can just calculate the hash again and find its location that way.

I would serve these from an in memory database 15ksize * 15000 = 225Mb of raw data, even with significant overhead it will easily fit in a medium hosting plan. The disk based caches might be elegant here, e.g. memcachedb, ehcache or similar then you only have one API and some configuration.

You should warm up the cache though on startup.

The metadata can be in a mysql or similar. You might keep a mastercopy there too for easier management and as a backend for the cache.

继续阅读：audio hosting web-services

Serving lots of small files?

Update

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Update

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？