Using a Filesystem (Not a Database!) for Schemaless Data - Best Practices
After reading over my other question, Using a Relational Database for Schema-Less Data, I began to wonder if a filesystem is more appropriate than a relational database for storing and querying schemaless data.
Rathe开发者_开发百科r than just building a file system on top of MySQL, why not just save the data directly to the filesystem? Indexing needs to be figured out, but modern filesystems are very stable, have great features like replication, snapshot and backup facilities, and are flexible at storing schema-less data.
However, I can't find any examples of someone using a filesystem instead of a database.
Where can I find more resources on how to implement a schemaless (or "document-oriented") database as a layer on top of a filesystem? Is anyone using a modern filesystem as a schemaless database?
Yes a filesystem could be taken as a special case of a NOSQL-like database system. It may have some limitations that should be considered during any design decisions:
pros: - - simple, intuitive.
- takes advantage of years of tuning and caching algorithms
- easy backup, potentially easy clustering
things to think about:
richness of metadata - what types of data does it store, how does it let you query them, can you have hierarchal or multivalued attributes
speed of querying metadata - not all fs's are particularly well optimized with anything other than size, dates.
inability to join queries (though that's pretty much common to NoSQL)
inefficient storage usage (unless the file system performs block suballocation, you'll typically blow 4-16K per item stored regardless of size)
- May not have the kind of caching algorithm you want for it's directory structure
- tends to be less tunable, etc.
- backup solutions may have trouble depending on how you store things - too deep, too many items per node, etc - which might obviate an obvious advantage of such a structure. locking for a LOCAL filesystem works pretty well of course if you call the right routines, but not necessarily for a network base fileesytem (those problems have been solved in various ways, but it's certainly a design issue)
I got the same idea more than 15 years ago, when hosting costs and hardware limitations where very different from today.
My main motivation was to design a cheap and simple solution able to withstand traffic spikes. Another goal was to improve the security of the applications by removing SQL attack vectors out of the equation.
I end up with a simple document-oriented database, more like a wrapper around FS functions.
What started as a personal project out of curiosity proved to be very rewarding in the long run. I will try to list both pros and cons.
PROS:
- Fast
- Cheap maintenance. Most applications I build using a file system "database" are still working till today with zero maintenance regarding the database implementation part. This was an unexpected outcome and it is happening due to the fact the file system functions are rarely changing in all the programming languages I used this solution for (PHP, C, C++, Erlang). I can't say the same about applications using mainstream databases. They often require fixing deprecated code and many of my old projects are now dead in the water because either me or the clients decided not to finance the expensive upgrades anymore. Or running old unsupported db versions that pose a high security risk.
- Resilient to attacks being completely immune to SQL injections. Many attackers are targeting mainstream products and they are clueless when facing a custom storage facility.
- Amazingly good on withstanding traffic spikes compared to many database systems that require sockets connections. It's quite easy to exhaust the maximum connection limitations of a database and many drivers for well known NoSQL databases have a limited connections pool they reuse across multiple threads forcing the industry to design expensive distributed systems.
- Unexpected easy to scale. In one case when the application required much more data to be stored that I was initially anticipated I used a distributed file system (Ceph) and I solved the problem without any code modification.
- Keeping the files in a RAM FS opens many opportunities to optimize things
- Did I say security? All you have to care is usually to make sure any upload process can not write you FS database files nor can play tricks on file names. And of course your usual OS security measures to protect your files.
- Easy to backup and maintain using file system tools.
CONS:
- Atomic operations are hard to implement due to the lack of supervisor processes that are found in more complex database systems.
- Implementing counters is hard and you will have to be quite creative designing a FS based database locking mechanism expecially if you want to remain compatible with distributed FS such as Ceph for which OS level file locks are known to be buggy.
- Handling concurrent writes is tricky. I came up with a simple solution resembling Cassandra writes, adding updates as new files and having cron jobs cleaning up the old "versions" of the data.
My conclusion was, using the file system as a database is best for applications where the content is maintained by a limited number of administrators and concurrency writes are rarely a concern. But you want to have as more cheap reads as possible. For those case scenarios this idea can be quite a money saver.
Disclaimer: Please don't judge me too hard :) I'm a programmer with an old mind set of being more a creator than a user of the out of the box solutions. I lived the times when programmers where doing a lot from scratch to fit their needs including... operating systems. I believe personal experiments (including reinventing the wheel) are good learning opportunities for anybody.
One thing you may want to take into consideration is Oracle's BFILE datatype, which is a pointer to a file on disk. Perhaps that might be the best of both worlds? Microsoft SQL server doesn't seem to offer this capability.
There's a big example of an implementation at Amazon's S3.
http://aws.amazon.com/s3/
This sort of implementation is where a lot of companies are moving towards, because it scales fundamentally better than a relational database can. The approach is simple, and it works, and for some problems, it's a great solution. In the case of Amazon's S3, it's particularly nice for cloud storage, if you don't want to have to worry about the hassles of storing the data yourself.
You are welcome to take a look at our Solid File System, which is a virtual file system product with built-in support for file metadata and SQL-like search mechanism that searches through this data. Also please read the article that describes the benefits of storing different types of data in different kinds of storages.
精彩评论