Migrating from processing many small data files to a few large files in ruby

2022-12-12 05:09 问答作者：

What should I keep in mind when migrating from processing many small data files to a few large data files in ruby?

Ba开发者_开发知识库ckground: I'm a bioinformatician who is processing next generation sequencing data, which produces about one million sequences per run. I previously saved each one of the million sequences to its own file, and did a few processing steps to each sequence, producing a couple of files for each sequence. Unfortunately, having a couple of million files is making file input and output a major bottleneck (and also makes backup slow). (Having millions of files is also discouraged in answers to this question)

I considered using sqlite to store each file, but I want to avoid this option if possible, to avoid adding dependencies.

I suspect that I should write one and only one module for handling the large files, and let all of the processing scripts (which run as independent processes) use this module whenever it wants to do input or output. Providing the processing classes with a filestream created with StringIO may be useful for this, as that way they don't need to know about how the large files work.

In order to avoid having to read an entire large file when getting input (I want processing of each sequence to be an independent process, so that an analysis of one sequence can't corrupt the analysis of another sequence), I'll have to keep track of where I'm up to in the large input file. Although more sophisticated inter-process communication techniques exist, I might merely use a temporary file to store the character position for IO#seek.

I'll also have to keep in mind that I won't really be able to run multiple processes at once if they're writing to the same file, and that the large file handler will need to flush its output regularly.

I don't know the details of your situation, but the application you are describing -- I want to store a million things and I'd like to access them quickly and flexibly -- sounds like a DB to me. By avoiding tools like sqlite you aren't necessarily avoiding dependencies; you might be trading one kind of dependency for another.

If you do have to roll your own file-based solution, you don't necessarily have to go from one extreme to the other. What about 1000 medium-sized files, dispersed across 10 subdirectories? And those medium-sized files could be .tar archives or something similar (directories in disguise) that, from the point of view of your code, might behave a lot like the 1 million little files you're used to handling. In addition, those .tar files will remain accessible directly from the command-line without any special software.

Maybe those are crazy ideas, but if you're going to avoid a DB and instead whip together something quick and practical, consider options that don't require you to build the moral equivalent of your own DB system.

If this is just a case of storing "a bunch of files" you might just need a simple key/value store like BDB which could scale up quite easily to any RDBMS including MySQL, SQLite, or even a key/value store like Tokyo-Cabinet.

Any reasons for SQLite being such a problem? A robust data storage mechanism might be a much better approach to the 'pile of files' system.

继续阅读：file-io ruby

Migrating from processing many small data files to a few large files in ruby

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Easiest way to get words of one line from istream into a vector?

抽烟只抽炫赫门？

性激素六项检查的最佳时间是多久？多少钱？？

Infinite gtk warnings when I right click on the icon