How should I store a large amout of text data in memory?

2022-12-18 03:23 问答作者：

I am working on a c parser and wondering how expert manage large amount of text / string (> 100mb) to store in memory? the content is expected to be accessible all the time in fast pace. bg: redhat / gcc / libc

a single char array would be out of boundary causing segmentation fault... any idea or experience is welcomed to sha开发者_JS百科re / discuss...

mmap(2) the file in the the VM, and just use that.

"a single char array would be out of boundary causing segmentation fault" - I think this isn't right. A segmentation fault is caused by accessing protected memory, not by allocating too big a chunk. In any case, you should be able to allocate up 2-3GB on a 32-bit machine and much more on 64-bit.

You can use a char array, but if you want fast access then perhaps you need some sort of indexing on top of that.

Could you clarify your usecase more? Are you trying to create a parser for the c language? Why do you expect to have such long input or output: neither source nor binaries are commonly that big.

mmap is the best way to deal with a large amount of data that is stored in a file, if you want random access to that data.

mmap tells the virtual memory system to map a contiguous portion of address space to contain the data found in the file. The virtual memory system will allocate a range of address space, backed by that file. When you access any location in that address space, it will allocate a page of physical memory, read that section of the file in from the disk, and point that portion of your virtual address space to the physical memory that it used to read the file. When it needs to make more room in physical memory, it will write out any changes to disk (if applicable), and remove the mapping of that section of virtual address space.

You would use it like so:

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h> /* the header where mmap is defined */
#include <fcntl.h>

int file;
char *contents;
struct stat statbuf;
off_t len;

file = open("path/to/file", O_RDONLY);
if (file < 0)
  exit(1); /* or otherwise handle the error */

if (fstat(file, &statbuf) < 0)
  exit(1);

len = statbuf.st_size;

contents = mmap(0, len, PROT_READ, MAP_SHARED, file, 0);
if (contents == MAP_FAILED)
  exit(1);

// Now you can use contents as a pointer to the contents of the file

// When you're done, unmap and close the file.

munmap(contents, len);
close(file);

It's a very unusual C parser that needs the source text (if that is what you are talking about) to be held in memory. Most parsers read the source effectively a token at a time and convert it immediately into some internal representation. And they typically hold the representation for only a single source file (plus #includes), which is highly unlikely to be as big as 100Mb - perhaps you have some design issues here?

If you are allocating a > 100Mb char array on the stack, you'll most likely overflow the stack. While you could increase the stack size using compiler/linker options, this won't necessarily solve the problem as some OSes expect approximately linear access to stack pages (google "stack guard pages")

Instead, if you know the size at compile time, try allocating a static char array instead. Better yet, use malloc(). (The code you posted declares an array whose size depends on the variable a -- this is called a "variable-length array", which is a C99 extension that not all compilers support. OTOH every C implementation lets you call malloc() to allocate memory dynamically.)

Such large amount of data better be stored as

Global array if the data is going to be constant.
In heap (memory allocated dynamically) if globals are not allowed in you case.

But make it a point not to store it in stack, lest it might overflow and cause other problems.

If you are asking about specific data structures that can be efficiently used to store/access this data then I suggest:

Hash table
Array
List.

You could save a lot of space by compressing the tokens as your read them from the source stream (presumably a text file). Eliminating excess whitespace and comments as you read the input text could reduce your memory requirements by up to 50%.

But I'm curious why you need to store so much in memory at once. String literals, identifiers, and symbol table entries can be cached to disk when you're at a point in parsing that makes them inaccessible or out of scope.

sorry if it's a beginner pbm, segmentation fault appears with the following.

int a = 10000000; char content2[a]; content2[0] = 'a';

the use case is, the file is a daily generated with a structural plain text format before parsing (similar as xml) the data itself which is quite static, I want to make it accessible as fast as I can, so I prefer to keep it in memory after parsed

继续阅读：c io large-files memory performance

How should I store a large amout of text data in memory?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？