开发者

How should I store a large amout of text data in memory?

I am working on a c parser and wondering how expert manage large amount of text / string (> 100mb) to store in memory? the content is expected to be accessible all the time in fast pace. bg: redhat / gcc / libc

a single char array would be out of boundary causing segmentation fault... any idea or experience is welcomed to sha开发者_JS百科re / discuss...


mmap(2) the file in the the VM, and just use that.


"a single char array would be out of boundary causing segmentation fault" - I think this isn't right. A segmentation fault is caused by accessing protected memory, not by allocating too big a chunk. In any case, you should be able to allocate up 2-3GB on a 32-bit machine and much more on 64-bit.

You can use a char array, but if you want fast access then perhaps you need some sort of indexing on top of that.

Could you clarify your usecase more? Are you trying to create a parser for the c language? Why do you expect to have such long input or output: neither source nor binaries are commonly that big.


mmap is the best way to deal with a large amount of data that is stored in a file, if you want random access to that data.

mmap tells the virtual memory system to map a contiguous portion of address space to contain the data found in the file. The virtual memory system will allocate a range of address space, backed by that file. When you access any location in that address space, it will allocate a page of physical memory, read that section of the file in from the disk, and point that portion of your virtual address space to the physical memory that it used to read the file. When it needs to make more room in physical memory, it will write out any changes to disk (if applicable), and remove the mapping of that section of virtual address space.

You would use it like so:

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h> /* the header where mmap is defined */
#include <fcntl.h>

int file;
char *contents;
struct stat statbuf;
off_t len;

file = open("path/to/file", O_RDONLY);
if (file < 0)
  exit(1); /* or otherwise handle the error */

if (fstat(file, &statbuf) < 0)
  exit(1);

len = statbuf.st_size;

contents = mmap(0, len, PROT_READ, MAP_SHARED, file, 0);
if (contents == MAP_FAILED)
  exit(1);

// Now you can use contents as a pointer to the contents of the file

// When you're done, unmap and close the file.

munmap(contents, len);
close(file);


It's a very unusual C parser that needs the source text (if that is what you are talking about) to be held in memory. Most parsers read the source effectively a token at a time and convert it immediately into some internal representation. And they typically hold the representation for only a single source file (plus #includes), which is highly unlikely to be as big as 100Mb - perhaps you have some design issues here?


If you are allocating a > 100Mb char array on the stack, you'll most likely overflow the stack. While you could increase the stack size using compiler/linker options, this won't necessarily solve the problem as some OSes expect approximately linear access to stack pages (google "stack guard pages")

Instead, if you know the size at compile time, try allocating a static char array instead. Better yet, use malloc(). (The code you posted declares an array whose size depends on the variable a -- this is called a "variable-length array", which is a C99 extension that not all compilers support. OTOH every C implementation lets you call malloc() to allocate memory dynamically.)


Such large amount of data better be stored as

  1. Global array if the data is going to be constant.
  2. In heap (memory allocated dynamically) if globals are not allowed in you case.

But make it a point not to store it in stack, lest it might overflow and cause other problems.

If you are asking about specific data structures that can be efficiently used to store/access this data then I suggest:

  1. Hash table
  2. Array
  3. List.


You could save a lot of space by compressing the tokens as your read them from the source stream (presumably a text file). Eliminating excess whitespace and comments as you read the input text could reduce your memory requirements by up to 50%.

But I'm curious why you need to store so much in memory at once. String literals, identifiers, and symbol table entries can be cached to disk when you're at a point in parsing that makes them inaccessible or out of scope.


sorry if it's a beginner pbm, segmentation fault appears with the following.

int a = 10000000; char content2[a]; content2[0] = 'a';

the use case is, the file is a daily generated with a structural plain text format before parsing (similar as xml) the data itself which is quite static, I want to make it accessible as fast as I can, so I prefer to keep it in memory after parsed

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜