开发者

.net string class alternative

Since I am planning an application that will hold MANY of its data in memory, I would like to have some kind of 'compact' string class, at least one which will contain string in format not larger than zero terminated ASCII version of the string.

Do you know of any such string class implementation - it should have some utility functions like the original string class.

EDIT:

I need to so开发者_JAVA技巧rt the strings and be able to scan through them, just to mention few of the operations that I will use.

Ideally, it would be source compatible with System.String, so basic search&replace action would optimize application memory footprint.

NUMBERS:

I could have 100k record of each record having up to 10 string having 30-60 characters. So:

100000x10x60=60000000=57mega characters. Why not have 60 megs of ram used for that instead of 120 megs of ram? Operations will be faster, everything will be tighter.

Trees will be used for searching, but won't be helpful in regex scans that I plan to have.


EDIT: I now have a blog post on this topic which goes into a fair amount more detail.


Going by your numbers:

I could have 100k record of each record having up to 10 string having 30-60 characters.

Let's start off by adding in the object overhead - a string takes up about 20 bytes (IIRC - possibly more on a 64-bit CLR) plus the actual data, due to the inevitable object overhead and the length. Let's do the maths again:

Using string: 1 million objects at 20+120 bytes = 140MB

Using a new class: 1 million objects at 20+60 bytes = 80MB

Still a 60MB difference of course, but proportionally less than you were expecting. You're only saving 42% of the space instead of 50%.

Now, you talk about things being faster: given that the CLR is natively aware of string, I suspect a third-party class won't be able to match the speed for some of its operations, and you'd have to put a lot of work in to get many of the others to be the same speed. Admittedly you will have better cache coherency, and if you can ignore culture issues, that should save a bit of time too by making all comparisons ordinal.

For the sake of 60MB, I wouldn't bother. That's a tiny difference these days - consider how many more customers you'll have to gain through making this small saving in order to make up for the significant extra cost of working with two different string types.

Having said all of this, I'm quite tempted to implement it myself anyway as a blogging project like Edulinq. Don't expect any results for weeks or months though :)

EDIT: I've just thought of another problem. The numbers we've got above aren't actually right... because the string class is special. It embeds its data directly in the object - unlike any other data type apart from arrays, the size of a string instance isn't fixed; it varies based on the data within it.

Writing your own AsciiString class, you wouldn't be able to do that - you'd have to embed an array reference within the class:

public class AsciiString
{
    private readonly byte[] data;
}

That means you'd need an extra 4 or 8 bytes for the reference (32 or 64-bit CLR) and the additional overhead of an array object (16 bytes, IIRC) per string.

If you designed it like Java, taking a substring could reuse the existing byte array (two strings could share), but then you'd need an extra length and offset within AsciiString. You'd also lose some of the cache coherency benefits.

You could use just raw byte arrays as the data structure and write a bunch of extension methods to act on them... but that would be horrible, as then you couldn't tell the difference between a normal byte array and one which was meant to represent an ASCII string.

Another possibility would be to create a struct like this:

struct AsciiString
{
    private readonly byte[] data;
    ...
}

That would effectively give you strong typing again, but you'd need to think about things like:

AsciiString x = new AsciiString();

which would end up with a null data reference. You could effectively treat this as if x were a null value, but it would be pretty non-idiomatic.


I've actually had a similar problem, but with somewhat different problem parameters. My application deals with 2 types of strings - relatively short ones measuring 60-100 chars and longer ones with 100-1000 bytes (averages around 300).

My use case also has to support unicode text, but a relatively small percentage of the strings actually have non-english chars.

In my use case i was exposing each String property as a native String, but the underlying data structure was a byte[] holding unicode bytes.

My use case also requires searching and sorting through these strings, getting substrings and other common string operations. My dataset measures in the millions.

The basic implementation looks something like this:

byte[] _myProperty;

public String MyProperty
{

   get 
   { 
        if (_myProperty== null)
            return null;

        return Encoding.UTF8.GetString(value);
   }

   set
   { 
        _myProperty = Encoding.UTF8.GetBytes(value);

   }

}

The performance hit for these conversions, even when you search and sort was relatively small (was about 10-15%).

This was fine for a while, but i wanted to reduce the overhead further. The next step was to create a merged array for all the strings in a given object (an object would hold either 1 short and 1 long string, or 4 short and 1 long string). so there would be one byte[] for each object, and only require 1 byte for each of the strings (save their lengths which are always < 256). even if your strings can be longer then 256, and int is still cheaper then the 12-16 byte overhead for the byte[].

This reduced much of the byte[] overhead, and added a little complexity but no additional impact to performance (the encoding pass is relatively expensive compared with the array copy involved).

this implementation looks something like this:

byte _property1;
byte _property2;
byte _proeprty3;

private byte[] _data; 

byte[] data;
//i actually used an Enum to indicate which property, but i am sure you get the idea
private int GetStartIndex(int propertyIndex)
{  

   int result = 0;
   switch(propertyIndex)
   {
       //the fallthrough is on purpose 
       case 2:
          result+=property2;
       case 1:
          result+=property1;

   }

   return result;
}

private int GetLength(int propertyIndex)
{
   switch (propertyIndex)
   {
     case 0:
        return _property1;
     case 1: 
        return _property2;
     case 2:
        return _property3;
   }
    return -1;
}

private String GetString(int propertyIndex)
{
   int startIndex = GetStartIndex(propertyIndex);
   int length = GetLength(propertyIndex);
   byte[] result = new byte[length];
   Array.Copy(data,startIndex,result,0,length);

   return Encoding.UTF8.GetString(result);

}

so the getter looks like this:

public String Property1
{
   get{ return GetString(0);}
}

The setter is in the same spirit - copy the original data into two arrays (between 0 start to startIndex, and between startIndex+length to length) , and create a new array with the 3 arrays (dataAtStart+NewData+EndData) and set the length of the array to the appropriate local variable.

I was still not happy with the memory saved and the very hard labor of the manual implementation for each property, so i built an in-memory compress paging system that uses the amazingly fast QuickLZ to compress a full page. This gave me a lot of control over the time-memory tradeoff (which is essentially the size of the page).

The compression rate for my use-case (compared with the more efficient byte[] store) approaches 50% (!). I used a page size of approx 10 strings per page and grouped similar properties together (which tend to have similar data). This added an additional overhead of 10-20% (on top of the encoding/decoding pass which is still required). The paging mechanism caches recently accessed pages up to a configurable size. Even without compression this implementation allows you to set a fixed factor on the overhead for each page. The major downside of my current implementation of the page cache is that with compression it is not thread-safe (without it there is no such problem).

If you're interested in the compressed paging mechanism let me know (I've been looking for an excuse to open source it).


Alternate datastructures

I would suggest that, given your desire to also search through the stored "string" values you should consider whether either a Trie structure such as a Patricia Trie or, for even better memory amortisation, a Directed Acyclic Word Graph (refered to affctionalty as a DAWG) would work better.

Construction of them would take longer (though often they are used in cases where the underlying storage itself represents this form reasonably well allowing rapid construction up front)and even though some operations on them are algorithmically superior you may find that in your real world usage things are actually slower they do significantly reduce the memory footprint of your data so long as there is a reasonable amount of repetition.

These can be viewed as generalisations of the (built in) de duplification provided in .net (and java and many other managed languages) of string interning.

If you specifically wish to retain an ordering of the strings with some lexicographical manner (so you need only consider one character or code point at a time) then the Patricia Trie is likely to be the preferable option, implementing the ordering on top of the DAWG would be problematic.

Alternate, more esoteric solutions may work if you have a particular domain of strings including:

run length encoding and other forms of compression.

At the cost of random access to a string and the risk of actually using more memory if the inputs turn out to not be as hoped. Huffman coding tends to work well on english text and is pretty simple to achieve, it has the advantage that the dictionary for it can be shard across all entires in the set so long as the frequency distribution of letters is comparable. Sorting would become problematic again.

Fixed length strings.

if you know the strings are small, and all of almost the same (or exactly the same) size you can store in in fixed size values (even structs if desired if the number of characters is in the region of 16 or less (the or use limit here will depend on your precise usage and may be heavily dependent on how willing you a to tune your code to play nice with this design)


You could create a new data structure to hold these, though I think this is overkill.

But, if you have an array of each word, or common phrase, then you store the index as an array for each word.

You then pay 4 bytes for each word, but if each word is on average 3.6 characters, then you save yourself 3.2 bytes for each word, on average, since you pay the 2-byte/letter penalty once/word.

But, in order to do searches or sorts you will take a big performance hit by having to rebuild the string at least for a short time.

You may want to rethink how to design your program, as there are many programs that use large amounts of data and can operate in relatively restricted memory.


Well, there is the UTF8Encoding class

//Example from MSDN
using System;
using System.Text;

public class Example
{
   public static void Main()
   {
      Encoding enc = new UTF8Encoding(true, true);
      string value = "\u00C4 \uD802\u0033 \u00AE"; 

      try
      {
         byte[] bytes= enc.GetBytes(value);
         foreach (var byt in bytes)
            Console.Write("{0:X2} ", byt);
         Console.WriteLine();

         string value2 = enc.GetString(bytes);
         Console.WriteLine(value2);
      }
      catch (EncoderFallbackException e)
      {
         //Encoding error
      }                     
   }
}

However, like Jon says, anytime you want to use it with any method that expects a string (most of the .Net library), you'll have to convert it back to a normal unicode string anyways... if you gave us more information about what you're trying to do, perhaps we could help you come up with a better solution?

Or, if you truly need low-level byte-array non-internationalizable null-terminated strings, you might be better off just writing this in C++.


How many duplicates are you expecting? If there are lots of duplicates in your array you might want to consider implementing a string cache (wrapper around a Dictionary<string, string>) that caches instances of particular strings, and returns a reference to that instance for each duplicate string you cache in it.

You can combine that with checking for interned strings, so you always use the interned version, if you have a lot of strings shared over the whole program.

Depending on your data, this might give a much better result than trying to optimize the storage of each individual string.


I think the key to this is that each record have many string fields

By storing all the string fields for each record in a single array of chars and then using an int field that has the offset you could reduce the number of objects greatly. (Each object has an overhead of about 2 words even before you put any data in it.)

Your properties could then convert to/from standard strings. The garbage collector is great at sorting out a lot of short lived garbage, so creating lots of "tmp" string when properties are accessed should not be an issue.

(Now if a lot of the string fields never have their values change, things become a lot easier)


You can save the per-object overhead by having a big byte[] which stores the characters and then an int-offset into that array as your "string".


Maybe a good old fashion character array would suit your needs.


Are all these strings distinct?

In most real world data sets i would the actual number of distinct strings is probably not that high, and if you take into account string interning then the actual amount of memory consumed may end up be significantly less than you might think.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜