开发者

C data structures

Is there a C data structure equatable to the following python structure?

 data = {'X': 1, 'Y': 2}

Basically I want a structure where I can give开发者_如何学Go it an pre-defined string and have it come out with an integer.


The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.

A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.

It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.

This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).


Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.


The above data structure is a dict type.

In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.


There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.


If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:

typedef struct {
    char *key;
    int  val;
} tElement;

then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:

def init:
    create g.key[100] as string
    create g.val[100] as integer
    set g.size to 0
def add (key,val):
    if lookup(key) != not_found:
        return already_exists
    if g.size == 100:
        return no_space
    g.key[g.size] = key
    g.val[g.size] = val
    g.size = g.size + 1
    return okay
def del (key):
    pos = lookup (key)
    if pos == not_found:
        return no_such_key
    if pos < g.size - 1:
        g.key[pos] = g.key[g.size-1]
        g.val[pos] = g.val[g.size-1]
    g.size = g.size - 1
def find (key):
    for pos goes from 0 to g.size-1:
        if g.key[pos] == key:
            return pos
    return not_found

Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.

Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.


Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.


If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.

If you don't know how big your data set will be getting, I would consider this the absolute minimum.


A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.

A degenerate case of "all items in the same bucket" is no different to an array or linked list.


For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.

Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.

I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.

I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:

char *s = "X";
int val = *s - 'W';

Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.


C doesn't have any collection classes. C++ has std::map.

You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/


A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.

Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.


Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜