Efficiently check string for one of several hundred possible suffixes

2022-12-25 20:10 问答作者：

I need to write a C/C++ function that would quickly check if string ends with one of ~1000 predefined suffixes. Specifically the string is a hostname and I need to check if it belongs to one of several hundred predefined second-level domains.

This function will be called a lot so it needs to be written as efficiently as possible. Bitwise hacks etc anything goes as long as it turns out fast.

Set of suffixes is predetermined at compile-time and doesn't change.

I am thinking of either implementing a variation of Rabin-Karp or write a tool that would generate a function with nested ifs and switches that would be custom tailored to specific set of suffixes. Since the application in question is 64-bit to speed up comparisons I could store suffixes of up to 8 bytes in length as const sorted arr开发者_开发百科ay and do binary search within it.

Are there any other reasonable options?

If the suffixes don't contain any expansions/rules (like a regex), you could build a Trie of the suffixes in reverse order, and then match the string based on that. For instance

suffixes:
  foo
  bar
  bao

reverse order suffix trie:
  o
   -a-b  (matches bao)
   -o-f  (matches foo)
  r-a-b  (matches bar)

These can then be used to match your string:

"mystringfoo" -> reverse -> "oofgnirtsym" -> trie match -> foo suffix

You mention that you're looking at second-level domain names only, so even without knowing the precise set of matching domains, you could extract the relevant portion of the input string.

Then simply use a hashtable. Dimension it in such a way that there are no collisions, so you don't need buckets; lookups will be exactly O(1). For small hash types (e.g. 32 bits), you'd want to check if the strings really match. For a 64-bit hash, the probability of another domain colliding with one of the hashes in your table is already so low (order 10^-17) that you can probably live with it.

I would reverse all of the suffix strings, build a prefix tree of them and then test the reverse of your IP string against that.

I think that building your own automata would be the most efficient way.. it's a sort of your second solution, according to which, starting from a finite set of suffixes, it generates an automaton fitted for that suffixes.

I think you can easily use flex to do it, taking care of reversing the input or handling in a special way the fact that you are looking just for suffixes (just for efficienty matters)..

By the way using a Rabin-Karp approach would be efficient too since your suffixes will be short. You can fit a hashset with all the suffixes needed and then

take a string
take the suffix
calculate the hash of the suffix
check if suffix is in the table

Just create a 26x26 array of set of domains. e.g. thisArray[0][0] will be the domains that end in 'aa', thisArray[0][1] is all the domains that end in 'ab' and so on...

Once you have that, just search your array for thisArray[2nd last char of hostname][last char of hostname] to get the possible domains. If there's more than one at that stage, just brute force the rest.

I think that the solution should be very different depending on the type of input strings. If the strings are some kind of string class that can be iterated from the end (such as stl strings) it is a lot easier than if they are NULL-terminated C-strings.

String Class

Iterate the string backwards (don't make a reverse copy - use some kind of backward iterator). Build a Trie where each node consists of two 64-bit words, one pattern and one bitmask. Then check 8 characters at a time in each level. The mask is used if you want to match less than 8 characters - e.g. deny "*.org" would give a mask with 32 bits set. The mask is also used as termination criteria.

C strings

Construct an NDFA that matches the strings on a single-pass over them. That way you don't have to first iterate to the end but can instead use it in one pass. An NDFA can be converted to a DFA, which will probably make the implementation more efficient. Both construction of the NDFA and conversion to DFA will probably be so complex that you will have to write tools for it.

After some research and deliberation I've decided to go with trie/finite state machine approach.

The string is parsed starting from the last character going backwards using a TRIE as long as the portion of suffix that was parsed so far can correspond to multiple suffixes. At some point we either hit the first character of one of the possible suffixes which means that we have a match, hit a dead end, which means there are no more possible matches or get into situation where there is only one suffix candidate. In this case we just do compare remainder of the suffix.

Since trie lookups are constant time, worst case complexity is o(maximum suffix length). The function turned out to be pretty fast. On 2.8Ghz Core i5 it can check 33,000,000 strings per second for 2K possible suffixes. 2K suffixes totaling 18 kilobytes, expanded to 320kb trie/state machine table. I guess that I could have stored it more efficiently but this solution seems to work good enough for the time being.

Since suffix list was so large, I didn't want to code it all by hand so I ended up writing C# application that generated C code for the suffix checking function:

    public static uint GetFourBytes(string s, int index)
    {
        byte[] bytes = new byte[4] { 0, 0, 0, 0};
        int len = Math.Min(s.Length - index, 4);
        Encoding.ASCII.GetBytes(s, index, len, bytes, 0);
        return BitConverter.ToUInt32(bytes, 0);
    }

    public static string ReverseString(string s)
    {
        char[] chars = s.ToCharArray();
        Array.Reverse(chars);
        return new string(chars);
    }

    static StringBuilder trieArray = new StringBuilder();
    static int trieArraySize = 0;

    static void Main(string[] args)
    {
        // read all non-empty lines from input file
        var suffixes = File
            .ReadAllLines(@"suffixes.txt")
            .Where(l => !string.IsNullOrEmpty(l));

        var reversedSuffixes = suffixes
            .Select(s => ReverseString(s));

        int start = CreateTrieNode(reversedSuffixes, "");

        string outFName = @"checkStringSuffix.debug.h";
        if (args.Length != 0 && args[0] == "--release")
        {
            outFName = @"checkStringSuffix.h";
        }

        using (StreamWriter wrt = new StreamWriter(outFName))
        {
            wrt.WriteLine(
                "#pragma once\n\n" +
                "#define TRIE_NONE -1000000\n"+
                "#define TRIE_DONE -2000000\n\n"
                );

            wrt.WriteLine("const int trieArray[] = {{{0}\n}};", trieArray);

            wrt.WriteLine(
                "inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {\n"+
                "   int len = trie[0];\n"+
                "   if (curr - str < len) return false;\n"+
                "   const char* cmp = (const char*)(trie + 1);\n"+
                "   while (len-- > 0) {\n"+
                "       if (*--curr != *cmp++) return false;\n"+
                "   }\n"+
                "   return true;\n"+
                "}\n\n"+
                "bool checkStringSuffix(const char* str, int len) {\n" +
                "   if (len < " + suffixes.Select(s => s.Length).Min().ToString() + ") return false;\n" +
                "   const char* curr = (str + len - 1);\n"+
                "   int currTrie = " + start.ToString() + ";\n"+
                "   while (curr >= str) {\n" +
                "       assert(*curr >= 0x20 && *curr <= 0x7f);\n" +
                "       currTrie = trieArray[currTrie + *curr - 0x20];\n" +
                "       if (currTrie < 0) {\n" +
                "           if (currTrie == TRIE_NONE) return false;\n" +
                "           if (currTrie == TRIE_DONE) return true;\n" +
                "           return checkSingleSuffix(str, curr, trieArray - currTrie - 1);\n" +
                "       }\n"+
                "       --curr;\n"+
                "   }\n" +
                "   return false;\n"+
                "}\n"
                );
        }        
    }

    private static int CreateTrieNode(IEnumerable<string> suffixes, string prefix)
    {
        int retVal = trieArraySize;

        if (suffixes.Count() == 1)
        {
            string theSuffix = suffixes.Single();
            trieArray.AppendFormat("\n\t/* {1} - {2} */ {0}, ", theSuffix.Length, trieArraySize, prefix);
            ++trieArraySize;
            for (int i = 0; i < theSuffix.Length; i += 4)
            {
                trieArray.AppendFormat("0x{0:X}, ", GetFourBytes(theSuffix, i));
                ++trieArraySize;
            }

            retVal = -(retVal + 1);
        }
        else
        {
            var groupByFirstChar =
                from s in suffixes
                let first = s[0]
                let remainder = s.Substring(1)
                group remainder by first;

            string[] trieIndexes = new string[0x60];
            for (int i = 0; i < trieIndexes.Length; ++i)
            {
                trieIndexes[i] = "TRIE_NONE";
            }

            foreach (var g in groupByFirstChar)
            {
                if (g.Any(s => s == string.Empty))
                {
                    trieIndexes[g.Key - 0x20] = "TRIE_DONE";
                    continue;
                }
                trieIndexes[g.Key - 0x20] = CreateTrieNode(g, g.Key + prefix).ToString();
            }
            trieArray.AppendFormat("\n\t/* {1} - {2} */ {0},", string.Join(", ", trieIndexes), trieArraySize, prefix);
            retVal = trieArraySize;
            trieArraySize += 0x60;
        }

        return retVal;
    }

So it generates code like this:

    inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {
       int len = trie[0];
       if (curr - str < len) return false;
       const char* cmp = (const char*)(trie + 1);
       while (len-- > 0) {
           if (*--curr != *cmp++) return false;
       }
       return true;
    }

    bool checkStringSuffix(const char* str, int len) {
       if (len < 5) return false;
       const char* curr = (str + len - 1);
       int currTrie = 81921;
       while (curr >= str) {
           assert(*curr >= 0x20 && *curr <= 0x7f);
           currTrie = trieArray[currTrie + *curr - 0x20];
           if (currTrie < 0) {
               if (currTrie == TRIE_NONE) return false;
               if (currTrie == TRIE_DONE) return true;
               return checkSingleSuffix(str, curr, trieArray - currTrie - 1);
           }
           --curr;
       }
       return false;
    }

Since for my particular set of data len in checkSingleSuffix was never more than 9, I tried to replace the comparison loop with switch (len) and hardcoded comparison routines that compared up to 8 bytes of data at a time but it didn't affect overall performance at all either way.

Thanks for everyone who contributed their ideas!

继续阅读：64-bit algorithm string

Efficiently check string for one of several hundred possible suffixes

String Class

C strings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

String Class

C strings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？