Why does my finite state machine take so long to execute?

2022-12-23 23:24 问答作者：

I'm working on a state machine which is supposed to extract function calls of the form

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

where the extracted data would be pref("this.is.a.string.which\"can have QUOTES\"", 123456); from a file. Currently, to process a 41kb file, this process is taking close to a minute and a half. Is there something I'm seriously misunderstanding here about this finite state machine?

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
 开发者_StackOverflow中文版       switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

Billy3

EDIT: Well this is one of the strangest things I've ever seen. I just rebuilt the project and it's running reasonably again. Odd.

Unless your 41 kb file is mostly comments or prefs, it's going to spend most of its time in state 0. And for each character in state 0, you make a minimum of two function calls.

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

You can speed this up by pre-testing to see if the current character is 'p'

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

If the character isn't 'p' then there is no need to make any function calls. In particular not creating a iterator, which is probably where the time is being spent.

I don't know if this is part of the problem, but you have a typo in case 0, "perf" is misspelled as "pref".

Well it's hard to say just by looking at it...but I'm guessing the find algorithms are doing it. Why are you searching within a FSM? By definition you're supposed to giving them one character at a time....Add more states. Also try making results a list, not a vector. A lot of copying going on with

vector<string>

But mostly: Profile it!

Finite state machines are a workable solution, but for text processing, its best to use a highly optimized finite state machine generator. In this case, a regular expression. Here it is as Perl regex:

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

Since most regex libraries are Perl compatible, it shouldn't be hard to translate the syntax. If your search becomes more complicated, then a parser is in order.

If you are doing parsing, why not use a parser library.

I typically have Boost.Spirit.Qi in mind.

You express your grammar with EBNF-like expressions which certainly makes it easier for maintenance.
It's a header only library, so you don't have any problem of bringing a whole binary in the mix.

While I can appreciate the minimalist approach, I am afraid that your idea of coding a finite state machine on your own is not that wise. It works well with a toy example, but as the requirements add up you'll have one monstrous switch and it's going to become more and more complicated to understand what's going on.

And please, don't tell me you know it's not going to evolve: I don't believe in oracles ;)

继续阅读：finite-automata performance

Why does my finite state machine take so long to execute?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？