开发者

Reading tokens from file (complicated)

I have a basic tokenization structure/algorithm in place. It's pretty complicated, and I hope I can clarify it simply enough to enlighten you about the "flaw" in my design.

class ParserState

// bool functions return false if getline() or stream extraction '>>' fails
static bool nextLine(); // reads and tokenizes next line from file and puts it in m_buffer
static bool nextToken(); // gets next token from m_buffer, via fetchToken(), and puts it in m_token
static bool fetchToken( std::string &token ); // procures next token from file/buffer

static size_t m_lineNumber;
static std::ifstream m_fstream;
static std::string m_buffer;
static std::string m_token;

The reason for this setup is being able to report the line number if a syntax error occurs. Depending on the phase/state of the parser, differend things happen in my program, and subclasses of this ParserState use m_token and nextToken to continue. fetchToken calls nextLine if m_buffer is empty, and puts the next token in its argument:

istringstream stream;

do // read new line until valid token can be extracted
{
    Debug(5) << "m_buffer contains: " << m_buffer << "\n";
    stream.str( m_buffer );

    if( stream >> token )
    {
        Debug(5) << "Token extracted: " << token << "\n";
        m_token = token;
        return true; // return when token found
    }
    stream.clear();
} while( nextLine() );
// if no tokens can be extracted from the whole file, return false
return false;

The problem is that the token removed from m_buffer isn't removed, and the same token gets read with every call to nextToken(). The problem is that m_buffer can be modified, thus the call to istringstream::str in the loop. But this is the cause of my issue, and as far as I can see, it can't be worked around, hence my question: How can I let stream >> token remove something from the string pointed to internally by the stringstream? Perhaps I need to not use a stringstream, but something more elementary in this situation (like find first whitespace and cut the first token from the string)?

Thanks a billion!

PS: any suggestions altering my function/class structure are ok as long as they allow line numbers to be kept track of 开发者_Python百科(so no full file read into m_buffer and a class member istringstream, which is what I had before I wanted line number error reporting).


Why not simply make m_buffer an std::istringstream instead of a std::string? You would remove a temporary variable as well as get the desired effect. Whenever you change m_buffer in statements such as

m_buffer = ...

write this instead:

m_buffer.str(...);


To avoid reading the same token multiple times I think you have to get the position in stream using tellg and then restore it using seekg (these methods are described here). However using std::istringstream looks as an overkill for me here. I would rather work with m_buffer directly.


The usual scheme for handling line-number reporting is to read lines one at time, as you have, incrementing a the line count, and then as your tokenizer starts to build a token, it takes a snapshot of the line number and stores it into the token data structure (typically containing the line number, token type, and token value if any).

This decouples line-reading from token building without losing the line number. It also means you can have lots of tokens, they can all have line numbers (including different ones), a token can start on one line and and finish on another, etc.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜