开发者

Need a regular expression to extract only letters and whitespace from a string

I'm building a small utility method that parses a line (a string) and returns a vector of all the words. The istringstream code I have below works fine except for when there is punctuation so naturally my fix is to want to "sanitize" the line before I run it through the while loop.

I would appreciate 开发者_JAVA百科some help in using the regex library in c++ for this. My initial solution was to us substr() and go to town but that seems complicated as I'll have to iterate and test each character to see what it is then perform some operations.

vector<string> lineParser(Line * ln)
{
    vector<string> result;
    string word;
    string line = ln->getLine();
    istringstream iss(line);
    while(iss)
    {
        iss >> word;
        result.push_back(word);
    }
    return result;
}


Don't need to use regular expressions just for punctuation:

// Replace all punctuation with space character.
std::replace_if(line.begin(), line.end(),
                std::ptr_fun<int, int>(&std::ispunct),
                ' '
               );

Or if you want everything but letters and numbers turned into space:

std::replace_if(line.begin(), line.end(),
                std::not1(std::ptr_fun<int,int>(&std::isalphanum)),
                ' '
               );

While we are here:
Your while loop is broken and will push the last value into the vector twice.

It should be:

while(iss)
{
    iss >> word;
    if (iss)                    // If the read of a word failed. Then iss state is bad.
    {    result.push_back(word);// Only push_back() if the state is not bad.
    }
}

Or the more common version:

while(iss >> word) // Loop is only entered if the read of the word worked.
{
    result.push_back(word);
}

Or you can use the stl:

std::copy(std::istream_iterator<std::string>(iss),
          std::istream_iterator<std::string>(),
          std::back_inserter(result)
         );


[^A-Za-z\s] should do what you need if your replace the matching characters by nothing. It should remove all characters that are not letters and spaces. Or [^A-Za-z0-9\s] if you want to keep numbers too.

You can use online tools like this one : http://gskinner.com/RegExr/ to test out your patterns (Replace tab). Indeed some modifications can be required based on the regex lib you are using.


I'm not positive, but I think this is what you're looking for:

#include<iostream>
#include<regex>
#include<vector>

int
main()
{
    std::string line("some words: with some punctuation.");
    std::regex words("[\\w]+");
    std::sregex_token_iterator i(line.begin(), line.end(), words);
    std::vector<std::string> list(i, std::sregex_token_iterator());
    for (auto j = list.begin(), e = list.end(); j != e; ++j)
        std::cout << *j << '\n';
}

some
words
with
some
punctuation


The simplest solution is probably to create a filtering streambuf to convert all non alphanumeric characters to space, then to read using std::copy:

class StripPunct : public std::streambuf
{
    std::streambuf* mySource;
    char            myBuffer;

protected:
    virtual int underflow()
    {
        int result = mySource->sbumpc();
        if ( result != EOF ) {
            if ( !::isalnum( result ) )
                result = ' ';
            myBuffer = result;
            setg( &myBuffer, &myBuffer, &myBuffer + 1 );
        }
        return result;
    }

public:
    explicit StripPunct( std::streambuf* source )
        : mySource( source )
    {
    }
};

std::vector<std::string>
LineParser( std::istream& source )
{
    StripPunct               sb( source.rdbuf() );
    std::istream             src( &sb );
    return std::vector<std::string>(
        (std::istream_iterator<std::string>( src )),
        (std::istream_iterator<std::string>()) );
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜