开发者

Split a string into words by multiple delimiters [duplicate]

This question already has answers here: Right way to split an std::string into a vector<string> (12 answers) Closed last year.

The community reviewed whether to reopen this question last year and left it closed:

Original close reason(s) were not resolved

I have some text (meaningful text or arithmetical expression) and I want to split it into words.

If I had a single delimiter, I'd use:

std::stringstream stringStream(inputString);
std::string word;
while(std::getline(stringStream, word, delimiter)) 
{
    wordVector.push_back(word);
}

How can I break the string into tokens with开发者_运维问答 several delimiters?


Assuming one of the delimiters is newline, the following reads the line and further splits it by the delimiters. For this example I've chosen the delimiters space, apostrophe, and semi-colon.

std::stringstream stringStream(inputString);
std::string line;
while(std::getline(stringStream, line)) 
{
    std::size_t prev = 0, pos;
    while ((pos = line.find_first_of(" ';", prev)) != std::string::npos)
    {
        if (pos > prev)
            wordVector.push_back(line.substr(prev, pos-prev));
        prev = pos+1;
    }
    if (prev < line.length())
        wordVector.push_back(line.substr(prev, std::string::npos));
}


If you have boost, you could use:

#include <boost/algorithm/string.hpp>
std::string inputString("One!Two,Three:Four");
std::string delimiters("|,:");
std::vector<std::string> parts;
boost::split(parts, inputString, boost::is_any_of(delimiters));


Using std::regex

A std::regex can do string splitting in a few lines:

std::regex re("[\\|,:]");
std::sregex_token_iterator first{input.begin(), input.end(), re, -1}, last;//the '-1' is what makes the regex split (-1 := what was not matched)
std::vector<std::string> tokens{first, last};

Try it yourself


I don't know why nobody pointed out the manual way, but here it is:

const std::string delims(";,:. \n\t");
inline bool isDelim(char c) {
    for (int i = 0; i < delims.size(); ++i)
        if (delims[i] == c)
            return true;
    return false;
}

and in function:

std::stringstream stringStream(inputString);
std::string word; char c;

while (stringStream) {
    word.clear();

    // Read word
    while (!isDelim((c = stringStream.get()))) 
        word.push_back(c);
    if (c != EOF)
        stringStream.unget();

    wordVector.push_back(word);

    // Read delims
    while (isDelim((c = stringStream.get())));
    if (c != EOF)
        stringStream.unget();
}

This way you can do something useful with the delims if you want.


If you interesting in how to do it yourself and not using boost.

Assuming the delimiter string may be very long - let say M, checking for every char in your string if it is a delimiter, would cost O(M) each, so doing so in a loop for all chars in your original string, let say in length N, is O(M*N).

I would use a dictionary (like a map - "delimiter" to "booleans" - but here I would use a simple boolean array that has true in index = ascii value for each delimiter).

Now iterating on the string and check if the char is a delimiter is O(1), which eventually gives us O(N) overall.

Here is my sample code:

const int dictSize = 256;    

vector<string> tokenizeMyString(const string &s, const string &del)
{
    static bool dict[dictSize] = { false};

    vector<string> res;
    for (int i = 0; i < del.size(); ++i) {      
        dict[del[i]] = true;
    }

    string token("");
    for (auto &i : s) {
        if (dict[i]) {
            if (!token.empty()) {
                res.push_back(token);
                token.clear();
            }           
        }
        else {
            token += i;
        }
    }
    if (!token.empty()) {
        res.push_back(token);
    }
    return res;
}


int main()
{
    string delString = "MyDog:Odie, MyCat:Garfield  MyNumber:1001001";
//the delimiters are " " (space) and "," (comma) 
    vector<string> res = tokenizeMyString(delString, " ,");

    for (auto &i : res) {

        cout << "token: " << i << endl;
    }
return 0;
}

Note: tokenizeMyString returns vector by value and create it on the stack first, so we're using here the power of the compiler >>> RVO - return value optimization :)


And here, ages later, a solution using C++20:

constexpr std::string_view words{"Hello-_-C++-_-20-_-!"};
constexpr std::string_view delimeters{"-_-"};
for (const std::string_view word : std::views::split(words, delimeters)) {
    std::cout << std::quoted(word) << ' ';
}
// outputs: Hello C++ 20!

Required headers:

#include <ranges>
#include <string_view>

Reference: https://en.cppreference.com/w/cpp/ranges/split_view


Using Eric Niebler's range-v3 library:

https://godbolt.org/z/ZnxfSa

#include <string>
#include <iostream>
#include "range/v3/all.hpp"

int main()
{
    std::string s = "user1:192.168.0.1|user2:192.168.0.2|user3:192.168.0.3";
    auto words = s  
        | ranges::view::split('|')
        | ranges::view::transform([](auto w){
            return w | ranges::view::split(':');
        });
      ranges::for_each(words, [](auto i){ std::cout << i  << "\n"; });
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜