How to make my split work only on one real line and be capable to skip quoted parts of string?
So we have a simple split:
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <iterator>
using namespace std;
vect开发者_如何学运维or<string> split(const string& s, const string& delim, const bool keep_empty = true) {
vector<string> result;
if (delim.empty()) {
result.push_back(s);
return result;
}
string::const_iterator substart = s.begin(), subend;
while (true) {
subend = search(substart, s.end(), delim.begin(), delim.end());
string temp(substart, subend);
if (keep_empty || !temp.empty()) {
result.push_back(temp);
}
if (subend == s.end()) {
break;
}
substart = subend + delim.size();
}
return result;
}
or boost split. And we have simple main like:
int main() {
const vector<string> words = split("close no \"\n matter\" how \n far", " ");
copy(words.begin(), words.end(), ostream_iterator<string>(cout, "\n"));
}
how to make it oputput something like
close
no
"\n matter"
how
end symbol found.
we want to introduce to split structures
that shall be held unsplited and charecters that shall end parsing process. how to do such thing?
Updated By way of 'thank you' for awarding the bonus I went and implemented 4 features that I initially skipped as "You Ain't Gonna Need It".
now supports partially quoted columns
This is the problem you reported: e.g. with a delimiter
,
onlytest,"one,two",three
would be valid, nottest,one","two","three
. Now both are acceptednow supports custom delimiter expressions
You could only specify single characters as delimiters. Now you can specify any Spirit Qi parser expression as the delimiter rule. E.g
splitInto(input, output, ' '); // single space splitInto(input, output, +qi.lit(' ')); // one or more spaces splitInto(input, output, +qi.lit(" \t")); // one or more spaces or tabs splitInto(input, output, (qi::double_ >> !'#') // -- any parse expression
Note this changes behaviour for the default overload
The old version treated repeated spaces as a single delimiter by default. You now have to explicitly specify that (2nd example) if you want it.
now supports quotes ("") inside quoted values (instead of just making them disappear)
See the code sample. Quite simple of course. Note that the sequence
""
outside a quoted construct still represents the empty string (for compatibility with e.g. existing CSV output formats which quote empty strings redundantly)support boost ranges in addition to containers as input (e.g. char[])
Well, you ain't gonna need it (but it was rather handy for me in order to just be able to write
splitInto("a char array", ...)
:)
As I had half expected, you were gonna need partially quoted fields (see your comment1. Well, here you are (the bottleneck was getting it to work consistently across different versions of Boost)).
Introduction
Random notes and observations for the reader:
splitInto
template function happily supports whatever you throw at it:- input from a vector or std::string or std::wstring
- output to -- some combinations shown in demo --
vector<string>
(all lines flattened)vector<vector<string>>
(tokens per line)list<list<string>>
(if you prefer)set<set<string>>
(unique linewise tokensets)- ... any container you dream up
- for demo purposes showing off karma output generation (especially taking care of nested container)
- note:
\n
in output being shown as?
for comprehension (safechars
)
- note:
- complete with handy plumbing for new Spirit users (legible rule naming, commented DEBUG defines in case you want to play with things)
- you can specify any Spirit parse expression to match delimiters. This means that by passing
+qi::lit(' ')
instead of the default (' '
) you will skip empty fields (i.e. repeated delimiters)
Versions required/tested
This was compiled using
- gcc 4.4.5,
- gcc 4.5.1 and
- gcc 4.6.1.
It works (tested) against
- boost 1.42.0 (possibly earlier versions too) all the way through
- boost 1.47.0.
Note: The flattening of output containers only seems to work for Spirit V2.5 (boost 1.47.0).
(this might be something simple as needing an extra include for older versions?)
The Code!
//#define BOOST_SPIRIT_DEBUG
#define BOOST_SPIRIT_DEBUG_PRINT_SOME 80
// YAGNI #4 - support boost ranges in addition to containers as input (e.g. char[])
#define SUPPORT_BOOST_RANGE // our own define for splitInto
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp> // for pre 1.47.0 boost only
#include <boost/spirit/version.hpp>
#include <sstream>
namespace /*anon*/
{
namespace phx=boost::phoenix;
namespace qi =boost::spirit::qi;
namespace karma=boost::spirit::karma;
template <typename Iterator, typename Output>
struct my_grammar : qi::grammar<Iterator, Output()>
{
typedef qi::rule<Iterator> delim_t;
//my_grammar(delim_t const& _delim) : delim(_delim),
my_grammar(delim_t _delim) : delim(_delim),
my_grammar::base_type(rule, "quoted_delimited")
{
using namespace qi;
noquote = char_ - '"';
plain = +((!delim) >> (noquote - eol));
quoted = lit('"') > *(noquote | '"' >> char_('"')) > '"';
#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
mixed = *(quoted|plain);
#else
// manual folding
mixed = *( (quoted|plain) [_a << _1]) [_val=_a.str()];
#endif
// you gotta love simple truths:
rule = mixed % delim % eol;
BOOST_SPIRIT_DEBUG_NODE(rule);
BOOST_SPIRIT_DEBUG_NODE(plain);
BOOST_SPIRIT_DEBUG_NODE(quoted);
BOOST_SPIRIT_DEBUG_NODE(noquote);
BOOST_SPIRIT_DEBUG_NODE(delim);
}
private:
qi::rule<Iterator> delim;
qi::rule<Iterator, char()> noquote;
#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
qi::rule<Iterator, std::string()> plain, quoted, mixed;
#else
qi::rule<Iterator, std::string()> plain, quoted;
qi::rule<Iterator, std::string(), qi::locals<std::ostringstream> > mixed;
#endif
qi::rule<Iterator, Output()> rule;
};
}
template <typename Input, typename Container, typename Delim>
bool splitInto(const Input& input, Container& result, Delim delim)
{
#ifdef SUPPORT_BOOST_RANGE
typedef typename boost::range_const_iterator<Input>::type It;
It first(boost::begin(input)), last(boost::end(input));
#else
typedef typename Input::const_iterator It;
It first(input.begin()), last(input.end());
#endif
try
{
my_grammar<It, Container> parser(delim);
bool r = qi::parse(first, last, parser, result);
r = r && (first == last);
if (!r)
std::cerr << "parsing failed at: \"" << std::string(first, last) << "\"\n";
return r;
}
catch (const qi::expectation_failure<It>& e)
{
std::cerr << "FIXME: expected " << e.what_ << ", got '";
std::cerr << std::string(e.first, e.last) << "'" << std::endl;
return false;
}
}
template <typename Input, typename Container>
bool splitInto(const Input& input, Container& result)
{
return splitInto(input, result, ' '); // default space delimited
}
/********************************************************************
* replaces '\n' character by '?' so that the demo output is more *
* comprehensible (see when a \n was parsed and when one was output *
* deliberately) *
********************************************************************/
void safechars(char& ch)
{
switch (ch) { case '\r': case '\n': ch = '?'; break; }
}
int main()
{
using namespace karma; // demo output generators only :)
std::string input;
#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
// sample invocation: simple vector of elements in order - flattened across lines
std::vector<std::string> flattened;
input = "actually on\ntwo lines";
if (splitInto(input, flattened))
std::cout << format(*char_[safechars] % '|', flattened) << std::endl;
#endif
std::list<std::set<std::string> > linewise, custom;
// YAGNI #1 - now supports partially quoted columns
input = "partially q\"oute\"d columns";
if (splitInto(input, linewise))
std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '\n', linewise) << std::endl;
// YAGNI #2 - now supports custom delimiter expressions
input="custom delimiters: 1997-03-14 10:13am";
if (splitInto(input, custom, +qi::char_("- 0-9:"))
&& splitInto(input, custom, +(qi::char_ - qi::char_("0-9"))))
std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '\n', custom) << std::endl;
// YAGNI #3 - now supports quotes ("") inside quoted values (instead of just making them disappear)
input = "would like ne\"\"sted \"quotes like \"\"\n\"\" that\"";
custom.clear();
if (splitInto(input, custom, qi::char_("() ")))
std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '\n', custom) << std::endl;
return 0;
}
The Output
Output from the sample as shown:
actually|on|two|lines
set['columns', 'partially', 'qouted']
set['am', 'custom', 'delimiters']
set['', '03', '10', '13', '14', '1997']
set['like', 'nested', 'quotes like "?" that', 'would']
Update Output for your previously failing test case:
--server=127.0.0.1:4774/|--username=robota|--userdescr=robot A ? I am cool robot ||--robot|>|echo.txt
1 I must admit I had a good laugh when reading that 'it crashed' [sic]. That sounds a lot like my end-users. Just to be precise: a crash is an unrecoverable application failure. What you ran into was a handled error, and was nothing more than 'unexpected behavior' from your point of view. Anyways, that's fixed now :)
The following code:
vector<string>::const_iterator matchSymbol(const string & s, string::const_iterator i, const vector<string> & symbols)
{
vector<string>::const_iterator testSymbol;
for (testSymbol=symbols.begin();testSymbol!=symbols.end();++testSymbol) {
if (!testSymbol->empty()) {
if (0==testSymbol->compare(0,testSymbol->size(),&(*i),testSymbol->size())) {
return testSymbol;
}
}
}
assert(testSymbol==symbols.end());
return testSymbol;
}
vector<string> split(const string& s, const vector<string> & delims, const vector<string> & terms, const bool keep_empty = true)
{
vector<string> result;
if (delims.empty()) {
result.push_back(s);
return result;
}
bool checkForDelim=true;
string temp;
string::const_iterator i=s.begin();
while (i!=s.end()) {
vector<string>::const_iterator testTerm=terms.end();
vector<string>::const_iterator testDelim=delims.end();
if (checkForDelim) {
testTerm=matchSymbol(s,i,terms);
testDelim=matchSymbol(s,i,delims);
}
if (testTerm!=terms.end()) {
i=s.end();
} else if (testDelim!=delims.end()) {
if (!temp.empty() || keep_empty) {
result.push_back(temp);
temp.clear();
}
string::const_iterator j=testDelim->begin();
while (i!=s.end() && j!=testDelim->end()) {
++i;
++j;
}
} else if ('"'==*i) {
if (checkForDelim) {
string::const_iterator j=i;
do {
++j;
} while (j!=s.end() && '"'!=*j);
checkForDelim=(j==s.end());
if (!checkForDelim && !temp.empty() || keep_empty) {
result.push_back(temp);
temp.clear();
}
temp.push_back('"');
++i;
} else {
//matched end quote
checkForDelim=true;
temp.push_back('"');
++i;
result.push_back(temp);
temp.clear();
}
} else if ('\n'==*i) {
temp+="\\n";
++i;
} else {
temp.push_back(*i);
++i;
}
}
if (!temp.empty() || keep_empty) {
result.push_back(temp);
}
return result;
}
int runTest()
{
vector<string> delims;
delims.push_back(" ");
delims.push_back("\t");
delims.push_back("\n");
delims.push_back("split_here");
vector<string> terms;
terms.push_back(">");
terms.push_back("end_here");
const vector<string> words = split("close no \"\n end_here matter\" how \n far testsplit_heretest\"another split_here test\"with some\"mo>re", delims, terms, false);
copy(words.begin(), words.end(), ostream_iterator<string>(cout, "\n"));
}
generates:
close
no
"\n end_here matter"
how
far
test
test
"another split_here test"
with
some"mo
Based on the examples you gave, you seemed to want newlines to count as delimiters when they appear outside of quotes and be represented by the literal \n
when inside of quotes, so that's what this does. It also adds the ability to have multiple delimiters, such as split_here
as I used the test.
I wasn't sure if you want unmatched quotes to be split the way matched quotes do since the example you gave has the unmatched quote separated by spaces. This code treats unmatched quotes as any other character, but it should be easy to modify if this is not the behavior you want.
The line:
if (0==testSymbol->compare(0,testSymbol->size(),&(*i),testSymbol->size())) {
will work on most, if not all, implementations of the STL, but it is not gauranteed to work. It can be replaced with the safer, but slower, version:
if (*testSymbol==s.substr(i-s.begin(),testSymbol->size())) {
If your grammar contains escaped sequences, I do not believe you will be able to use simple split techniques.
You will need a state machine.
Here is some example code to give you an idea of what I mean. This solution is neither fully specified nor implied correct. I am fairly certain it has one-off errors that can only be found with thorough testing.
std::vector<std::string> result;
std::string str;
size_t i = 0, last = 0;
for (;;) {
next_token:
last = i;
for (;;) {
switch (str.at(i)) {
case '"': goto handle_quote;
case ' ': goto handle_token;
}
i++;
if (i >= str.size())
goto handle_token;
}
handle_quote:
for (;;) {
switch (str.at(i)) {
case '"': goto handle_token;
}
i++;
if (i >= str.size())
std::runtime_error("invalid format, mismatched quotes");
}
handle_token:
results.push_back(std::string.substr(last, i - last));
if (i >= str.size())
break;
i++;
}
This sort of code is hard to reason about and maintain. That is what happen when people make crappy grammars, though. Tabs were designed to delimit fields, encourage their use when possible.
I would be ecstatic to upvote another more object oriented solution.
精彩评论