How to split this string
I have some strings, entered by users, that may look like this:
- ++7
- 7++
- 1++7
- 1+7
- 1++7+10++15+20+30++
Those are to mean:
- Anything up to and including 7
- Anything from 7 and up
- 1 and 7 and anything inbetween
- 1 and 7 only
- 1 to 7, 10 to 15, 20 and 30 and above
I need to parse those strings into actual ranges. That is I need to create a list of objects of type Range which have a start and an end. For single items I just set the start and end to the same, and for those that are above or below, I set start or end to null. For example for the first one I would get one range which had start set to null and end set to 7.
I currently have a kind of messy method using a regular expression to do this splitting and parsing and I want to simplify it. My problem is that I need to split on + first, and then on ++. But if I split on + first, then the ++ instances are ruined and I end up with a mess.
Looking at those strings it should be really easy to parse them, I just can't come up with a smart way to do it. It just have to be an easier (cleaner, easier to read) way. Probably involving some easy concept I just haven't heard about before :P
The regular expression looks like this:
private readonly Regex Pattern = new Regex(@" ( [+]{2,} )?
([^+]+)
(?:
(?: [+]{2,} [^+]* )*
[+]{2,} ([^+]+)
)?
( [+]{2,} )? ", RegexOptions.IgnorePatternWhitespace);
That is then used like this:
public IEnumerable<Range<T>> Parse(string subject, TryParseDelegate<string, T> itemParser)
{
if (string.IsNullOrEmpty(subject))
yield break;
for (var item = RangeStringConstants.Items.Match(subject); item.Success; item = item.NextMatch())
{
var startI开发者_如何学编程sOpen = item.Groups[1].Success;
var endIsOpen = item.Groups[4].Success;
var startItem = item.Groups[2].Value;
var endItem = item.Groups[3].Value;
if (endItem == string.Empty)
endItem = startItem;
T start, end;
if (!itemParser(startItem, out start) || !itemParser(endItem, out end))
continue;
yield return Range.Create(startIsOpen ? default(T) : start,
endIsOpen ? default(T) : end);
}
}
It works, but I don't think it is particularly readable or maintainable. For example changing the '+' and '++' into ',' and '-' would not be that trivial to do.
My problem is that I need to split on + first, and then on ++. But if I split on + first, then the ++ instances are ruined and I end up with a mess.
You could split on this regex first:
(?<!\+)\+(?!\+)
That way, only the 'single' +
's are being split on, leaving you to parse the ++
's. Note that I am assuming that there cannot be three successive +
's.
The regex above simple says: "split on the '+' only if there's no '+' ahead or behind it".
Edit:
After reading that there can be more than 2 successive +
's, I recommend writing a small grammar and letting a parser-generator create a lexer+parser for your little language. ANTLR can generate C# source code as well.
Edit 2:
But before implementing any solution (parser or regex) you'd first have to define what is and what isn't valid input. If you're going to let more than two successive +
's be valid, ie. 1+++++5
, which is [1++
, +
, ++5
], I'd write a little grammar. See this tutorial how that works: http://www.antlr.org/wiki/display/ANTLR3/Quick+Starter+on+Parser+Grammars+-+No+Past+Experience+Required
And if you're going to reject input of more than 2 successive +
's, you can use either Lasse's or my (first) regex-suggestion.
Here's some code that uses regular expressions.
Note that the issue raised by Bart in the comments to your question, ie. "How do you handle 1+++5", is not handled at all.
To fix that, unless your code is already out in the wild and not subject to change of behaviour, I would suggest you change your syntax to the following:
- use .. to denote ranges
- allow both + and - for numbers, for positive and negative numbers
- use comma and/or semicolon to separate distinct numbers or ranges
- allow whitespace
Look at the difference between the two following strings:
- 1++7+10++15+20+30++
- 1..7, 10..15, 20, 30..
The second string is much easier to parse, and much easier to read.
It would also remove all ambiguity:
- 1+++5 = 1++ + 5 = 1.., 5
- 1+++5 = 1 + ++5 = 1, ..5
There's no way to parse wrong the second syntax.
Anyway, here's my code. Basically it works by adding four regex patterns for the four types of patterns:
- num
- num++
- ++num
- num++num
For "num", it will handle negative numbers with a leading minus sign, and one or more digits. It does not, for obvious reasons, handle the plus sign as part of the number.
I've interpreted "and up" to mean "up to Int32.MaxValue" and same for down to Int32.MinValue.
public class Range
{
public readonly Int32 From;
public readonly Int32 To;
public Range(Int32 from, Int32 to)
{
From = from;
To = to;
}
public override string ToString()
{
if (From == To)
return From.ToString();
else if (From == Int32.MinValue)
return String.Format("++{0}", To);
else if (To == Int32.MaxValue)
return String.Format("{0}++", From);
else
return String.Format("{0}++{1}", From, To);
}
}
public static class RangeSplitter
{
public static Range[] Split(String s)
{
if (s == null)
throw new ArgumentNullException("s");
String[] parts = new Regex(@"(?<!\+)\+(?!\+)").Split(s);
List<Range> result = new List<Range>();
var patterns = new Dictionary<Regex, Action<Int32[]>>();
patterns.Add(new Regex(@"^(-?\d+)$"),
values => result.Add(new Range(values[0], values[0])));
patterns.Add(new Regex(@"^(-?\d+)\+\+$"),
values => result.Add(new Range(values[0], Int32.MaxValue)));
patterns.Add(new Regex(@"^\+\+(-?\d+)$"),
values => result.Add(new Range(Int32.MinValue, values[0])));
patterns.Add(new Regex(@"^(-?\d+)\+\+(-?\d+)$"),
values => result.Add(new Range(values[0], values[1])));
foreach (String part in parts)
{
foreach (var kvp in patterns)
{
Match ma = kvp.Key.Match(part);
if (ma.Success)
{
Int32[] values = ma.Groups
.OfType<Group>()
.Skip(1) // group 0 is the entire match
.Select(g => Int32.Parse(g.Value))
.ToArray();
kvp.Value(values);
}
}
}
return result.ToArray();
}
}
Unit-tests:
[TestFixture]
public class RangeSplitterTests
{
[Test]
public void Split_NullString_ThrowsArgumentNullException()
{
Assert.Throws<ArgumentNullException>(() =>
{
var result = RangeSplitter.Split(null);
});
}
[Test]
public void Split_EmptyString_ReturnsEmptyArray()
{
Range[] result = RangeSplitter.Split(String.Empty);
Assert.That(result.Length, Is.EqualTo(0));
}
[TestCase(01, "++7", Int32.MinValue, 7)]
[TestCase(02, "7", 7, 7)]
[TestCase(03, "7++", 7, Int32.MaxValue)]
[TestCase(04, "1++7", 1, 7)]
public void Split_SinglePatterns_ProducesExpectedRangeBounds(
Int32 testIndex, String input, Int32 expectedLower,
Int32 expectedUpper)
{
Range[] result = RangeSplitter.Split(input);
Assert.That(result.Length, Is.EqualTo(1));
Assert.That(result[0].From, Is.EqualTo(expectedLower));
Assert.That(result[0].To, Is.EqualTo(expectedUpper));
}
[TestCase(01, "++7")]
[TestCase(02, "7++")]
[TestCase(03, "1++7")]
[TestCase(04, "1+7")]
[TestCase(05, "1++7+10++15+20+30++")]
public void Split_ExamplesFromQuestion_ProducesCorrectResults(
Int32 testIndex, String input)
{
Range[] ranges = RangeSplitter.Split(input);
String rangesAsString = String.Join("+",
ranges.Select(r => r.ToString()).ToArray());
Assert.That(rangesAsString, Is.EqualTo(input));
}
[TestCase(01, 10, 10, "10")]
[TestCase(02, 1, 10, "1++10")]
[TestCase(03, Int32.MinValue, 10, "++10")]
[TestCase(04, 10, Int32.MaxValue, "10++")]
public void RangeToString_Patterns_ProducesCorrectResults(
Int32 testIndex, Int32 lower, Int32 upper, String expected)
{
Range range = new Range(lower, upper);
Assert.That(range.ToString(), Is.EqualTo(expected));
}
}
精彩评论