Find all substrings between two strings
I need to get all substrings from string.
For ex:StringParser.GetSubstrings("[start]aaaaaa[end] wwwww [start]cccccc[end]", "[start]", "[end]");
that returns 2 string "aaaaaa" and "cccccc" Sup开发者_如何转开发pose we have only one level of nesting. Not sure about regexp, but I think it will be userful.
private IEnumerable<string> GetSubStrings(string input, string start, string end)
{
Regex r = new Regex(Regex.Escape(start) + "(.*?)" + Regex.Escape(end));
MatchCollection matches = r.Matches(input);
foreach (Match match in matches)
yield return match.Groups[1].Value;
}
Here's a solution that doesn't use regular expressions and doesn't take nesting into consideration.
public static IEnumerable<string> EnclosedStrings(
this string s,
string begin,
string end)
{
int beginPos = s.IndexOf(begin, 0);
while (beginPos >= 0)
{
int start = beginPos + begin.Length;
int stop = s.IndexOf(end, start);
if (stop < 0)
yield break;
yield return s.Substring(start, stop - start);
beginPos = s.IndexOf(begin, stop+end.Length);
}
}
You can use a regular expression, but remember to call Regex.Escape on your arguments:
public static IEnumerable<string> GetSubStrings(
string text,
string start,
string end)
{
string regex = string.Format("{0}(.*?){1}",
Regex.Escape(start),
Regex.Escape(end));
return Regex.Matches(text, regex, RegexOptions.Singleline)
.Cast<Match>()
.Select(match => match.Groups[1].Value);
}
I also added the SingleLine option so that it will match even if there are new-lines in your text.
You're going to need to better define the rules that govern your matching needs. When building any kind of matching or search code you need to be vary clear about what inputs you anticipate and what outputs you need to produce. It's very easy to produce buggy code if you don't take these questions into close consideration. That said...
You should be able to use regular expressions. Nesting may make it slightly more complicated but still doable (depending on what you expect to match in nested scenarios). Something like should get you started:
var start = "[start]";
var end = "[end]";
var regEx = new Regex(String.Format("{0}(.*){1}", Regex.Escape(start), Regex.Escape(end)));
var source = "[start]aaaaaa[end] wwwww [start]cccccc[end]";
var matches = regEx.Match( source );
It should be trivial to wrap the code above into a function appropriate for your needs.
I was bored, and thus I made a useless micro benchmark which "proves" (on my dataset, which has strings up to 7k of characters and <b>
tags for start/end parameters) my suspicion that juharr's solution is the fastest of the three overall.
Results (1000000 iterations * 20 test cases):
juharr: 6371ms Jake: 6825ms Mark Byers: 82063ms
NOTE: Compiled regex didn't speed things up much on my dataset.
精彩评论