开发者

C# reliable way to pattern match?

At the moment I am trying to match patterns such as

text text date1 date2

So I have regular expressions that do just that. However, the issue is for example if users input data with say more than 1 whitespace or if they put some of the text in a new line etc the pattern does not get picked up because it doesn't exactly match the pattern set.

Is there a more reliable way for pattern matching? The goal is to make it very simple for the user to write but make it easily matchable on my end. I was considering stripping out all the whitespace/newlines etc and then trying to match the pattern with no spaces i.e. texttextdate1date2.

Anyone got any better solutions?

Update

Here is a small example of the pattern I would need to match:

FIND me@test.com 01/01/2010 to 10/01/2010

开发者_开发百科

Here is my current regex:

FIND [A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4} [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4} to [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}

This works fine 90% of the time, however, if users submit this information via email it can have all different kinds of formatting and HTML I am not interested in. I am using a combination of the HtmlAgilityPack and a HTML tag removing regex to strip all the HTML from the email, but even at that I can't seem to get a match on some occassions.

I believe this could be a more parsing related question than pattern matching, but I think maybe there is a better way of doing this...


To match at least one or more whitespace characters (space, tab, newline), use:

\s+

Substitute the above wherever you have the physical space in your pattern and you should be fine.


Example of matching multiple groups in a text with multiple whitespaces and/or newlines.

var txt = "text text   date1\ndate2";
var matches = Regex.Match(txt, @"([a-z]+)\s+([a-z]+)\s+([a-z0-9]+)\s+([a-z0-9]+)", RegexOptions.Singleline);

matches.Groups[n].Value with n from 1 to 4 will contain your matches.


I would split the string into a string array and match each resulting string to the necessary Regular Expression.


\b(text)[\s]+(text)[\s]+(date1)[\s]+(date2)\b


Its a nasty expression but here is something that will work for the input you provided:

^(\w+)\s+([\w@.]+)\s+(\d{2}\/\d{2}\/\d{4})[^\d]+(\d{2}\/\d{2}\/\d{4})$

This will work with variable amounts of whitespace between the capture groups as well.


Through ORegex you can tokenize your string and just pattern match on token sequences:

var tokens = input.Split(new[]{' ','\t','\n','\r'}, StringSplitOptions.RemoveEmptyEntries);
var oregex = new ORegex<string>("{0}{0}{1}{1}", IsText, IsDate);

var matches = oregex.Matches(tokens); //here is your subsequence tokens.

...

public bool IsText(string str)
{
    ...
}

public bool IsDate(string str)
{
    ...
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜