开发者

Regex vs String.Contains

Hola. I'm failing to write a method to test for words within a plain text or html document. I was reasonably literate with regex, and I am newer to c# (from way more java).

Just 'cause,

string html = source.ToLower();
string plaintext = Regex.Replace(html, @"<(.|\n)*?>", " "); // remove tags
plaintext = Regex.Replace(plaintext, @"\s+", " "); // remove excess white space

and then,

string tag = "c++";
bool foundAsRegex = Regex.IsMatch(plaintext,@"\b" + Regex.Escape(tag) + @"\b");
bool foundAsContains = plaintext.Contains(tag);

For a ca开发者_运维知识库se where "c++" should be found, sometimes foundAsRegex is true and sometimes false. My google-fu is weak, so I didn't get much back on "what the hell". Any ideas or pointers welcome!

edit:

I'm searching for matches on skills in resumes. for example, the distinct value "c++".

edit:

a real excerpt is given below:

"...administration- c, c++, perl, shell programming..."


The problem is that \b matches between a word character and a non-word character. Given the expression \bc\+\+\b, you have a problem. "+" is a non-word character. So searching for the pattern in "xxx c++, xxx", you're not going to find anything. There's no "word break" after the "+" character.

If you're looking for non-word characters then you'll have to change your logic. Not sure what the best thing would be. I suppose you can use \W, but then it's not going to match at the beginning or end of the line, so you'll need (^|\W) and (\W|$) ... which is ugly. And slow, although perhaps still fast enough depending on your needs.


Your regular expression is turning into:

/\bc\+\+\b/

Which means you're looking for a word boundary, followed by the string c++, followed by another word boundary. This means it won't match on strings like abc++, whereas plaintext.Contains will succeed.

If you can give us examples of where your regex fails when you expected it to succeed, then we can give you a more definite answer.

Edit: My original regex was /\bc++\b/, which is incorrect, as c++ is being passed to Regex.Escape(), which escapes out regular expression metacharacters like +. I've fixed it above.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜