开发者

Using C# regex to select text based on custom tags

I have a string in c# containing some data i need to extract based on certain conditions.

The string contains many tenders in the following form :

<TENDER> some words, don't know how many, may contain numbers and things like slashes 开发者_JAVA技巧(/) or whatever <DESCRIPTION> some more words and possibly other things like numbers or whatever describing the tender here </DESCRIPTION> some more words and possibly numbers and weird things </TENDER>

This string doesn't contain any nested <TENDER> tags, its flat. The <DESCRIPTION> tags occur only once within the <TENDER> tags.

I'm using : <TENDER>(.+?)</TENDER> as the regex to split up the tenders and it works fine. If this is wrong or stupid and you know a better way to write this please let me know as I have discovered I suck at regex.

My problem that I now need to only select a tender if its description contains any word in a list of keywords (lets say for now i want to select a tender only if it contains either "concrete" or"brick" in the description).

So far the regex I have come up with looks like this, but I don't know what to put in the middle. Also I have a vague suspicion that this might return me some false positives.

<TENDER>(.+?)<DESCRIPTION>have no idea what to do here</DESCRIPTION>(.+?)</TENDER>

If any of you regex guru's could point me in the right direction I would be most appreciative.


Use

<TENDER>([^<>]+?)<DESCRIPTION>[^<>]*?(brick|concrete)[^<>]*?</DESCRIPTION>([^<>]+?)</TENDER> 

I am using [^<>] instead of . to avoid leaving the tags.


Use IgnorePatternWhiteSpace because I have commented the pattern. It does not affect the data processing...it allows one to break out patterns and comment.

string pattern = @"
(?<=<TENDER>)            # Look Behind for TENDER
(?<TenderBefore>.*?)     # Put the data into the TenderBefore Named Match Capture Group
(?:<DESCRIPTION>)
(?=.*brick|concrete)     # Look ahead for the keywords
(?<Description>.*?)      # Put the data into the Description NMCG
(?:</DESCRIPTION>)
(?<TenderAfter>.*?)      # Put text into NMCG TenderAfter
(?=<\/TENDER>)           # Tender Look ahead.";

After processing the matches, extract the data out of each match such as

string Tender = string.Format("{0}<DESCRIPTION>{1}</DESCRIPTION>{2}",
 myMatch.Groups["TenderBefore"].Value,
 myMatch.Groups["Description"].Value,
 myMatch.Groups["TenderAfter"].Value);

HTH


Instead of regex, try using a proper DOM parsing library, such as the Html Agility Pack. It should work with any tags, even custom ones.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜