Simple regex help using C# (Regex pattern included)
I have some website source stream I am trying to parse. My current Regex is this:
Regex pattern = new Regex (
@"<a\b # Begin start tag
[^>]+? # Lazily consume up to id attribute
id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]? # $1: id
[^>]+? # Lazily consume up to href attribute
href\s*=\s*['""]?([^>\s'""]+)['""]? # $2: href
[^>]* # Consume up to end of open tag
> # End start开发者_StackOverflow中文版 tag
(.*?) # $3: name
</a\s*> # Closing tag",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace );
But it doesn't match the links anymore. I included a sample string here.
Basically I am trying to match these:
<a href="http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" id="thread_title_3046631">How to Get a Travel Visa</a>
"http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" is the **Link**
304663` is the **TopicId**
"How to Get a Travel Visa" is the **Title**
In the sample I posted, there are at least 3, I didn't count the other ones.
Also I use RegexHero (online and free) to see my matching interactively before adding it to code.
For completeness, here how it's done with the Html Agility Pack, which is a robust HTML parser for .Net (also available through NuGet, so installing it takes about 20 seconds).
Loading the document, parsing it, and finding the 3 links is as simple as:
string linkIdPrefix = "thread_title_";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/upixof");
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("a")
.Where(link => link.Id.StartsWith(linkIdPrefix));
That's it, really. Now you can easily get the data:
foreach (var link in threadLinks)
{
string href = link.GetAttributeValue("href", null);
string id = link.Id.Substring(linkIdPrefix.Length); // remove "thread_title_"
string text = link.InnerHtml; // or link.InnerText
Console.WriteLine("{0} - {1}", id, href);
}
This is quite simple, the markup changed, and now the href
attribute appears before the id
:
<a\b # Begin start tag
[^>]+? # Lazily consume up to href attribute
href\s*=\s*['""]?([^>\s'""]+)['""]? # $1: href
[^>]+? # Lazily consume up to id attribute
id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]? # $2: id
[^>]* # Consume up to end of open tag
> # End start tag
(.*?) # $3: name
</a\s*> # Closing tag
Note that:
- This is mainly why this is a bad idea.
- The group numbers have changed. You can use named groups instead, while you're at it:
(?<ID>[^>\s'""]+)
instead of([^>\s'""]+)
. - The quotes are still escaped (this should be OK in character sets)
Example on regex hero.
Don't do that (well, almost, but it's not for everyone). Parsers are meant for that type of thing.
精彩评论