Having trouble with regular expression
I am a total noob at regular expressions and need to parse some html. I am looking for individual categories. The following is what the html looks like:
<p>Categories:
<a href="/some/URL/That/I/dont/need">Category1</a> |
<a href="/could/be/another/URL/That/I/dont/need">Category2</a>
</p>
There could be 1-5 categories. What I need is the "Category1 or Category2 etc"
This project is in c# using Visual Studio 2010. Currently what I have is this:
private static readonly Regex _categoriesRegex = new Regex("(<p>Categories:)((/w/.?<Categories>.*?).*?)(</p>)", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);
I know I am probably way off but wondering if anyone could at least lead me in the right d开发者_如何学运维irection.
Don't use regex for this kind of task, use a dedicated tool instead. Your best option is probably to use HTML Agility Pack.
EDIT: here's an example using HTML Agility Pack (written in LINQPad):
void Main()
{
var doc = new HtmlDocument();
doc.Load(@"D:\tmp\foobar.html");
var query =
from p in doc.DocumentNode.Descendants("p")
where p.InnerText.StartsWith("Categories:")
from a in p.Elements("a")
select a.InnerText;
query.Dump();
}
It returns:
Category1
Category2
I should note that it was the first time I actually tried to use HAP, and I'm pleasantly surprised by how easy it is (writing the code above took about 3 minutes). The API is very similar to Linq to XML, which makes it very intuitive if you're comfortable with Linq.
Usually the HTML Agility Pack (HAP) is suggested for these types of questions, and Thomas' solution is great, however I'm usually not 100% for it if you can guarantee that your input is well-formed and your desired result is straightforward. If that's the case then you can usually get by with using LINQ to XML instead of introducing HAP to your project. I demonstrate this approach below. I've also included a regex approach since your request isn't too wild, given that non-nested input is simple to deal with.
I recommend you stick with the LINQ solution since it's maintainable and easy for others to understand. The regex was added only to demonstrate how to do it and address your original question.
string input = @"<p>Categories:
<a href=""/some/URL/That/I/dont/need"">Category1</a> |
<a href=""/could/be/another/URL/That/I/dont/need"">Category2</a>
</p>";
// LINQ to XML approach for well formed HTML
var xml = XElement.Parse(input);
var query = xml.Elements("a").Select(e => e.Value);
foreach (var item in query)
{
Console.WriteLine(item);
}
// regex solution
string pattern = @"Categories:(?:[^<]+<a[^>]+>([^<]+)</a>)+";
Match m = Regex.Match(input, pattern);
if (m.Success)
{
foreach (Capture c in m.Groups[1].Captures)
{
Console.WriteLine(c.Value);
}
}
Addint a little bit to @Thomas Levesque answer (wich is the right way to go):
If you want to get the link instead of the text between <a>
tags, you just need to do:
var query =
from p in doc.DocumentNode.Descendants("p")
where p.InnerText.StartsWith("Categories:")
from a in p.Elements("a")
select a.Attributes["href"].Value;
EDIT: If you're not familiar with LINQ syntax, you could get the same with:
var nodes = doc.DocumentNode.SelectNodes("//p"); //Here I get all the <p> tags in the document
if (nodes != null)
{
foreach (var n in nodes)
{
if (n.InnerText.StartsWith("Categories:")) //If the <p> tag we need was found
{
foreach (var a in n.SelectNodes("./a[@href]")) //Iterating through all <a> tags that are next to the <p> tag (childs)
{
//It will print something like: "Name: Category1 Link: /some/URL/That/I/dont/need
Console.WriteLine("Name: {0} \t Link: {1}", a.InnerText, a.Attributes["href"].Value;
}
break;
}
}
}
精彩评论