Having trouble with regular expression

2023-01-28 04:18 问答作者：

I am a total noob at regular expressions and need to parse some html. I am looking for individual categories. The following is what the html looks like:

<p>Categories: 
        <a href="/some/URL/That/I/dont/need">Category1</a>  | 
        <a href="/could/be/another/URL/That/I/dont/need">Category2</a> 
</p>

There could be 1-5 categories. What I need is the "Category1 or Category2 etc"

This project is in c# using Visual Studio 2010. Currently what I have is this:

private static readonly Regex _categoriesRegex = new Regex("(<p>Categories:)((/w/.?<Categories>.*?).*?)(</p>)", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

I know I am probably way off but wondering if anyone could at least lead me in the right d开发者_如何学运维irection.

Don't use regex for this kind of task, use a dedicated tool instead. Your best option is probably to use HTML Agility Pack.

EDIT: here's an example using HTML Agility Pack (written in LINQPad):

void Main()
{
    var doc = new HtmlDocument();
    doc.Load(@"D:\tmp\foobar.html");
    var query =
        from p in doc.DocumentNode.Descendants("p")
        where p.InnerText.StartsWith("Categories:")
        from a in p.Elements("a")
        select a.InnerText;

    query.Dump();
}

It returns:

Category1
Category2

I should note that it was the first time I actually tried to use HAP, and I'm pleasantly surprised by how easy it is (writing the code above took about 3 minutes). The API is very similar to Linq to XML, which makes it very intuitive if you're comfortable with Linq.

Usually the HTML Agility Pack (HAP) is suggested for these types of questions, and Thomas' solution is great, however I'm usually not 100% for it if you can guarantee that your input is well-formed and your desired result is straightforward. If that's the case then you can usually get by with using LINQ to XML instead of introducing HAP to your project. I demonstrate this approach below. I've also included a regex approach since your request isn't too wild, given that non-nested input is simple to deal with.

I recommend you stick with the LINQ solution since it's maintainable and easy for others to understand. The regex was added only to demonstrate how to do it and address your original question.

string input = @"<p>Categories: 
        <a href=""/some/URL/That/I/dont/need"">Category1</a>  | 
        <a href=""/could/be/another/URL/That/I/dont/need"">Category2</a> 
</p>";

// LINQ to XML approach for well formed HTML
var xml = XElement.Parse(input);
var query = xml.Elements("a").Select(e => e.Value);
foreach (var item in query)
{
    Console.WriteLine(item);
}

// regex solution
string pattern = @"Categories:(?:[^<]+<a[^>]+>([^<]+)</a>)+";

Match m = Regex.Match(input, pattern);
if (m.Success)
{
    foreach (Capture c in m.Groups[1].Captures)
    {
        Console.WriteLine(c.Value);    
    }
}

Addint a little bit to @Thomas Levesque answer (wich is the right way to go):

If you want to get the link instead of the text between <a> tags, you just need to do:

    var query =
        from p in doc.DocumentNode.Descendants("p")
        where p.InnerText.StartsWith("Categories:")
        from a in p.Elements("a")
        select a.Attributes["href"].Value;

EDIT: If you're not familiar with LINQ syntax, you could get the same with:

var nodes = doc.DocumentNode.SelectNodes("//p"); //Here I get all the <p> tags in the document
if (nodes != null)
{
    foreach (var n in nodes)
    {
        if (n.InnerText.StartsWith("Categories:")) //If the <p> tag we need was found
        {
            foreach (var a in n.SelectNodes("./a[@href]")) //Iterating through all <a> tags that are next to the <p> tag (childs)
            {
                //It will print something like: "Name: Category1        Link: /some/URL/That/I/dont/need
                Console.WriteLine("Name: {0} \t Link: {1}", a.InnerText, a.Attributes["href"].Value; 
            }
            break;
        }
    }
}

继续阅读：regex

Having trouble with regular expression

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？