Grab 'href' value from following code?

2023-01-27 02:58 问答作者：

I need to grab the href value from HTML like the following in C#:

<td class="tl"><a href="http://facebook.com/"target="_blank"><img src="images/poput_icon.png"/></a>

Can anyone show me how to do this? Are RegEx's the best approach? I need to gather these from a page that contains 100s of links, but they all look like the above code. I want to ignore ot开发者_StackOverflowher href's on the page.

Thanks in advance.

Jimmy

First, don't use Regular Expressions to parse XML. See here for more detailed information on the whys and wherefores.

Second, you can use LINQ-to-XML to achieve this. Assuming you have loaded your XML snippet into an XDocument instance (and therefore, td is the root element), you can then do the following:

var href = doc
    .Element("td")
    .Element("a")
    .Attribute("href")
    .Value;

I would do this with a regular expression, yes. So you want to find the value inside an anchor tag surrounding an img tag at the beginning of a table cell?

Here's C# code to create a Regex object that will match links like that, then use it, where document is a String containing the entire document to search:

Regex linkscraper = new Regex(@"<\s*td[^>]*>\s*<\s*a[^>]*href\s*=\s*""(?<link>[^""]*)""[^>]>\s*<\s*img[^>]*>\s*<\s*\/a\s*>");
MatchCollection links = linkscraper.matches(document);

Matching links are in Match objects in the Links collection, with the group name "link".

The leading @ turns this into a raw string: all \ are passed through directly, rather than being processed, so we aren't forced to double them to allow regular expression \ behavior. Since quotes can't be escaped with \" in a raw string, they're escaped with "".

This is a fairly complicated regular expression. Breaking it down:

It's splattered with a bunch of \s* elements, roughly meaning "any whitespace, or none". It makes your linkscraper expression ignore variations in spacing allowed by HTML.
The [^>] character class matches anything that isn't a ">"; repeating it (the trailing *) represents "other stuff inside the tag that we don't care about". The exclusion is to prevent the regex from going haywire and going outside a tag. Regular expressions are greedy, so it will cheerfully match the first part of the first tag in the document continued all the way to the end of the last one if we don't do this.
With all those pieces explained, it's relatively simple to understand:
- a TD tag (which may or may not have spaces, or attributes), immediately followed by (for definitions of "immediately" that allow arbitrary whitespace)
- an A tag, where the href is captured into a capturing group named "link". The [^""], which is an escaped form of [^"], matches all non-quote characters. We don't care about the rest of the tag.
- An img tag, which can contain whatever it wants.
- The /a closing tag.

If you know more about the exact formatting of the document you are trying to extract links from, you can tighten up this regular expression. Specifically, the [^>]* groups, the "match zero or more characters that aren't >" blocks used to allow tags to contain whatever they want, should probably be replaced by subexpressions more specific to the actual document. This will catch anything of the form <TD><A href=...><IMG></a>, which may or may not match more than you want it to.

继续阅读：c#-4.0

Grab 'href' value from following code?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？