Grab 'href' value from following code?
I need to grab the href value from HTML like the following in C#:
<td class="tl"><a href="http://facebook.com/"target="_blank"><img src="images/poput_icon.png"/></a>
Can anyone show me how to do this? Are RegEx's the best approach? I need to gather these from a page that contains 100s of links, but they all look like the above code. I want to ignore ot开发者_StackOverflowher href's on the page.
Thanks in advance.
Jimmy
First, don't use Regular Expressions to parse XML. See here for more detailed information on the whys and wherefores.
Second, you can use LINQ-to-XML to achieve this. Assuming you have loaded your XML snippet into an XDocument
instance (and therefore, td
is the root element), you can then do the following:
var href = doc
.Element("td")
.Element("a")
.Attribute("href")
.Value;
I would do this with a regular expression, yes. So you want to find the value inside an anchor tag surrounding an img tag at the beginning of a table cell?
Here's C# code to create a Regex object that will match links like that, then use it, where document
is a String containing the entire document to search:
Regex linkscraper = new Regex(@"<\s*td[^>]*>\s*<\s*a[^>]*href\s*=\s*""(?<link>[^""]*)""[^>]>\s*<\s*img[^>]*>\s*<\s*\/a\s*>");
MatchCollection links = linkscraper.matches(document);
Matching links are in Match objects in the Links collection, with the group name "link".
The leading @ turns this into a raw string: all \ are passed through directly, rather than being processed, so we aren't forced to double them to allow regular expression \ behavior. Since quotes can't be escaped with \" in a raw string, they're escaped with "".
This is a fairly complicated regular expression. Breaking it down:
- It's splattered with a bunch of
\s*
elements, roughly meaning "any whitespace, or none". It makes your linkscraper expression ignore variations in spacing allowed by HTML. - The
[^>]
character class matches anything that isn't a ">"; repeating it (the trailing *) represents "other stuff inside the tag that we don't care about". The exclusion is to prevent the regex from going haywire and going outside a tag. Regular expressions are greedy, so it will cheerfully match the first part of the first tag in the document continued all the way to the end of the last one if we don't do this. - With all those pieces explained, it's relatively simple to understand:
- a TD tag (which may or may not have spaces, or attributes), immediately followed by (for definitions of "immediately" that allow arbitrary whitespace)
- an A tag, where the href is captured into a capturing group named "link". The
[^""]
, which is an escaped form of[^"]
, matches all non-quote characters. We don't care about the rest of the tag. - An img tag, which can contain whatever it wants.
- The /a closing tag.
If you know more about the exact formatting of the document you are trying to extract links from, you can tighten up this regular expression. Specifically, the [^>]*
groups, the "match zero or more characters that aren't >" blocks used to allow tags to contain whatever they want, should probably be replaced by subexpressions more specific to the actual document. This will catch anything of the form <TD><A href=...><IMG></a>
, which may or may not match more than you want it to.
精彩评论