Extracting URLs using regex in .NET
I've taken inspiration from the example show in the following URL csharp-online and intended to retrieve all the URLs from this page alexa
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.Text.RegularExpressions;
namespace ExtractingUrls
{
class Program
{
static void Main(string[] args)
{
WebClient client = new WebClient();
const string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
string source = client.DownloadString(url);
//Console.WriteLine(Getvals(source));
string matchPattern =
@"<a.rel=""nofollow"".style=""font-size:0.8em;"".href=[""'](?<url>[^""^']+[.]*)[""'].class=""offsite"".*>(?<name>[^<]+[.]*)</a>";
foreach (Hashtable grouping in ExtractGroupings(source, matchPattern, true))
{
foreach (DictionaryEntry DE in groupi开发者_StackOverflowng)
{
Console.WriteLine("Value = " + DE.Value);
Console.WriteLine("");
}
}
// End.
Console.ReadLine();
}
public static ArrayList ExtractGroupings(string source, string matchPattern, bool wantInitialMatch)
{
ArrayList keyedMatches = new ArrayList();
int startingElement = 1;
if (wantInitialMatch)
{
startingElement = 0;
}
Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
MatchCollection theMatches = RE.Matches(source);
foreach (Match m in theMatches)
{
Hashtable groupings = new Hashtable();
for (int counter = startingElement; counter < m.Groups.Count; counter++)
{
// If we had just returned the MatchCollection directly, the
// GroupNameFromNumber method would not be available to use
groupings.Add(RE.GroupNameFromNumber(counter),
m.Groups[counter]);
}
keyedMatches.Add(groupings);
}
return (keyedMatches);
}
}
}
But here I face a problem, when I'm executing each URL is being displayed thrice, That's first the whole anchor tag is getting displayed, next the URL is being displayed twice. can anyone suggest me where should I correct so that I can have each URL displayed exactly once.
Use HTML Agility Pack to parse HTML. I think it will make your problem much easier to solve.
Here's one way to do it:
WebClient client = new WebClient();
string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
string source = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(source);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href and @rel='nofollow']"))
{
Console.WriteLine(link.Attributes["href"].Value);
}
in your regex, you have two groupings, and the entire match. If I'm reading it correctly, you should only want the URL portion of the matches, which is the second of the 3 groupings....
instead of this:
for (int counter = startingElement; counter < m.Groups.Count; counter++)
{
// If we had just returned the MatchCollection directly, the
// GroupNameFromNumber method would not be available to use
groupings.Add(RE.GroupNameFromNumber(counter),
m.Groups[counter]);
}
don't you want this?:
groupings.Add(RE.GroupNameFromNumber(1),m.Groups[1]);
int startingElement = 1;
if (wantInitialMatch)
{
startingElement = 0;
}
...
for (int counter = startingElement; counter < m.Groups.Count; counter++)
{
// If we had just returned the MatchCollection directly, the
// GroupNameFromNumber method would not be available to use
groupings.Add(RE.GroupNameFromNumber(counter),
.Groups[counter]);
}
Your passing wantInitialMatch = true
, so your for loop is returning:
.Groups[0] //entire match
.Groups[1] //(?<url>[^""^']+[.]*) href part
.Groups[2] //(?<name>[^<]+[.]*) link text
take a look of this: http://bouncetadiss.blogspot.com/2008/02/parsing-uri-url-in-c-and-vbnet.html
精彩评论