C# Regex Issue Getting URLs
To explain briefly, I'm trying to search Google with a keyword, then get the URLs of the top 10 results and save them.
This is the stripped down command line version of the code. It should return 1 result at least. If it works with that, I can apply it to my full version of the code and get all the results.
Basically the code I have right now, it fails if I try to get the entire source of Google. If I include a random section of code from Google's HTML source, it works fine. To me, that means my Regex has an error somewhere.
If there is a better way to do this aside from Regex, please let me know. The URLs are between <h3 class="r"><a href="
and " class=l onmousedown="return clk(this.href
I got this Regex code from a generator, but it's really hard for me to understand Regex, Since nothing I've read explains it clearly. If someone could pick out what's wrong and explain why, I'd greatly appreciate it.
Thanks, Kevin
using System;
using System.Text.RegularExpressions;
using System.Net;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
WebClient wc = new WebClient();
string keyword = "seo nj";
string html = wc.DownloadString(String.Format("http://www.google.com/search?q={0}", keyword));
string re1 = "(<)"; // Any Single Character 1
string re2 = "(h3)"; // Alphanum 1
string re3 = "(\\s+)"; // White Space 1
string re4 = "(class)"; // Variable Name 1
string re5 = "(=)"; // Any Single Character 2
string re6 = "(\"r\")"; // Double Quote String 1
string re7 = "(>)"; // Any Single Character 3
string re8 = "(<)"; // Any Single Character 4
string re9 = "([a-z])"; // Any Single Word Character (Not Whitespace) 1
string re10 = "(\\s+)"; // White Space 2
string re11 = "((?:[a-z][a-z]+))"; // Word 1
string re12 = "(=)"; // Any Single Character 5
string re13 = ".*?"; // Non-greedy match on filler
string re14 = "((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))"; // HTTP URL 1
string re15 = "(\")"; // Any Single Character 6
string re16 = "(\\s+)"; // White Space 3
string re17 = "(class)"; // Word 2
string re18 = "(=)"; // Any Single Character 7
string re19 = "(l)"; // Any Single Character 8
string re20 = "(\\s+)"; // White Space 4
string re21 = "(onmousedown)"; // Word 3
string re22 = "(=)"; // Any Single Character 9
string re23 = "(\")"; // Any Single Character 10
string re24 = "(return)"; // Word 4
string re25 = "(\\s+)"; // White Space 5
string re26 = "(clk)"; // Word 5
Regex r = new Regex(re1 + re2 + re3 + re4 + re5 + re6 + re7 + re8 + re9 + re10 + re11 + re12 + re13 + re14 + re15 + re16 + re17 + re18 + re19 + re20 + re21 + re22 + re23 + re24 + re25 + re26, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match m = r.Match(txt);
if (m.Success)
{
Console.WriteLine("Good");
String c1 = m.Groups[1].ToString();
String alphanum1 = m.Groups[2].ToString();
String ws1 = m.Groups[3].ToString();
String var1 = m.Groups[4].ToString();
String c2 = m.Groups[5].ToString();
String string1 = m.Groups[6].ToString();
String c3 = m.Groups[7].ToString();
String c4 = m.Groups[8].ToString();
String w1 = m.Groups[9].ToString();
String ws2 = m.Groups[10].ToString();
String word1 = m.Groups[11].ToString();
String c5 = m.Groups[12].ToString();
String httpurl1 = m.Groups[13].ToString();
String c6 = m.Groups[14].ToString();
String ws3 = m.Groups[15].ToString();
开发者_开发知识库 String word2 = m.Groups[16].ToString();
String c7 = m.Groups[17].ToString();
String c8 = m.Groups[18].ToString();
String ws4 = m.Groups[19].ToString();
String word3 = m.Groups[20].ToString();
String c9 = m.Groups[21].ToString();
String c10 = m.Groups[22].ToString();
String word4 = m.Groups[23].ToString();
String ws5 = m.Groups[24].ToString();
String word5 = m.Groups[25].ToString();
//Console.Write("(" + c1.ToString() + ")" + "(" + alphanum1.ToString() + ")" + "(" + ws1.ToString() + ")" + "(" + var1.ToString() + ")" + "(" + c2.ToString() + ")" + "(" + string1.ToString() + ")" + "(" + c3.ToString() + ")" + "(" + c4.ToString() + ")" + "(" + w1.ToString() + ")" + "(" + ws2.ToString() + ")" + "(" + word1.ToString() + ")" + "(" + c5.ToString() + ")" + "(" + httpurl1.ToString() + ")" + "(" + c6.ToString() + ")" + "(" + ws3.ToString() + ")" + "(" + word2.ToString() + ")" + "(" + c7.ToString() + ")" + "(" + c8.ToString() + ")" + "(" + ws4.ToString() + ")" + "(" + word3.ToString() + ")" + "(" + c9.ToString() + ")" + "(" + c10.ToString() + ")" + "(" + word4.ToString() + ")" + "(" + ws5.ToString() + ")" + "(" + word5.ToString() + ")" + "\n");
Console.WriteLine(httpurl1);
}
else
{
Console.WriteLine("Bad");
}
Console.ReadLine();
}
}
}
You're doing it wrong.
Google has an API for doing searches programmatically. Don't put yourself through the pain of trying to parse HTML with regexes, when there's already a published, supported way to do what you want.
Besides, what you're trying to do -- submit automated searches through Google's Web site and scrape the results -- is a violation of section 5.3 of their Terms of Service:
You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers)
using RegEx to parse HTML is sado-masochism.
Try using the HTML Agility Pack instead. It will allow you to parse HTML. See this question for an example of using it.
精彩评论