C# Regex Issue Getting URLs

2023-03-18 23:34 问答作者：

To explain briefly, I'm trying to search Google with a keyword, then get the URLs of the top 10 results and save them.

This is the stripped down command line version of the code. It should return 1 result at least. If it works with that, I can apply it to my full version of the code and get all the results.

Basically the code I have right now, it fails if I try to get the entire source of Google. If I include a random section of code from Google's HTML source, it works fine. To me, that means my Regex has an error somewhere.

If there is a better way to do this aside from Regex, please let me know. The URLs are between <h3 class="r"><a href=" and " class=l onmousedown="return clk(this.href

I got this Regex code from a generator, but it's really hard for me to understand Regex, Since nothing I've read explains it clearly. If someone could pick out what's wrong and explain why, I'd greatly appreciate it.

Thanks, Kevin

using System;
using System.Text.RegularExpressions;
using System.Net;

namespace ConsoleApplication1
{
    class Program
    {
    static void Main(string[] args)
    {
        WebClient wc = new WebClient();
        string keyword = "seo nj";

        string html = wc.DownloadString(String.Format("http://www.google.com/search?q={0}", keyword));

        string re1 = "(<)"; // Any Single Character 1
        string re2 = "(h3)";    // Alphanum 1
        string re3 = "(\\s+)";  // White Space 1
        string re4 = "(class)"; // Variable Name 1
        string re5 = "(=)"; // Any Single Character 2
        string re6 = "(\"r\")"; // Double Quote String 1
        string re7 = "(>)"; // Any Single Character 3
        string re8 = "(<)"; // Any Single Character 4
        string re9 = "([a-z])"; // Any Single Word Character (Not Whitespace) 1
        string re10 = "(\\s+)"; // White Space 2
        string re11 = "((?:[a-z][a-z]+))";  // Word 1
        string re12 = "(=)";    // Any Single Character 5
        string re13 = ".*?";    // Non-greedy match on filler
        string re14 = "((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))";   // HTTP URL 1
        string re15 = "(\")";   // Any Single Character 6
        string re16 = "(\\s+)"; // White Space 3
        string re17 = "(class)";    // Word 2
        string re18 = "(=)";    // Any Single Character 7
        string re19 = "(l)";    // Any Single Character 8
        string re20 = "(\\s+)"; // White Space 4
        string re21 = "(onmousedown)";  // Word 3
        string re22 = "(=)";    // Any Single Character 9
        string re23 = "(\")";   // Any Single Character 10
        string re24 = "(return)";   // Word 4
        string re25 = "(\\s+)"; // White Space 5
        string re26 = "(clk)";  // Word 5

        Regex r = new Regex(re1 + re2 + re3 + re4 + re5 + re6 + re7 + re8 + re9 + re10 + re11 + re12 + re13 + re14 + re15 + re16 + re17 + re18 + re19 + re20 + re21 + re22 + re23 + re24 + re25 + re26, RegexOptions.IgnoreCase | RegexOptions.Singleline);
        Match m = r.Match(txt);
        if (m.Success)
        {
            Console.WriteLine("Good");
            String c1 = m.Groups[1].ToString();
            String alphanum1 = m.Groups[2].ToString();
            String ws1 = m.Groups[3].ToString();
            String var1 = m.Groups[4].ToString();
            String c2 = m.Groups[5].ToString();
            String string1 = m.Groups[6].ToString();
            String c3 = m.Groups[7].ToString();
            String c4 = m.Groups[8].ToString();
            String w1 = m.Groups[9].ToString();
            String ws2 = m.Groups[10].ToString();
            String word1 = m.Groups[11].ToString();
            String c5 = m.Groups[12].ToString();
            String httpurl1 = m.Groups[13].ToString();
            String c6 = m.Groups[14].ToString();
            String ws3 = m.Groups[15].ToString();
开发者_开发知识库            String word2 = m.Groups[16].ToString();
            String c7 = m.Groups[17].ToString();
            String c8 = m.Groups[18].ToString();
            String ws4 = m.Groups[19].ToString();
            String word3 = m.Groups[20].ToString();
            String c9 = m.Groups[21].ToString();
            String c10 = m.Groups[22].ToString();
            String word4 = m.Groups[23].ToString();
            String ws5 = m.Groups[24].ToString();
            String word5 = m.Groups[25].ToString();
            //Console.Write("(" + c1.ToString() + ")" + "(" + alphanum1.ToString() + ")" + "(" + ws1.ToString() + ")" + "(" + var1.ToString() + ")" + "(" + c2.ToString() + ")" + "(" + string1.ToString() + ")" + "(" + c3.ToString() + ")" + "(" + c4.ToString() + ")" + "(" + w1.ToString() + ")" + "(" + ws2.ToString() + ")" + "(" + word1.ToString() + ")" + "(" + c5.ToString() + ")" + "(" + httpurl1.ToString() + ")" + "(" + c6.ToString() + ")" + "(" + ws3.ToString() + ")" + "(" + word2.ToString() + ")" + "(" + c7.ToString() + ")" + "(" + c8.ToString() + ")" + "(" + ws4.ToString() + ")" + "(" + word3.ToString() + ")" + "(" + c9.ToString() + ")" + "(" + c10.ToString() + ")" + "(" + word4.ToString() + ")" + "(" + ws5.ToString() + ")" + "(" + word5.ToString() + ")" + "\n");
            Console.WriteLine(httpurl1);
        }
        else
        {
            Console.WriteLine("Bad");
        }
        Console.ReadLine();
    }
}
}

You're doing it wrong.

Google has an API for doing searches programmatically. Don't put yourself through the pain of trying to parse HTML with regexes, when there's already a published, supported way to do what you want.

Besides, what you're trying to do -- submit automated searches through Google's Web site and scrape the results -- is a violation of section 5.3 of their Terms of Service:

You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers)

using RegEx to parse HTML is sado-masochism.

Try using the HTML Agility Pack instead. It will allow you to parse HTML. See this question for an example of using it.

继续阅读：.net regex

C# Regex Issue Getting URLs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？