C# Scrape data from wiki page (screen-scraping)

2023-04-05 16:45 问答作者：

I want to scrape a Wiki page. Specifically, this one.

My app will allow users to enter the registration number of the vehicle (for example, SBS8988Z) and it will display the related information (which is on the page itself).

For example, if the user enters SBS8988Z into a text field in my application, it should look for the line on that wiki page

SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)

and return SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen).

My code so far is (copied and edited from various websites)...

WebClient getdeployment = new WebClient();
string url = "http://sgwiki.com/wiki/Scania_K230UB_(Batch_1_Euro_V)";

getdeployment.Headers["User-Agent"] = "NextBusApp/GetBusData UserAgent";
string sgwikiresult = getdeployment.DownloadString(url); // <<< EXCEPTION
MessageBox.Show(sgwikiresult); //for debugging only!

HtmlAgilityPack.HtmlDocument sgwikihtml = new HtmlAgilityPack.HtmlDocument();
sgwikihtml.Load(new StreamReader(sgwikiresult));
HtmlNode root = sgwikihtml.DocumentNode;

List<string> anchorTags = new List<string>();   

foreach(HtmlNode deployment in root.SelectNodes("SBS8988Z"))
{
    string att = deployment.OuterHtml;
    anchorTags.Add(att);
}

However, I am getting a an ArgumentException was unhandled - Illegal Characters in path.

What is wrong with the code? Is there an easier way to do this? I'm using HtmlAgilityPack but if there is a better solution,开发者_开发百科 I'd be glad to comply.

What's wrong with the code? To be blunt, everything. :P

The page is not formatted in the way you are reading it. You can't hope to get the desired contents that way.

The contents of the page (the part we're interested in) looks something like this:

<h2>
<span id="Deployments" class="mw-headline">Deployments</span>
</h2>
<p>
    <!-- ... -->
    <b>SBS8987B</b>
    (SLBP 192/194*)
    <br>
    <b>SBS8988Z</b>
    (SLBP 192/194*) - F&amp;N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
    <br>
    <b>SBS8989X</b>
    (SLBP SP)
    <br>
    <!-- ... -->
</p>

Basically we need to find the b elements that contain the registration number we are looking for. Once we find that element, get the text and put it together to form the result. Here it is in code:

static string GetVehicleInfo(string reg)
{
    var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";

    // HtmlWeb is a helper class to get pages from the web
    var web = new HtmlAgilityPack.HtmlWeb();

    // Create an HtmlDocument from the contents found at given url
    var doc = web.Load(url);

    // Create an XPath to find the `b` elements which contain the registration numbers
    var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
              + "/following-sibling::p[1]"     // move to the first `p` element (where the actual content is in) after the header
              + "/b";                          // select the `b` elements

    // Get the elements from the specified XPath
    var deployments = doc.DocumentNode.SelectNodes(xpath);

    // Create a LINQ query to find the  requested registration number and generate a result
    var query =
        from b in deployments                 // from the list of registration numbers
        where b.InnerText == reg              // find the registration we're looking for
        select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)

    // The query should yield exactly one result (or we have a problem) or none (null)
    var content = query.SingleOrDefault();

    // Decode the content (to convert stuff like "&amp;" to "&")
    var decoded = System.Net.WebUtility.HtmlDecode(content);

    return decoded;
}

继续阅读：html-agility-pack screen screen-scraping

C# Scrape data from wiki page (screen-scraping)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？