开发者

C#: Get exact substring from HTML code using IndexOf and LastIndexOf

I have a HTML page retrieved using the GetResponseStream() in C#. I need an exact value (i开发者_StackOverflow中文版nt) that comes from that page, which is different every time I run the program. Nevertheless, the structure of the HTML code is the same, in particular:

 (...) 
 <td colspan="2" class="txtnormal"><div align="right">&nbsp;TAX:</div></td>
 <td class="txtnormal"><div align="right"><strong>0.00</strong></div></td>
 <td colspan="2">&nbsp;</td> 
 (...) 

and

 (...) 
     <td colspan="2"><div align="right" class="txtnormal">Total:</div></td>
 <td class="txtnormal"><div align="right"><strong>10.00</strong></div></td>
 <td colspan="2">&nbsp;</td>
 (...)

Notice that the code is repeated in the same page (i.e: <td class="txtnormal"><div align="right"><strong>VALUE</strong></div></td>), but the title of the values (TAX and Total) are the only different thing (the actual value could be the same).

I would like to store in a variable the Total value, this is: 10.0 in this case.

I tried this:

int first = responseFromServer.IndexOf("<td class= \"txtnormal\"><div align=\"right\"><strong>") + "<td class=\"txtnormal\"><div align=\"right\"><strong>".Length;
int last = responseFromServer.LastIndexOf("</strong></div></td>");
string value = responseFromServer.Substring(first, last - first);

But i get bad results, the value stored in value of ALL the HTML page until the value (is for the difference I´m doing).

Do you know how could I get the exact value, this is: the sub-string between the text I pasted?

Thank you very much.


To scrape from a page, you have a couple of options. The "best" is to use the DOM to find the node(s) in question and pull it's value. If you can't use the DOM for some reason, you can move to regex and pull the value that way.

Your method is "okay" in many instances, as long as you can be sure the site owner will never set up another instance of "</strong></div></td>" anywhere downstream. This is a risky assumption.

What value are you getting for the int string? that will tell you whether or not your particular pattern is working correctly. And I would consider the HTML DOM still, as it is a more accurate way to traverse the nodes.


I think Regex is your friend here:

using System;
using System.Text.RegularExpressions;

namespace SimpleApp
{
    class Program
    {
        static void Main(string[] args)
        {
            Regex theRegex = new Regex(@">Total:<.+?<strong>(.+?)</strong>");
            string str = @"<td colspan=""2""><div align=""right"" class=""txtnormal"">Total:</div></td>" +
                     @"<td class=""txtnormal""><div align=""right""><strong>10.00</strong></div></td>" +
                     @"<td colspan=""2"">&nbsp;</td>";
            if (theRegex.Match(str).Success)
            {
                Console.WriteLine("Found Total of " + theRegex.Match(str).Result("$1"));
            }
            else
            {
                Console.WriteLine("Not found");
            }
            Console.ReadLine();
        }
    }
}

Obviously your HTML page might have other things that could trip this simple regular expression up but you get the idea.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜