开发者

Search HTML lines and remove lines that don't start with </form></td><td><a

I have an HTML file with very bad formatted code that I get from a website, I want to extract some very small pieces of information.

I am only interested in lines that start like this:

</form></td><td><a href="http://www.mysite.com/users/user897" class="username">   <b>user897</b></a></td></tr><tr><td>HouseA</td><td>2</td><td class="entriesTableRow-gamename">HouseA Type12 <span class="entriesTableRow-moredetails"></span></td><td>1 of 2</td><td>user123</td><td>10</td><td>

and I want to extract 3 fields:

  A:HouseA
  B:HouseA Type12
  C:user123
  D:10

I know I've seen people recommend HTML Agility Pack and lib2xml but I really don't think I need all that. My app is in C/C++.

I am already using getline to start reading lines, I am just not sure what's the best way to proceed. Thanks!

    std::ifstream  data("Home.html");
    std::string line;
    while(std::getline(data,line))
    {
        linenum++;
        std::stringstream  lineStream(line);
        std::string       user;
        if (strncmp(line.c_str(), "</form></td><td>",strlen("</form></td><td>")) == 0)
        {

            printf("found a wanted line in line:%开发者_如何学编程d\n", linenum);
        }

    }


In the general case, an XML/HTML parser is likely the best way here, as it will be robust against differing input. (Whatever you do, don't use regexps!)

Update

However, if you're targetting specific input, as it seems that you're doing, you can use sscanf (as you suggest) or cin.read() or regexp to scan manually.

Just beware that this code can break at any moment that the HTML changes (even just with whitespace).

Therefore, my/our recommendation is to use a proper tool for the job. XML/HTML is not raw text, and should not be treated as such.

How about writing a python script instead? :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜