Read file and extract certain part only

2023-01-27 03:17 问答作者：

ifstream toOpen;
openFile.open("sample.html", ios::in); 

if(toOpen.is_open()){
    while(!toOpen.eof()){
        getline(toOpen,line);
        if(line.find("href=") && !line.find(".pdf")){   
                start_pos = line.find("href"); 
        tempString = line.substr(start_pos+1); // i dont want the quote
            sto开发者_StackOverflow社区p_pos = tempString .find("\"");
                string testResult = tempString .substr(start_pos, stop_pos);
        cout << testResult << endl;
        }
    }

    toOpen.close();
}

What I am trying to do, is to extrat the "href" value. But I cant get it works.

EDIT:

Thanks to Tony hint, I use this:

if(line.find("href=") != std::string::npos ){   
    // Process
}

it works!!

I'd advise against trying to parse HTML like this. Unless you know a lot about the source and are quite certain about how it'll be formatted, chances are that anything you do will have problems. HTML is an ugly language with an (almost) self-contradictory specification that (for example) says particular things are not allowed -- but then goes on to tell you how you're required to interpret them anyway.

Worse, almost any character can (at least potentially) be encoded in any of at least three or four different ways, so unless you scan for (and carry out) the right conversions (in the right order) first, you can end up missing legitimate links and/or including "phantom" links.

You might want to look at the answers to this previous question for suggestions about an HTML parser to use.

As a start, you might want to take some shortcuts in the way you write the loop over lines in order to make it clearer. Here is the conventional "read line at a time" loop using C++ iostreams:

#include <fstream>
#include <iostream>
#include <string>

int main ( int, char ** )
{
    std::ifstream file("sample.html");
    if ( !file.is_open() ) {
        std::cerr << "Failed to open file." << std::endl;
        return (EXIT_FAILURE);
    }
    for ( std::string line; (std::getline(file,line)); )
    {
        // process line.
    }
}

As for the inner part the processes the line, there are several problems.

It doesn't compile. I suppose this is what you meant with "I cant get it works". When asking a question, this is the kind of information you might want to provide in order to get good help.
There is confusion between variable names temp and tempString etc.
string::find() returns a large positive integer to indicate invalid positions (the size_type is unsigned), so you will always enter the loop unless a match is found starting at character position 0, in which case you probably do want to enter the loop.

Here is a simple test content for sample.html.

<html>
    <a href="foo.pdf"/>
</html>

Sticking the following inside the loop:

if ((line.find("href=") != std::string::npos) &&
    (line.find(".pdf" ) != std::string::npos))
{
    const std::size_t start_pos = line.find("href");
    std::string temp = line.substr(start_pos+6);
    const std::size_t stop_pos = temp.find("\"");
    std::string result = temp.substr(0, stop_pos);
    std::cout << "'" << result << "'" << std::endl;
}

I actually get the output

'foo.pdf'

However, as Jerry pointed out, you might not want to use this in a production environment. If this is a simple homework or exercise on how to use the <string>, <iostream> and <fstream> libraries, then go ahead with such a procedure.

继续阅读：extract string

Read file and extract certain part only

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？