Parsing HTML with C++ (using Qt preferably)
I'm trying to parse some HTML with C++ to extract all urls from the HTML (the urls can be inside the href and src attributes).
I tried to use Webkit to do the heavy work for me but for some reason when I load a frame with HTML the generated document is all wrong (if I make Webkit get the page from the web the generated document is just fine but Webkit also downloads all images, styles, and scripts and I don't want that)
Here is what I tried to do:
frame->setHtml(HTML);
QWebElement document = frame->documentElement();
QList<QWebElement> imgs = document.findAll("a"); // Doesn't find all links
QList<QWebElement> imgs = document.findAll("img"); // Doesn't find all images
QList<QWebElement> imgs = document.findAll(开发者_JAVA技巧"script");// Doesn't find all scripts
qDebug() << document.toInnerXml(); // Print a completely messed-up document with several missing elements
What am I doing wrong? Is there an easy way to parse HTML with Qt? (Or some other lightweight library)
You can always use XPath expressions to make your parsing life easier, take a look at this for instance.
or you can do something like this
QWebView* view = new QWebView(parent);
view.load(QUrl("http://www.your_site.com"));
QWebElementCollection elements = view.page().mainFrame().findAllElements("a");
精彩评论