which layout engine for finding coordinates of html elements on the web page?
I am doing some web data classification task and was thinking if I could get the co-ordinates of html elements as they would appear on a web-browser without taking into consideration any css or javascript being referred in the web page.
My language of programming is c++ and the need results for a couple million of pages, so it has to be fast. I know there is a Microsoft COM component which renders the page in a web browser control and then can be queried for position of different html tags. But this is not suitable in my case as it first renders the whole page which takes up a lot of time.
So as I found out, there are open-source layout engines WebKit, Gecko that can probably be used for this. But that's a huge piece of code and I need someone to direct me to the right classes or right modules to look into or any previous/similar work someone has done previously. Also, please let me know what you guys think is a good choice if I want to customize the existing code for use with multiple threads to make it faster.
Thanks开发者_运维知识库
Generally, you would find that different page rendering engines do render the html in their own way and the results will differ.
The thing is that if you stick to any concrete browser engine, what you are to do is somehow bringing this engine into your project and using engine's interface to retrieve these coordinates. Kind of a tough task though, simply because you'll have to read a lot of documentation and crawl through thousands of files.
I think that right approach would be posting this task in some place, that is specific for the page rendering engine you've chosen. (gecko/webkit/...)
If you prefer sticking to something MS-specific, guess it's gonna be easier, but can't help you with something like class names or code chunks that you want to see. Probably somebody else could guide you in this case.
精彩评论