HTML storage in Java
I want to download an HTML page, extract some used full text out of this HTML and convert the HTML to PDF then store the useful text and PDF in a noSQL solution.
What is the most efficient way to pass the HTML to the modules which extract useful text and the module which creates the PDF. I don't want to download the same HTML twice. One way to store the HTML is to download the HTML to a local disk under a unique named folder and pass the path to other modules so that they can process the HTML.
This approach doesn't looks that good to me, as there is implementation overhead.
I would love to see the entire HTML as a single variable so I can give it to other modules so they can traverse the HTML without loading it. One idea that crossed my mind is to download and zip the HTML and related code/pics t开发者_如何学Gohen store the binary in a byte[]
.
I haven't used these before but a quick Type search on eclipse with the text html gave me this:
Class HTMLDocument
From the docs :
A document that models HTML. The purpose of this model is to support both browsing and editing
精彩评论