Are there any Java HTML parsers where the generated Nodes retain indexes to the original text?

2023-04-02 01:59 问答作者：

I'd like to query a HTML document as XML (e.g. with XPath), so I need to pass the HTML through some form of HTML cleaner.

But I'd also like to make modifica开发者_运维知识库tions to the original source string based on the results of the queries.

Is there a Java HTML parser around that retains indexes to the original source string, so I can locate a node and modify the correct part of the original string?

Cheers.

It sounds like Jericho is almost exactly what you want. It is a robust HTML parser designed specifically for making unintrusive modifications to the source document.

While it doesn't come with DOM, SAX, or StAX interfaces, it has custom APIs that are similar enough to those standards that you should be able to adapt your approach to them fairly easily, or write an adapter between whatever you are using and Jericho. For instance, you can do XPath queries on Jericho documents using Jaxen -- see this blog entry for an example.

Jericho has begin and end attributes for every element, and even for parts of the element like the tag name or even an attribute name, so you can edit the document yourself with that information, but where Jericho really shines is the OutputDocument class, which lets you specify replacements directly by calling the appropriate methods with the Jericho elements that match your query instead of having to explicitly call getBegin() and getEnd() on them and pass that to some replacement method.

We use jericho html parser to do the parsing and htmlcleaner to do the actual clean up.

We had problems with jericho's behavior within a server app ( memory management, logging ) that we fixed. (the original developer didn't think our issues were important enough to put in the main code branch). Our fork is on github. We also made fixes to htmlcleaner.

I don't know about the "retain indexes to the original text" part but Jericho is a very good HTML parser library.

Here is an example of how to remove every span from a html:

public static String removeSpans(String html) {
    Source source = new Source(html);
    source.fullSequentialParse();
    OutputDocument outputDocument = new OutputDocument(source);
    List<Tag> tags = source.getAllTags();
    for (Tag tag : tags) {
        String tagname = tag.getName().toLowerCase();
        if (tagname.equals("span")) {
            //remove the <span>
            outputDocument.remove(tag);
        }
    }
    return outputDocument.toString();
}

I guess you could use HTML Parser.

You can get indexes to original Page using getStartPosition() and getEndPosition() from class Node.

As others have suggested, you probably want to render the DOM. This basically just means constructing the node tree, it wont alter the document source unless you use an HTML cleaner like jTidy. Then you have easy access to the document and can modify it as required. I would suggest DOM4J, it has a good api and xpath support too.

Re your "indexing" requirement, during your traversal/querying of the document you can cache in a list or map any elements or nodes that you wish to modify the text of at a later point.

this works great

http://jtidy.sourceforge.net/

EXAMPLE

Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(boolean xhtml); // set desired config options using tidy setters 
...                           // (equivalent to command line options)

tidy.parse(inputStream, System.out);

For crawling the DOM, i recommend using JDOM, its way faster then simple XML.

http://www.jdom.org/

DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("root");
Text text = doc.createText("This is the root");
root.appendChild(text);
doc.appendChild(root);

As far as implementation is concerned i would make a new document, and add nodes to it from the source.

You could try ANTLR with an HTML grammar.

You could take (at least) 2 approaches - try and use it as an actual HTML parser, and then get the indexes into the original string that you are interested in.

Or, it also has built-in support for doing in-place transformations on source text, where you define the transformations that you want to perform on the text as part of the grammar.

继续阅读：parsing xml

Are there any Java HTML parsers where the generated Nodes retain indexes to the original text?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？