TEXT_NODE: returns ONLY text?

2023-03-07 19:43 问答作者：

I'm using JavaScript in order to extract all text from a DOM object. My algorithm goes over the DOM object itself and it's descendants, if the node is a TEXT_NODE type than accumulates it's nodeValue.

For some weird reason I also get things like:

#hdr-editions a { text-decoration:none; }
#cnn_hdr-editionS { text-align:left;clear:both; }
#cnn_hdr-editionS a { text-decoration:none;font-size:10px;top:7px;line-height开发者_JAVA百科:12px;font-weight:bold; }
#hdr-prompt-text b { display:inline-block;margin:0 0 0 20px; }
#hdr-editions li { padding:0 10px; }

How do I filter this? Do I need to use something else? I want ONLY text.

From the looks of things, you're also collecting the text from <style> elements. You might want to run a check for those:

var ignore = { "STYLE":0, "SCRIPT":0, "NOSCRIPT":0, "IFRAME":0, "OBJECT":0 }

if (element.tagName in ignore)
    continue;

You can add any other elements to the object map to ignore them.

You want to skip over style elements.

In your loop, you could do this...

if (element.tagName == 'STYLE') {
   continue;
}

You also probably want to skip over script, textarea, etc.

This is text as far as the DOM is concerned. You'll have to filter out (skip) <script> and <style> tags.

[Answer added after reading OP's comments to Andy's excellent answer]

The problem is that you see the text nodes inside elements whose content is normally not rendered by browsers - such as STYLE and SCRIPT tags.

When scan the DOM tree, using depth-first search I assume, your scan should skip over the content of such tags.

For example - a recursive depth-first DOM tree walker might look like this:

function walker(domObject, extractorCallback) {
    if (domObject == null) return; // fail fast
    extractorCallback(domObject);
    if (domObject.nodeType != Node.ELEMENT_NODE) return;
    var childs = domObject.childNodes;
    for (var i = 0; i < childs.length; i++)
        walker(childs[i]);
}

var textvalue = "":
walker(document, function(node) { 
    if (node.nodeType == Node.TEXT_NODE)
        textvalue += node.nodeValue;
});

In such a case, if your walker encounters tags that you know you won't like to see their content, you should just skip going into that part of the tree. So walker() will have to be adapted as thus:

var ignore = { "STYLE":0, "SCRIPT":0, "NOSCRIPT":0, "IFRAME":0, "OBJECT":0 }

function walker(domObject, extractorCallback) {
    if (domObject == null) return; // fail fast
    extractorCallback(domObject);
    if (domObject.nodeType != Node.ELEMENT_NODE) return;

    if (domObject.tagName in ignore) return; // <--- HERE

    var childs = domObject.childNodes;
    for (var i = 0; i < childs.length; i++)
        walker(childs[i]);
}

That way, if we see a tag that you don't like, we simply skip it and all its children, and your extractor will never be exposed to the text nodes inside such tags.

继续阅读：dom javascript textnode

TEXT_NODE: returns ONLY text?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？