How can I check if HTML contains extended entities like <?

2023-01-20 18:34 问答作者：

Let's say we have a html string like "2 < 4"

How should be determined if it contains any of these extended sequences?

I 've found HTML::Entities on CPAN, but it doesn't provide 'check' method.

Details: fixing 'truncate' method in a way to not leave corrupted string like "2 &l" and not to do unnecesary work. It should look like this

$s = HTML::Entities::decode_entities ($s) if $has_ext_chars;
$开发者_如何学运维s = substr ($s, 0, $len - 3) . '...' if length $s > $len;
$s = HTML::Entities::encode_entities ($s, "‚„-‰‹‘-™›\xA0¤¦§©«-®°-±µ-·»") if $has_ext_chars;

How do I determine $has_ext_chars?

A complete list of character entities can be found on the W3C reference.

You have also to match \&#u?\d+; and \&#x[a-fA-F0-9]+;

From perldoc HTML::Entities:

The module can also export the %char2entity and the %entity2char hashes, which contain the mapping from all characters to the corresponding entities (and vice versa, respectively).

You can probably use them to build regexes. For example, to match entities:

use HTML::Entities '%entity2char';

my $regex = "&(?:" . join("|", map {s/;\z//; $_} keys %entity2char) . ");";

if ($str =~ /$regex/) {
    print "$str contains entities\n";
}

This will skip entities like &#entity_number; though.

You can try it with a regular expression

$str =~ /.*\&[^\s]+;.*/

From your code sample you have probably just introduced a cross site scripting attack into your application. If I were to get your code to process something like <script src="evil.example.com"></script> your code would decode it to valid HTML and not re-encode the < and > back to entities. (The angle brackets in the code are not ASCII angle brackets.)

If you are truncating a string that contains any HTML tags or entities you will probably break something if you use a simple solution. You might be better off building a solution based on an HTML parsing module. If you are only looking at text inside an element with no elements inside it you can grab the text, truncate it and then replace it back into the element. If you have to deal with mixed content it will be more complicated.

But in the interest of bad solutions:

#treats each entity as one character "2 &lt; 4" is 5 characters long
$trunc_len = $len - 3;
$str =~ s/^((?>(?:[^&]|&[^\s;]+;?){$trunc_len}))(?:[^&]|&[^\s;]+;?){4,}/$1.../;

#abuses proceadural nature of the regexp engine 
#treats each input character as on character "2 &lt; 4" is 8 characters long
$str =~ s/^( (?:[^&]|&[^\s;]+;?)+ )(?(?{ $found = (pos() > ( $found ? $len - 3 : $len ))})(?!)).*$(?(?{pos() < $len })(?!))/$1.../x;

Both are fairly permissive in what is an entity to allow for common browser quirks.

继续阅读：perl

How can I check if HTML contains extended entities like <?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？