convert HTML Character Entity Encoding in R

2023-02-12 09:23 问答作者：

Is there a way in R to convert HTML Character Entity Encodings?

I would like to convert HTML character entities like & to & or > to >

For Perl exists the package HTML::Entities which could do that, b开发者_开发问答ut I couldn't find something similar in R.

I also tried iconv() but couldn't get satisfying results. Maybe there is also a way using the XML package but I haven't figured it out yet.

Unescape xml/html values using xml2 package:

unescape_xml <- function(str){
  xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
}

unescape_html <- function(str){
  xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}

Examples:

unescape_xml("3 &lt; x &amp; x &gt; 9")
# [1] "3 < x & x > 9"
unescape_html("&euro; 2.99")
# [1] "€ 2.99"

Update: this answer is outdated. Please check the answer below based on the new xml2 pkg.

Try something along the lines of:

# load XML package
library(XML)

# Convenience function to convert html codes
html2txt <- function(str) {
      xpathApply(htmlParse(str, asText=TRUE),
                 "//body//text()", 
                 xmlValue)[[1]] 
}

# html encoded string
( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )
[1] "isn&apos;t"

# converted string
html2txt(x)
[1] "isn't"

UPDATE: Edited the html2txt() function so it applies to more situations

While the solution by Jeroen does the job, it has the disadvantage that it is not vectorised and therefore slow if applied to a large number of characters. In addition, it only works with a character vector of length one and one has to use sapply for a longer character vector.

To demonstrate this, I first create a large character vector:

set.seed(123)
strings <- c("abcd", "&amp; &apos; &gt;", "&amp;", "&euro; &lt;")
many_strings <- sample(strings, 10000, replace = TRUE)

And apply the function:

unescape_html <- function(str) {
  xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}

system.time(res <- sapply(many_strings, unescape_html, USE.NAMES = FALSE))
##    user  system elapsed 
##   2.327   0.000   2.326 
head(res)
## [1] "& ' >" "€ <"   "& ' >" "€ <"   "€ <"   "abcd"

It is much faster if all the strings in the character vector are combined into a single, large string, such that read_html() and xml_text() need only be used once. The strings can then easily be separated again using strsplit():

unescape_html2 <- function(str){
  html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
  parsed <- xml2::xml_text(xml2::read_html(html))
  strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}

system.time(res2 <- unescape_html2(many_strings))
##    user  system elapsed 
##   0.011   0.000   0.010 
identical(res, res2)
## [1] TRUE

Of course, you need to be careful that the string that you use to combine the various strings in str ("#_|" in my example) does not appear anywhere in str. Otherwise, you will introduce an error, when the large string is split again in the end.

Based on Stibu's answer, I went to benchmark the functions.

# first create large vector as in Stibu's answer
set.seed(123)
strings <- c("abcd", "&amp; &apos; &gt;", "&amp;", "&euro; &lt;")
many_strings <- sample(strings, 10000, replace = TRUE)

# then benchmark the functions by Stibu and Jeroen
bench::mark(
  textutils::HTMLdecode(many_strings),
  map_chr(many_strings, unescape_html),
  unescape_html2(many_strings)
)

# A tibble: 3 x 13
  expression                                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
  <bch:expr>                           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
1 textutils::HTMLdecode(many_strings)  855.02ms 855.02ms     1.17   329.18MB    10.5      1     9   855.02ms <chr … <Rpro… <bch…
2 map_chr(many_strings, unescape_html)    1.09s    1.09s     0.919    6.79MB     5.51     1     6      1.09s <chr … <Rpro… <bch…
3 unescape_html2(many_strings)           4.85ms   5.13ms   195.     581.48KB     0       98     0   503.63ms <chr … <Rpro… <bch…
# … with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

Here I vectorize Jeroen's unescape_html function by purrr::map_chr operator. So far, this just confirms Stibu's claim that the unescape_html2 is indeed many times faster! It is even way faster than textutils::HTMLdecode function.

But I also found that the xml version could be even faster.

unescape_xml2 <- function(str){
  html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
  parsed <- xml2::xml_text(xml2::read_xml(html))
  strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}

However, this function fails when dealing with the many_strings object (maybe because read_xml can not read Euro symbol. So I have to try a different way for benchmarking.

library(tidyverse)
library(rvest)

entity_html <- read_html("https://dev.w3.org/html5/html-author/charref")
entity_mapping <- entity_html %>% 
  html_node(css = "table") %>% 
  html_table() %>% 
  rename(text = X1,
         named = X2,
         hex = X3, 
         dec = X4,
         desc = X5) %>% 
  as_tibble
s2 <- entity_mapping %>% pull(dec) # dec can be replaced by hex or named

bench::mark(
  textutils::HTMLdecode(s2),
  map_chr(s2, unescape_xml),
  map_chr(s2, unescape_html),
  unescape_xml2(s2),
  unescape_html2(s2)
)

# A tibble: 5 x 13
  expression                      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory   time   gc    
  <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>   <list> <list>
1 textutils::HTMLdecode(s2)   191.7ms  194.9ms      5.16    64.1MB    10.3      3     6      582ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s2, unescape_xml)    73.8ms   80.9ms     11.9   1006.9KB     5.12     7     3      586ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s2, unescape_html)  162.4ms  163.7ms      5.83  1006.9KB     5.83     3     3      514ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s2)           459.2µs    473µs   2034.      37.9KB     2.00  1017     1      500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s2)            590µs  607.5µs   1591.      37.9KB     2.00   796     1      500ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

We can also try on hex ones.

> bench::mark(
+   # gsubreplace_mapping(s2, entity_mapping),
+   # gsubreplace_local(s2),
+   textutils::HTMLdecode(s3),
+   map_chr(s3, unescape_xml),
+   map_chr(s3, unescape_html),
+   unescape_xml2(s3),
+   unescape_html2(s3)
+ )

# A tibble: 5 x 13
  expression                      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory   time   gc    
  <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>   <list> <list>
1 textutils::HTMLdecode(s3)   204.2ms  212.3ms      4.72    64.1MB     7.87     3     5      636ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s3, unescape_xml)    76.4ms   80.2ms     11.8   1006.9KB     5.04     7     3      595ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s3, unescape_html)  164.6ms  165.3ms      5.80  1006.9KB     5.80     3     3      518ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s3)           487.4µs  500.5µs   1929.      74.5KB     2.00   965     1      500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s3)          611.1µs  627.7µs   1574.      40.4KB     0      788     0      501ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

Here the xml version is even more faster than the html version.

library(xml2)
xml_text(read_html(charToRaw("&amp; &gt;")))

gives:

[1] "& >"

继续阅读：character-encoding encoding r

convert HTML Character Entity Encoding in R

Is there a way in R to convert HTML Character Entity Encodings?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

Is there a way in R to convert HTML Character Entity Encodings?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？