开发者

convert HTML Character Entity Encoding in R

Is there a way in R to convert HTML Character Entity Encodings?

I would like to convert HTML character entities like & to & or > to >

For Perl exists the package HTML::Entities which could do that, b开发者_开发问答ut I couldn't find something similar in R.

I also tried iconv() but couldn't get satisfying results. Maybe there is also a way using the XML package but I haven't figured it out yet.


Unescape xml/html values using xml2 package:

unescape_xml <- function(str){
  xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
}

unescape_html <- function(str){
  xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}

Examples:

unescape_xml("3 &lt; x &amp; x &gt; 9")
# [1] "3 < x & x > 9"
unescape_html("&euro; 2.99")
# [1] "€ 2.99"


Update: this answer is outdated. Please check the answer below based on the new xml2 pkg.


Try something along the lines of:

# load XML package
library(XML)

# Convenience function to convert html codes
html2txt <- function(str) {
      xpathApply(htmlParse(str, asText=TRUE),
                 "//body//text()", 
                 xmlValue)[[1]] 
}

# html encoded string
( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )
[1] "isn&apos;t"

# converted string
html2txt(x)
[1] "isn't"

UPDATE: Edited the html2txt() function so it applies to more situations


While the solution by Jeroen does the job, it has the disadvantage that it is not vectorised and therefore slow if applied to a large number of characters. In addition, it only works with a character vector of length one and one has to use sapply for a longer character vector.

To demonstrate this, I first create a large character vector:

set.seed(123)
strings <- c("abcd", "&amp; &apos; &gt;", "&amp;", "&euro; &lt;")
many_strings <- sample(strings, 10000, replace = TRUE)

And apply the function:

unescape_html <- function(str) {
  xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}

system.time(res <- sapply(many_strings, unescape_html, USE.NAMES = FALSE))
##    user  system elapsed 
##   2.327   0.000   2.326 
head(res)
## [1] "& ' >" "€ <"   "& ' >" "€ <"   "€ <"   "abcd" 

It is much faster if all the strings in the character vector are combined into a single, large string, such that read_html() and xml_text() need only be used once. The strings can then easily be separated again using strsplit():

unescape_html2 <- function(str){
  html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
  parsed <- xml2::xml_text(xml2::read_html(html))
  strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}

system.time(res2 <- unescape_html2(many_strings))
##    user  system elapsed 
##   0.011   0.000   0.010 
identical(res, res2)
## [1] TRUE

Of course, you need to be careful that the string that you use to combine the various strings in str ("#_|" in my example) does not appear anywhere in str. Otherwise, you will introduce an error, when the large string is split again in the end.


Based on Stibu's answer, I went to benchmark the functions.

# first create large vector as in Stibu's answer
set.seed(123)
strings <- c("abcd", "&amp; &apos; &gt;", "&amp;", "&euro; &lt;")
many_strings <- sample(strings, 10000, replace = TRUE)

# then benchmark the functions by Stibu and Jeroen
bench::mark(
  textutils::HTMLdecode(many_strings),
  map_chr(many_strings, unescape_html),
  unescape_html2(many_strings)
)

# A tibble: 3 x 13
  expression                                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
  <bch:expr>                           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
1 textutils::HTMLdecode(many_strings)  855.02ms 855.02ms     1.17   329.18MB    10.5      1     9   855.02ms <chr … <Rpro… <bch…
2 map_chr(many_strings, unescape_html)    1.09s    1.09s     0.919    6.79MB     5.51     1     6      1.09s <chr … <Rpro… <bch…
3 unescape_html2(many_strings)           4.85ms   5.13ms   195.     581.48KB     0       98     0   503.63ms <chr … <Rpro… <bch…
# … with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

Here I vectorize Jeroen's unescape_html function by purrr::map_chr operator. So far, this just confirms Stibu's claim that the unescape_html2 is indeed many times faster! It is even way faster than textutils::HTMLdecode function.

But I also found that the xml version could be even faster.

unescape_xml2 <- function(str){
  html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
  parsed <- xml2::xml_text(xml2::read_xml(html))
  strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}

However, this function fails when dealing with the many_strings object (maybe because read_xml can not read Euro symbol. So I have to try a different way for benchmarking.

library(tidyverse)
library(rvest)

entity_html <- read_html("https://dev.w3.org/html5/html-author/charref")
entity_mapping <- entity_html %>% 
  html_node(css = "table") %>% 
  html_table() %>% 
  rename(text = X1,
         named = X2,
         hex = X3, 
         dec = X4,
         desc = X5) %>% 
  as_tibble
s2 <- entity_mapping %>% pull(dec) # dec can be replaced by hex or named

bench::mark(
  textutils::HTMLdecode(s2),
  map_chr(s2, unescape_xml),
  map_chr(s2, unescape_html),
  unescape_xml2(s2),
  unescape_html2(s2)
)

# A tibble: 5 x 13
  expression                      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory   time   gc    
  <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>   <list> <list>
1 textutils::HTMLdecode(s2)   191.7ms  194.9ms      5.16    64.1MB    10.3      3     6      582ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s2, unescape_xml)    73.8ms   80.9ms     11.9   1006.9KB     5.12     7     3      586ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s2, unescape_html)  162.4ms  163.7ms      5.83  1006.9KB     5.83     3     3      514ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s2)           459.2µs    473µs   2034.      37.9KB     2.00  1017     1      500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s2)            590µs  607.5µs   1591.      37.9KB     2.00   796     1      500ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

We can also try on hex ones.

> bench::mark(
+   # gsubreplace_mapping(s2, entity_mapping),
+   # gsubreplace_local(s2),
+   textutils::HTMLdecode(s3),
+   map_chr(s3, unescape_xml),
+   map_chr(s3, unescape_html),
+   unescape_xml2(s3),
+   unescape_html2(s3)
+ )

# A tibble: 5 x 13
  expression                      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory   time   gc    
  <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>   <list> <list>
1 textutils::HTMLdecode(s3)   204.2ms  212.3ms      4.72    64.1MB     7.87     3     5      636ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s3, unescape_xml)    76.4ms   80.2ms     11.8   1006.9KB     5.04     7     3      595ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s3, unescape_html)  164.6ms  165.3ms      5.80  1006.9KB     5.80     3     3      518ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s3)           487.4µs  500.5µs   1929.      74.5KB     2.00   965     1      500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s3)          611.1µs  627.7µs   1574.      40.4KB     0      788     0      501ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

Here the xml version is even more faster than the html version.


library(xml2)
xml_text(read_html(charToRaw("&amp; &gt;")))

gives:

[1] "& >"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜