convert HTML Character Entity Encoding in R
Is there a way in R to convert HTML Character Entity Encodings?
I would like to convert HTML character entities like
&
to &
or
>
to >
For Perl exists the package HTML::Entities which could do that, b开发者_开发问答ut I couldn't find something similar in R.
I also tried iconv()
but couldn't get satisfying results. Maybe there is also a way using the XML
package but I haven't figured it out yet.
Unescape xml/html values using xml2
package:
unescape_xml <- function(str){
xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
}
unescape_html <- function(str){
xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}
Examples:
unescape_xml("3 < x & x > 9")
# [1] "3 < x & x > 9"
unescape_html("€ 2.99")
# [1] "€ 2.99"
Update: this answer is outdated. Please check the answer below based on the new xml2 pkg.
Try something along the lines of:
# load XML package
library(XML)
# Convenience function to convert html codes
html2txt <- function(str) {
xpathApply(htmlParse(str, asText=TRUE),
"//body//text()",
xmlValue)[[1]]
}
# html encoded string
( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )
[1] "isn't"
# converted string
html2txt(x)
[1] "isn't"
UPDATE: Edited the html2txt() function so it applies to more situations
While the solution by Jeroen does the job, it has the disadvantage that it is not vectorised and therefore slow if applied to a large number of characters. In addition, it only works with a character vector of length one and one has to use sapply
for a longer character vector.
To demonstrate this, I first create a large character vector:
set.seed(123)
strings <- c("abcd", "& ' >", "&", "€ <")
many_strings <- sample(strings, 10000, replace = TRUE)
And apply the function:
unescape_html <- function(str) {
xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}
system.time(res <- sapply(many_strings, unescape_html, USE.NAMES = FALSE))
## user system elapsed
## 2.327 0.000 2.326
head(res)
## [1] "& ' >" "€ <" "& ' >" "€ <" "€ <" "abcd"
It is much faster if all the strings in the character vector are combined into a single, large string, such that read_html()
and xml_text()
need only be used once. The strings can then easily be separated again using strsplit()
:
unescape_html2 <- function(str){
html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
parsed <- xml2::xml_text(xml2::read_html(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
system.time(res2 <- unescape_html2(many_strings))
## user system elapsed
## 0.011 0.000 0.010
identical(res, res2)
## [1] TRUE
Of course, you need to be careful that the string that you use to combine the various strings in str
("#_|"
in my example) does not appear anywhere in str
. Otherwise, you will introduce an error, when the large string is split again in the end.
Based on Stibu's answer, I went to benchmark the functions.
# first create large vector as in Stibu's answer
set.seed(123)
strings <- c("abcd", "& ' >", "&", "€ <")
many_strings <- sample(strings, 10000, replace = TRUE)
# then benchmark the functions by Stibu and Jeroen
bench::mark(
textutils::HTMLdecode(many_strings),
map_chr(many_strings, unescape_html),
unescape_html2(many_strings)
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis>
1 textutils::HTMLdecode(many_strings) 855.02ms 855.02ms 1.17 329.18MB 10.5 1 9 855.02ms <chr … <Rpro… <bch…
2 map_chr(many_strings, unescape_html) 1.09s 1.09s 0.919 6.79MB 5.51 1 6 1.09s <chr … <Rpro… <bch…
3 unescape_html2(many_strings) 4.85ms 5.13ms 195. 581.48KB 0 98 0 503.63ms <chr … <Rpro… <bch…
# … with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
Here I vectorize Jeroen's unescape_html
function by purrr::map_chr
operator. So far, this just confirms Stibu's claim that the unescape_html2
is indeed many times faster! It is even way faster than textutils::HTMLdecode
function.
But I also found that the xml
version could be even faster.
unescape_xml2 <- function(str){
html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
parsed <- xml2::xml_text(xml2::read_xml(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
However, this function fails when dealing with the many_strings
object (maybe because read_xml
can not read Euro symbol. So I have to try a different way for benchmarking.
library(tidyverse)
library(rvest)
entity_html <- read_html("https://dev.w3.org/html5/html-author/charref")
entity_mapping <- entity_html %>%
html_node(css = "table") %>%
html_table() %>%
rename(text = X1,
named = X2,
hex = X3,
dec = X4,
desc = X5) %>%
as_tibble
s2 <- entity_mapping %>% pull(dec) # dec can be replaced by hex or named
bench::mark(
textutils::HTMLdecode(s2),
map_chr(s2, unescape_xml),
map_chr(s2, unescape_html),
unescape_xml2(s2),
unescape_html2(s2)
)
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 textutils::HTMLdecode(s2) 191.7ms 194.9ms 5.16 64.1MB 10.3 3 6 582ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s2, unescape_xml) 73.8ms 80.9ms 11.9 1006.9KB 5.12 7 3 586ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s2, unescape_html) 162.4ms 163.7ms 5.83 1006.9KB 5.83 3 3 514ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s2) 459.2µs 473µs 2034. 37.9KB 2.00 1017 1 500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s2) 590µs 607.5µs 1591. 37.9KB 2.00 796 1 500ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
We can also try on hex
ones.
> bench::mark(
+ # gsubreplace_mapping(s2, entity_mapping),
+ # gsubreplace_local(s2),
+ textutils::HTMLdecode(s3),
+ map_chr(s3, unescape_xml),
+ map_chr(s3, unescape_html),
+ unescape_xml2(s3),
+ unescape_html2(s3)
+ )
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 textutils::HTMLdecode(s3) 204.2ms 212.3ms 4.72 64.1MB 7.87 3 5 636ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s3, unescape_xml) 76.4ms 80.2ms 11.8 1006.9KB 5.04 7 3 595ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s3, unescape_html) 164.6ms 165.3ms 5.80 1006.9KB 5.80 3 3 518ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s3) 487.4µs 500.5µs 1929. 74.5KB 2.00 965 1 500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s3) 611.1µs 627.7µs 1574. 40.4KB 0 788 0 501ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
Here the xml
version is even more faster than the html
version.
library(xml2)
xml_text(read_html(charToRaw("& >")))
gives:
[1] "& >"
精彩评论