Importing wikipedia tables in R

2023-04-04 10:17 问答作者：

I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter this:

=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",开发者_如何学JAVA"table",3)

and this function will download the 3rd table, which lists all the counties of the UP of Michigan, from that page.

Is there something similar in R? or can be created via a user defined function?

Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:

library(httr)
library(XML)

url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"

r <- GET(url)

doc <- readHTMLTable(
  doc=content(r, "text"))

doc[6]

The function readHTMLTable in package XML is ideal for this.

Try the following:

library(XML)
doc <- readHTMLTable(
         doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")

doc[[6]]

            V1         V2                 V3                              V4
1       County Population Land Area (sqÂ mi) Population Density (per sqÂ mi)
2        Alger      9,862                918                            10.7
3       Baraga      8,735                904                             9.7
4     Chippewa     38,413               1561                            24.7
5        Delta     38,520               1170                            32.9
6    Dickinson     27,427                766                            35.8
7      Gogebic     17,370               1102                            15.8
8     Houghton     36,016               1012                            35.6
9         Iron     13,138               1166                            11.3
10    Keweenaw      2,301                541                             4.3
11        Luce      7,024                903                             7.8
12    Mackinac     11,943               1022                            11.7
13   Marquette     64,634               1821                            35.5
14   Menominee     25,109               1043                            24.3
15   Ontonagon      7,818               1312                             6.0
16 Schoolcraft      8,903               1178                             7.6
17       TOTAL    317,258             16,420                            19.3

readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:

> names(doc)
 [1] "NULL"                                                                               
 [2] "toc"                                                                                
 [3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
 [4] "NULL"                                                                               
 [5] "Cities and Villages of the Upper Peninsula"                                         
 [6] "Upper Peninsula Land Area and Population Density by County"                         
 [7] "19th Century Population by Census Year of the Upper Peninsula by County"            
 [8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"   
 [9] "NULL"                                                                               
[10] "NULL"                                                                               
[11] "NULL"                                                                               
[12] "NULL"                                                                               
[13] "NULL"                                                                               
[14] "NULL"                                                                               
[15] "NULL"                                                                               
[16] "NULL"

Here is a solution that works with the secure (https) link:

install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)

One simple way to do it is to use the RGoogleDocs interface to have Google Docs to do the conversion for you:

http://www.omegahat.org/RGoogleDocs/run.html

You can then use the =ImportHtml Google Docs function with all its pre-built magic.

A tidyverse solution using rvest. It's very useful if you need to find the table based on some keywords, for example in the table headers. Here is an example where we want to get the table on Vital statistics of Egypt. Note: html_nodes(x = page, css = "table") is a useful way to browse available tables on the page.

library(magrittr)
library(rvest)

# define the page to load
read_html("https://en.wikipedia.org/wiki/Demographics_of_Egypt") %>% 
    # list all tables on the page
    html_nodes(css = "table") %>% 
    # select the one containing needed key words
    extract2(., str_which(string = . , pattern = "Live births")) %>% 
    # convert to a table
    html_table(fill = T) %>%  
    view

That table is the only table which is a child of the second td child of so you can specify that pattern with css. Rather than use a type selector of table to grab the child table you can use the class which is faster:

library(rvest)

t <- read_html('https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan') %>% 
  html_node('td:nth-child(2) .wikitable') %>% 
  html_table()

print(t)

继续阅读：dataframe

Importing wikipedia tables in R

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？