开发者

Scraping movie scripts failing on small subset

I'm working on scraping the lord of the rings movie scripts from this website here. Each script is broken up across multiple pages that look like this

I can get the info I need for a single page with this code:

library(dplyr)
library(rvest)

url_success <- "http://www.ageofthering.com/atthemovies/scripts/fellowshipofthering1to4.php"

success <- read_html(url_success) %>%
  html_elements("#AutoNumber1") %>%
  html_table()

summary(success)
     Length Class  Mode
[1,] 2      tbl_df list

This works for all Fellowship of the Ring pages, and all Return of the King pages. It also works for Two Towers pages covering scenes 57 to 66. However, any other Two Towers page (scenes 1-56) does not return the same result

url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"

fail <- read_html(url_fail) %>%
  html_elements("#AutoNumber1") %>%
  html_table()

summary(fail)
Length  Class   Mode 
     0   list   list 

I've inspected the pages i开发者_如何学编程n Chrome, and the failing pages appear to have the same structure as the succeeding ones, including the 'AutoNumber1' table. Can anyone help with this?


Works with xpath. Perhaps ill-formed html (page doesn't seem too spec compliant)

library(rvest)

url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"

fail <- read_html(url_fail) %>%
  html_elements( xpath = '//*[@id="AutoNumber1"]') %>% 
  html_table()
fail
#> [[1]]
#> # A tibble: 139 × 2
#>    X1                                                                      X2   
#>    <chr>                                                                   <chr>
#>  1 "Scene 1 ~ The Foundations of Stone\r\n\r\n\r\nThe movie opens as the … "Sce…
#>  2 "GANDALF VOICE OVER:"                                                   "You…
#>  3 "FRODO VOICE OVER:"                                                     "Gan…
#>  4 "GANDALF VOICE OVER:"                                                   "I a…
#>  5 "The scene changes to \r\n    inside Moria.  Gandalf is on the Bridge … "The…
#>  6 "GANDALF:"                                                              "You…
#>  7 "Gandalf slams down his staff onto the Bridge, \r\ncausing it to crack… "Gan…
#>  8 "BOROMIR :"                                                             "(ho…
#>  9 "FRODO:"                                                                "Gan…
#> 10 "GANDALF:"                                                              "Fly…
#> # … with 129 more rows
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜