开发者

Generate a plaintext file from list of words on a webpage

I am trying to generate a plain text file containing a list开发者_如何学编程 of words that is on a webpage. The problem is that the list is divided into multiple pages.

http://www.whonamedit.com/eponyms/A/?start=50&maxrows=25

This is what I mean. Like for the letter A, I need all 13 pages of words and I also need every letter of the alphabet.

I was thinking of maybe modifying a webcrawler to do this task, would that be the easiest way?

I prefer Java, but Python is ok.

Sorry if the answer is obvious, but any nudges in the right direction would be SO GREATLY appreciated!!


Assuming this is specifically for the whonamedit website, you can do the following:

List<String>getWordsOnPage(String url) {
  // read words within <ul class="result-list"> element.
}

void getAllWords() {
  List<String> all = new ArrayList<String>();
  for (char letter = 'A'; letter <= 'Z'; ++letter) {
    for (int start = 0; true; start += 25) {
      List<String> page = getWordsOnPage("http://www.whonamedit.com/eponyms/" + letter + "/?start=" + start + "&maxrows=25");
      if (page.isEmpty()) {
        break;
      }
      all.addAll(page);
    }
  }
}


I use HtmlUnit to write spiders

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜