开发者

Getting a large number (but not all) Wikipedia pages

For a NLP project of mine, I want to download a large number of pages (say, 10000) at random from Wikipedia. Without downloading the entire XML dump, this is what I can think of:

  1. Open a Wikipedia page
  2. Parse the HTML for links in a Breadth First Search fashion and open each page
  3. Recursively open links on the pages obtained in 2

In steps 2 and 3, I will quit, if I have reached the number of pages I want.

How would you do it? Please suggest better ideas you can think of.

ANSWER: This is my Python code:

# Get 10000 random pages from Wikipedia.
import urllib2
import os
import shutil
#Make the directory to store the HTML pages.
print "Deleting the old randompages directory"
shutil.rmtree('randompages')

print "Created the directory for storing the pages"
os.mkdir('randompages')

num_page = raw_input('Number of pages to retriev开发者_如何学Goe:: ')

for i in range(0, int(num_page)):
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    infile = opener.open('http://en.wikipedia.org/wiki/Special:Random')

    page = infile.read()

    # Write it to a file.
    # TODO: Strip HTML from page
    f= open('randompages/file'+str(i)+'.html','w')
    f.write(page)
    f.close()

    print "Retrieved and saved page",i+1


for i = 1 to 10000
    get "http://en.wikipedia.org/wiki/Special:Random"


Wikipedia has an API. With this API you can get any random article in a given namespace:

http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=5

and for each article you call also get the wiki text:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&rvprop=content


I'd go the opposite way-- start with the XML dump, and then throw away what you don't want.

In your case, if you are looking to do natural language processing, I would assume that you are interested in pages that have complete sentences, and not lists of links. If you spider the links in the manner you describe, you'll be hitting a lot of link pages.

And why avoid the XML, when you get the benefit of using XML parsing tools that will make your selection process easier?


You may be able to do an end run around most of the requirement:

http://cs.fit.edu/~mmahoney/compression/enwik8.zip

is a ZIP file containing 100 MB of Wikipedia, already pulled out for you. The linked file is ~ 16 MB in size.


I know it has been long, but for those who are still looking for an efficient way to crawl and download large number of wikipedia pages (or entire wikipedia) without violating the robot.txt file, 'Webb' library is useful. Here is the link:

Webb Library for Web Crawling and Scrapping


Look at the DBpedia project.

There are small downloadable chunks with at least some article URLs. Once you parsed 10000, you can batch-download them carefully ...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜