开发者

Extract data from Wikipedia as clean as possible using Rails 3

I am developing a Rails 3 application from which I want to be able to extract data (title and short text) about any topic from Wikipedia.

I need to get the info very "clean" in other words free from HTML, Wikitags and irrelevant data like reference list and such.

Is it possible do get only the title and some text about the topic?

I am using a gem to get the data but it is very ugly.

{{for|the television series|Solsidan (TV series)}} {{Infobox settlement |official_name = Solsidan |image_skyline = |image_caption = |pushpin_map = Sweden |pushpin_label_position = |coordinates_region = SE |subdivision_type = [[Country]] |subdivision_name = [[Sweden]] |subdivision_type3 = [[Municipalities of Sweden|Municipality]] |subdivision_name3 = [[Nacka Municipality]] |subdivision_type2 = [[Counties of Sweden|County]] |subdivision_name2 = [[Stockholm County]] |subdivision_type1 = [[Provinces of Sweden|Province]] |subdivision_name1 = [[Uppland]] |area_footnotes = {{cite web | title=Tätorternas landareal, folkmängd och invånare per km2 2000 och 2005 | publisher=[[Statistics Sweden]] | url=http://www.scb.se/statistik/MI/MI0810/2005A01B/T%c3%a4torternami0810tab1.xls | format=xls | language=Swedish | accessdate=2009-05-08}} |area_total_km2 = 0.23 |population_as_of = 2005-12-31 |population_footnotes = |population_total = 209 |population_density_km2 = 895 |timezone = [[Central European Time|CET]] |utc_offset = +1 |timezone_DST = [[Central Eur开发者_StackOverflowopean Summer Time|CEST]] |utc_offset_DST = +2 |coordinates_display = display=inline,title |latd=59 |latm=17 |lats= |latNS=N |longd=17 |longm=51 |longs= |longEW=E |website = }} '''Solsidan''' is a [[Urban areas in Sweden|locality]] situated in [[Nacka Municipality]], [[Stockholm County]], [[Sweden]] == References == {{Reflist}} {{Stockholm-geo-stub}} {{Localities in Nacka Municipality}} [[Category:Populated places in Stockholm County]] [[no:Solsidan]] [[sv:Solsidan, Nacka kommun]]


Wikipedia provides regular images at Wikipedia:Database download both as MySQL dumps in the schema used by mediawiki, and in an XML interchange format. You can load these onto your own server (~6GiB to download, ~30 GB uncompressed for the current text of all english wikipedia articles), and query/process however you wish. The content is not yet processed to HTML, so you can process the wiki markup and emit whatever you want to around it. The page has lots of links to libraries in various languages that process these dumps, though I don't see a Ruby one so you might have to do it yourself.

There are also various subsets provided. abstract.xml contains the titles and abstracts, which sounds like what you want, and is only 3GB.

See also Wikipedia:Mirrors_and_forks for some discussion about the licensing requirements involved in reusing wikipedia content.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜