Extraction of specific fields from a thread in a forum
I am working on a data-mining project for which I need to analyse the progress of discussion in a thread of a forum. I am interested in extracting information like time of post, stats of post's author (no. of posts, joining date, etc.),开发者_StackOverflow text of the post, etc.
However while using standard scraping tools (like Scrapy in python) I need to write the regular expressions for detecting these fields in the page's html source. As these tags vary with the type of forum, it is becoming a major problem to tackle the regular expressions for every forum. Is there a standard bank of such regular expressions available, so that they can be used based on the type of forum?
Or is there any other technique to extract these fields from the forum's page.
I wrote some configuration files for some major forums. Hope you can decipher and infer how to parse it.
For VBulletin:
enclosed_section=tag:table,attributes:id;threadslist
thread=tag:a,attributes:id;REthread_title_
list_next_page=type:next_page,attributes:anchor_text;>
post=tag:div,attributes:id;REpost_message_
thread_next_page=type:next_page,attributes:anchor_text;>
enclosed_section is the div that contains links to all the threads thread is where you'll find the link to each thread list_next_page is the link to the next page with list of threads post is the div with the post text. thread_next_page is the link to the next page of the thread
For Invision:
enclosed_section=tag:table,attributes:id;forum_table
thread=tag:a,attributes:class;topic_title
list_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post=tag:div,attributes:class;post entry-content |
thread_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post_count_section=tag:td,attributes:class;stats
post_count=tag:li,attributes:,reg_exp:(\d+) Repl
You'll still have to create several approaches per forum. But as Henley suggests, there are also a lot of forums that share their structure.
About easily parsing the dates of the forum's threads, dateparser was born from this specific requirement and it could be of great help.
精彩评论