开发者

Help (or advice) me get started with lxml

I am trying to learn python, and I actually feel that "learn python the hardway", "a byte of python", and "head first python" are really great books. However - now that I want to start a "real" project, lxml makes me feel like a co开发者_StackOverflowmplete git.

This is what I would like to do (objectives)

I am trying to parse a newspaper sites article about politics

The url is http://politiken.dk/politik/

The final project should

  • 1) each day (maybe each hour) visit the above URL
  • 2) for each relevant article, I want to save the url to a database. The relevant articles are in a <div class="w460 section_forside sec-forside">. Some of the elements have images, some dont.

I would like to save the following:

  • a - the headline (<h1 class="top-art-header fs-26">)
  • b - the subheader (<p class="subheader-art">)
  • c - if the element has corresponding img, then the "alt" or "title" attribute

  • 3) visit each relevant URL and scrape the articles body and save it to the database.

  • 4) if each relevant URL is already in the database, then I skip that URL (the relevant articles as defined above are always the latest 10 published)

The desired result should be a database table with fields:

  • art.i) ID
  • art.ii) URL
  • art.iii) headline
  • art.iiii) subheader
  • art.iiiii) img alt
  • art.iiiiii) article body.
  • art.iiiiiii) date and time (a string located in <span class="date tr-upper m-top-2">)

The above is what I would like help to accomplish. Since screen-scraping is not always benovelent, I would like to explain why I want to do this.

Basically I want to mine the data for occurences of members of parliment or political parties. I will not republish the articles, sell the data or some such thing (I have not checked the legality of my approach, but hope and think it should be legal)

I imagine I have a table of politicians and a table of political parties.

for each politician I will have:

  • pol.i) ID
  • pol.ii) first_name
  • pol.iii) sur_name
  • pol.iiii) party

For each political party I will have:

  • party.i) ID
  • party.ii) correct-name
  • party.iii) calling-name -party.iiii) abbrevation

I want to do this for several danish newspaper sites, and then analyse if one newspaper gives prefrences to some politicians / parties - simply based on number of mentions.

This I will also need help to do - but one step at a time :-)

Later I would like to explore NLTK and the posibilities for sentiment mining.

I want to see if this could turn in to a ph.d. project in political science/journalism.

This is basically what I have (i.e. nothing)

I really have a hard time wrapping my head around lxml, the concept of elements, the different parses etc. I have of course read the tutorials but I am still very much stuck.

import lxml.html

url = "http://politiken.dk/politik/"
root = lxml.html.parse(url).getroot()
# this should retur return all the relevant elements
# does not work:
#relevant = root.cssselect("divi.w460 section_forside sec-forside") # the class has spaces in the name - but I can't seem to escape them?

# this will return all the linked artikles headlines
artikler = root.cssselect("h1.top-art-header")

# narrowing down, we use the same call to get just the URLs of the articles that we have already retrieved
# theese urls we will later mine, and subsequently skip
retrived_urls=[]
for a in root.cssselect("h1.top-art-header a"):
    retrived_urls.append(a)
# this works. 

What I hope to get from the answers

First of - as long as you don't call me (very bad) names - I would continue to be happy.

  • But what I really hope is a simple to understand explanation of how lxml works. If I know what tools to use for the above tasks it would be so much easier for me to really "dive into lxml". Maybe because of my short attention span, I currently get disillusioned when reading stuff way above my level of understanding, when I am not even sure that I am looking in the right place.
  • If you could provide any example code that fits some of the tasks, that would be really great. I hope to turn this project into a ph.d. but I am sure this sort of thing must have been done a thousand times already? If so, it is my experience that learning from others is a great way to get smarter.
  • If you feel strongly that I should forget about lxml and use eg. scrapy or html5lib then please say so :-) I started to look into html5lib because Drew Conway suggests in a blog post about python tools for the political scientist, but I couldn't find any introduction level material. Alsp lxml is what the good people at scraperwiki recommends. As per scrapy, this might be the best solution, but I am afraid that scrapy is to much of a framework - as such really good if you know what you are doing, and want to do it fast, but maybe not the best way to learn python magic.
  • I plan on using a relational database, but if you think e.g. mongo would be an advantage, I will change my plans.
  • Since I can't install import lxml in python 3.1 I am using 2.6. If this is wrong - please say so also.

Timeframe

I have asked a bunch of beginner questions on stackoverflow. Too many to be proud of. But with more then a fulltime job I never seem to be able to burry myself in code and just absorb the skillz I so long for. I hope this will be a question/answer that I can come back to regualy and update what I have learn, and relearn what I have forgot. This also means that this question will most likely remain active for quite some time. But I will comment on every answer that I might be lucky enough to recieve, and I will continuosly update the "what I got" section.

Currently I feel that I might have bitten off more then I can chew - so now it's back to "head first python" and "learn python the hard way".

Final words

If you have gotten this far - you are amazing - even if you don't answer the question. You have now read a lot of simple, confused, and stupid questions (I am proud of asking thoose questions, so don't argue). You should grap a coffe and a filterless smoke and congratulate your self :-)

Happy holidays (in Denmark we celebrate easter and currently the sun is shining like Samual Jacksons wallet in pulp fiction)

Edit's

It seems beutifulSoup is a good choice. As per the developer however BeautifulSoup is not a good choice if I want to use python3. But as per this I would prefer python3 (not strongly though).

I have also discovered that there is an lxml chapter in "dive into python 3". Will look into that aswell.


This is a lot to read - perhaps you could break up into smaller specific questions.

Regarding lxml, here are some examples. The official documentation is also very good - take the time to work through the examples. And the mailing list is very active.

Regarding BeautifulSoup, lxml is more efficient and in my experience can handle broken HTML better than BeautifulSoup. The downside is lxml relies on C libraries so can be harder to install.


lxml is definitely the tool of choice these days for html parsing.

There is an lxml cheat sheet with many of your answers here:

http://scraperwiki.com/docs/contrib/python_lxml_cheat_sheet/

That batch of code you wrote works as-is and it runs in a ScraperWiki edit window. http://scraperwiki.com/scrapers/andreas_stackoverflow_example/edit/

Normally a link is of the form: <a href="link">title</a>

After parsing by lxml, you can get at the link using: a.attrib.get("href") and the text using a.text

However, in this particular case the links are of the form: <a href="link"> <span> </span> title</a> so the value a.text represents only the characters between '<a href="link">' and that first '<span>'.

But you can use the following code to flatten it down by recursing through the sub-elements (the <span> in this case):

def flatten(el):           
    result = [ (el.text or "") ]
    for sel in el:
        result.append(flatten(sel))
        result.append(sel.tail or "")
    return "".join(result)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜