Easy way to get data between tags of xml or html files in python?
I am using Python and need to find and retrieve all character data between tags:
<tag>I need this stuff</tag>
I then want to output the found data to another file. I am just looking for a very easy and efficient way to do this.
If y开发者_运维技巧ou can post a quick code snippet to portray the ease of use. Because I am having a bit of trouble understanding the parsers.
without external modules, eg
>>> myhtml = """ <tag>I need this stuff</tag>
... blah blah
... <tag>I need this stuff too
... </tag>
... blah blah """
>>> for item in myhtml.split("</tag>"):
... if "<tag>" in item:
... print item [ item.find("<tag>")+len("<tag>") : ]
...
I need this stuff
I need this stuff too
Beautiful Soup is a wonderful HTML/XML parser for Python:
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
- Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
- Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
- Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
I quite like parsing into element tree and then using element.text
and element.tail
.
It also has xpath like searching
>>> from xml.etree.ElementTree import ElementTree
>>> tree = ElementTree()
>>> tree.parse("index.xhtml")
<Element html at b7d3f1ec>
>>> p = tree.find("body/p") # Finds first occurrence of tag p in body
>>> p
<Element p at 8416e0c>
>>> p.text
"Some text in the Paragraph"
>>> links = p.getiterator("a") # Returns list of all links
>>> links
[<Element a at b7d4f9ec>, <Element a at b7d4fb0c>]
>>> for i in links: # Iterates through all found links
... i.attrib["target"] = "blank"
>>> tree.write("output.xhtml")
This is how I am doing it:
(myhtml.split('<tag>')[1]).split('</tag>')[0]
Tell me if it worked!
Use xpath and lxml;
from lxml import etree
pageInMemory = open("pageToParse.html", "r")
parsedPage = etree.HTML(pageInMemory)
yourListOfText = parsedPage.xpath("//tag//text()")
saveFile = open("savedFile", "w")
saveFile.writelines(yourListOfText)
pageInMemory.close()
saveFile.close()
Faster than Beautiful soup.
If you want to test out your Xpath's - I find FireFox's Xpather extremely helpful.
Further Notes:
- lxml-an-underappreciated-web-scraping-library
- web-scraping-with-lxml
def value_tag(s):
i = s.index('>')
s = s[i+1:]
i = s.index('<')
s = s[:i]
return s
精彩评论