Extract strings in python
Basically, I want to extract the strings "AAA", "BBB", "CCC", "DDD" from a text file...
...... (other text goes here).....
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
..... (useless text here).....
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
....(more text).....
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
......(more text).....
I want something like if I do:-
data = foo("file.txt")
I get:-
data = ['AAA','BBB','CCC','DDD']
What is the best possible way? My file is not big...
Basically, I want to extract "remaining upload data transfer" from this file which in HTML look开发者_StackOverflows like THIS
You could write a REGEX but it would be "parsing" the HTML to some extent. The problem with writing regular expressions for HTML is HTML is a mess. It's rarely perfect and this causes problems when you rely on it for data.
I would personally use BeautifulSoup. It does do more than you're asking but also at superfraction of the effort.
You want BeautifulSoup:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_file)
soup.find("font", "textfont")
def foo():
input_file = open("myfile.txt", 'r')
input = ''.join(input_file.readlines())
looking_for = ['AAA', 'BBB', 'CCC', 'DDD']
have = []
for thing in looking_for:
if thing in input:
have.append(thing)
return have
In a case like this it's, attempt regex for it ( which will be really had ), use a prewritten library, or do it your self with a f = open() f.read()
and your own parser.
If you just want to get the data from inside all of the tags in the HTML document, while dropping all the tags themselves, you could do something like this:
import HTMLParser
class DataOnlyParser(HTMLParser.HTMLParser):
def parse(self, text):
self.result = []
self.feed(text)
self.close()
return self.result
def handle_data(self, data):
data = data.strip()
if data:
self.result.append(data)
p = DataOnlyParser()
data = """
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
"""
print p.parse(data)
# ['AAA', 'BBB', 'CCC', 'DDD']
If your selection criteria is more complex though, and/or if the input is malformed, you'd probably be better off with a library like lxml.
You do NOT want to use regular expressions to "parse" html. See here.
精彩评论