开发者

Best method of extracting text from multiple html files into one CSV file

After reading this forum I am not sure which method is best to extract sections of data into a CSV file I.e. Python/Beautiful Soup/html2text. Because of the large number of files, I want to try and write a script I can run within the Terminal.

Output: One CSV file, with lines of text and five columns of da开发者_StackOverflow中文版ta. e.g. first and last line

100 2010-12-20 145 ABC 04110000

1 2010-11-10 133 DDD 041123847

Thanks!


I would recommend using BeautifulSoup. Something like this will do (completely untested). Read the documentation for more.

csvfile = open('dump.csv', 'w')
for file in glob.glob('*.html'):
    print 'Processing', file
    soup = BeautifulSoup(open(file).read())
    for tr in soup.findAll('tr'):
        print >>csvfile, ' '.join(tr.findAll('td'))


I don't know if Python natively supports XPath, but if it does, you should do some research on that subject.

Another alternative solution would be regular expressions.


I've modified my code to:

#!/usr/bin/env python
import glob
import codecs
from BeautifulSoup import BeautifulSoup
with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile:
        for file in glob.glob('*html*'):
            print 'Processing', file
            soup = BeautifulSoup(open(file).read())
            rows = soup.findAll('tr')
            for tr in rows:
                    cols = tr.findAll('td')
                    #print >> csvfile,"#".join(col.string for col in cols)
                    #print >> csvfile,"#".join(td.find(text=True))
                    for col in cols:
                            print >> csvfile, col.string
                    print >> csvfile, "==="
            print >> csvfile, "***"

The code now pulls out data with * and === separators I then use perl to put into a clean csv file. For some reason it does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data + the Date&Time and Number at the start of the table do not come out.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜