开发者

Parsing text files using Python

I am very new to Python and am looking to use it to parse a text file. The开发者_StackOverflow社区 file has between 250-300 lines of the following format:

---- Mark Grey (mark.grey@gmail.com) changed status from Busy to Available @ 14/07/2010 16:32:36 ----
----  Silvia Pablo (spablo@gmail.com) became Available @ 14/07/2010 16:32:39 ----

I need to store the following information into another file (excel or text) for all the entries from this file

UserName/ID  Previous Status New Status Date Time

So my result file should look like this for the above entried

Mark Grey/mark.grey@gmail.com  Busy Available 14/07/2010 16:32:36
Silvia Pablo/spablo@gmail.com  NaN  Available 14/07/2010 16:32:39

Thanks in advance,

Any help would be really appreciated


To get you started:

result = []
regex = re.compile(
    r"""^-*\s+
    (?P<name>.*?)\s+
    \((?P<email>.*?)\)\s+
    (?:changed\s+status\s+from\s+(?P<previous>.*?)\s+to|became)\s+
    (?P<new>.*?)\s+@\s+
    (?P<date>\S+)\s+
    (?P<time>\S+)\s+
    -*$""", re.VERBOSE)
with open("inputfile") as f:
    for line in f:
        match = regex.match(line)
        if match:
            result.append([
                match.group("name"),
                match.group("email"),
                match.group("previous")
                # etc.
            ])
        else:
            # Match attempt failed

will get you an array of the parts of the match. I'd then suggest you use the csv module to store the results in a standard format.


import re

pat = re.compile(r"----\s+(.*?) \((.*?)\) (?:changed status from (\w+) to|became) (\w+) @ (.*?) ----\s*")
with open("data.txt") as f:
    for line in f:
        (name, email, prev, curr, date) = pat.match(line).groups()
        print "{0}/{1}  {2} {3} {4}".format(name, email, prev or "NaN", curr, date)

This makes assumptions about whitespace and also assumes that every line conforms to the pattern. You might want to add error checking (such as checking that pat.match() doesn't return None) if you want to handle dirty input gracefully.


The two RE patterns of interest seem to be...:

p1 = r'^---- ([^(]+) \(([^)]+)\) changed status from (\w+) to (\w+) (\S+) (\S+) ----$'
p2 = r'^---- ([^(]+) \(([^)]+)\) became (\w+) (\S+) (\S+) ----$'

so I'd do:

import csv, re, sys

# assign p1, p2 as above (or enhance them, etc etc)

r1 = re.compile(p1)
r2 = re.compile(p2)
data = []

with open('somefile.txt') as f:
    for line in f:
        m = p1.match(line)
        if m:
            data.append(m.groups())
            continue
        m = p2.match(line)
        if not m:
            print>>sys.stderr, "No match for line: %r" % line
            continue
        listofgroups = m.groups()
        listofgroups.insert(2, 'NaN')
        data.append(listofgroups)

with open('result.csv', 'w') as f:
    w = csv.writer(f)
    w.writerow('UserName/ID Previous Status New Status Date Time'.split())
    w.writerows(data)

If the two patterns I described are not general enough, they may need to be tweaked, of course, but I think this general approach will be useful. While many Python users on Stack Overflow intensely dislike REs, I find them very useful for this kind of pragmatical ad hoc text processing.

Maybe the dislike is explained by others wanting to use REs for absurd uses such as ad hoc parsing of CSV, HTML, XML, ... -- and many other kinds of structured text formats for which perfectly good parsers exist! And also, other tasks well beyond REs' "comfort zone", and requiring instead solid general parser systems like pyparsing. Or at the other extreme super-simple tasks done perfectly well with simple strings (e.g. I remember a recent SO question which used if re.search('something', s): instead of if 'something' in s:!-).

But for the reasonably broad swathe of tasks (excluding the very simplest ones at one end, and the parsing of structured or somewhat-complicated grammars at the other) for which REs are appropriate, there's really nothing wrong with using them, and I recommend to all programmers to learn at least REs' basics.


Alex mentioned pyparsing and so here is a pyparsing approach to your same problem:

from pyparsing import Word, Suppress, Regex, oneOf, SkipTo
import datetime

DASHES = Word('-').suppress()
LPAR,RPAR,AT = map(Suppress,"()@")
date = Regex(r'\d{2}/\d{2}/\d{4}')
time = Regex(r'\d{2}:\d{2}:\d{2}')
status = oneOf("Busy Available Idle Offline Unavailable")

statechange1 = 'changed status from' + status('fromstate') + 'to' + status('tostate')
statechange2 = 'became' + status('tostate')
linefmt = (DASHES + SkipTo('(')('name') + LPAR + SkipTo(RPAR)('email') + RPAR + 
            (statechange1 | statechange2) +
            AT + date('date') + time('time') + DASHES)

def convertFields(tokens):
    if 'fromstate' not in tokens:
        tokens['fromstate'] = 'NULL'
    tokens['name'] = tokens.name.strip()
    tokens['email'] = tokens.email.strip()
    d,mon,yr = map(int, tokens.date.split('/'))
    h,m,s = map(int, tokens.time.split(':'))
    tokens['datetime'] = datetime.datetime(yr, mon, d, h, m, s)
linefmt.setParseAction(convertFields)

for line in text.splitlines():
    fields = linefmt.parseString(line)
    print "%(name)s/%(email)s  %(fromstate)-10.10s %(tostate)-10.10s %(datetime)s" % fields

prints:

Mark Grey/mark.grey@gmail.com  Busy       Available  2010-07-14 16:32:36
Silvia Pablo/spablo@gmail.com  NULL       Available  2010-07-14 16:32:39

pyparsing allows you to attach names to the results fields (just like the named groups in Tom Pietzcker's RE-styled answer), plus parse-time actions to act on or manipulate the parsed actions - note the conversion of the separate date and time fields into a true datetime object, already converted and ready for processing after parsing with no additional muss nor fuss.

Here is a modified loop that just dumps out the parsed tokens and the named fields for each line:

for line in text.splitlines():
    fields = linefmt.parseString(line)
    print fields.dump()

prints:

['Mark Grey ', 'mark.grey@gmail.com', 'changed status from', 'Busy', 'to', 'Available', '14/07/2010', '16:32:36']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:36
- email: mark.grey@gmail.com
- fromstate: Busy
- name: Mark Grey
- time: 16:32:36
- tostate: Available
['Silvia Pablo ', 'spablo@gmail.com', 'became', 'Available', '14/07/2010', '16:32:39']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:39
- email: spablo@gmail.com
- fromstate: NULL
- name: Silvia Pablo
- time: 16:32:39
- tostate: Available

I suspect that as you continue to work on this problem, you will find other variations on the format of the input text specifying how the user's state changed. In this case, you would just add another definition like statechange1 or statechange2, and insert it into linefmt with the others. I feel that pyparsing's structuring of the parser definition helps developers come back to a parser after things have changed, and easily extend their parsing program.


Well, if i were to approach this problem, probably I'd start by splitting each entry into its own, separate string. This looks like it might be line oriented, so a inputfile.split('\n') is probably adequate. From there I would probably craft a regular expression to match each of the possible status changes, with subgroups wrapping each of the important fields.


thanks very much for all your comments. They were very useful. I wrote my code using the directory functionality. What it does is it reads through the file and creates an output file for each of the user with all his status updates. Here is the code pasted below.

#Script to extract info from individual data files and print out a data file combining info from these files

import os
import commands

dataFileDir="data/";

#Dictionary linking names to email ids
#For the time being, assume no 2 people have the same name
usrName2Id={};

#User id  to user name mapping to check for duplicate names
usrId2Name={};

#Store info: key: user ids and values a dictionary with time stamp keys and status messages values
infoDict={};

#Given an array of space tokenized inputs, extract user name
def getUserName(info,mailInd):

    userName="";
    for i in range(mailInd-1,0,-1):

        if info[i].endswith("-") or info[i].endswith("+"):
            break;

        userName=info[i]+" "+userName;

    userName=userName.strip();
    userName=userName.replace("  "," ");
    userName=userName.replace(" ","_");

    return userName;

#Given an array of space tokenized inputs, extract time stamp
def getTimeStamp(info,timeStartInd):
    timeStamp="";
    for i in range(timeStartInd+1,len(info)):
        timeStamp=timeStamp+" "+info[i];

    timeStamp=timeStamp.replace("-","");
    timeStamp=timeStamp.strip();
    return timeStamp;

#Given an array of space tokenized inputs, extract status message
def getStatusMsg(info,startInd,endInd):
    msg="";
    for i in range(startInd,endInd):
        msg=msg+" "+info[i];
    msg=msg.strip();
    msg=msg.replace(" ","_");
    return msg;

#Extract and store info from each line in the datafile
def extractLineInfo(line):

    print line;
    info=line.split(" ");

    mailInd=-1;userId="-NONE-";
    timeStartInd=-1;timeStamp="-NONE-";
    becameInd="-1";
    statusMsg="-NONE-";

    #Find indices of email id and "@" char indicating start of timestamp
    for i in range(0,len(info)):
        #print (str(i)+" "+info[i]);
        if(info[i].startswith("(") and info[i].endswith("@in.ibm.com)")):
            mailInd=i;
        if(info[i]=="@"):
            timeStartInd=i;

        if(info[i]=="became"):
            becameInd=i;

    #Debug print of mail and time stamp start inds
    """print "\n";
    print "Index of mail id: "+str(mailInd);
    print "Index of time start index: "+str(timeStartInd);
    print "\n";"""

    #Extract IBM user id and name for lines with ibm id
    if(mailInd>=0):
        userId=info[mailInd].replace("(","");
        userId=userId.replace(")","");
        userName=getUserName(info,mailInd);
    #Lines with no ibm id are of the form "Suraj Godar Mr became idle @ 15/07/2010 16:30:18"
    elif(becameInd>0):
        userName=getUserName(info,becameInd);

    #Time stamp info
    if(timeStartInd>=0):
        timeStamp=getTimeStamp(info,timeStartInd);
        if(mailInd>=0):
            statusMsg=getStatusMsg(info,mailInd+1,timeStartInd);
        elif(becameInd>0):
            statusMsg=getStatusMsg(info,becameInd,timeStartInd);

    print userId;
    print userName;
    print timeStamp
    print statusMsg+"\n";

    if not(userName in usrName2Id) and not(userName=="-NONE-") and not(userId=="-NONE-"):
        usrName2Id[userName]=userId;

    #Store status messages keyed by user email ids
    timeDict={};

    #Retrieve user id corresponding to user name
    if userName in usrName2Id:
        userId=usrName2Id[userName];

    #For valid user ids, store status message in the dict within dict data str arrangement
    if not(userId=="-NONE-"):

        if not(userId in infoDict.keys()):
            infoDict[userId]={};

        timeDict=infoDict[userId];
        if not(timeStamp in timeDict.keys()):
            timeDict[timeStamp]=statusMsg;
        else:
            timeDict[timeStamp]=timeDict[timeStamp]+" "+statusMsg;


#Print for each user a file containing status
def printStatusFiles(dataFileDir):


    volNum=0;

    for userName in usrName2Id:
        volNum=volNum+1;

        filename=dataFileDir+"/"+"status-"+str(volNum)+".txt";
        file = open(filename,"w");

        print "Printing output file name: "+filename;
        print volNum,userName,usrName2Id[userName]+"\n";
        file.write(userName+" "+usrName2Id[userName]+"\n");

        timeDict=infoDict[usrName2Id[userName]];
        for time in sorted(timeDict.keys()):
            file.write(time+" "+timeDict[time]+"\n");


#Read and store data from individual data files
def readDataFiles(dataFileDir):

    #Process each datafile
    files=os.listdir(dataFileDir)
    files.sort();
    for i in range(0,len(files)):
    #for i in range(0,1):

        file=files[i];

        #Do not process other non-data files lying around in that dir
        if not file.endswith(".txt"):
            continue

        print "Processing data file: "+file
        dataFile=dataFileDir+str(file);
        inpFile=open(dataFile,"r");
        lines=inpFile.readlines();

        #Process lines
        for line in lines:

            #Clean lines
            line=line.strip();
            line=line.replace("/India/Contr/IBM","");
            line=line.strip();

            #Skip header line of the file and L's sign in sign out times
            if(line.startswith("System log for account") or line.find("signed")>-1):
                continue;


            extractLineInfo(line);


print "\n";
readDataFiles(dataFileDir);
print "\n";
printStatusFiles("out/");
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜