开发者

Using Python to remove incomplete line from the end of a JSON formatted log file

I have some JSON formatted log files that I am copying to S3 so I can run Hive queries on them using Elastic Map Reduce. The script I use to copy the log files to S3 is writt开发者_运维技巧en in Python.

Every once in a while I encounter a file with an incomplete line, typically at the end of the file. This causes any Hive queries that need that file to fail. I've been manually fixing the files by removing the bad line, but I'd like to integrate this step into my Python script to prevent these failures.

Here's an example of the type of file I'm working with:

{"logLine":{"browserName":"FireFox","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0"}}
{"logLine":{"browserName":"Pre","userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.24 (KHTML, like Gecko; Google Web Preview) Chrome/11.0.696 Safari/534.24"}}
{"logLine":{"browserName":"Internet Explorer","userAgent":"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1

In that case I want to remove the last line since it's incomplete. I know it's incomplete because it's missing the end of line character(s), and also because it's not valid JSON due to the missing end quote and curly braces.

Is there an easy way to identify and remove that file from the file using Python?


Python has a json module in its standard library. It has a parser that will raise an exception if the input isn't valid JSON. To check the last line, you could do something like

import json
with open('log.txt') as file:
    lines = file.readlines()
try:
    json.loads(lines[-1])
except ValueError:
    with open('log.txt', 'w') as file:
        file.write(''.join(lines[:-1]))


I would use this example below. Note that it loads the whole file into memory, so if the file is big then you may do it by loading the file line by line.

import json
with open('log.txt') as file:
    lines = file.readlines()

towrite = ''
for line in lines:
    try:
        towrite += json.dumps(json.loads(line)) + '\n'
    except ValueError:
        pass
with open('log.txt', 'w') as file:
    file.write(towrite)


You can grab each line and pass them through a filter function.

This function would be something like

def isLineComplete(line):
    return line[-1] == "}"

Overview:

myFile = ...

cleanLines = filter(isLineComplete, myFile.readlines())


Assuming that you can isolate lines, here is how you would check:

try:
    json.loads('{"logLine":{"browserName":"Internet Explorer","userAgent":"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1');
except:
    #code to remove line from file


You could use json.loads to try to parse every line and ignore the ones that raise an exception

lines = """{"logLine":{"browserName":"FireFox"}}
{"logLine":{"browserName":"Pre"}}
{"logLine":{"browserName":"Internet Explorer"
"""
cleaned = []
for line in lines.splitlines():
    try:
        json.loads(line)
    except ValueError:
        continue
    cleaned.append(line)
    print cleaned
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜