开发者

JSON to fixed width file

I have to extract data from JSON file depending on a specific key. The data then has to be filtered (based on the key value) and separated into different fixed width flat files. I have to develop a solution using shell scripting.

Since the data is just key:value pair I can extract them by processing each line in the JSON file, checking the type and writing the values to the corresponding fixed-width file.

My problem is that开发者_Go百科 the input JSON file is approximately 5GB in size. My method is very basic and would like to know if there is a better way to achieve this using shell scripting ?

Sample JSON file would look like as below:

{"Type":"Mail","id":"101","Subject":"How are you ?","Attachment":"true"}
{"Type":"Chat","id":"12ABD","Mode:Online"}

The above is a sample of the kind of data I need to process.


Give this a try:

#!/usr/bin/awk
{
    line = ""
    gsub("[{}\x22]", "", $0)
    f=split($0, a, "[:,]")
    for (i=1;i<=f;i++)
        if (a[i] == "Type")
            file = a[++i]
        else
            line = line sprintf("%-15s",a[i])
    print line > file ".fixed.out"
}

I made assumptions based on the sample data provided. There is a lot based on those assumptions that may need to be changed if the data varies much from what you've shown. In particular, this script will not work properly if the data values or field names contain colons, commas, quotes or braces. If this is a problem, it's one of the primary reasons that a proper JSON parser should be used. If it were my assignment, I'd push back hard on this point to get permission to use the proper tools.

This outputs lines that have type "Mail" to a file named "Mail.fixed.out" and type "Chat" to "Chat.fixed.out", etc.

The "Type" field name and field value ("Mail", etc.) are not output as part of the contents. This can be changed.

Otherwise, both the field names and values are output. This can be changed.

The field widths are all fixed at 15 characters, padded with spaces, with no delimiters. The field width can be changed, etc.

Let me know how close this comes to what you're looking for and I can make some adjustments.


perl script

#!/usr/bin/perl -w
use strict;
use warnings;

no strict 'refs'; # for FileCache
use FileCache; # avoid exceeding system's maximum number of file descriptors
use JSON;

my $type;
my $json = JSON->new->utf8(1); #NOTE: expect utf-8 strings

while(my $line = <>) { # for each input line
    # extract type
    eval { $type = $json->decode($line)->{Type} };
    $type = 'json_decode_error' if $@;
    $type ||= 'missing_type';

    # print to the appropriate file
    my $fh = cacheout '>>', "$type.out";
    print $fh $line; #NOTE: use cache if there are too many hdd seeks
}

corresponding shell script

#!/bin/bash
#NOTE: bash is used to create non-ascii filenames correctly

__extract_type()
{
    perl -MJSON -e 'print from_json(shift)->{Type}' "$1"
}

__process_input()
{
    local IFS=$'\n'
    while read line; do # for each input line
        # extract type
        local type="$(__extract_type "$line" 2>/dev/null ||
            echo json_decode_error)"
        [ -z "$type" ] && local type=missing_type

        # print to the appropriate file
        echo "$line" >> "$type.out"
    done
}

__process_input

Example:

$ ./script-name < input_file
$ ls -1 *.out
json_decode_error.out
Mail.out
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜