开发者

Python: Regex question / CSV parsing / Psycopg nested arrays

I'm having trouble parsing nested array's returned by Psycopg2. The DB I'm working on returns records that can have nested array's as valu开发者_JAVA技巧e. Psycopg only parses the outer array of such values.

My first approach was splitting the string on comma's, but then I ran into the problem that sometimes a string within the result also contains comma's, which renders the entire approach unusable. My next attempt was using regex to find the "components" within the string, but then I noticed I wasn't able to detect numbers (since numbers can also occur within strings).

Currently, this is my code:

import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?$')
result = r.search(text)
if result:
    result = result.groups()

The result of this should be:

['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']

Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.

Am I using a wrong approach? Or can anyone point me in the right direction?

Thanks in advance!


Python's native lib should do a good work. Have you tried it already?

http://docs.python.org/library/csv.html


From your sample, it looks something like ^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,|}))+(?<=})|})$ to me. That's not perfect since it would allow "{foo,bar}baz}", but it could be fixed if that matters to you.


If you can do ASSERTIONS, this will get you on the right track.

This problem is too extensive to be done in a single regex. You are trying to validate and parse at the same time in a global match. But your intented result requires sub-processing after the match. For that reason, its better to write a simpler global parser, then itterate over the results for validation and fixup (yes, you have fixup stipulated in your example).

The two main parsing regex's are these:

  1. strips delimeter quote too and only $2 contains data, use in a while loop, global context
    /(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/

  2. my preferred one, does not strip quotes, only captures $1, can use to capture in an array or in a while loop, global context
    /(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/

This is an example of post processing (in Perl) with a documented regex: (edit: fix append trailing ,)

use strict; use warnings;

my $str = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}';

my $rx = qr/ (?!}$) (?:^{?|,) \s* ( ".*?" | .*?) \s* (?=,|}$) /x;

my $rxExpanded = qr/
         (?!}$)           # ASSERT ahead:  NOT a } plus end
         (?:^{?|,)        # Boundry: Start of string plus { OR comma
         \s*              # 0 or more whitespace
         ( ".*?" | .*?)   # Capture "Quoted" or non quoted data
         \s*              # 0 or more whitespace
         (?=,|}$)         # Boundry ASSERT ahead:  Comma OR } plus end
  /x;

my ($newstring, $sucess) = ('[', 0);

for my $field ($str =~ /$rx/g)
{
   my $tmp = $field;
   $sucess = 1;

   if (  $tmp =~ s/^"|"$//g || $tmp =~ /(?:[a-f0-9]+-){3,}/ ) {
      $tmp = "'$tmp'";
   }
   $newstring .= "$tmp,";
}
if ( $sucess ) {
    $newstring =~ s/,$//;
    $newstring .= ']';
    print $newstring,"\n";
}
else {
    print "Invalid string!\n";
}

Output:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6 2fe6393-00f7-418d-b0b3-7116f6d5cf10']


It seemed that the CSV approach was the easiest to implement:

def parsePsycopgSQLArray(input):
    import csv
    import cStringIO

    input = input.strip("{")
    input = input.strip("}")

    buffer = cStringIO.StringIO(input)
    reader = csv.reader(buffer, delimiter=',', quotechar='"')   

    return reader.next() #There can only be one row 

if __name__ == "__main__":
    text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}' 
    result = parsePsycopgSQLArray(text)
    print result

Thanks for the responses, they were most helpfull!


Improved upon Dirk's answer. This handles escape characters better as well as the empty array case. One less strip call as well:

def restore_str_array(val):
    """
    Converts a postgres formatted string array (as a string) to python

    :param val: postgres string array
    :return: python array with values as strings
    """
    val = val.strip("{}")
    if not val:
        return []
    reader = csv.reader(StringIO(val), delimiter=',', quotechar='"', escapechar='\\')
    return reader.next()
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜