Python: Regex question / CSV parsing / Psycopg nested arrays

2023-02-10 22:17 问答作者：

I'm having trouble parsing nested array's returned by Psycopg2. The DB I'm working on returns records that can have nested array's as valu开发者_JAVA技巧e. Psycopg only parses the outer array of such values.

My first approach was splitting the string on comma's, but then I ran into the problem that sometimes a string within the result also contains comma's, which renders the entire approach unusable. My next attempt was using regex to find the "components" within the string, but then I noticed I wasn't able to detect numbers (since numbers can also occur within strings).

Currently, this is my code:

import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?$')
result = r.search(text)
if result:
    result = result.groups()

The result of this should be:

['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']

Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.

Am I using a wrong approach? Or can anyone point me in the right direction?

Thanks in advance!

Python's native lib should do a good work. Have you tried it already?

http://docs.python.org/library/csv.html

From your sample, it looks something like ^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,|}))+(?<=})|})$ to me. That's not perfect since it would allow "{foo,bar}baz}", but it could be fixed if that matters to you.

If you can do ASSERTIONS, this will get you on the right track.

This problem is too extensive to be done in a single regex. You are trying to validate and parse at the same time in a global match. But your intented result requires sub-processing after the match. For that reason, its better to write a simpler global parser, then itterate over the results for validation and fixup (yes, you have fixup stipulated in your example).

The two main parsing regex's are these:

strips delimeter quote too and only $2 contains data, use in a while loop, global context
/(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/
my preferred one, does not strip quotes, only captures $1, can use to capture in an array or in a while loop, global context
/(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/

This is an example of post processing (in Perl) with a documented regex: (edit: fix append trailing ,)

use strict; use warnings;

my $str = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}';

my $rx = qr/ (?!}$) (?:^{?|,) \s* ( ".*?" | .*?) \s* (?=,|}$) /x;

my $rxExpanded = qr/
         (?!}$)           # ASSERT ahead:  NOT a } plus end
         (?:^{?|,)        # Boundry: Start of string plus { OR comma
         \s*              # 0 or more whitespace
         ( ".*?" | .*?)   # Capture "Quoted" or non quoted data
         \s*              # 0 or more whitespace
         (?=,|}$)         # Boundry ASSERT ahead:  Comma OR } plus end
  /x;

my ($newstring, $sucess) = ('[', 0);

for my $field ($str =~ /$rx/g)
{
   my $tmp = $field;
   $sucess = 1;

   if (  $tmp =~ s/^"|"$//g || $tmp =~ /(?:[a-f0-9]+-){3,}/ ) {
      $tmp = "'$tmp'";
   }
   $newstring .= "$tmp,";
}
if ( $sucess ) {
    $newstring =~ s/,$//;
    $newstring .= ']';
    print $newstring,"\n";
}
else {
    print "Invalid string!\n";
}

Output:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6 2fe6393-00f7-418d-b0b3-7116f6d5cf10']

It seemed that the CSV approach was the easiest to implement:

def parsePsycopgSQLArray(input):
    import csv
    import cStringIO

    input = input.strip("{")
    input = input.strip("}")

    buffer = cStringIO.StringIO(input)
    reader = csv.reader(buffer, delimiter=',', quotechar='"')   

    return reader.next() #There can only be one row 

if __name__ == "__main__":
    text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}' 
    result = parsePsycopgSQLArray(text)
    print result

Thanks for the responses, they were most helpfull!

Improved upon Dirk's answer. This handles escape characters better as well as the empty array case. One less strip call as well:

def restore_str_array(val):
    """
    Converts a postgres formatted string array (as a string) to python

    :param val: postgres string array
    :return: python array with values as strings
    """
    val = val.strip("{}")
    if not val:
        return []
    reader = csv.reader(StringIO(val), delimiter=',', quotechar='"', escapechar='\\')
    return reader.next()

继续阅读：arrays csv psycopg2 python regex

Python: Regex question / CSV parsing / Psycopg nested arrays

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？