Parsing YAML-like text file into hash structure

2023-02-03 02:25 问答作者：

I've got the text file:

country = { 
    tag = ENG 
    ai = { 
        flags = { } 
        combat = { ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD } 
        continent = { "Oceania" } 
        area = { "America" "Maine" "Georgia" "Newfoundland" "Cuba" "Bengal" "Carnatic" "Ceylon" "Tanganyika" "The Mascarenes" "The Cape" "Gold" "St Helena" "Guiana" "Falklands" "Bermuda" "Oregon" } 
        region = { "North America" "Carribean" "India" } 
        war = 50 
        ferocity = no 
    }
    date = { year = 0 month = january day = 0 } 
}

What I'm trying to do is to parse this text into perl hash structure, so that the output after data dump looks like this:

$VAR1 = {
          'country' => {
                         'ai' => {
                                   'area' => [
                                               'America',
                                               'Maine',
                                               'Georgia',
                                               'Newfoundland',
                                               'Cuba',
                                               'Bengal',
                                               'Carnatic',
                                               'Ceylon',
                                               'Tanganyika',
                                               'The Mascarenes',
                                               'The Cape',
                                               'Gold',
                                               'St Helena',
                                               'Guiana',
                                               'Falklands',
                                               'Bermuda',
                                               'Oregon'
                                             ],
                                   'combat' => [
                                                 'ROY',
                                                 'WLS',
                                                 'PUR',
                                                 'SCO',
                                                 'EIR',
                                                 'FRA',
                                                 'DEL',
                                                 'USA',
                                                 'QUE',
                                                 'BGL',
                                                 'MAH',
                                                 'MOG',
                                                 'VIJ',
                                                 'MYS',
                                                 'DLH',
                                                 'GUJ',
                                                 'ORI',
                                                 'JAI',
                                                 'ASS',
                                                 'MLC',
                                                 'MYA',
                                                 'ARK',
                                                 'PEG',
                                                 'TAU',
                                                 'HYD'
                                               ],
                                   'continent' => [
                                                    'Oceania'
                                                  ],
                                   'ferocity' => 'no',
                                   'flags' => [],
                                   'region' => [
                                                 'North America',
                                                 'Carribean',
                                                 'India'
                                               ],
                                   'war' => 50
                                 },
                         'date' => {
                                     'day' => 0,
                                     'month' => 'january',
                                     'year' => 0
                                   },
                         'tag' => 'ENG'
                       }
        };

Hardcoded version might look like this:

#!/usr/bin/perl
use Data::Dumper;
开发者_如何学Pythonuse warnings;
use strict;

my $ret;

$ret->{'country'}->{tag} = 'ENG';
$ret->{'country'}->{ai}->{flags} = [];
my @qw = qw( ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD );
$ret->{'country'}->{ai}->{combat} = \@qw; 
$ret->{'country'}->{ai}->{continent} =  ["Oceania"];
$ret->{'country'}->{ai}->{area} =  ["America", "Maine", "Georgia", "Newfoundland", "Cuba", "Bengal", "Carnatic", "Ceylon", "Tanganyika", "The Mascarenes", "The Cape", "Gold", "St Helena", "Guiana", "Falklands", "Bermuda", "Oregon"];
$ret->{'country'}->{ai}->{region} = ["North America", "Carribean", "India"];
$ret->{'country'}->{ai}->{war} = 50;
$ret->{'country'}->{ai}->{ferocity} = 'no';
$ret->{'country'}->{date}->{year} = 0;
$ret->{'country'}->{date}->{month} = 'january';
$ret->{'country'}->{date}->{day} = 0;

sub hash_sort {
  my ($hash) = @_;
  return [ (sort keys %$hash) ];
}

$Data::Dumper::Sortkeys = \hash_sort;

print Dumper($ret);

I have to admit I have a huge problem dealing with nested curly brackets. I've tried to solve it by using greedy and ungreedy matching, but it seems it didn't do the trick. I've also read about extended patterns (like (?PARNO)) but I have absolutely no clue how to use them in my particular problem. Order of data is irrelevant, since I have the hash_sort subroutine. I'll apprieciate any help.

I broke it down to some simple assumptions:

An entry would consist of an identifier followed by an equals sign
An entry would be one of three basic types: a level or set or a single value
A set has 3 forms: 1) quoted, space-separated list; 2) key-value pairs, 3) qw-like unquoted list
A set of key-value pairs must contain an indentifier for a key and either nonspaces or a quoted value for a value

See the interspersed comments.

use strict;
use warnings;

my $simple_value_RE
    = qr/^ \s* (\p{Alpha}\w*) \s* = \s* ( [^\s{}]+ | "[^"]*" ) \s* $/x
    ;
my $set_or_level_RE
    = qr/^ \s* (\w+) \s* = \s* [{] (?: ([^}]+) [}] )? \s* $/x
    ;
my $quoted_set_RE
    = qr/^ \s* (?: "[^"]+" \s+ )* "[^"]+" \s* $/x
    ;
my $associative_RE
    = qr/^ \s* 
        (?: \p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ ) \s+ )*
        \p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ ) 
        \s* $
    /x
    ;
my $pair_RE = qr/ \b ( \p{Alpha}\w* ) \s* = \s* ( "[^"]+" | \S+ )/x;

sub get_level { 
    my $handle = shift;
    my %level;
    while ( <$handle> ) {
        # if the first character on the line is a close, then we're done
        # at this level
        last if  m/^\s*[}]/; 
        my ( $key, $value );

        # get simple values
        if (( $key, $value ) =  m/$simple_value_RE/ ) { 
            # done.
        }
        elsif (( $key, my $complete_set ) = m/$set_or_level_RE/ ) {
            if ( $complete_set ) {
                if ( $complete_set =~ m/$quoted_set_RE/ ) { 
                    # Pull all quoted values with global flag
                    $value = [ $complete_set =~ m/"([^"]+)"/g ];
                }
                elsif ( $complete_set =~ m/$associative_RE/ ) { 
                    # going to create a hashref. First, with a global flag
                    # repeatedly pull all qualified pairs
                    # then split them to key and value by spliting them at
                    # the first '='
                    $value 
                        = { map { split /\s*=\s*/, $_, 2 } 
                                ( $complete_set =~ m/$pair_RE/g )
                        };
                }
                else {
                    # qw-like
                    $value = [ split( ' ', $complete_set ) ];
                }
            }
            else { 
                $value = get_level( $handle );
            }
        }
        $level{ $key } = $value;
    }
    return wantarray ? %level : \%level;
}

my %base = get_level( \*DATA );

Well, as David suggested, the easiest way would be to get whatever produced the file to use a standard format. JSON, YAML, or XML would be much easier to parse.

But if you really have to parse this format, I'd write a grammar for it using Regexp::Grammars (if you can require Perl 5.10) or Parse::RecDescent (if you can't). This'll be a little tricky, especially because you seem to be using braces for both hashes & arrays, but it should be doable.

The contents look pretty regular. Why not perform some substitutions on the content and convert it to hash syntax, then eval it. That would be a quick and dirty way to convert it.

You can also write a parser, assuming you know the grammar.

继续阅读：perl regex

Parsing YAML-like text file into hash structure

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？