开发者

Split line with perl

I have a multiline credits with missing a few commas:

rendező: Joe Carnahan forgatókönyvíró: Brian Bloom, Michael Brandt, Skip Woods zeneszerző: Alan Silvestri operatőr: Mauro Fiore producer: Stephen J. Cannell, Jules Daly, Ridley Scott szereplő(k): Liam Neeson (John 'Hannibal' Smith ezredes) Bradley Cooper (Templeton 'Szépfiú' Peck hadnagy) szinkronhang: Gáti Oszkár (John 'Hannibal' (Smith magyar hangja)) Rajkai Zoltán (Templeton 'Faceman' Peck magyar hangja)

This leads to inability to split line by commas:

$credits (split /, */, $line):

I want to split after comma and if not exist comma between credits, split after first credits (ex.):

rendező: Joe Carnahan
forga开发者_Go百科tókönyvíró: Brian Bloom
Michael Brandt
Skip Woods
zeneszerző: Alan Silvestri
operatőr: Mauro Fiore
producer: Stephen J. Cannell
Jules Daly
Ridley Scott
szereplő(k): Liam Neeson (John 'Hannibal' Smith ezredes)
Bradley Cooper (Templeton 'Szépfiú' Peck hadnagy)
szinkronhang: Gáti Oszkár (John 'Hannibal' (Smith magyar hangja))
Rajkai Zoltán (Templeton 'Faceman' Peck magyar hangja)

Thanks


So you can split by a comma-space in most cases, but otherwise by a space character preceded by a right parenthesis. This would be:

/, |(?<=\)) /

Or, perhaps (?) more clearly:

/,[[:space:]]|(?<=\))[[:space:]]/

The pipe character will make for a disjunctive match between what's on either side of it. But there's also parsing out the roles, and the entire string is full of non-ascii characters.

Script:

use strict;
use warnings;
use utf8;
use Data::Dump 'dump';

my $big_string = q/rendező: ... hangja)/;
my @credits = map {
    my ($title, $names) = /([[:alpha:]()]+): (.+)/;
    my @names = split /,[[:space:]]|(?<=\))[[:space:]]/, $names;
    my $credit = { $title => \@names };
} split / (?=[[:alpha:]()]+:)/, $big_string;
binmode STDOUT, ':utf8';
print dump \@credits;

Output:

[
  { rendező => ["Joe Carnahan"] },
  {
    forgatókönyvíró => ["Brian Bloom", "Michael Brandt", "Skip Woods"],
  },
  { zeneszerző => ["Alan Silvestri"] },
  { operatőr => ["Mauro Fiore"] },
  {
    producer => ["Stephen J. Cannell", "Jules Daly", "Ridley Scott"],
  },
  {
    "szerepl\x{151}(k)" => [
      "Liam Neeson (John 'Hannibal' Smith ezredes)",
      "Bradley Cooper (Templeton 'Sz\xE9pfi\xFA' Peck hadnagy)",
    ],
  },
  {
    szinkronhang => [
      "G\xE1ti Oszk\xE1r (John 'Hannibal' (Smith magyar hangja))",
      "Rajkai Zolt\xE1n (Templeton 'Faceman' Peck magyar hangja)",
    ],
  },
]

Notes:

  • An array of hashrefs is used to preserve the order of the list.
  • The utf8 pragma will make the [:alpha:] construct utf8-aware.
  • Given Perl >= v5.10, The utf8::all pragma can replace utf8 and also remove the need to call &binmode prior to output.
  • Lookarounds ((?=), (?<=), etc.) can be tricky; see perlre and this guide for good information on them.


I think you can try to set up a regular expression. you can substitute any 'word:' with '\nword:' in the same way you can substitte ',' with ',\n'

to give a look to regular expression check this page: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

the 2 roules should be something similar to:

$newstr ~= ($str =~ tr/[a-zA-Z]+:/(\n)[a-Z]+:/);

it's just a guess... not really aware of Perl syntax

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜