How can I split a line when some fields contain spaces?

2023-01-20 23:14 问答作者：

I have a text file that I extracted from a PDF file. It's arranged in a tabular format; this is part of it:

 DATE SESS PROF1 PROF2 COURSE SEC GRADE COUNT 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 A 3 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 A- 2 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B 4 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B+ 2 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B- 1 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 WU 1 

 2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1 

 2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1 

 2007/09 1 FUENTES TANIA DACSB 06500 002 A 3 

 2007/09 1 FUENTES TANIA DACSB 06500 002 A- 8 

 2007/09 1 FUENTES ALEXA DACSB 06500 002 B 5 

 2007/09 1 FUENTES ALEXA DACSB 06500 002 B+ 3 

 2007/09 1 FUENTES ALEXA DACSB 06500 002 B- 1 

 2007/09 1 FUENTES ALEXA DACSB 06500 002 C 1 

 2007/09 1 FUENTES ALEXA DACSB 06500 002 C+ 1 

 2007/09 1 LIGGINS FREDER DACSB 06500 003 A 1

Where the first line is the columns names, and the rest of the lines are the data. there are 8 columns which I want to get, at first it seemed very easy by splitting with split(/\s+/, ...) for each line I read, but then,I noticed that in some lines there are additional spaces, for example:

2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1

Sometimes the data for a 开发者_开发技巧certain column is optional as you can see it.

The problem is complex, but it's not unsolvable. It seems to me that course will always contain a space between the alpha code and the numeric code and that the prof names will also always contain a space. But then you're pretty much screwed if somebody has a two-part last name like "VAN DYKE".

A regex would describe this record:

my $record_exp
    = qr{ ^ \s*
          (\d{4}/\d{2}) # yyyy/mm date
          \s+
          (\d+)         # any number of digits
          \s+
          (\S+ \s \S+) # non-space cluster, single space, non-space cluster
          \s+
          # sames as last, possibly not there, separating spaces are included
          # in the conditional, because we have to make sure it will start
          # right at the next rule.
          (?:(\S+ \s \S+)\s+)?  
          # a cluster of alpha, single space, cluster of digits
          (\p{Alpha}+ \s \d+)   
          \s+    # any number of spaces           
          (\S+)  # any number of non-space
          \s+    # ditto..  
          (\S+)  
          \s+    
          (\S+)  
        }x;

Which makes the loop a lot easier:

while ( <$input> ) { 
    my @fields = m{$record_exp};
    # ... list of semantic actions here...
}

But you could also store it into structures, knowing that the only variable part of the data is the profs:

use strict;
use warnings;
my @records;
<$input>; # bleed the first line
while ( <$input> ) { 
    my @fields         = split; # split on white-space
    my $record         = { date => shift @fields };
    $record->{session} = shift @fields;
    $record->{profs}   = [ join( ' ', splice( @fields, 0, 2 )) ];
    while ( @fields > 5 ) { 
        push @{ $record->{profs} }, join( ' ', splice( @fields, 0, 2 ));
    }
    $record->{course} = splice( @fields, 0, 2 );
    @$record{ qw<sec grade count> } = @fields;
    push @records, $record;
}

Believe it ambiguous :

if PROF1 can contain spaces, how do you know where it ends and where PROF2 begins? What if PROF2 also contains a space? Or 3 spaces ..

You probably can't even tell yourself, and if you can it's because you can tell the difference between a first-name and a surname.

If you're on Linux/Unix, try running text2pdf on the pdf.. might give you better results.

Looks to me like the first four columns and last 5 columns are always present and the 5th and 6th (prof2) columns are optional

So split the line as you were attempting, pull off the first four and last five elements from the resulting array, then whatever remains is your 5th column and 6th columns

If however either the prof1 or the prof2 entry can be missing, you're stuck - your file format is ambiguous

There is nothing that says you must use only a single regex. You can go prune off bits of your line in chunks if that makes it easier to handle the weird parts.

I would probably still use split(), but then access the data thusly:

my @values = split '\s+', $string;
my $date = $values[0];
my $sess = $values[1];
my $count = $values[-1];
my $grade = $values[-2];
my $sec = $values[-3];
my $course = $values[-4];
my @profs = @values[2..($#values-5)];

With this construct you don't have to worry about how many profs you have. Even if you have none, the other values will all work fine (and you'll get an empty array for your profs).

继续阅读：perl regex

How can I split a line when some fields contain spaces?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？