开发者

Multiple multi-line regex matches in Bash

I'm trying to do some fairly simple string parsing in bash script. Basically, I have a file that is comprised of multiple multi-line fields. Each field is surrounded by a known header and footer.

I want to extract each field separately into an array or similar, like this

>FILE=`cat file`
>REGEX="@#@#@#[\s\S]+?@#@#@"
> 
>if [[$FILE =~ $REGEX ]] then
>   echo $BASH_REMATCH
>fi

FILE:

@#@#@#################################
this is field one
@#@#@#
@#@#@#################################
this is field two
they c开发者_JAVA百科an be any number of lines
@#@#@#

Now I'm pretty sure the problem is that bash doesn't match newlines with the "."

I can match this with "pcregrep -M", but of course the whole file is going to match. Can I get one match at a time from pcregrep?

I'm not opposed to using some inline perl or similar.


if you have gawk

awk 'BEGIN{ RS="@#*#" }
NF{
    gsub("\n"," ") #remove this is you want to retain new lines
    print "-->"$0 
    # put to array
    arr[++d]=$0
} ' file

output

$ ./shell.sh
--> this is field one
--> this is field two they can be any number of lines


The TXR language performs whole-document multi-line matching, binds variables, and (with the -B "dump bindings" option) emits properly escaped shell variable assignments that can be eval-ed. Arrays are supported.

The @ character is special so it has to be doubled up to match literally.

$ cat fields.txr
@(collect)
@@#@@#@@#################################
@  (collect)
@field
@  (until)
@@#@@#@@#
@  (end)
@  (cat field)@# <- catenate the fields together with a space separator by default
@(end)

$ txr -B fields.txr data
field[0]="this is field one"
field[1]="this is field two they can be any number of lines"

$ eval $(txr -B fields.txr data)
$ echo ${field[0]}
this is field one
$ echo ${field[1]}
this is field two they can be any number of lines

The @field syntax matches an entire line. These are collected into a list since it is inside a @(collect), and the lists are collected into lists-of-lists because that is nested inside another @(collect). The inner @(cat field) however, reduces the inner lists to a single string, so we end up with a list of strings.

This is "classic TXR": how it was originally designed and used, sparked by the idea:

Why don't we make here-documents work backwards and do parsing from reams of text into variables?

This implicit emission of matched variables by default, in the shell syntax by default, continues to be a supported behavior even though the language has grown much more powerful, so there is less of a need to integrate with shell scripts.


I would build something around awk. Here is a first proof of concept:

awk '
    BEGIN{ f=0; fi="" }
    /^@#@#@#################################$/{ f=1 }
    /^@#@#@#$/{ f=0; print"Field:"fi; fi="" }
    { if(f==2)fi=fi"-"$0; if(f==1)f++ }
' file


begin="@#@#@#################################"
end="@#@#@#"
i=0
flag=0

while read -r line
do
    case $line in
        $begin)
            flag=1;;
        $end)
            ((i++))
            flag=0;;
        *)
            if [[ $flag == 1 ]]
            then
                array[i]+="$line"$'\n'    # retain the newline
            fi;;
     esac
done < datafile

If you want to keep the marker lines in the array elements, move the assignment statement (with its flag test) to the top of the while loop before the case.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜