Multiple multi-line regex matches in Bash

2022-12-17 01:32 问答作者：

I'm trying to do some fairly simple string parsing in bash script. Basically, I have a file that is comprised of multiple multi-line fields. Each field is surrounded by a known header and footer.

I want to extract each field separately into an array or similar, like this

>FILE=`cat file`
>REGEX="@#@#@#[\s\S]+?@#@#@"
> 
>if [[$FILE =~ $REGEX ]] then
>   echo $BASH_REMATCH
>fi

FILE:

@#@#@#################################
this is field one
@#@#@#
@#@#@#################################
this is field two
they c开发者_JAVA百科an be any number of lines
@#@#@#

Now I'm pretty sure the problem is that bash doesn't match newlines with the "."

I can match this with "pcregrep -M", but of course the whole file is going to match. Can I get one match at a time from pcregrep?

I'm not opposed to using some inline perl or similar.

if you have gawk

awk 'BEGIN{ RS="@#*#" }
NF{
    gsub("\n"," ") #remove this is you want to retain new lines
    print "-->"$0 
    # put to array
    arr[++d]=$0
} ' file

output

$ ./shell.sh
--> this is field one
--> this is field two they can be any number of lines

The TXR language performs whole-document multi-line matching, binds variables, and (with the -B "dump bindings" option) emits properly escaped shell variable assignments that can be eval-ed. Arrays are supported.

The @ character is special so it has to be doubled up to match literally.

$ cat fields.txr
@(collect)
@@#@@#@@#################################
@  (collect)
@field
@  (until)
@@#@@#@@#
@  (end)
@  (cat field)@# <- catenate the fields together with a space separator by default
@(end)

$ txr -B fields.txr data
field[0]="this is field one"
field[1]="this is field two they can be any number of lines"

$ eval $(txr -B fields.txr data)
$ echo ${field[0]}
this is field one
$ echo ${field[1]}
this is field two they can be any number of lines

The @field syntax matches an entire line. These are collected into a list since it is inside a @(collect), and the lists are collected into lists-of-lists because that is nested inside another @(collect). The inner @(cat field) however, reduces the inner lists to a single string, so we end up with a list of strings.

This is "classic TXR": how it was originally designed and used, sparked by the idea:

Why don't we make here-documents work backwards and do parsing from reams of text into variables?

This implicit emission of matched variables by default, in the shell syntax by default, continues to be a supported behavior even though the language has grown much more powerful, so there is less of a need to integrate with shell scripts.

I would build something around awk. Here is a first proof of concept:

awk '
    BEGIN{ f=0; fi="" }
    /^@#@#@#################################$/{ f=1 }
    /^@#@#@#$/{ f=0; print"Field:"fi; fi="" }
    { if(f==2)fi=fi"-"$0; if(f==1)f++ }
' file

begin="@#@#@#################################"
end="@#@#@#"
i=0
flag=0

while read -r line
do
    case $line in
        $begin)
            flag=1;;
        $end)
            ((i++))
            flag=0;;
        *)
            if [[ $flag == 1 ]]
            then
                array[i]+="$line"$'\n'    # retain the newline
            fi;;
     esac
done < datafile

If you want to keep the marker lines in the array elements, move the assignment statement (with its flag test) to the top of the while loop before the case.

继续阅读：bash regex

Multiple multi-line regex matches in Bash

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？