Multiple multi-line regex matches in Bash
I'm trying to do some fairly simple string parsing in bash script. Basically, I have a file that is comprised of multiple multi-line fields. Each field is surrounded by a known header and footer.
I want to extract each field separately into an array or similar, like this
>FILE=`cat file`
>REGEX="@#@#@#[\s\S]+?@#@#@"
>
>if [[$FILE =~ $REGEX ]] then
> echo $BASH_REMATCH
>fi
FILE:
@#@#@#################################
this is field one
@#@#@#
@#@#@#################################
this is field two
they c开发者_JAVA百科an be any number of lines
@#@#@#
Now I'm pretty sure the problem is that bash doesn't match newlines with the "."
I can match this with "pcregrep -M", but of course the whole file is going to match. Can I get one match at a time from pcregrep?
I'm not opposed to using some inline perl or similar.
if you have gawk
awk 'BEGIN{ RS="@#*#" }
NF{
gsub("\n"," ") #remove this is you want to retain new lines
print "-->"$0
# put to array
arr[++d]=$0
} ' file
output
$ ./shell.sh
--> this is field one
--> this is field two they can be any number of lines
The TXR language performs whole-document multi-line matching, binds variables, and (with the -B
"dump bindings" option) emits properly escaped shell variable assignments that can be eval
-ed. Arrays are supported.
The @
character is special so it has to be doubled up to match literally.
$ cat fields.txr
@(collect)
@@#@@#@@#################################
@ (collect)
@field
@ (until)
@@#@@#@@#
@ (end)
@ (cat field)@# <- catenate the fields together with a space separator by default
@(end)
$ txr -B fields.txr data
field[0]="this is field one"
field[1]="this is field two they can be any number of lines"
$ eval $(txr -B fields.txr data)
$ echo ${field[0]}
this is field one
$ echo ${field[1]}
this is field two they can be any number of lines
The @field
syntax matches an entire line. These are collected into a list since it is inside a @(collect)
, and the lists are collected into lists-of-lists because that is nested inside another @(collect)
. The inner @(cat field)
however, reduces the inner lists to a single string, so we end up with a list of strings.
This is "classic TXR": how it was originally designed and used, sparked by the idea:
Why don't we make here-documents work backwards and do parsing from reams of text into variables?
This implicit emission of matched variables by default, in the shell syntax by default, continues to be a supported behavior even though the language has grown much more powerful, so there is less of a need to integrate with shell scripts.
I would build something around awk
. Here is a first proof of concept:
awk '
BEGIN{ f=0; fi="" }
/^@#@#@#################################$/{ f=1 }
/^@#@#@#$/{ f=0; print"Field:"fi; fi="" }
{ if(f==2)fi=fi"-"$0; if(f==1)f++ }
' file
begin="@#@#@#################################"
end="@#@#@#"
i=0
flag=0
while read -r line
do
case $line in
$begin)
flag=1;;
$end)
((i++))
flag=0;;
*)
if [[ $flag == 1 ]]
then
array[i]+="$line"$'\n' # retain the newline
fi;;
esac
done < datafile
If you want to keep the marker lines in the array elements, move the assignment statement (with its flag test) to the top of the while
loop before the case
.
精彩评论