Parse text using regular expressions
I have a dictionary in 开发者_开发问答.txt format, which looks like this:
term 1
definition 1
definition 2
term 2
definition 1
definition 2
definition 3
etc.
There is a tab always before a definition, basically it's like this:
term 1
[tab]definition 1
[tab]definition 2
etc.
Now I need to wrap every term and it's definitions with <term>
tag, i.e:
<term>
term 1
definition 1
definition 2
</term>
I was trying to use regular expressions to find term with it's definitions, but with no luck. Could you please help me with this?
Thank you for any suggestions!
Try this regular expression:
(^|\n).+(\n[ \t]+.+)*
Assuming that ^
marks the start of the string, \n
is the line break character and .
does not match line breaks.
Assuming an implementation that
- Matches multiple lines (
/.../m
) - Uses
\A
to indicate the start of a line
this should match one "term":
\A[^\t][^\n]+\n(\t[^\n]+\n)+
Match a line with a leading non-whitespace character followed by one or more lines with leading TABs:
$ perl -0077 -pe 's/^(\S.+\n(^\t.+\n)+)/<term>\n$1<\/term>\n/mg' dict <term> term 1 definition 1 definition 2 </term> <term> term 2 definition 1 definition 2 definition 3 </term>
精彩评论