Extracting some data items in a string using regular expression

2022-12-12 03:52 问答作者：

<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>

In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:

<!\[[^\]!>]+\]!>

It works, but as you can see I've used this part:

[^\]!>]+

This kills some innocents. If the fruit name contains any one o开发者_开发百科f these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.

How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?

The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.

The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.

(Also, no need to escape the ])

About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.

To match a fruit name, you could use:

<!\[(.*?)]!>

After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.

Here's a full regex to match each fruit with the following text:

<!\[(.*?)]!>(.*?)(?=(<!\[)|$)

This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.

depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language

eg gawk

# set field separator as !>
awk -F'!>' '
{ 
  # for each field 
  for(i=1;i<=NF;i++){
    # check if there is <!
    if($i ~ /<!/){
        # if <! is found,  substitute from front till <!
        gsub(/.*<!/,"",$i)

    }
    # print result
    print $i
  }
}
' file

output

# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]

No complicated regex needed.

继续阅读：regex string

Extracting some data items in a string using regular expression

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？