Scala Regex Multiple Block Capturing
I'm trying to capture parts of a multi-lined string with a regex in Scala. The input is of the form:
val input = """some text
|begin {
| content to extract
| content to extract
|}
|some text
|begin {
| other content to extract
|}
|some text""".stripMargin
I've tried several possibilities that should get me the text out of the begin {
}
blocks. One of them:
val Block = """(?s).*begin \{(.*)\}""".r
inp开发者_开发知识库ut match {
case Block(content) => println(content)
case _ => println("NO MATCH")
}
I get a NO MATCH
. If I drop the \}
the regex looks like (?s).*begin \{(.*)
and it matches the last block including the unwanted }
and "some text". I checked my regex at rubular.com as with /.*begin \{(.*)\}/m
and it matches at least one block. I thought when my Scala regex would match the same I could start using findAllIn
to match all blocks. What am I doing wrong?
I had a look at Scala Regex enable Multiline option but I could not manage to capture all the occurrences of the text blocks in, for example, a Seq[String]
.
Any help is appreciated.
As Alex has said, when using pattern matching to extract fields from regular expressions, the pattern acts as if it was bounded (ie, using ^
and $
). The usual way to avoid this problem is to use findAllIn
first. This way:
val input = """some text
|begin {
| content to extract
| content to extract
|}
|some text
|begin {
| other content to extract
|}
|some text""".stripMargin
val Block = """(?s)begin \{(.*)\}""".r
Block findAllIn input foreach (_ match {
case Block(content) => println(content)
case _ => println("NO MATCH")
})
Otherwise, you can use .*
at the beginning and end to get around that restriction:
val Block = """(?s).*begin \{(.*)\}.*""".r
input match {
case Block(content) => println(content)
case _ => println("NO MATCH")
}
By the way, you probably want a non-eager matcher:
val Block = """(?s)begin \{(.*?)\}""".r
Block findAllIn input foreach (_ match {
case Block(content) => println(content)
case _ => println("NO MATCH")
})
When doing a match, I believe there is a full match implicity required. Your match is equivalent to:
val Block = """^(?s).*begin \{(.*)\}$""".r
It works if you add .* to the end:
val Block = """(?s).*begin \{(.*)\}.*""".r
I haven't been able to find any documentation on this, but I have encountered this same issue.
As a complement to the other answers, I wanted to point out the existence of kantan.regex, which lets you write the following:
import kantan.regex.ops._
// The type parameter is the type as which to decode results,
// the value parameters are the regular expression to apply and the group to
// extract data from.
input.evalRegex[String]("""(?s)begin \{(.*?)\}""", 1).toList
This yields:
List(Success(
content to extract
content to extract
), Success(
other content to extract
))
精彩评论