开发者

Regular Expression match <title></title> including newline in anywhere

I'm trying to write a regul开发者_StackOverflow中文版ar expression to extract from a URL, but the problem is "." doesn't match newline as we already know. How do I write a regular expression to match and extract pageTitle (.*?) but newline could be in anywhere between

I'm using grails.


Whilst you can't use a regex to parse general HTML, you can probably get away with it in this case. In Groovy, you can use (?s) operator to make the dot match newlines. You should also probably use the (?i) operator to make your regex case-insensitive. You can combine these as (?is).

For example

def titleTagWithNoLineBreaks = "<title>This is a title</title>"
def titleTagWithLineBreaks = """<title>This is
a title</title>"""

// Note the (?is) at the beginning of the regex
// The 'i' makes the regex case-insensitive
// The 's' make the dot match newline characters
def pattern = ~/(?is)<title>(.*?)<\/title>/

def matcherWithNoLineBreaks = titleTagWithNoLineBreaks =~ pattern
def matcherWithLineBreaks = titleTagWithLineBreaks =~ pattern

assert matcherWithNoLineBreaks.size() == 1
assert matcherWithLineBreaks.size() == 1

assert matcherWithLineBreaks[0][1].replaceAll(/\n/,' ') == "This is a title"

Hope that helps.


Assuming it's for PHP:

preg_match( "#<title>(.*?)</title>#s", $source, $match );
$title = $match[1];

Regardless of what software you are using, adding the s extension will modify the . (any character) so that it includes newlines.


If all you need is to parse possibly-malformed HTML documents you could try using the TagSoup parser. Then you could just use GPath expressions and won't have to worry about weirdness like "</title>" in a comment in the title and such.

import org.ccil.cowan.tagsoup.Parser

final parser  = new Parser()
final slurper = new XmlSlurper(parser)
final html    = slurper.parse('http://www.example.com/')

println html.depthFirst().find { it.name() == 'title' }
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜