Perl RegEx: Limiting the pattern to only the first occurrence of a character

2023-01-08 20:27 问答作者：

I am trying to extract the content of a date element from many ill-formed sgml documents. For instance, the document can contain a simple date element like

<DATE>4th July 1936</DATE>

开发者_如何学Cor

<DATE blaAttrib="89787adjd98d9">4th July 1936</DATE>

but can also as hairy as:

<DATE blaAttrib="89787adjd98d9">4th July 1936
<EM>spanned across multiple lines and EM element inside DATE</EM></DATE>

The aim is to get the "4th July 1936". Since the files are not big, I chose to read the whole content into a variable and do the regex. The following is the snippet of my Perl code:

{
    local $/ = undef;
    open FILE, "$file" or die "Couldn't open file: $!";
    $fileContent = <FILE>;
    close FILE;

    if ( $fileContent =~ m/<DATE(.*)>(.*)<\/DATE>/)
    {
        # $2 should contain the "4th July 1936" but it did not.
    }
}

Unfortunately the regex does not work for the hairy example. This is because inside the <DATE> there is an <EM> element and it also spans multiple lines.

Can any kind soul give me some pointers, directions, or clues?

Thanks heaps!

Use an XML parser if you can.

But from your example, probably you could try

if ($fileContent =~ m/<DATE[^>]*>([^<]+)/) {
  # use $1 here
  # you may need to strip new lines
}

If the date format is fixed, you might want to use something like this:

m/<DATE(.*)>([0-9]+(st|nd|rd|th)\s(January|February|March|April|May|June|July|August|September|October|November|December)\s[0-9]+)(.*)<\/DATE>/

instead of matching .*, you should match "everything that is not an anchor"

ie :


 if($string =~ /^<DATE[^>]*>([^<]+)</){

there, $1 is your date

Use an HTML parser.

Please, use an HTML parser.

But for a regex, I'd try

<DATE(.*?)>(.*)<\/DATE>

which should be faster than KennyTM's alternative... By the way, why are you capturing that second group?

You should use non greedy matching and the modifier s to make . match newline

my @l = (
'<DATE>4th July 1936</DATE>',
'<DATE blaAttrib="89787adjd98d9">4th July 1936</DATE>',
'<DATE blaAttrib="89787adjd98d9">4th July 1936
<EM>spanned across multiple lines and EM element inside DATE</EM></DATE>'
);

foreach(@l) {
  /^<DATE.*?>(.*?)</s && print $1;
}

output:

4th July 1936
4th July 1936
4th July 1936

Even your "hairy" example can be reduced to a similar type. If you are always going to have 1) the actual date on the same line as the start tag--and 2) that's all you want--it doesn't matter where the end tag is.

$fileContent =~ m/<DATE([^>]*)>\s*(\d+\p{Alpha}+\s+\p{Alpha}+\s+\d{4})/

is always going to work. (If you're not going to find '>' in the tag, then it's a good idea to not cause so much backtracking after .* eats up your entire line, causes the expression to fail and then has to give back and check, give back and check, ...)

There is not any way to use regex over multiple lines, but you can use a little trick. If files aren't to big, as you have mentioned, you can first replace all '\n' characters with some value (NEW_LINE or something like that), or you can delete them and then use your pattern.

继续阅读：perl regex

Perl RegEx: Limiting the pattern to only the first occurrence of a character

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？