Intelligent RegEx in Perl?

2022-12-13 23:39 问答作者：

Background

Consider the following input:

<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
>

After processing, I need it to look like the following:

<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
    CustomAttribute="TRUE"
>

Implementation

This is all I need to do for no more than 5 files, so using anything other than a regular expression seems like overkill. Anyway, I came up with the following (Perl) regular expression to accomplish this:

$data =~ s/(<\s*Foo)(.*?)>/$1$2 CustomAttribute="TRUE">/sig;

Problems

This works well, however, there is one obvious problem. This sort of pattern is "dumb" because if CustomAttribute has already b开发者_JAVA技巧een added, the operation outlined above will simply append another CustomAttribute=... blindly.

A simple solution, of course, is to write a secondary expression that will attempt to match for CustomAttribute prior to running the replacement operation.

Questions

Since I'm rather new to the scripting language and regular expression worlds, I'm wondering whether it's possible to solve this problem without introducing any host language constructs (i.e., an if-statement in Perl), and simply use a more "intelligent" version of what I wrote above?

I won't beat you over the head with how you should not use a regex for this. I mean, you shouldn't, but you obviously know that from what you said in your question, so moving on...

Something that will accomplish what you're asking for is called a negative lookahead assertion (usually (?!...)), which basically says that you don't want the match to apply if the pattern inside the assertion is found ahead of this point. In your example, you don't want it to apply if CustomAttribute is already present, so:

$data =~ s/(<\s*Foo)(?![^>]*\bCustomAttribute=)(.*?)>/$1$2CustomAttribute="TRUE">/sig;

This sounds like it might be a job for XML::Twig, which can process XML and change parts of it as it runs into them, including adding attributes to tags. I suspect you'd spend as much time getting used to Twig and you would finding a regex solution that only mostly worked. And, at the end you'd know enough Twig to use it on the next project. :)

Time for a lecture I guess ;--)

I am not sure why you think using a full-blown XML processor is overkill. It is actually easier to write the code using the proper tool. A regexp will be more complex and will rely on unwritten assumptions about the data, which is dangerous. Some of those assumptions are likely to be: no '>' in attribute values, no CDATA sections, no non-ascii characters in tag or attribute names, consistent attribute value quoting...

The only thing a regexp will give you is the assurance that the output keeps the original format of the data (in your case the fact that the attributes are each on a separate line). But if your format is consistent that can be done, and if not it should not matter, unless you keep you XML in a line-oriented revision control system.

Here is an example with XML::Twig. It assumes you have enough memory to keep any entire Foo element in memory, and it works even on the admittedly contrived bit of XML in the DATA section. It would probably be just as easy to do with XML::LibXML (read the XML in memory, select all Foo elements, add attribute to each of them, output, that's 5 easy to understand lines by my count).

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my( $tag, $att, $val)= ( 'Foo', 'CustomAttribute', 'TRUE');

XML::Twig->new(                 # only process those elements
                twig_roots => { $tag => sub { 
                                              # add/set attribute
                                              $_->set_att( $att => $val); 
                                              # output and free memory
                                              $_->flush;
                                            }
                              },
                twig_print_outside_roots => 1, # output everything else
                pretty_print => 'cvs',         # seems to be the right format
              )
         ->parse( \*DATA)  # use parsefile( $file) if parsing... a file
         ->flush;          # not needed in XML::Twig 3.33
__DATA__
<doc>
  <Foo
      Bar="bar"
      Baz="1"
      Bax="bax"
  >
  here is some text
  </Foo>
  <Foo CustomAttribute="TRUE"><Foo no_att="1"/></Foo>
  <bar><![CDATA[<Foo no_att="1">tricked?</Foo>]]></bar>
  <Foo><![CDATA[<Foo no_att="1" CustomAttribute="TRUE">tricked?</Foo>]]></Foo>
  <Foo
      Bar=">"
      Baz="1"
      Bax="bax"
  ></Foo>
  <Foo
      Bar="
>"
      Baz="1"
      Bax="bax"
  ></Foo>
  <Foo
      Bar=">"
      Baz="1"
      Bax="bax"
      CustomAttribute="TRUE"
  ></Foo>
  <Foo
      Bar="
>"
      Baz="1"
      Bax="b
ax"
      CustomAttribute="TR
UE"
  ></Foo>
</doc>

You can send your matches through a function with the 'e' modifier for more processing.

my $str = qq`
<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
    CustomAttribute="TRUE"
>
<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
>
`;

sub foo {
    my $guts = shift;
    $guts .= qq` CustomAttribute="TRUE"` if $guts !~ m/CustomAttribute/;
    return $guts;
}
$str =~ s/(<Foo )([^>]*)(>)/$1.foo($2).$3/xsge;

继续阅读：parsing perl regex

Intelligent RegEx in Perl?

Background

Implementation

Problems

Questions

更多精彩内容

精彩评论

最新问答

上环后多久能同房？

nec家用投影仪无法调成全屏？

Apple ID帐号姓名怎么修改？

利用“网盘搜索引擎”精准搜索各大网盘资源？

软巢囊肿会导致流血吗？

问答排行榜

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Background

Implementation

Problems

Questions

更多精彩内容

精彩评论

最新问答

上环后多久能同房？

nec家用投影仪无法调成全屏？

Apple ID帐号姓名怎么修改？

利用“网盘搜索引擎”精准搜索各大网盘资源？

软巢囊肿会导致流血吗？

问答排行榜

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？