Parsing a Syntax Tree with Perl Regex
Perhaps regex is not the best way to parse this, tell me if I it is not. Anyway, here开发者_C百科 are some examples of what the syntax tree looks like:
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
Anyway, what I am trying to do is pull the connective out (and, then, once, etc) and its corresponding head (CC,IN,CC), which I already know for each syntax tree so it can act as an anchor, and I also need to retrieve its parent (in the first it is S, second SBARTMP, and third it is S), and its siblings, if there are any (in the first none, in the second left hand side sibling, and third left-hand-side and right-hand-side sibling). Anything higher than the parent is not included
my $pos = "(\\\w|-)*";
my $sibling = qr{\s*(\\((?:(?>[^()]+)|(?1))*\\))\s*};
my $connective = "once";
my $re = qr{(\(\w*\s*$sibling*\s*\\(IN\s$connective\\)\s*$sibling*\s*\))};
This code works for things like:
my $test1 = "(X (SBAR-TMP (IN once) (S sdf) (S sdf)))";
my $test2 = "(X (SBAR-TMP (IN once))";
my $test3 = "(X (SBAR-TMP (IN once) (X as))";
my $test4 = "(X (SBAR-TMP (X adsf) (IN once))";
It will throw away the X on top and keep everything else, however, once the siblings have stuff embedded in them then it does not match because the regex does not go deeper.
my $test = "(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))";
I am not sure how to account for this. I am kind of new to the extended patterns for Perl, just started learning it. To clarify a bit about what the regex is doing: it looks for the connective within two parentheses and the capital-letter/- combo, looks for a complete parent of the same format closing with two parentheses and then should look for any number of siblings that have all their parentheses paired off.
To only get the nearest 'parent' to your anchor connective you can do it as a recursive parent with a FAIL or do it directly. (for some reason I can't edit my other posts, must be cookies being deleted).
use strict;
use warnings;
my $connective = qr/ \((?:IN|CC)\s(?:once|and|then)\)/x;
my $sibling = qr/
\s*
(
(?! $connective )
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s*
/x;
my $regex1 = qr/
\( ( [\w-]+ \s* $sibling* \s* $connective \s* $sibling* ) \) #1
/x;
my $regex2 = qr/
( #1
\( \s*
( #2
[\w-]+ \s*
(?> $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
| (?1)
)
)
\s*
\)
)
/x;
my $sample = qq/
(X (SBAR-TMP (IN once) (S sdf) (S sdf)))
(X (SBAR-TMP (IN once))
(X (SBAR-TMP (IN once) (X as))
(X (SBAR-TMP (X adsf) (IN once))
(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
/;
while ($sample =~ /$regex1/xg) {
print "Found: $1\n";
}
print '-' x 20, "\n";
while ($sample =~ /$regex2/xg) {
print "Found: $2\n";
}
__END__
Why did you give up on this, you almost had it. Try this:
use strict;
use warnings;
my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
\s*
(
(?!$connect)
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s*
/x;
my $regex = qr/
( #1
\(
\s* [\w-]+ \s*
(?> $sibling* \s* $connective \s* $sibling*
| (?1)
)
\s*
\)
)
/x;
my @tests = (
'(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',
'(X (SBAR-TMP (IN once))',
'(X (SBAR-TMP (IN once) (X as))',
'(X (SBAR-TMP (X adsf) (IN once))',
);
for my $sample (@tests)
{
while ($sample =~ /$regex/xg) {
print "Found: $1\n";
}
}
my $another =<<EOS;
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S
(S
(NP blah
(VP blah)
)
(CC then)
(NP blah
(VP blah
(PP blah)
)
)
)
)
EOS
print "\n---------\n";
while ($another =~ /$regex/xg) {
print "\nFound:\n$1\n";
}
END
This should work as well
use strict;
use warnings;
my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
(?: \s*
(
(?!$connective)
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s* )
/x;
my $regex = qr/
( #1
\( \s*
( #2
[\w-]+ \s*
(?> $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
| (?1)
)
)
\s*
\)
)
/x;
my @tests = (
'(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',
'(X (SBAR-TMP (IN once))',
'(X (SBAR-TMP (IN once) (X as))',
'(X (SBAR-TMP (X adsf) (IN once))',
'(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))',
);
for my $sample (@tests)
{
while ($sample =~ /$regex/xg) {
print "Found: $2\n";
}
}
my $another = "
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
";
print "\n---------\n";
while ($another =~ /$regex/xg) {
print "\nFound:\n$2\n";
}
__END__
精彩评论