String corruption and nonprintable characters using XML::Twig in Win32 Perl
This is a really weird problem. It's taken me practically all day to whittle it down to a small executable script that demonstrates the problem fully.
Problem Summary: I'm using XML::Twig to pull a data snippet from an XML file, then I'm sticking that data snippet into the middle of another piece of data, let's call it parent data. The parent data has this weird non-printable character at its beginning when I start. It's vendor supplied data, so I cannot control it. My problem is that after I stick the data snippet into the middle of the parent data, the final product has a new non-printable character at its beginning in addition to the one it started with originally. This new non-printable character was not in either the parent data nor in the child data snippet. I don't know where it's coming from nor how it's getting into my data.
I'm doubtful that it is an XML::Twig bug because the string corruption occurs while reading a line from a filehandle in a while loop, but I've been unsuccessful at recreating my problem when I remove the XML::Twig code in my scripts so I had to leave it in.
This is my first experience with non-printable characters in strings that I'm trying to process. Do I need to do something special instead of treating them like ordinary strings or something?
I'm using ActiveState Perl 5.10.1 and XML::Twig 3.32 (latest) and the Eclipse 3.5.1 IDE on Windows XP.
Here is a script that demonstrates the problem:
use strict;
use warnings;
use XML::Twig;
my $FALSE = 0;
my $TRUE = 1;
my $name = 'KurtsProgram';
my $task = 'MainTask';
my $hidden_char = "\xBF";
my $data = $hidden_char .
'(*********************************************
Data-File-Header-Junk
**********************************************)
PROGRAM MainProgram ()
END_PROGRAM
TASK SecondaryTask ()
END_TASK
TASK MainTask ()
MainProgram;
END_TASK
';
my $new_data = insertProgram( $name, $task, $data );
# test to see if results start out as expected
if ( $new_data =~ m/^\Q$hidden_char\E/ ) {
print "SUCCESS\n";
}
else {
print STDERR "ERROR: What happened?\n";
print STDERR "ORIGINAL: \n$data\n";
print STDERR "MODIFIED: \n$new_data\n";
}
sub insertProgram {
my ( $local_name, $local_task, $local_data ) = @_;
# get program section from XML template
my $twig = new XML::Twig;
$twig->parse( '<?xml version="1.0"?>
<TemplateSet>
<PROGRAM>PROGRAM <Name>ProgramNameGoesHere</Name> ()
END_PROGRAM</PROGRAM>
<TASK>TASK <Name>TaskNameGoesHere</Name> ()
END_TASK</TASK>
</TemplateSet>
' );
my $program = $twig->root->first_child('PROGRAM');
# replace program name in XML template
$program->first_child('Name')->set_text($local_name);
my $insert = $program->text();
# stick modified program into data
if ( $local_data =~ s/(\s+PROGRAM\s+[^\s]+\s+\()/\n\n $insert $1/ ) {
# found it and inserted new program
}
else {
# not found
return;
}
# add program name to task list
my $added_program_to_task = $FALSE;
my $found_start = $FALSE;
my $found_end = $FALSE;
my $new_data = "";
# open string as a filehandle for line by line processing
my $filehandle;
open( $filehandle, '<', \$local_data )
or die("Can't open string as a filehandle: $!");
while (defined (my $line = <$filehandle>)) {
# look for start of our task
if (
( !$found_start ) &&
( $line =~ m/\s+TASK\s+\Q$local_task\E\s+\(/ )
) {
# found the task!
$found_start = $TRUE;
}
# look for end of our task
if (
( $found_start ) && ( !$found_end ) &&
( $line =~ m/\s+END_TASK/ )
)
{
# found the end tag for the task section!
$found_end = $TRUE;
# add the program name to the bottom of the list
$line = " " . $local_name . ";\n" . $line;
$added_program_to_task = $TRUE;
}
# compile new data from processed line or original line
$new_data = $new_data . $line;
}
close($filehandle);
if ($added_program_to_task) {
# success
}
else {
# unable to find task
return;
}
return $new_data;
}
When I run this script, I get the following output:
ERROR: What happened?
ORIGINAL:
¿(*********************************************
Data-File-Header-Junk
**********************************************)
PROGRAM MainProgram ()
END_PROGRAM
TASK SecondaryTask ()
END_TASK
TASK MainTask ()
MainProgram;
END_TASK
MODIFIED:
¿(*********************************************
Data-File-Header-Junk
**********************************************)
PROGRAM KurtsProgram ()
END_PROGRAM
PROGRAM MainProgram ()
END_PROGRAM
TASK SecondaryTask ()
END_TASK
TASK MainTask ()
MainProgram;
KurtsProgram;
END_TASK
You can see the ex开发者_开发百科tra character that was added to the front of the data right under the M in MODIFIED.
It has done an ISO-8859-1 to UTF-8 encoding conversion on the character: \xBF
-> \xC2\xBF
.
XML::Twig converts all its input to UTF-8 (see here).
You could tell Twig to keep the input encoding using the keep_encoding
option (also see the XML::Twig FAQ: My XML documents/data are produced by tools that do not grok Unicode, will XML::Twig help me there?).
But perhaps it would be better to keep the UTF-8, or perhaps silently drop the character, depending on what exactly you're going to do with it.
I can't really make sense of your code, it is still too complex to be quickly debugged, but maybe the problem has to do with a BOM (see the Unicode BOM FAQ) that would be ignored at the beginning of an XML document, but not if you copy it in the middle of an other one? just guessing here because of the xBF value, that's part of the BOM for a UTF-8 document.
精彩评论