开发者

minor changes on a regex to deal with special chars [ä,ö,ü,ß...]

I have parsed a larger dataset: and i run into a problem: on the results of the parsed dataset - (German language). See an example - with one little thing left: in the German language we have special characters which are not recognized correctly .... see the following lines - out of a result:

lfd. Nr. Schul- nummer Schulname Stra�e PLZ Ort Telefon Fax Schulart Webseite
1 0401 M�dchenrealschule Marienburg,�Abenberg, der Di�zese Eichst�tt Marienburg 1 91183� Abenberg�  09178/509210  Realschulen  mrs-marienburg.homepage.t-online.de 
2 6581 Volksschule Abenberg�(Grundschule) G�ss�belstr. 2 91183� Abenberg�  09178/215 09178/905060 Volksschulen  home.t-online.de/home/vs-abenberg 
3 6913 Mittelschule Abenberg� G�ss�belstr. 2 91183� Abenberg�  09178/215 09178/905060 Volksschulen  home.t-online.de/home/vs-abenberg 
4 0402 Johann-Turmair-Realschule�Staatliche Realschule Abensberg Stadionstra�e 46 93326� Abensberg�  09443/9143-0,12,13 09443/914330 Realschulen  www.rs-abensberg.de 
5 3041 Cabrini-Schule Offenstetten, Priv. F�rderzentrum�F�rderschwerp. geist.Entwickl. d. Kath.Jugendf�rs. Am Schmiedweiher 8 93326� Abensberg�Offenstetten 09443/9188-3 09443/918855 Volksschulen zur sonderp�dog. F�rderung  www.cabrinischule.de 
6 3074 Private Berufsschule zur sonderp�d. F�rderung,�F�rderschwerpunkt Lernen, Abensberg Regensburger Stra�e 60 93326� Abensberg�  09443/709191 094开发者_运维问答43/709193 Berufsschulen zur sonderp�dog. F�rderung  www.berufsschule-abensberg.de 

in the following lines i add the correct characters see some of the corrections in bold!

lfd. Nr. Schul- nummer Schulname **Straße** PLZ Ort Telefon Fax Schulart Webseite
1 0401 **Mädchenrealschule** Marienburg, Abenberg, der **Diözese** Eichstätt Marienburg 1 91183 Abenberg  09178/509210  Realschulen  mrs-marienburg.homepage.t-online.de 
2 6581 Volksschule Abenberg (Grundschule) **Güssübelstr**. 2 91183 Abenberg 

see some of the corrections in bold....

Well how can we rewrite the regex to go round the issue with the special characters...?

any hint on this here .... !?

btw see the code:

sub processData() {
   while ( $range <= $total_records) {
      getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
      $te->parse_file('processing.html');
      my ($table) = $te->tables;
      for my $row ( $table->rows ) {
         cleanup(@$row);
         print OUTFILE "@$row\n";
      }
      $| = 1;  
      print "Processed records $range to $counter";
      print "\r";
      $counter = $counter + 50;
      $range = $range + 50;
      $te = HTML::TableExtract->new;
   }
}

sub cleanup() {
   for ( @_ ) {
      s/\s+/ /g;
   }
}


This has nothing to do with regexes. The issue is that you have an encoding problem. Normalize everything to UTF-8 and you will be far happier.

And for goodness’ sake, don’t use POSIX locales! Use the UCA.


The question is not clear, because I see no regex in your code except for the substitution in cleanup(). Is that what you believe is causing you the problem? The 'special' German characters that are being corrupted will not match a \s pattern, and I very much doubt if this is the culprit.

Your data is encoded in UTF-8 - both on input and output. But the output text substitutes various two-byte characters with EF BF BD, which is UTF-8 for Unicode U+FFFD or 'REPLACEMENT CHARACTER'. As long as you opened all files as UTF-8 all should be well. I don't believe there is much that a simple use encoding 'UTF8' at the head of your program won't cure.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜