minor changes on a regex to deal with special chars [ä,ö,ü,ß...]

2023-02-11 08:55 问答作者：

I have parsed a larger dataset: and i run into a problem: on the results of the parsed dataset - (German language). See an example - with one little thing left: in the German language we have special characters which are not recognized correctly .... see the following lines - out of a result:

lfd. Nr. Schul- nummer Schulname Stra�e PLZ Ort Telefon Fax Schulart Webseite
1 0401 M�dchenrealschule Marienburg,�Abenberg, der Di�zese Eichst�tt Marienburg 1 91183� Abenberg�  09178/509210  Realschulen  mrs-marienburg.homepage.t-online.de 
2 6581 Volksschule Abenberg�(Grundschule) G�ss�belstr. 2 91183� Abenberg�  09178/215 09178/905060 Volksschulen  home.t-online.de/home/vs-abenberg 
3 6913 Mittelschule Abenberg� G�ss�belstr. 2 91183� Abenberg�  09178/215 09178/905060 Volksschulen  home.t-online.de/home/vs-abenberg 
4 0402 Johann-Turmair-Realschule�Staatliche Realschule Abensberg Stadionstra�e 46 93326� Abensberg�  09443/9143-0,12,13 09443/914330 Realschulen  www.rs-abensberg.de 
5 3041 Cabrini-Schule Offenstetten, Priv. F�rderzentrum�F�rderschwerp. geist.Entwickl. d. Kath.Jugendf�rs. Am Schmiedweiher 8 93326� Abensberg�Offenstetten 09443/9188-3 09443/918855 Volksschulen zur sonderp�dog. F�rderung  www.cabrinischule.de 
6 3074 Private Berufsschule zur sonderp�d. F�rderung,�F�rderschwerpunkt Lernen, Abensberg Regensburger Stra�e 60 93326� Abensberg�  09443/709191 094开发者_运维问答43/709193 Berufsschulen zur sonderp�dog. F�rderung  www.berufsschule-abensberg.de

in the following lines i add the correct characters see some of the corrections in bold!

lfd. Nr. Schul- nummer Schulname **Straße** PLZ Ort Telefon Fax Schulart Webseite
1 0401 **Mädchenrealschule** Marienburg, Abenberg, der **Diözese** Eichstätt Marienburg 1 91183 Abenberg  09178/509210  Realschulen  mrs-marienburg.homepage.t-online.de 
2 6581 Volksschule Abenberg (Grundschule) **Güssübelstr**. 2 91183 Abenberg

see some of the corrections in bold....

Well how can we rewrite the regex to go round the issue with the special characters...?

any hint on this here .... !?

btw see the code:

sub processData() {
   while ( $range <= $total_records) {
      getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
      $te->parse_file('processing.html');
      my ($table) = $te->tables;
      for my $row ( $table->rows ) {
         cleanup(@$row);
         print OUTFILE "@$row\n";
      }
      $| = 1;  
      print "Processed records $range to $counter";
      print "\r";
      $counter = $counter + 50;
      $range = $range + 50;
      $te = HTML::TableExtract->new;
   }
}

sub cleanup() {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

This has nothing to do with regexes. The issue is that you have an encoding problem. Normalize everything to UTF-8 and you will be far happier.

And for goodness’ sake, don’t use POSIX locales! Use the UCA.

The question is not clear, because I see no regex in your code except for the substitution in cleanup(). Is that what you believe is causing you the problem? The 'special' German characters that are being corrupted will not match a \s pattern, and I very much doubt if this is the culprit.

Your data is encoded in UTF-8 - both on input and output. But the output text substitutes various two-byte characters with EF BF BD, which is UTF-8 for Unicode U+FFFD or 'REPLACEMENT CHARACTER'. As long as you opened all files as UTF-8 all should be well. I don't believe there is much that a simple use encoding 'UTF8' at the head of your program won't cure.

继续阅读：perl regex special-characters unicode

minor changes on a regex to deal with special chars [ä,ö,ü,ß...]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？