开发者

Regular expression replace

I need a Reg Ex script

  • remove all symbols
  • allow max 1 hyphen connected to each other
  • allow max 1 period total

example:

  • Mike&Ike output is: MikeIke
  • Mike-Ike output is: Mike-Ike
  • Mike-Ike-Jill output is: Mike-Ike-Jill
  • Mike--Ike-Jill output is: Mike-Ike-Jill
  • Mike--Ike---Jill output is: Mike-Ike-Jill
  • Mike.Ike.Bill output is: Mike.IkeBill
  • Mike***Joe output is: MikeJoe
  • Mike123 output is: Mike123
  • 开发者_StackOverflow社区


#!/usr/bin/env perl

use 5.10.0;
use strict;
use warnings;

my @samples = (
    "Mike&Ike"          => "MikeIke",
    "Mike-Ike"          => "Mike-Ike",
    "Mike-Ike-Jill"     => "Mike-Ike-Jill",
    "Mike--Ike-Jill"    => "Mike-Ike-Jill",
    "Mike--Ike---Jill"  => "Mike-Ike-Jill",
    "Mike.Ike.Bill"     => "Mike.IkeBill",
    "Mike***Joe"        => "MikeJoe",
    "Mike123"           => "Mike123",
);

while (my($got, $want) = splice(@samples, 0, 2)) {
    my $had = $got;
    for ($got) {
  # 1) Allow max 1 dashy bit connected to each other.
        s/ ( \p{Dash} ) \p{Dash}+                           /$1/xg;
  # 2) Allow max 1 period, total.
        1 while s/ ^ [^.]* \. [^.]* \K \.                   //x   ;
  # 3) Remove all symbols...
        s/ (?! [\p{Dash}.] ) [\p{Symbol}\p{Punctuation}]    //xg  ;
  #                   ...and punctuation
  #       except for dashy bits and dots.
    }

    if ($got eq $want) { print "RIGHT" }
    else               { print "WRONG" }
    print ":\thad\t<$had>\n\twanted\t<$want>\n\tgot\t<$got>\n";
}

Generates:

RIGHT:  had <Mike&Ike>
    wanted  <MikeIke>
    got <MikeIke>
RIGHT:  had <Mike-Ike>
    wanted  <Mike-Ike>
    got <Mike-Ike>
RIGHT:  had <Mike-Ike-Jill>
    wanted  <Mike-Ike-Jill>
    got <Mike-Ike-Jill>
RIGHT:  had <Mike--Ike-Jill>
    wanted  <Mike-Ike-Jill>
    got <Mike-Ike-Jill>
RIGHT:  had <Mike--Ike---Jill>
    wanted  <Mike-Ike-Jill>
    got <Mike-Ike-Jill>
RIGHT:  had <Mike.Ike.Bill>
    wanted  <Mike.IkeBill>
    got <Mike.IkeBill>
RIGHT:  had <Mike***Joe>
    wanted  <MikeJoe>
    got <MikeJoe>
RIGHT:  had <Mike123>
    wanted  <Mike123>
    got <Mike123>


you could do something with several passes.
it's kind of generic workaround that could be shorted by using lookbehind.
(not all regex flavors do support this)

  1. remove multiple - with regex -{2,}
  2. remove symbols except -. with regex [^-\.A-Za-z0-9]
  3. replace first . with a temp character e.g. ! and replace remaining .
  4. replace the ! from last step with .

update using C# .net
(I'm not a C# programmer, used this regex tester and this reference for C# .net regex flavor.)

String str = "Mike&Ike ......";
str = Regex.Replace( str, @"-+", @"-" );
str = Regex.Replace( str, @"(?<=\.)(.*?)\.", @"$1" );
str = Regex.Replace( str, @"[^\w\r\n]", @"" );
  1. replacing multipe - with single -
  2. remove . if it's not the first . using positiv lookbehind (?<=...)
  3. remove symbols (actually everything not a word character or newline) \w is short for [A-Za-z0-9]
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜