开发者

How to conditional regex

I want a regex that does one thing if it has 3 instances of .in the string, and something else if it has more than 3 instances.

for example

aaa.bbb.ccc.ddd // one part of the regex

aaa.bbb.ccc.ddd.eee // the second part of the regex

how do I achieve this in either js or c#?

something like

?(\.){4} then THIS else THAT

within the regex...

Update

Ok basically what I'm doing is this:

I want to switch, for any given System.Uri, to another subdomain in an extension method.

The problem I came across is that my domains are usually of the form http://subdomain.domain.TLD.TLD/more/url, but sometimes, it can be just http://domain.TLD.TLD/more/url (which just points to www)

So this is what I came up with:

public static class UriExtensions
{
    private const string TopLevelDomainRegex = @"(\.[^\.]{2,3}|\.[^\.]{2,3}\.[^\.]{2,3})$";
    private const string UnspecifiedSubdomainRegex = @"^((http[s]?|ftp):\/\/)(()([^:\/\s]+))(:([^\/]*))?((?:\/)?|(?:\/)(((\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?))?$";
    private const string SpecifiedSubdomainRegex = @"^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?((?:\/)?|(?:\/)(((\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?))?$";

    public static string AbsolutePathToSubdomain(this Uri uri, string subdomain)
    {
        subdomain = subdomain == "www" ? string.Em开发者_Python百科pty : string.Concat(subdomain, ".");

        var replacement = "$1{0}$5$6".FormatWith(subdomain);

        var spec = Regex.Replace(uri.Authority, TopLevelDomainRegex, string.Empty).Distinct().Count(c => c == '.') != 0;
        return Regex.Replace(uri.AbsoluteUri, spec ? SpecifiedSubdomainRegex : UnspecifiedSubdomainRegex, replacement);
    }
}

Basically with this code I take the System.Uri and:

  1. Take just the subdomain.domain.TLD.TLD using the Authority property.
  2. Match it against "pseudo TLDs" (I'm never going to have a registered domain with 2-3 letters that would break the regex, which basically checks for anything ending in .XX[X] or .XX[X].XX[X])
  3. I strip the TLDs, and end up with either domain or subdomain.domain
  4. If the resulting data has zero dots, I use the UnspecifiedSubdomainRegex, because I couldn't figure out how to use SpecifiedSubdomainRegex to tell it that if it has no dots on that part, it should return string.Empty

My question then is if there is a way to merge these three regexes into something simpler

PD: Forget about javascript, I was just using it to test the regex on the fly


You can do this using the (?(?=condition)then|else) construct. However, this is not available in JavaScript (but it is available in .NET, Perl and PCRE):

^(?(?=(?:[^.]*\.){3}[^.]*$)aaa|eee)

for example, will check if a string contains exactly three dots, and if it does, it tries to match aaa at the start of the string; otherwise it tries to match eee. So it will match the first three letters of

aaa.bbb.ccc.ddd
eee.ddd.ccc.bbb.aaa
eee

but fail on

aaa.bbb.ccc
eee.ddd.ccc.bbb
aaa.bbb.ccc.ddd.eee

Explanation:

^            # Start of string
(?           # Conditional: If the following lookahead succeeds:
 (?=         #   Positive lookahead - can we match...
  (?:        #     the following group, consisting of
   [^.]*\.   #     0+ non-dots and 1 dot
  ){3}       #     3 times
  [^.]*      #     followed only by non-dots...
  $          #     until end-of-string?
 )           #   End of lookahead
 aaa         # Then try to match aaa
|            # else...
 eee         # try to match eee
)            # End of conditional


^(?:[^.]*\.[^.]*){3}$

the regex above will match the string that has exactly 3 dots --- http://rubular.com/r/Tsaemvz1Yi.

^(?:[^.]*\.[^.]*){4,}$

and this one - for the string that has 4 dots or more --- http://rubular.com/r/IJDeQWVhEB


In Python (excuse me; but regexes are without language frontier)

import re

regx = re.compile('^([^.]*?\.){3}[^.]*?\.')

for ss in ("aaa.bbb.ccc",
           "aaa.bbb.ccc.ddd",
           'aaa.bbb.ccc.ddd.eee',
           'a.b.c.d.e.f.g.h.i...'):
  if regx.search(ss):
    print ss + '     has at least 4 dots in it'
  else:
    print ss + '     has a maximum of 3 dots in it'

result

aaa.bbb.ccc     has a maximum of 3 dots in it
aaa.bbb.ccc.ddd     has a maximum of 3 dots in it
aaa.bbb.ccc.ddd.eee     has at least 4 dots in it
a.b.c.d.e.f.g.h.i...     has at least 4 dots in it

This regex' pattern doesn't require that the entire string be analysed (no symbol $ in it). It's better on long strings.


You don't need Regex for this (as for many other common tasks).

public static string AbsolutePathToSubdomain(this Uri uri, string subdomain)
{
    // Pre-process the new subdomain
    if (subdomain == null || subdomain.Equals("www", StringComparison.CurrentCultureIgnoreCase))
        subdomain = string.Empty;

    // Count number of TLDs (assume at least one)
    List<string> parts = uri.Host.Split('.').ToList();
    int tldCount = 1;
    if (parts.Count >= 2 && parts[parts.Count - 2].Length <= 3)
    {
        tldCount++;
    }

    // Drop all subdomains
    if (parts.Count - tldCount > 1)
        parts.RemoveRange(0, parts.Count - tldCount - 1);

    // Add new subdomain, if applicable
    if (subdomain != string.Empty)
        parts.Insert(0, subdomain);

    // Construct the new URI
    UriBuilder builder = new UriBuilder(uri);
    builder.Host = string.Join(".", parts.ToArray());
    builder.Path = "/";
    builder.Query = "";
    builder.Fragment = "";

    return builder.Uri.ToString();
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜