Get domain name (not subdomain) in php

2022-12-28 02:45 问答作者：

I have a URL which can be any of the following formats:

http://example.com
https://example.com
http://example.com/foo
http://example.com/foo/bar
www.example.com
example.com
foo.example.com
www.foo.example.com
foo.bar.example.com
http://foo.bar.example.com/foo/bar
example.net/foo/bar

Essential开发者_StackOverflow社区ly, I need to be able to match any normal URL. How can I extract example.com (or .net, whatever the tld happens to be. I need this to work with any TLD.) from all of these via a single regex?

Well you can use parse_url to get the host:

$info = parse_url($url);
$host = $info['host'];

Then, you can do some fancy stuff to get only the TLD and the Host

$host_names = explode(".", $host);
$bottom_host_name = $host_names[count($host_names)-2] . "." . $host_names[count($host_names)-1];

Not very elegant, but should work.

If you want an explanation, here it goes:

First we grab everything between the scheme (http://, etc), by using parse_url's capabilities to... well.... parse URL's. :)

Then we take the host name, and separate it into an array based on where the periods fall, so test.world.hello.myname would become:

array("test", "world", "hello", "myname");

After that, we take the number of elements in the array (4).

Then, we subtract 2 from it to get the second to last string (the hostname, or example, in your example)

Then, we subtract 1 from it to get the last string (because array keys start at 0), also known as the TLD

Then we combine those two parts with a period, and you have your base host name.

It is not possible to get the domain name without using a TLD list to compare with as their exist many cases with completely the same structure and length:

 nas.db.de (Subdomain)
 bbc.co.uk (Top-Level-Domain)

 www.uk.com (Subdomain)
 big.uk.com (Second-Level-Domain)

Mozilla's public suffix list should be the best option as it is used by all major browsers:
https://publicsuffix.org/list/public_suffix_list.dat

Feel free to use my function:

function tld_list($cache_dir=null) {
    // we use "/tmp" if $cache_dir is not set
    $cache_dir = isset($cache_dir) ? $cache_dir : sys_get_temp_dir();
    $lock_dir = $cache_dir . '/public_suffix_list_lock/';
    $list_dir = $cache_dir . '/public_suffix_list/';
    // refresh list all 30 days
    if (file_exists($list_dir) && @filemtime($list_dir) + 2592000 > time()) {
        return $list_dir;
    }
    // use exclusive lock to avoid race conditions
    if (!file_exists($lock_dir) && @mkdir($lock_dir)) {
        // read from source
        $list = @fopen('https://publicsuffix.org/list/public_suffix_list.dat', 'r');
        if ($list) {
            // the list is older than 30 days so delete everything first
            if (file_exists($list_dir)) {
                foreach (glob($list_dir . '*') as $filename) {
                    unlink($filename);
                }
                rmdir($list_dir);
            }
            // now set list directory with new timestamp
            mkdir($list_dir);
            // read line-by-line to avoid high memory usage
            while ($line = fgets($list)) {
                // skip comments and empty lines
                if ($line[0] == '/' || !$line) {
                    continue;
                }
                // remove wildcard
                if ($line[0] . $line[1] == '*.') {
                    $line = substr($line, 2);
                }
                // remove exclamation mark
                if ($line[0] == '!') {
                    $line = substr($line, 1);
                }
                // reverse TLD and remove linebreak
                $line = implode('.', array_reverse(explode('.', (trim($line)))));
                // we split the TLD list to reduce memory usage
                touch($list_dir . $line);
            }
            fclose($list);
        }
        @rmdir($lock_dir);
    }
    // repair locks (should never happen)
    if (file_exists($lock_dir) && mt_rand(0, 100) == 0 && @filemtime($lock_dir) + 86400 < time()) {
        @rmdir($lock_dir);
    }
    return $list_dir;
}
function get_domain($url=null) {
    // obtain location of public suffix list
    $tld_dir = tld_list();
    // no url = our own host
    $url = isset($url) ? $url : $_SERVER['SERVER_NAME'];
    // add missing scheme      ftp://            http:// ftps://   https://
    $url = !isset($url[5]) || ($url[3] != ':' && $url[4] != ':' && $url[5] != ':') ? 'http://' . $url : $url;
    // remove "/path/file.html", "/:80", etc.
    $url = parse_url($url, PHP_URL_HOST);
    // replace absolute domain name by relative (http://www.dns-sd.org/TrailingDotsInDomainNames.html)
    $url = trim($url, '.');
    // check if TLD exists
    $url = explode('.', $url);
    $parts = array_reverse($url);
    foreach ($parts as $key => $part) {
        $tld = implode('.', $parts);
        if (file_exists($tld_dir . $tld)) {
            return !$key ? '' : implode('.', array_slice($url, $key - 1));
        }
        // remove last part
        array_pop($parts);
    }
    return '';
}

What it makes special:

it accepts every input like URLs, hostnames or domains with- or without scheme
the list is downloaded row-by-row to avoid high memory usage
it creates a new file per TLD in a cache folder so get_domain() only needs to check through file_exists() if it exists so it does not need to include a huge database on every request like TLDExtract does it.
the list will be automatically updated every 30 days

Test:

$urls = array(
    'http://www.example.com',// example.com
    'http://subdomain.example.com',// example.com
    'http://www.example.uk.com',// example.uk.com
    'http://www.example.co.uk',// example.co.uk
    'http://www.example.com.ac',// example.com.ac
    'http://example.com.ac',// example.com.ac
    'http://www.example.accident-prevention.aero',// example.accident-prevention.aero
    'http://www.example.sub.ar',// sub.ar
    'http://www.congresodelalengua3.ar',// congresodelalengua3.ar
    'http://congresodelalengua3.ar',// congresodelalengua3.ar
    'http://www.example.pvt.k12.ma.us',// example.pvt.k12.ma.us
    'http://www.example.lib.wy.us',// example.lib.wy.us
    'com',// empty
    '.com',// empty
    'http://big.uk.com',// big.uk.com
    'uk.com',// empty
    'www.uk.com',// www.uk.com
    '.uk.com',// empty
    'stackoverflow.com',// stackoverflow.com
    '.foobarfoo',// empty
    '',// empty
    false,// empty
    ' ',// empty
    1,// empty
    'a',// empty    
);

Recent version with explanations (German):
http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm

My solution in https://gist.github.com/pocesar/5366899

and the tests are here http://codepad.viper-7.com/GAh1tP

It works with any TLD, and hideous subdomain patterns (up to 3 subdomains).

There's a test included with many domain names.

Won't paste the function here because of the weird indentation for code in StackOverflow (could have fenced code blocks like github)

echo getDomainOnly("http://example.com/foo/bar");

function getDomainOnly($host){
    $host = strtolower(trim($host));
    $host = ltrim(str_replace("http://","",str_replace("https://","",$host)),"www.");
    $count = substr_count($host, '.');
    if($count === 2){
        if(strlen(explode('.', $host)[1]) > 3) $host = explode('.', $host, 2)[1];
    } else if($count > 2){
        $host = getDomainOnly(explode('.', $host, 2)[1]);
    }
    $host = explode('/',$host);
    return $host[0];
}

I recommend using TLDExtract library for all operations with domain name.

I think the best way to handle this problem is:

$second_level_domains_regex = '/\.asn\.au$|\.com\.au$|\.net\.au$|\.id\.au$|\.org\.au$|\.edu\.au$|\.gov\.au$|\.csiro\.au$|\.act\.au$|\.nsw\.au$|\.nt\.au$|\.qld\.au$|\.sa\.au$|\.tas\.au$|\.vic\.au$|\.wa\.au$|\.co\.at$|\.or\.at$|\.priv\.at$|\.ac\.at$|\.avocat\.fr$|\.aeroport\.fr$|\.veterinaire\.fr$|\.co\.hu$|\.film\.hu$|\.lakas\.hu$|\.ingatlan\.hu$|\.sport\.hu$|\.hotel\.hu$|\.ac\.nz$|\.co\.nz$|\.geek\.nz$|\.gen\.nz$|\.kiwi\.nz$|\.maori\.nz$|\.net\.nz$|\.org\.nz$|\.school\.nz$|\.cri\.nz$|\.govt\.nz$|\.health\.nz$|\.iwi\.nz$|\.mil\.nz$|\.parliament\.nz$|\.ac\.za$|\.gov\.za$|\.law\.za$|\.mil\.za$|\.nom\.za$|\.school\.za$|\.net\.za$|\.co\.uk$|\.org\.uk$|\.me\.uk$|\.ltd\.uk$|\.plc\.uk$|\.net\.uk$|\.sch\.uk$|\.ac\.uk$|\.gov\.uk$|\.mod\.uk$|\.mil\.uk$|\.nhs\.uk$|\.police\.uk$/';
$domain = $_SERVER['HTTP_HOST'];
$domain = explode('.', $domain);
$domain = array_reverse($domain);
if (preg_match($second_level_domains_regex, $_SERVER['HTTP_HOST']) {
    $domain = "$domain[2].$domain[1].$domain[0]";
} else {
    $domain = "$domain[1].$domain[0]";
}

$onlyHostName = implode('.', array_slice(explode('.', parse_url($link, PHP_URL_HOST)), -2));

Using https://subdomain.domain.com/some/path as example

parse_url($link, PHP_URL_HOST) returns subdomain.domain.com

explode('.', parse_url($link, PHP_URL_HOST)) then breaks subdomain.domain.com into an array:

array(3) {
  [0]=>
  string(5) "subdomain"
  [1]=>
  string(7) "domain"
  [2]=>
  string(3) "com"
}

array_slice then slices the array so only the last 2 values are in the array (signified by the -2):

array(2) {
  [0]=>
  string(6) "domain"
  [1]=>
  string(3) "com"
}

implode then combines those two array values back together, ultimately giving you the result of domain.com

Note: this will only work when end domain you're expecting only has one . in it, like something.domain.com or else.something.domain.net

It will not work for something.domain.co.uk where you would expect domain.co.uk

There are two ways to extract subdomain from a host:

The first method that is more accurate is to use a database of tlds (like public_suffix_list.dat) and match domain with it. This is a little heavy in some cases. There are some PHP classes for using it like php-domain-parser and TLDExtract.

The second way is not as accurate as the first one, but is very fast and it can give the correct answer in many case, I wrote this function for it:

function get_domaininfo($url) {
    // regex can be replaced with parse_url
    preg_match("/^(https|http|ftp):\/\/(.*?)\//", "$url/" , $matches);
    $parts = explode(".", $matches[2]);
    $tld = array_pop($parts);
    $host = array_pop($parts);
    if ( strlen($tld) == 2 && strlen($host) <= 3 ) {
        $tld = "$host.$tld";
        $host = array_pop($parts);
    }

    return array(
        'protocol' => $matches[1],
        'subdomain' => implode(".", $parts),
        'domain' => "$host.$tld",
        'host'=>$host,'tld'=>$tld
    );
}

Example:

print_r(get_domaininfo('http://mysubdomain.domain.co.uk/index.php'));

Returns:

Array
(
    [protocol] => https
    [subdomain] => mysubdomain
    [domain] => domain.co.uk
    [host] => domain
    [tld] => co.uk
)

Here's a function I wrote to grab the domain without subdomain(s), regardless of whether the domain is using a ccTLD or a new style long TLD, etc... There is no lookup or huge array of known TLDs, and there's no regex. It can be a lot shorter using the ternary operator and nesting, but I expanded it for readability.

// Per Wikipedia: "All ASCII ccTLD identifiers are two letters long, 
// and all two-letter top-level domains are ccTLDs."

function topDomainFromURL($url) {
  $url_parts = parse_url($url);
  $domain_parts = explode('.', $url_parts['host']);
  if (strlen(end($domain_parts)) == 2 ) { 
    // ccTLD here, get last three parts
    $top_domain_parts = array_slice($domain_parts, -3);
  } else {
    $top_domain_parts = array_slice($domain_parts, -2);
  }
  $top_domain = implode('.', $top_domain_parts);
  return $top_domain;
}

function getDomain($url){
    $pieces = parse_url($url);
    $domain = isset($pieces['host']) ? $pieces['host'] : '';
    if(preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)){
        return $regs['domain'];
    }
    return FALSE;
}

echo getDomain("http://example.com"); // outputs 'example.com'
echo getDomain("http://www.example.com"); // outputs 'example.com'
echo getDomain("http://mail.example.co.uk"); // outputs 'example.co.uk'

I had problems with the solution provided by pocesar. When I would use for instance subdomain.domain.nl it would not return domain.nl. Instead it would return subdomain.domain.nl Another problem was that domain.com.br would return com.br

I am not sure but i fixed these issues with the following code (i hope it will help someone, if so I am a happy man):

function get_domain($domain, $debug = false){
    $original = $domain = strtolower($domain);
    if (filter_var($domain, FILTER_VALIDATE_IP)) {
        return $domain;
    }
    $debug ? print('<strong style="color:green">&raquo;</strong> Parsing: '.$original) : false;
    $arr = array_slice(array_filter(explode('.', $domain, 4), function($value){
        return $value !== 'www';
    }), 0); //rebuild array indexes
    if (count($arr) > 2){
        $count = count($arr);
        $_sub = explode('.', $count === 4 ? $arr[3] : $arr[2]);
        $debug ? print(" (parts count: {$count})") : false;
        if (count($_sub) === 2){ // two level TLD
            $removed = array_shift($arr);
            if ($count === 4){ // got a subdomain acting as a domain
                $removed = array_shift($arr);
            }
            $debug ? print("<br>\n" . '[*] Two level TLD: <strong>' . join('.', $_sub) . '</strong> ') : false;
        }elseif (count($_sub) === 1){ // one level TLD
            $removed = array_shift($arr); //remove the subdomain
            if (strlen($arr[0]) === 2 && $count === 3){ // TLD domain must be 2 letters
                array_unshift($arr, $removed);
            }elseif(strlen($arr[0]) === 3 && $count === 3){
                array_unshift($arr, $removed);
            }else{
                // non country TLD according to IANA
                $tlds = array(
                    'aero',
                    'arpa',
                    'asia',
                    'biz',
                    'cat',
                    'com',
                    'coop',
                    'edu',
                    'gov',
                    'info',
                    'jobs',
                    'mil',
                    'mobi',
                    'museum',
                    'name',
                    'net',
                    'org',
                    'post',
                    'pro',
                    'tel',
                    'travel',
                    'xxx',
                );
                if (count($arr) > 2 && in_array($_sub[0], $tlds) !== false){ //special TLD don't have a country
                    array_shift($arr);
                }
            }
            $debug ? print("<br>\n" .'[*] One level TLD: <strong>'.join('.', $_sub).'</strong> ') : false;
        }else{ // more than 3 levels, something is wrong
            for ($i = count($_sub); $i > 1; $i--){
                $removed = array_shift($arr);
            }
            $debug ? print("<br>\n" . '[*] Three level TLD: <strong>' . join('.', $_sub) . '</strong> ') : false;
        }
    }elseif (count($arr) === 2){
        $arr0 = array_shift($arr);
        if (strpos(join('.', $arr), '.') === false && in_array($arr[0], array('localhost','test','invalid')) === false){ // not a reserved domain
            $debug ? print("<br>\n" .'Seems invalid domain: <strong>'.join('.', $arr).'</strong> re-adding: <strong>'.$arr0.'</strong> ') : false;
            // seems invalid domain, restore it
            array_unshift($arr, $arr0);
        }
    }
    $debug ? print("<br>\n".'<strong style="color:gray">&laquo;</strong> Done parsing: <span style="color:red">' . $original . '</span> as <span style="color:blue">'. join('.', $arr) ."</span><br>\n") : false;
    return join('.', $arr);
}

Here's one that works for all domains, including those with second level domains like "co.uk"

function strip_subdomains($url){

    # credits to gavingmiller for maintaining this list
    $second_level_domains = file_get_contents("https://raw.githubusercontent.com/gavingmiller/second-level-domains/master/SLDs.csv");

    # presume sld first ...
    $possible_sld = implode('.', array_slice(explode('.', $url), -2));

    # and then verify it
    if (strpos($second_level_domains, $possible_sld)){
        return  implode('.', array_slice(explode('.', $url), -3));
    } else {
        return  implode('.', array_slice(explode('.', $url), -2));
    }
}

Looks like there's a duplicate question here: delete-subdomain-from-url-string-if-subdomain-is-found

Very late, I see that you marked regex as a keyword and my function works like a charm, so far I haven't found a url that fails:

function get_domain_regex($url){
  $pieces = parse_url($url);
  $domain = isset($pieces['host']) ? $pieces['host'] : '';
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    return $regs['domain'];
  }else{
    return false;
  }
}

if you want one without regex I have this one, which I am sure I also took from this post

function get_domain($url){
  $parseUrl = parse_url($url);
  $host = $parseUrl['host'];
  $host_array = explode(".", $host);
  $domain = $host_array[count($host_array)-2] . "." . $host_array[count($host_array)-1];
  return $domain;
}

They both work amazing, BUT, this took me a while to realize if the url doesn't start with http:// or https:// it will fail so make sure the url string starts with the protocol.

Simply try this:

   preg_match('/(www.)?([^.]+\.[^.]+)$/', $yourHost, $matches);

   echo "domain name is: {$matches[0]}\n";

this working for majority of domains.

This function will return the domain name without the extension of any url given even if you parse a url without the http:// or https://

You can extend this code

(?:\.co)?(?:\.com)?(?:\.gov)?(?:\.net)?(?:\.org)?(?:\.id)?

with more extensions if you want to handle more second level domainnames.

    function get_domain_name($url){
      $pieces = parse_url($url);
      $domain = isset($pieces['host']) ? $pieces['host'] : $url;
      $domain = strtolower($domain);
      $domain = preg_replace('/.international$/', '.com', $domain);
      if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,90}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
          if (preg_match('/(.*?)((?:\.co)?(?:\.com)?(?:\.gov)?(?:\.net)?(?:\.org)?(?:\.id)?(?:\.asn)?.[a-z]{2,6})$/i', $regs['domain'], $matches)) {
              return $matches[1];
          }else  return $regs['domain'];
      }else{
        return $url;
      }
    }

I'm using this to achieve the same target and it always works, I hope it will help others.

$url          = https://use.fontawesome.com/releases/v5.11.2/css/all.css?ver=2.7.5
$handle       = pathinfo( parse_url( $url )['host'] )['filename'];
$final_handle = substr( $handle , strpos( $handle , '.' ) + 1 );

print_r($final_handle); // fontawesome

Simplest solution

@preg_replace('#\/(.)*#', '', @preg_replace('#^https?://(www.)?#', '', $url))

Simply try this:

<?php
  $host = $_SERVER['HTTP_HOST'];
  preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
  echo "domain name is: {$matches[0]}\n";
?>

继续阅读：domain-name php regex

Get domain name (not subdomain) in php

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？