Get domain from URL to GROUP BY using MySQL
I have a table filled with URLs. The URLs are in all sorts of formats: http://foo.com, http://bar.foo.com, http://foo.com/b开发者_运维百科ar, etc. But I'm only interested in the domain name itself, so in this case: foo.com. What I'd like to do is select how many times domain names exists in this table. So something like:
SELECT "whatever the domain is in field 'url'", COUNT(*) AS count
FROM table_with_urls
GROUP BY "whatever the domain is in field 'url'"
There are a few similar questions on Stack Overflow, but nothing really answered this. I can't use LIKE or match something with REGEXP, because I'm not (always) looking for specific domain names to match against, but mostly I just want all domain names from the table along with a total count.
Is this possible using MySQL?
i had the same problem and this is what i did:
select SUBSTRING(url from 1 for locate('/',url ,10)-1),count(*) from url_list group by SUBSTRING(url from 1 for locate('/',url ,10)-1);
Add another indexed column for 'domain' and when you do an INSERT, store this value separately.
If you want to install a MySQL extension then https://github.com/StirlingMarketingGroup/mysql-get-etld-p1
It extracts basically what you'd expect it to
select`get_etld_p1`('http://a.very.complex-domain.co.uk:8080/foo/bar');-- 'complex-domain.co.uk'
select`get_etld_p1`('https://www.bbc.co.uk/');-- 'bbc.co.uk'
select`get_etld_p1`('https://github.com/StirlingMarketingGroup/');-- 'github.com'
select`get_etld_p1`('https://localhost:10000/index');-- 'localhost'
select`get_etld_p1`('android-app://com.google.android.gm');-- 'com.google.android.gm'
select`get_etld_p1`('example.test.domain.com');-- 'domain.com'
select`get_etld_p1`('postgres://user:pass@host.com:5432/path?k=v#f');-- 'host.com'
select`get_etld_p1`('exzvk.omsk.so-ups.ru');-- 'so-ups.ru'
select`get_etld_p1`('http://10.64.3.5/data_check/index.php?r=index/rawdatacheck');-- '10.64.3.5'
select`get_etld_p1`('not a domain');-- null
Then, if you wanted that to be performant, you could make a second, denormalizing, column that stores just those values, something like
CREATE TABLE `db`.`sometablewithurls` (
`SomeTableWithURLsID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`URL` TEXT NOT NULL DEFAULT '',
`_ETLDP1` VARCHAR(255) NOT NULL DEFAULT '',
PRIMARY KEY (`SomeTableWithURLsID`),
INDEX `_ETLDP1` (`_ETLDP1` ASC));
DROP TRIGGER IF EXISTS `db`.`sometablewithurls_BEFORE_INSERT`;
DELIMITER $$
USE `db`$$
CREATE DEFINER = CURRENT_USER TRIGGER `db`.`sometablewithurls_BEFORE_INSERT` BEFORE INSERT ON `sometablewithurls` FOR EACH ROW
BEGIN
set new.`_ETLDP1`=ifnull(`get_etld_p1`(new.`URL`),'');
END$$
DELIMITER ;
DROP TRIGGER IF EXISTS `db`.`sometablewithurls_BEFORE_UPDATE`;
DELIMITER $$
USE `db`$$
CREATE DEFINER = CURRENT_USER TRIGGER `db`.`sometablewithurls_BEFORE_UPDATE` BEFORE UPDATE ON `sometablewithurls` FOR EACH ROW
BEGIN
set new.`_ETLDP1`=ifnull(`get_etld_p1`(new.`URL`),'');
END$$
DELIMITER ;
Notice the index on the _ETLDP1
(stands for extended top-level domain plus 1), and the trigger updating it both on insert and on update to make sure it keeps up to date even if URL changes.
精彩评论