How to detect presence of a domain name in a string of text, in php
Ive been having a massive spamming issue on my site where people have been creating dozens of accou开发者_运维知识库nts every day, and spamming their site via URL shortening services (of which there are hundreds). Is there a function that will check for an existence of a link (without http:// or www.) which could be at any TLD, mostly the less-common ones.
Most of them are in a form of domain.ext/43tg34g
I would like to check the presence of domain.ext and prohibit new users from posting these.
If they create them as hyperlinks then maybe you can search for <a>...</a>
I think the trick is to scan the text for TLDs and look backwards for all consecutive none-whitespace characters before it. Since dots representing line endings or ellipsis are typically followed by a space I think you can safely assume that the dot in for instance:
.com
is not representing a line ending or the end of an ellipsis. (Can't think of any other common uses for dots at the moment.)
Mind you though that it's fairly easy to circumvent such preventive measures with things like:
test . com
test[dot]com
test. com
etc...
Perhaps I'll offer you a basic regex later on. I have to go now though.
Alright, I've had a crack at it. Probably still a bit of a naive solution, but it's a start:
// TLDs acquired from http://www.iana.org/domains/root/db/
// left out not-latin TLDs though
$tlds = array(
'ac', 'ad', 'aero', 'ae', 'af', 'ag', 'ai', 'al', 'am', 'an', 'ao', 'aq', 'arpa', 'ar', 'asia', 'as', 'at', 'au', 'aw', 'ax', 'az',
'ba', 'bb', 'bd', 'be', 'bf', 'bg', 'bh', 'biz', 'bi', 'bj', 'bl', 'bm', 'bn', 'bo', 'bq', 'br', 'bs', 'bt', 'bv', 'bw', 'by', 'bz',
'ca', 'cat', 'cc', 'cd', 'cf', 'cg', 'ch', 'ci', 'ck', 'cl', 'cm', 'cn', 'coop', 'com', 'co', 'cr', 'cu', 'cv', 'cw', 'cx', 'cy', 'cz',
'de', 'dj', 'dk', 'dm', 'do', 'dz',
'ec', 'edu', 'ee', 'eg', 'eh', 'er', 'es', 'et', 'eu',
'fi', 'fj', 'fk', 'fm', 'fo', 'fr',
'ga', 'gb', 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gov', 'gp', 'gq', 'gr', 'gs', 'gt', 'gu', 'gw', 'gy',
'hk', 'hm', 'hn', 'hr', 'ht', 'hu',
'id', 'ie', 'il', 'im', 'info', 'in', 'int', 'io', 'iq', 'ir', 'is', 'it',
'je', 'jm', 'jobs', 'jo', 'jp',
'ke', 'kg', 'kh', 'ki', 'km', 'kn', 'kp', 'kr', 'kw', 'ky', 'kz',
'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls', 'lt', 'lu', 'lv', 'ly',
'ma', 'mc', 'md', 'me', 'mf', 'mg', 'mh', 'mil', 'mk', 'ml', 'mm', 'mn', 'mobi', 'mo', 'mp', 'mq', 'mr', 'ms', 'mt', 'museum', 'mu', 'mv', 'mw', 'mx', 'my', 'mz',
'name', 'na', 'nc', 'ne', 'net', 'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz',
'om', 'org',
'pa', 'pe', 'pf', 'pg', 'ph', 'pk', 'pl', 'pm', 'pn', 'pro', 'pr', 'ps', 'pt', 'pw', 'py',
'qa',
're', 'ro', 'rs', 'ru', 'rw',
'sa', 'sb', 'sc', 'sd', 'se', 'sg', 'sh', 'si', 'sj', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st', 'su', 'sv', 'sx', 'sy', 'sz',
'tc', 'td', 'tel', 'tf', 'tg', 'th', 'tj', 'tk', 'tl', 'tm', 'tn', 'to', 'tp', 'travel', 'tr', 'tt', 'tv', 'tw', 'tz',
'ua', 'ug', 'uk', 'um', 'us', 'uy', 'uz',
'va', 'vc', 've', 'vg', 'vi', 'vn', 'vu',
'wf', 'ws',
'xxx',
'ye', 'yt',
'za', 'zm', 'zw'
);
echo preg_replace( '~(\S+\.(' . implode( '|', $tlds ) . ')(/\S*)?)~i', '<a href="\1">\1</a>', $text );
The TLDs array is sorted with the alphabetically longer TLDs first, so that the regex will give them precedence.
Replace '<a href="\1">\1</a>'
with ''
to actually remove the URLs. I've just left this here so you can easily see where the URL's are recognized in the text.
You can see it in action on codepad.
Improvements are very much welcome.
精彩评论