How to detect presence of a domain name in a string of text, in php

2023-03-28 21:14 问答作者：

Ive been having a massive spamming issue on my site where people have been creating dozens of accou开发者_运维知识库nts every day, and spamming their site via URL shortening services (of which there are hundreds). Is there a function that will check for an existence of a link (without http:// or www.) which could be at any TLD, mostly the less-common ones.

Most of them are in a form of domain.ext/43tg34g

I would like to check the presence of domain.ext and prohibit new users from posting these.

If they create them as hyperlinks then maybe you can search for <a>...</a>

I think the trick is to scan the text for TLDs and look backwards for all consecutive none-whitespace characters before it. Since dots representing line endings or ellipsis are typically followed by a space I think you can safely assume that the dot in for instance:

.com

is not representing a line ending or the end of an ellipsis. (Can't think of any other common uses for dots at the moment.)

Mind you though that it's fairly easy to circumvent such preventive measures with things like:

test . com
test[dot]com
test. com

etc...

~~Perhaps I'll offer you a basic regex later on. I have to go now though.~~

Alright, I've had a crack at it. Probably still a bit of a naive solution, but it's a start:

// TLDs acquired from http://www.iana.org/domains/root/db/
// left out not-latin TLDs though
$tlds = array(
    'ac', 'ad', 'aero', 'ae', 'af', 'ag', 'ai', 'al', 'am', 'an', 'ao', 'aq', 'arpa', 'ar', 'asia', 'as', 'at', 'au', 'aw', 'ax', 'az',
    'ba', 'bb', 'bd', 'be', 'bf', 'bg', 'bh', 'biz', 'bi', 'bj', 'bl', 'bm', 'bn', 'bo', 'bq', 'br', 'bs', 'bt', 'bv', 'bw', 'by', 'bz',
    'ca', 'cat', 'cc', 'cd', 'cf', 'cg', 'ch', 'ci', 'ck', 'cl', 'cm', 'cn', 'coop', 'com', 'co', 'cr', 'cu', 'cv', 'cw', 'cx', 'cy', 'cz',
    'de', 'dj', 'dk', 'dm', 'do', 'dz',
    'ec', 'edu', 'ee', 'eg', 'eh', 'er', 'es', 'et', 'eu',
    'fi', 'fj', 'fk', 'fm', 'fo', 'fr',
    'ga', 'gb', 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gov', 'gp', 'gq', 'gr', 'gs', 'gt', 'gu', 'gw', 'gy',
    'hk', 'hm', 'hn', 'hr', 'ht', 'hu',
    'id', 'ie', 'il', 'im', 'info', 'in', 'int', 'io', 'iq', 'ir', 'is', 'it',
    'je', 'jm', 'jobs', 'jo', 'jp',
    'ke', 'kg', 'kh', 'ki', 'km', 'kn', 'kp', 'kr', 'kw', 'ky', 'kz',
    'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls', 'lt', 'lu', 'lv', 'ly',
    'ma', 'mc', 'md', 'me', 'mf', 'mg', 'mh', 'mil', 'mk', 'ml', 'mm', 'mn', 'mobi', 'mo', 'mp', 'mq', 'mr', 'ms', 'mt', 'museum', 'mu', 'mv', 'mw', 'mx', 'my', 'mz',
    'name', 'na', 'nc', 'ne', 'net', 'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu',  'nz',
    'om', 'org',
    'pa', 'pe', 'pf', 'pg', 'ph', 'pk', 'pl', 'pm', 'pn', 'pro', 'pr', 'ps', 'pt', 'pw', 'py',
    'qa',
    're', 'ro', 'rs', 'ru', 'rw',
    'sa', 'sb', 'sc', 'sd', 'se', 'sg', 'sh', 'si', 'sj', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st', 'su', 'sv', 'sx', 'sy', 'sz',
    'tc', 'td', 'tel', 'tf', 'tg', 'th', 'tj', 'tk', 'tl', 'tm', 'tn', 'to', 'tp', 'travel', 'tr', 'tt', 'tv', 'tw', 'tz',
    'ua', 'ug', 'uk', 'um', 'us', 'uy', 'uz',
    'va', 'vc', 've', 'vg', 'vi', 'vn', 'vu',
    'wf', 'ws',
    'xxx',
    'ye', 'yt',
    'za', 'zm', 'zw'
);
echo preg_replace( '~(\S+\.(' . implode( '|', $tlds ) . ')(/\S*)?)~i', '<a href="\1">\1</a>', $text );

The TLDs array is sorted with the alphabetically longer TLDs first, so that the regex will give them precedence.

Replace '<a href="\1">\1</a>' with '' to actually remove the URLs. I've just left this here so you can easily see where the URL's are recognized in the text.

You can see it in action on codepad.

Improvements are very much welcome.

继续阅读：php

How to detect presence of a domain name in a string of text, in php

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？