How can I ensure a URL points to safe, non-adult, non-spam content when allowing people to post content to my website?

2023-02-07 22:00 问答作者：

I am working on a PHP site that allows users to post a listing for their business related to the sites 开发者_JAVA百科theme. This includes a single link URL, some text, and an optional URL for an image file.

Example:

<img src="http://www.somesite.com" width="40" />
<a href="http://www.abcbusiness.com" target="new">ABC Business</a>
<p>
Some text about how great abc business is...
</p>

The HTML in the text is filtered using the class from htmlpurifier.org and the content is checked for bad words, so I feel pretty good about that part.

The image file URL is always placed inside a <img src="" /> tag with a fixed width and validated to be an actual HTTP URL, so that should be Ok.

The dangerous part is the link.

Question: How can I be sure that the link does not point to some SPAM, unsafe, or porn site (using code)?

I can check headers for 404, etc... but is there a quick and easy way to validate a sites content from a link.

EDIT:

I am using a CAPTCHA and do require registration before posting is allowed.

Its going to be very hard to try and determine this yourself by scraping the site URL's in question. You'll probably want to rely on some 3rd party API which can check for you.

http://code.google.com/apis/safebrowsing/

Check out that API, you can send it a URL and it will tell you what it thinks. This one is mainly checking for malware and phishing... not so much porn and spam. There are others that do the same thing, just search around on google.

is there a quick and easy way to validate a sites content from a link.

No. There is no global white/blacklist of URLs which you can use to somehow filter out "bad" sites, especially since your definition of a "bad" site is so unspecific.

Even if you could look at a URL and tell whether the page it points to has bad content, it's trivially easy to disguise a URL these days.

If you really need to prevent this, you should moderate your content. Any automated solution is going to be imperfect and you're going to wind up manually moderating anyways.

Manual moderation, perhaps. I can't think of any way to automate this other than using some sort of blacklist, but even then that is not always reliable as newer sites might not be on the list.

Additionally, you could try using cURL and downloading the index page and looking for certain keywords that would raise a red flag, and then perhaps hold those for manual validation.

I would suggest having a list of these keywords in array (porn, sex, etc). If the index page that you downloaded with cURL has any of those keywords, reject or flag for moderation.

This is not reliable nor is it the most optimized way of approving links.

Ultimately, you should have manual moderation regardless, but if you wish to automate it, this is a possible route for you to take.

you can create a little monitoring system that will transfer this content created by user

to an approval queue that only administrators can access to approve the content that should

displayed at the site

继续阅读：php validation

How can I ensure a URL points to safe, non-adult, non-spam content when allowing people to post content to my website?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？