How does Google recognizes adult content with safesearch?

2023-02-02 15:41 问答作者：

I am creating a search engine ( for studying ) and I want to know how Google recognizes adult content and images with Safesearch ( http://en.wikipedia.org/wiki/Safesearch ).

The program 开发者_StackOverflowlanguage doesn't matter, I want to know only the approach for a generic program language.

If the rules for any sort of content filter fell into the hands of people trying to get that content through the filter, the filter would become ineffective.

So I imagine that Google's rules (1) are not publicly available and (2) change frequently.

That said, starting with a small blacklist of adult sites and following outgoing links (and/or finding sites with links to the blacklisted sites) probably finds a huge number of adult sites. But by no means all, you'd want some sort of text processing and image recognition algorithms in addition.

NOTE: A popular theory is that adult content providers pay people to ask questions on stackoverflow.com so that Jon Skeet and Marc Gravell will have less time to update the SafeSearch filters. However, it is easily shown that Jon and Marc answer questions at such a high rate that any such strategy would not be economically viable.

Ben's answer is correct about all points, but I would like to add my considerations.

About image recognition: you will find pretty easy, given a large set of images, to identify objects like naked breasts, penises and such inside of them using pattern recognition.

All artificial intelligence algorithms, however, have weak points. You might experience that a certain percentage of your images, depending on the quality of the classificator used, is misclassified.

Then, you have to apply other criteria more than image processing. Surely Google's criteria are not public, but you would like to consider ICRA tags for volountarily marking certain material as adult material, text processing and cross-domain links. If I was the creator of the Safesearch, I would have adopted the following pattern: adult sites often exchange links, so you'll find lots of intersections in the link graphs between a group of adult sites.

Putting it all together, a good classification approach uses several smaller criteria, scoring them to determine whether an image is an adult image or not.

Possibly in a similar way to how spam is filtered.

First step is to create a training set, based on known adult sites, and extract features from them. These could be keywords, colors used in images, domain name structure, whois details, whatever. Anything that could in some way be specifically different for adult content as compared to non-adult content.

Next step is to apply some sort of statistical model to that. Bayesian models seem to work well for spam, but may not for adult stuff.

Support vector machines seem like a good fit, but that's a lot more complex and I'm not really familiar with it myself.

继续阅读：algorithm image-processing

How does Google recognizes adult content with safesearch?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？