开发者

How can I use PHP to find base info about text quality?

I have a PHP/MySQL driven site, which I have not maintained for the past 6 months. It is a site where users come and submit their articles. I have 50.000 articles and by some 'ad hoc' tests I should say that about 50-60% is spam and copy pasted text from other sites.

I am looking to write a PHP script that will take some base parameters to mark/remove spam text(not copy/pasted, for this step only pure spam) so my idea is to make a script which takes every unit, counts characters, words, different words and phrases usage and word density and depending on those factors remove as pure spam (with much repeated phrases, etc.). So for this I will lose a whole day开发者_运维技巧 and my question is:

Is there some solution already developed in PHP? If I need to code it myself, what parameters on determining spam should I use?


Here's a PHP class that I've used in the past - Basic Spam Class I am not the author, so I don't take any responsibility for potential damage done by the code. I've used it for checking short texts though - user comments on a site, so I'm not sure about the performance on 50k of long articles, maybe you will need to do some enhancements on it. But at least you have something to start from.


Maybe you could take a look at Akismet and Bad Behaviour. The first one to analyze the articles you already have (as well as future ones) and Bad Behaviour to combat spam before it ever gets into your database.

They may not be ideal, but they could help you on your way.


I've observed that a lot of spam posts on sites like that have a lack of articles. They contain just a bunch of keywords and links. You could add a parameter for minimum number of articles. If less than 1% of the post is articles you could reject it as spam.

For example, if you count the number of thes ans as and somes in the above paragraph you get 3 as and 1 the (4 articles total out of 43 words is 9.3%)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜