开发者

PHP nested array search

I'm new in PHP

I have an array like this

$suspiciousList = array(
array ("word" => "badword1", "score" => 400, "type" => 1), 
array ("word" => "badword2", "score" => 250, "type" => 1),
array ("word" => "badword3", "score" => 400, "type" => 1), 
array ("word" => "badword4", "score" => 400, "type" => 1));

I have problems when users input words with spaces like (badw ord1, b adword2, etc.), or a user may input like (b a d w o r d 1)

How can I detect or search for combinations from the array (dictionary)?

My idea is to make every word become an array split by spaces.

$this->suspiciousPart[] = $word;

I'm write following function

public function deepDetect2() {
    for($i=0;$i<sizeof($this->suspiciousPart);$i++) {
        $word = "";
        for($j=$i;$j<sizeof($this->suspiciousPart);$j++) {
            $word .= $this->suspiciousPart[$j];
            //var_dump($word);
            if(strpos(in_array($word, $this->suspiciousList), $word) !== false) 开发者_运维知识库{
                if($this->detect($word) == true) {
                    $i++;
                } else {
                    $j++;   
                }
            } else {
                $i++;
            }
        }
    }
}

Anybody have other ideas how to do this?

Thanks


  1. Strip spaces
  2. Search with ONE regular expression containing all your keywords, like this: (word1|word2|word3)


This question is a good start: How do you implement a good profanity filter? - and I agree with the conclusion, i.e. the detection will have always poor results.

I would try these approaches:

1) Simply detect words that are vulgar according to your dictionary.

2) Come up with a few heuristics like "continuous sequence of 'words' composed of one letter" (b a d w o r d) and use them to evaluate users' posts. Then you can compute expected number of vulgar words: \sum_i^{number of your heuristics} P_i * N_i, where P_i is the probability that word found with heuristic i is really a vulgar one and N_i is a number of words found by heuristics i. I think the probabilistic approach is better than simply stating "this post does (not) contain a vulgar word".

3) Let a moderator decide if a post is really vulgar or not. Otherwise imperfection of your automatic replacing method will most probably get your users mad.

4) I think it's useless to look up words in an English (or Turkish?) dictionary in order to find words that are not really English words because people misspell words too much these days.


Anyway, you can strip whitespace characters and use (mb_)substr_count() but it leads to getting false positives.


As Jirka Helmich suggested you could remove whitespaces (and maybe other special chars) and then search the string to find words from your array.

public function searchForBadWords($strippedText) {
     foreach($suspiciousList as $suspiciousPart) {
          $count = substr_count($strippedText, $suspiciousPart['word']);
          //you can use str_replace here or something, it depends what you want to achive
     }
}

Problem is if you have words like blablabad wordblabla and you remove spaces to normal words could become bad words blablabadwordblabla (know what I mean?) :D

Cheers

Edit: So Ahmad I see you just get words recognizing them by " " on the beginning/end(in shortcut). Maybe you should try to implement both methods, yours with single words and this above with substring searching. It depends also how much you care about performance. Maybe you should try do some reserches or sth to see how effective it is?:D


@f1ames : I'm using these following code to make it array.

    $words = mb_strtolower($words, 'UTF-8');
    $words = $this->removeUniCharCategories($words);
    $words = explode(" ",$words);
    //Remove empty Array !
    $words = array_filter($words);
    foreach ($words as &$value) {
        $newWords[] = $value;
    }
    $words = $newWords;

But i'm still find the best sollution

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜