Mongo db utf-8 exception

2023-04-11 20:56 问答作者：

i'm wondering why i'm having troubles when inserting strings in db like hey hey %80 the '%80' still produce an eception :

Uncaught exception 'MongoException' with message 'non-utf8 string: hey hey �'

what i need to do? :( is %80 not a utf-8; char? :O

js pass the string to the controller:

function new_pool_post(_url,_data,_starter){
$.ajax({
    type:'POST',
    data:_data,
    dataType:'json',
    url:_url,
    beforeSend:function(){
     $('.ajax-loading').show();
     $(_starter).attr('disabled','disabled');
    },
    error:function(){
        $('.ajax-loading').hide();
        $(_starter).removeAttr('disabled');
    },
    success:function(json){
    $('.ajax-loading').hide();
    $(_starter).removeAttr('disabled');
    if(json){
        $('.pool-append').prepend(json.pool_post);

    }
     }
});
}

controller receive data:

$id_project = $this->input->post('id_project',true);
               $id_user = $this->session->userdata('user_id');
               $pool_post = $this->input->post('pool_post',true);

controller sanitize data :

public function xss_clean($str, $is_image = FALSE)
    {
        /*
         * Is the string an array?
         *
         */
        if (is_array($str))
        {
            while (list($key) = each($str))
            {
                $str[$key] = $this->xss_clean($str[$key]);
            }

            return $str;
        }
                /*Remove non utf-8; chars*/

               $str =  htmlspecialchars(urlencode(preg_replace('/[\x00-\x1F\x80-\xFF]/','',$str)));

        /*
         * Remove Invisible Characters
         */
        $str = remove_invisible_characters($str);

        // Validate Entities in URLs
        $str = $this->_validate_entities($str);

        /*
         * URL Decode
         *
         * Just in case stuff like this is submitted:
         *
         * <a href="http://%77%77%77%2E%67%6F%6F%67%6C%65%2E%63%6F%6D">开发者_开发百科Google</a>
         *
         * Note: Use rawurldecode() so it does not remove plus signs
         *
         */
        $str = rawurldecode($str);

        /*
         * Convert character entities to ASCII
         *
         * This permits our tests below to work reliably.
         * We only convert entities that are within tags since
         * these are the ones that will pose security problems.
         *
         */

        $str = preg_replace_callback("/[a-z]+=([\'\"]).*?\\1/si", array($this, '_convert_attribute'), $str);

        $str = preg_replace_callback("/<\w+.*?(?=>|<|$)/si", array($this, '_decode_entity'), $str);

        /*
         * Remove Invisible Characters Again!
         */
        $str = remove_invisible_characters($str);

        /*
         * Convert all tabs to spaces
         *
         * This prevents strings like this: ja  vascript
         * NOTE: we deal with spaces between characters later.
         * NOTE: preg_replace was found to be amazingly slow here on 
         * large blocks of data, so we use str_replace.
         */

        if (strpos($str, "\t") !== FALSE)
        {
            $str = str_replace("\t", ' ', $str);
        }

        /*
         * Capture converted string for later comparison
         */
        $converted_string = $str;

        // Remove Strings that are never allowed
        $str = $this->_do_never_allowed($str);

        /*
         * Makes PHP tags safe
         *
         * Note: XML tags are inadvertently replaced too:
         *
         * <?xml
         *
         * But it doesn't seem to pose a problem.
         */
        if ($is_image === TRUE)
        {
            // Images have a tendency to have the PHP short opening and 
            // closing tags every so often so we skip those and only 
            // do the long opening tags.
            $str = preg_replace('/<\?(php)/i', "&lt;?\\1", $str);
        }
        else
        {
            $str = str_replace(array('<?', '?'.'>'),  array('&lt;?', '?&gt;'), $str);
        }

        /*
         * Compact any exploded words
         *
         * This corrects words like:  j a v a s c r i p t
         * These words are compacted back to their correct state.
         */
        $words = array(
                'javascript', 'expression', 'vbscript', 'script', 
                'applet', 'alert', 'document', 'write', 'cookie', 'window'
            );

        foreach ($words as $word)
        {
            $temp = '';

            for ($i = 0, $wordlen = strlen($word); $i < $wordlen; $i++)
            {
                $temp .= substr($word, $i, 1)."\s*";
            }

            // We only want to do this when it is followed by a non-word character
            // That way valid stuff like "dealer to" does not become "dealerto"
            $str = preg_replace_callback('#('.substr($temp, 0, -3).')(\W)#is', array($this, '_compact_exploded_words'), $str);
        }

        /*
         * Remove disallowed Javascript in links or img tags
         * We used to do some version comparisons and use of stripos for PHP5, 
         * but it is dog slow compared to these simplified non-capturing 
         * preg_match(), especially if the pattern exists in the string
         */
        do
        {
            $original = $str;

            if (preg_match("/<a/i", $str))
            {
                $str = preg_replace_callback("#<a\s+([^>]*?)(>|$)#si", array($this, '_js_link_removal'), $str);
            }

            if (preg_match("/<img/i", $str))
            {
                $str = preg_replace_callback("#<img\s+([^>]*?)(\s?/?>|$)#si", array($this, '_js_img_removal'), $str);
            }

            if (preg_match("/script/i", $str) OR preg_match("/xss/i", $str))
            {
                $str = preg_replace("#<(/*)(script|xss)(.*?)\>#si", '[removed]', $str);
            }
        }
        while($original != $str);

        unset($original);

        // Remove evil attributes such as style, onclick and xmlns
        $str = $this->_remove_evil_attributes($str, $is_image);

        /*
         * Sanitize naughty HTML elements
         *
         * If a tag containing any of the words in the list
         * below is found, the tag gets converted to entities.
         *
         * So this: <blink>
         * Becomes: &lt;blink&gt;
         */
        $naughty = 'alert|applet|audio|basefont|base|behavior|bgsound|blink|body|embed|expression|form|frameset|frame|head|html|ilayer|iframe|input|isindex|layer|link|meta|object|plaintext|style|script|textarea|title|video|xml|xss';
        $str = preg_replace_callback('#<(/*\s*)('.$naughty.')([^><]*)([><]*)#is', array($this, '_sanitize_naughty_html'), $str);

        /*
         * Sanitize naughty scripting elements
         *
         * Similar to above, only instead of looking for
         * tags it looks for PHP and JavaScript commands
         * that are disallowed.  Rather than removing the
         * code, it simply converts the parenthesis to entities
         * rendering the code un-executable.
         *
         * For example: eval('some code')
         * Becomes:     eval&#40;'some code'&#41;
         */
        $str = preg_replace('#(alert|cmd|passthru|eval|exec|expression|system|fopen|fsockopen|file|file_get_contents|readfile|unlink)(\s*)\((.*?)\)#si', "\\1\\2&#40;\\3&#41;", $str);


        // Final clean up
        // This adds a bit of extra precaution in case
        // something got through the above filters
        $str = $this->_do_never_allowed($str);

        /*
         * Images are Handled in a Special Way
         * - Essentially, we want to know that after all of the character 
         * conversion is done whether any unwanted, likely XSS, code was found.  
         * If not, we return TRUE, as the image is clean.
         * However, if the string post-conversion does not matched the 
         * string post-removal of XSS, then it fails, as there was unwanted XSS 
         * code found and removed/changed during processing.
         */

        if ($is_image === TRUE)
        {
            return ($str == $converted_string) ? TRUE: FALSE;
        }

        log_message('debug', "XSS Filtering completed");
        return $str;
    }

controller pass sanitized data to model and model inserts in mongo db: nothing more ... :)

I had related problem

ucfirst for UTF-8 need use mb_ucfirst('helo','UTF-8');

And i think in your situation problem is with: substr need use mb_substr

else :

So meybe on the begin iconv convert to iso-8859-1 and on write to db icon to t Utf-8

To prevent the problem you can use

header("Content-Type: text/html; charset=UTF-8");

in the top of the php file.
Found the solution in this stackoverflow post and worked for me when migrating MySQL DB to MongoDB with latin special chars.

继续阅读：codeigniter mongodb php

Mongo db utf-8 exception

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？