PHP preg_replace doesn't act as expected with file names strings
I'm trying to create a function which removes all none English characters (except spaces,dots and hyphens) from a string. For this I tried using preg_replace, but the function produces strange results.
I have a file called "example-נידדל.jpg"
Here is what I'm getting when trying to sanitize the file name:
echo preg_replace('/[^A-Za-z0-9\.]/','','example-נידדל.jpg');
The above produces: example.jpg as expected.
But when I try to pull the file name from a $_FILES array after uploading it to the server I get:
echo preg_replace('/[^A-Za-z0-9\.]/','',$_FILES['file_upload']["name"]);
The above produces example-15041497149114911500.jpg
The numbers I'm getting are in fact the HTML numbers of the characters which were suppose to be removed, s开发者_运维知识库ee the following for character reference: http://realdev1.realise.com/rossa/phoneme/listCharactors.asp?start=1488&stop=1785&rows=297&page=1
I can't figure out why doesn't the preg_replace work with file names.
Can anyone help?
Thanks,
Roy
What about using mb_convert_encoding to convert the HTML entities back into UTF-8 before the preg_replace?
echo preg_replace('/[^A-Za-z0-9\.]/', '', mb_convert_encoding($_FILES['file_upload']["name"], 'UTF-8', 'HTML-ENTITIES'));
I would use a combination of regular expressions and iconv to transliterate it.
Update: Prior transliteration/filtering the filename mabye needs to be urldecoded:
$path = urldecode($path); // convert triplets to bytes.
Here is a code example from here that does something very similar to your question:
function pathauto_cleanstring($string)
{
    $url = $string;
    $url = preg_replace('~[^\\pL0-9_]+~u', '-', $url); // substitutes anything but letters, numbers and '_' with separator
    $url = trim($url, "-");
    $url = iconv("utf-8", "us-ascii//TRANSLIT", $url); // TRANSLIT does the whole job
    $url = strtolower($url);
    $url = preg_replace('~[^-a-z0-9_]+~', '', $url); // keep only letters, numbers, '_' and separator
    return $url;
}
It expects your into to be UTF-8 encoded.
Reference
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论