
PHP preg_replace doesn't act as expected with file names strings

I'm trying to create a function which removes all none English characters (except spaces,dots and hyphens) from a string. For this I tried using preg_replace, but the function produces strange results.

I have a file called "example-נידדל.jpg"

Here is what I'm getting when trying to sanitize the file name:

echo preg_replace('/[^A-Za-z0-9\.]/','','example-נידדל.jpg');

The above produces: example.jpg as expected.

But when I try to pull the file name from a $_FILES array after uploading it to the server I get:

echo preg_replace('/[^A-Za-z0-9\.]/','',$_FILES['file_upload']["name"]);

The above produces example-15041497149114911500.jpg

The numbers I'm getting are in fact the HTML numbers of the characters which were suppose to be removed, s开发者_运维知识库ee the following for character reference: http://realdev1.realise.com/rossa/phoneme/listCharactors.asp?start=1488&stop=1785&rows=297&page=1

I can't figure out why doesn't the preg_replace work with file names.

Can anyone help?



What about using mb_convert_encoding to convert the HTML entities back into UTF-8 before the preg_replace?

echo preg_replace('/[^A-Za-z0-9\.]/', '', mb_convert_encoding($_FILES['file_upload']["name"], 'UTF-8', 'HTML-ENTITIES'));

I would use a combination of regular expressions and iconv to transliterate it.

Update: Prior transliteration/filtering the filename mabye needs to be urldecoded:

$path = urldecode($path); // convert triplets to bytes.

Here is a code example from here that does something very similar to your question:

function pathauto_cleanstring($string)
    $url = $string;
    $url = preg_replace('~[^\\pL0-9_]+~u', '-', $url); // substitutes anything but letters, numbers and '_' with separator
    $url = trim($url, "-");
    $url = iconv("utf-8", "us-ascii//TRANSLIT", $url); // TRANSLIT does the whole job
    $url = strtolower($url);
    $url = preg_replace('~[^-a-z0-9_]+~', '', $url); // keep only letters, numbers, '_' and separator
    return $url;

It expects your into to be UTF-8 encoded.






验证码 换一张
取 消

