PHP readdir with european characters

2022-12-12 14:36 问答作者：

I get images files which have Czech characters in the filename (eg, ěščřžýáíé) and I want to rename them without the accents so that they are more compatible for the web. I thought I could use a simple str_replace function but it doesn't seem to work the same with the file array as it does with a string literal.

I read the files with readdir, after checking for extension.

function readFiles($dir, $ext = false) {
    if (is_dir($dir)) {
        if ($dh = opendir($dir)) {
            while (($file = readdir($dh)) !== false) {
                if($ext){  
                    if(end(explode('.', $file)) == $ext) {
                        $f[] = $file;
                    }
                } else {
                    $f[] = $file;
                }
            }

            closedir($dh);
            return $f;
        } else {
            return false;
        }
    } else {
        return false;
    }
}

$files = readFiles(".", "jpg");

$search = array('š','á','ž','í','ě','é','ř','ň','ý','č',' ');
$replace = array('s','a','z','i','e','e','r','n','y','c','-');

$string = "čšěáýísdjksnalci sášěééalskcnkkjy+ěéší";
$safe_string = str_replace($search, $replace, $string);

echo '<pre>';

foreach($files as $fl) {
    $safe_files[] = str_replace($search, $replace, $fl);
}

var_dump($files);
var_dump($safe_files);

var_dump($string);
var_dump($safe_string);

echo '</pre>';

Output

array(6) {
  [0]=>
  string(21) "Hl�vka s listem01.jpg"
  [1]=>
  string(23) "Hl�vky v atelieru02.jpg"
  [2]=>
  string(17) "Jarn� v�hon03.jpg"
  [3]=>
  string(17) "Mlad� chmel04.jpg"
  [4]=>
  string(23) "Stavba chmelnice 05.jpg"
  [5]=>
  string(21) "Zimni chmelnice06.jpg"
}
array(6) {
  [0]=>
  string(21) "Hl�vka-s-listem01.jpg"
  [1]=>
  string(23) "Hl�vky-v-atelieru02.jpg"
  [2]=>
  string(17) "Jarn�-v�hon03.jpg"
  [3]=>
  string(17) "Mlad�-chmel04.jpg"
  [4]=>
  string(23) "Stavba-chmelnice-05.jpg"
  [5]=>
  string(21) "Zimni-chmelnice06.jpg"
}
string(53) "čšěáýísdjksnalci sášěééal开发者_StackOverflowskcnkkjy+ěéší"
string(38) "cseayisdjksnalci-saseeealskcnkkjy+eesi"

Right now I'm running on WAMP but answers that work across platforms are even better :)

According to the 0xFFFD marks (which appears in Firefox as diamonds with a question mark inside) you already aren't reading them using the correct encoding (which would be Unicode / UTF-8). As far I found this bug, it seems to be related.

Here's another SO topic about that: php readdir problem with japanese language file name

To the point, wait until they get PHP6 stable and then use it.

Unrelated to the problem: the Normalizer is a better tool to get rid of diacritical marks.

If it works with strings but not with arrays, just applies it on strings :-)

$search = array('š','á','ž','í','ě','é','ř','ň','ý','č',' ');
$replace = array('s','a','z','i','e','e','r','n','y','c','-');

len = count($safe_files)

for ($i=0; $i<len; $i++)
    $safe_files[$i] = str_replace($search, $replace, $safe_files[$i]);

I think str_replace accept arrays only for the 2 first params, and not the last. I may be wrong, but anyway this should work.

If by any mean, you have a real encoding problem, it could just be that you OS use a single byte encoding while your source file use another, probably UTF-8.

In that case, do something like :

$search = array('š','á','ž','í','ě','é','ř','ň','ý','č',' ');
$replace = array('s','a','z','i','e','e','r','n','y','c','-');

$code_encoding = "UTF-8"; // this is my guess, but put whatever is yours
$os_encoding = "CP-1250"; // this is my guess, but put whatever is yours

len = count($safe_files)

for ($i=0; $i<len; $i++)
{
    $safe_files[$i] = iconv($os_encoding , $code_encoding, $safe_files[$i]); // convert before replace
    /*
     ALternatively :
     $safe_files[$i] = mb_convert_encoding($safe_files[$i], $code_encoding , $os_encoding );
    */
    $safe_files[$i] = str_replace($search, $replace, $safe_files[$i]);
}

mb_convert_encoding() require the ext/mbstring extension and iconv() require ext/iconv.

Not directly an answer to your question maybe but you might want to take a look at the iconv() function in PHP and more in particulare the //TRANSLIT option that you can append to the second argument. I've used it several times turning french and eastern europe strings to their a-z and url friendly counterparts.

From PHP.net (http://www.php.net/manual/en/function.iconv.php)

If you append the string //TRANSLIT to out_charset transliteration is activated. This means that when a character can't be represented in the target charset, it can be approximated through one or several similarly looking characters.

Your source code (and the test string) appear to be in utf8, while file names seem to use a single-byte encoding. I'd suggest you use the same encoding for your replacement string. To avoid source encoding issues, it'd better to write accented chars in your code in a hex form (like \xE8 for "č" etc).

So I got it working on my Windows XP system by this

$search = array('š','á','ž','í','e','é','r','n','ý','c',' ');
$replace = array('s','a','z','i','e','e','r','n','y','c','-');

$files = readFiles(".", "jpg");
$len = count($files);

for($i = 0; $i < $len; $i++){
  if(mb_check_encoding($files[$i], 'ASCII')){
    $safe_files[$i] = $files[$i];
  }else{
    $safe_files[$i] = str_replace(
        $search, $replace, iconv("iso-8859-1", "utf-8//TRANSLIT", $files[$i]));
  }
  if($files[$i] != $safe_files[$i]){
    rename($files[$i], $safe_files[$i]);
  }
}

I don't know if it's a conincidence or not, but calling mb_get_info() shows

[internal_encoding] => ISO-8859-1

Here is another function I found helpful on the PHP strtr page

<?
// Windows-1250 to ASCII
// This function replace all Windows-1250 accent characters with
// thier non-accent ekvivalents. Useful for Czech and Slovak languages.

function win2ascii($str)    {   

$str = StrTr($str,
    "\xE1\xE8\xEF\xEC\xE9\xED\xF2",
    "\x61\x63\x64\x65\x65\x69\x6E");

$str = StrTr($str,
    "\xF3\xF8\x9A\x9D\xF9\xFA\xFD\x9E\xF4\xBC\xBE",
    "\x6F\x72\x73\x74\x75\x75\x79\x7A\x6F\x4C\x6C");

$str = StrTr($str,
    "\xC1\xC8\xCF\xCC\xC9\xCD\xC2\xD3\xD8",
    "\x41\x43\x44\x45\x45\x49\x4E\x4F\x52");

$str = StrTr($str,
    "\x8A\x8D\xDA\xDD\x8E\xD2\xD9\xEF\xCF",
    "\x53\x54\x55\x59\x5A\x4E\x55\x64\x44");

return $str;
}
?>

Basically, it wasn't such a problem to convert the european characters to an ascii equivilent, but I could find no reliable way to rename the files (ie, reference files with non-ascii characters).

For UTF-8 use the PHP function utf8_encode. Microsoft Windows uses ISO-8859-1 so in this case a conversion is necessary.

Example - listing the files in a dir:

<?php
$dir_handle = opendir(".");
while (false !== ($file = readdir($dir_handle)))
{
  echo utf8_encode($file)."<br>";
}
?>

Area5one has it right - it's a problem of different encoding.

When I upgraded my machine from XP to Win7, I also upgraded my version of MySQL and PHP. Somewhere along the way, PHP programs that used to work stopped working. In particular, scandir, readdir and utf-8 had lived happily together, but no longer.

So, I've modified my code. Variables related to data taken from the hard disk end in "_iso" to reflecct Windows' ISO-8859-1 encoding, data from the MySQL database goes in variables ending in "_utf". Thus, the code from area5one would like this: $dir_handle_iso = opendir("."); while (false !== ($file_iso = readdir($dir_handle_iso))) { $file_utf = utf8_encode($file); ... }

This works for me 100%:

setlocale(LC_ALL,"cs_CZ");
$new_str = iconv("UTF-8","ASCII//TRANSLIT",$orig_str);

$file = mb_convert_encoding($file, 'UTF-8', "iso-8859-1"); Worked for me (Windows, Danish characters).

继续阅读：php readdir str-replace

PHP readdir with european characters

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？