Working with files and utf8 in PHP

2023-01-17 12:48 问答作者：

Lets say I have a file called foo.txt encoded in utf8:

aoeu  
qjkx
ñpyf

And I want to get an array that contains all the lines in that file (one line per index) that have the letters aoeuñpyf, and only the lines with these letters.

I wrote the following code (also encoded as utf8):

$allowed_letters=arra开发者_开发技巧y("a","o","e","u","ñ","p","y","f");

$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
    $line=fgets($f);
    foreach(preg_split("//",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
        if(!in_array($letter,$allowed_letters)){
            $line="";
        }
    }
    if($line!=""){
        $lines[]=$line;
    }
}
fclose($f);

However, after that, the $lines array just has the aoeu line in it.

This seems to be because somehow, the "ñ" in $allowed_letters is not the same as the "ñ" in foo.txt.

Also if I print a "ñ" of the file, a question mark appears, but if I print it like this print "ñ";, it works.

How can I make it work?

If you are running Windows, the OS does not save files in UTF-8, but in cp1251 (or something...) by default you need to save the file in that format explicitly or run each line in utf8_encode() before performing your check. I.e.:

$line=utf8_encode(fgets($f));

If you are sure that the file is UTF-8 encoded, is your PHP file also UTF-8 encoded?

If everything is UTF-8, then this is what you need :

foreach(preg_split("//u",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
   // ...
}

(append u for unicode chars)

However, let me suggest a yet faster way to perform your check :

$allowed_letters=array("a","o","e","u","ñ","p","y","f");

$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
    $line=fgets($f);

    $line = str_split(rtrim($line));
    if (count(array_intersect($line, $allowed_letters)) == count($line)) {
            $lines[] = $line;
    }
}
fclose($f);

(add space chars to allow space characters as well, and remove the rtrim($line))

In UTF-8, ñ is encoded as two bytes. Normally in PHP all string operations are byte-based, so when you preg_split the input it splits up the first byte and the second byte into separate array items. Neither the first byte on its own nor the second byte on its own will match both bytes together as found in $allowed_letters, so it'll never match ñ.

As Yanick posted, the solution is to add the u modifier. This makes PHP's regex engine treat both the pattern and the input line as Unicode characters instead of bytes. It's lucky that PHP has special Unicode support here; elsewhere PHP's Unicode support is extremely spotty.

A simpler and quicker way than splitting would be to compare each line against a character-group regex. Again, this must be a u regex.

if(preg_match('/^[aoeuñpyf]+$/u', $line))
    $lines[]= $line;

It sounds like you've already got your answer, but it is important to recognize that unicode characters can be stored in multiple ways. Unicode normalization* is a process which can help ensure comparisons work as expected.

http://en.wikipedia.org/wiki/Unicode_equivalence

继续阅读：file-io php unicode utf-8

Working with files and utf8 in PHP

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？