Handling Extended ASCII in File Uploads

2023-01-04 05:10 问答作者：

A website I recently completed with a friend has a gallery where one can upload images and text files. The only accepted text file (to ease development) is .txt and normally goes off without a hitch (or not..)

The problems I've encountered are the same of any developer: Microsoft's Extended ASCII.

Before outputting the text from the file, I go over several different layers to try to clean it up:

$txtfile = file_get_contents(".".$this->var['submission']['file_loc']);

// BOM Fun
    $boms = array
    (
        "utf8"    => array(3,pack("CCC",0xEF,0xBB,0xBF)),
        "utf16be"       => array(2,pack("CC",0xFE,0xFF)),
        "utf16le"       => array(2,pack("CC",0xFF,0xFE)),
        "utf32be"       => array(4,pack("CCCC",0x00,0x00,0xFE,0xFF)),
        "utf32le"       => array(4,pack("CCCC",0xFF,0xFE,0x00,0x00)),
        "gb18030"       => array(4,pack("CCCC",0x84,0x31,0x95,0x33))
    );
    foreach($boms as $bom)
    {
        if(mb_substr($txtfile,0,$bom[0]) == $bom[1])
        {
            $txtfi开发者_StackOverflowle = substr($txtfile,$bom[0]);
            break;
        }
    }
$txtfile_o = $txtfile;
$badwords = array(chr(145),chr(146),chr(147),chr(148),chr(151),chr(133));
$fixwords = array("'","'",'"','"','-','...');
$txtfile_o = str_replace($badwords,$fixwords,$txtfile_o);
$txtfile_o = mb_convert_encoding($txtfile_o,"UTF-8");

The str_replace is the general method of converting Microsoft's awful smart quotes, em-dash, and ellipsis into their normal ASCII equivalents for output.

This code works perfectly find under the condition that the file uploaded is ANSI / us-ascii.

This code does not work (for no particular reason) when the uploaded file is UTF-8.

When the file is UTF-8, viewing the file itself in the web browser works fine, but printing it out via the web interface using this code does not. In this event, the smart quotes become some sort of accented a character.

This is where I'm stuck. The output encoding for the webpage is UTF-8, the web browser sees it as UTF-8, the file is in UTF-8 and yet neither the replace for the smart quotes works nor does the web browser display them correctly.

Any and all help on this would be greatly appreciated.

If I understand correctly your problem is that your code that replaces "extended ASCII" characters for their ASCII counterparts fails when the user submits a file in UTF-8.

This was to be expected. You cannot operate on UTF-8 files with str_replace and the like, which operate at the byte level, while a character in UTF-8 is constituted by one byte only for characters in the ASCII range.

What I'd recommend you to do is to use some heuristic to determine if the file is encoded in UTF-8 (the BOM is a good way if you're sure it'll be present) or Windows-1252 or whatever and then convert it to UTF-8 if it isn't. In that case, you wouldn't need to replace any characters, you could preserve the smart quotes.

The characters you are trying to replace have different byte values in UTF8. Actually, they have more than one byte each in UTF8. You are trying to search for them with Windows encoding values and that's why you won't find them.

Look up the UTF8 byte sequences of the characters and use them for the search.

继续阅读：extended-ascii file-upload php smart-quotes utf-8

Handling Extended ASCII in File Uploads

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？