How can I preg_replace special character like 'Prêt-à-porter'?
There are heaps of Qs about this on this forum and on the web in general. But I don't just get it.
Here is my code:
function updateGuideKeywords($dal)
{
$pattern = "/[^a-zA-Z-êàé]/";
$keywords = preg_replace($pattern, '', $_POST['keywords']);
echo json_encode($keywords);
}
Now, the input is Prêt-à-porter
, and the output is "Pr\u00eat-\u00e0-porter"
.
Why do I get the '\u00e' ?
And how can I alter my pattern to include the characters ê
, à
and é
?
EDIT
humm... since it looks like a unicode / character issue, I might go for the solution I found on this page.Here they suggest doing something like this:
$chain="prêt-à-porter";
$pattern = array("'é'", "'è'", "'ë'", "'ê'", "'É'", "'È'", "'Ë'", "'Ê'", "'á'", "'à'", "'ä'", "'â'", "'å'", "'Á'", "'À'", "'Ä'", "'Â'", "'Å'", "'ó'", "'ò'", "'ö'", "'ô'", "'Ó'", "'Ò'", "'Ö'", "'Ô'", "'í'", "'ì'", "'ï'", "'î'", "'Í'", "'Ì'", "'Ï'", "'Î'", "'ú'", "'ù'", "'ü'", "'û'", "'Ú'", "'Ù'", "'Ü'", "'Û'", "'ý'", "'ÿ'", "'Ý'", "'ø'", "'Ø'", "'œ'", "'Œ'", "'Æ'", "'ç'", "'Ç'");
$replace = array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E', 'a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A', 'A', 'o', 'o', 'o', 'o', 'O', 'O', 'O', 'O', 'i', 'i', 'i', 'I', 'I', 'I', 'I', 'I', 'u', 'u', 'u', 'u', 'U', 'U', 'U', 'U', 'y', 'y', 'Y', 'o', 'O', 'a', 'A', 'A', 'c', 'C');
$chain = preg_replace($pattern, $replace, $chain);
EDIT 2
This is my solution so far:function updateGuideKeywords()
{
//First we replace characters with accents
$pattern = array("'é'", "'è'", "'ë'", "'ê'", "'É'", "'È'", "'Ë'", "'Ê'", "'á'", "'à'", "'ä'", "'â'", "'å'", "'Á'", "'À'", "'Ä'", "'Â'", "'Å'", "'ó'", "'ò'", "'ö'", "'ô'", "'Ó'", "'Ò'", "'Ö'", "'Ô'", "'í'", "'ì'", "'ï'", "'î'", "'Í'", "'Ì'", "'Ï'", "'Î'", "'ú'", "'ù'", "'ü'", "'û'", "'Ú'", "'Ù'", "'Ü'", "'Û'", "'ý'", "'ÿ'", "'Ý'", "'ø'", "'Ø'", "'œ'", "'Œ'", "'Æ'", "'ç'", "'Ç'");
$replace = array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E', 'a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A', 'A', 'o', 'o', 'o', 'o', 'O', 'O', 'O', 'O', 'i', 'i', 'i', 开发者_StackOverflow中文版'I', 'I', 'I', 'I', 'I', 'u', 'u', 'u', 'u', 'U', 'U', 'U', 'U', 'y', 'y', 'Y', 'o', 'O', 'a', 'A', 'A', 'c', 'C'); $shguideID = $_POST['shguideID'];
$keywords = preg_replace($pattern, $replace, $_POST['keywords']);
//Then we remove unwanted characters by only allowing a-z, A-Z, comma, 'minus' and white space
$keywords = preg_replace("/[^a-zA-Z-,\s]/", "", $keywords);
echo json_encode($keywords);
}
If you want to replace 'é' with 'e', etc. use iconv() with the //TRANSLIT modifier
e.g.,
$newString = iconv('UTF-8', 'ASCII//TRANSLIT', $myString);
A more complete example:
$ cat scratch.php
<?php
$x = "Prêt-à-porter";
var_dump(json_encode(iconv("UTF-8", "ASCII//TRANSLIT", $x)));
$ php scratch.php
string(15) ""Pret-a-porter""
$
"Pr\u00eat-\u00e0-porter"
is a correct JavaScript string literal representation of Prêt-à-porter
. I assume you're doing a json_encode
at some point along the line?
Note also that PHP's regular expressions are not Unicode-aware, so if you are using UTF-8 (which generally you want to be), the character ê
is not a single character, but byte C3 followed by byte AA. That's fine for simple literal matches, but in situations like a character class you're now matching two bytes separately instead of one after each other, which can easily mess up your expression.
this may not be 100% accurate, but looking at the regex your using i don't think preg_replace() is the issue. I think the reason you are getting '\u00e' is due to php's poor support of character encodings.
From what I see of your output, your characters are not removed (hence in your pattern), so the only thing is that the output is made in unicode. Try to change your document to UTF-8 or encode HTML entities and it should work, but beware if you encode entities before replacing, it won't detect them as they will be already converted.
Your code, with the latest edits so far, works this way:
The expression
/[^a-zA-Z-êàé]/
means "match anything that's not English letter, minus sign, ê, à or é".preg_replace($pattern, '', 'Prêt-à-porter')
returns 'Prêt-à-porter' since nothing matches.json_encode() returns the JSON representation of 'Prêt-à-porter', which is 'r\u00eat-\u00e0-porter'
It's not clear to me what's your exact goal. If you want to remove anything that's not a minus or letter you can try this pattern:
/[^\w0-9]/u
You could also use mb_ereg_replace() to work with multibyte characters in your string.
精彩评论