Removing accents chars from arabic text
I've a database with collation of utf8 / utf8_bin. DB have arabic text with accent chars (kasar etc). I want to search text without accent chars.
What i did is, i wrote 'like' query ...
WHERE replace(field1,0x[CODE],'') like '%[arabic text]%' --- where [CODE] is th开发者_StackOverflow中文版e accent char.
This is working fine when i write static utf arabic text in place of field1. But it does not replace when it run from database field.
May be db field have latin1 encoding. How can i fix / check this issue.
Thanks.
utf8_bin
is a binary collation - it is extremely literal and strict in comparing characters.
utf8_general_ci
is more lenient and normalizes Umlauts and accents to their "basic" version:
Ä
=> A
Ü
=> U
etc.
I have never worked with Arabic before so I don't know whether it applies for those accents as well but I would expect so.
You should be able to temporarily use the lenient collation like this:
WHERE field1 like '%[arabic text]%' COLLATE utf8_general_ci;
i used the below query in postgreSQL/navicat to create a new accentless column of the original accented text
with remove_accent_1 as (
SELECT
index, text as original_text, replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace("text", 'ٍ',''), 'ِ',''), 'ً',''), 'َ',''), 'ّ',''), 'ْ',''), 'ۖ',''), 'ٌ',''), 'ُ',''), 'آ','ا'), 'ٰ ',''), ' ۚ ',''), 'ۗ ','') as new_text
FROM "arabic_text"
ORDER BY "index" ASC )
select
index, original_text, new_text
from remove_accent_1
精彩评论