Finding and removing Non-ASCII characters from an Oracle Varchar2

2022-12-20 05:07 问答作者：

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit. When we try and migrate these record they fail as they contai开发者_如何学Cn characters that become multibyte UF8 characters. What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.

I would like to do :

SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')

but Oracle does not implement the [:ascii:] character class.

Is there a simple way doing what I want to do?

I think this will do the trick:

SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')

If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...

UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')

...where field and table are your field and table names respectively.

I wouldn't recommend it for production code, but it makes sense and seems to work:

SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')

The select may look like the following sample:

select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)  
order by nvalue;

In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.

There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.

Create a PLSQL function to receive your input string and return a varchar2.

In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.

That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.

The following also works:

select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
          CONVERT (
             '3735844533120%$03  ',
             'US7ASCII',
             'WE8ISO8859P1'),
          '[^!@/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
  FROM DUAL);

I had a similar issue and blogged about it here. I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:

select dump(a,1016), a, b
from
 (select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
         COLUMN b
  from TABLE)
where a is not null
order by a;

I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.

I found the answer here:

http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html

CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/

Then run this to update your data

update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);

Try the following:

-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')

-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual

You can try something like following to search for the column containing non-ascii character :

select * from your_table where your_col <> asciistr(your_col);

I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.

I tried this from an excellent comment

'[^ -~|[:space:]]'

but got this ORA-12728: invalid range in regular expression .

but it lead me to my solution:

select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from  
    (select '-   <- strangest thing here, and I want to keep line break after
-' deta from dual ) t

displays (in my TOAD tool) as

Finding and removing Non-ASCII characters from an Oracle Varchar2

replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)

Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.

REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))

I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.

Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.

Here is the simple test in Oracle 11.2.03

select s
     , regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
     , dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
  from (select 127+level l,chr(127+level) c from dual connect by level < 129))

And "rep 127-255" is

Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255

i.e for some reason this version of Oracle does not replace char(226) and above. Using '['||chr(127)||'-'||chr(225)||']' gives the desired result. If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).

Please note that whenever you use

regexp_like(column, '[A-Z]')

Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.

Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.

Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like

å => a
ä => a
ö => o

of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.

As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:

SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;

This will replace anything outside that printable range as a question mark.

This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.

Do this, it will work.

trim(replace(ntwk_slctor_key_txt, chr(0), ''))

I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is). The following is a simple character whitelist approach:

SELECT est.clients_ref
  ,TRANSLATE (
              est.clients_ref
             ,   'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
              || REPLACE (
                          TRANSLATE (
                                     est.clients_ref
                                    ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
                                    ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
                                    )
                         ,'~'
                         )
             ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
             )
      clean_ref

FROM edms_staging_table est

继续阅读：ascii oracle regex

Finding and removing Non-ASCII characters from an Oracle Varchar2

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？