开发者

Remove special characters from a database field

I have a database with severa开发者_如何学JAVAl thousand records, and I need to strip down one of the fields to ensure that it only contains certain characters (Alphanumeric, spaces, and single quotes). What SQL can I use to strip any other characters (such as slashes, etc) from that field in the whole database?


update mytable
set FieldName = REPLACE(FieldName,'/','')

That's a good place to start.


I have created simple function for this

DROP FUNCTION IF EXISTS `regex_replace`$$

CREATE FUNCTION `regex_replace`(pattern VARCHAR(1000),replacement VARCHAR(1000),original VARCHAR(1000)) RETURNS VARCHAR(1000) CHARSET utf8mb4
    DETERMINISTIC
BEGIN    
    DECLARE temp VARCHAR(1000); 
    DECLARE ch VARCHAR(1); 
    DECLARE i INT;
    SET i = 1;
    SET temp = '';
    IF original REGEXP pattern THEN 
        loop_label: LOOP 
            IF i>CHAR_LENGTH(original) THEN
                LEAVE loop_label;  
            END IF;

            SET ch = SUBSTRING(original,i,1);

            IF NOT ch REGEXP pattern THEN
                SET temp = CONCAT(temp,ch);
            ELSE
                SET temp = CONCAT(temp,replacement);
            END IF;

            SET i=i+1;
        END LOOP;
    ELSE
        SET temp = original;
    END IF;

    RETURN temp;
END

Usage example:

SELECT <field-name> AS NormalText, regex_replace('[^A-Za-z0-9 ]', '', <field-name>)AS RegexText FROM 
<table-name>


This might be useful.

This solution doesn't involves creating procedures or functions or lengthy use of replace within replace. Instead we know that all the ASCII characters that doesn't involves special character lies within ASCII codes \x20-\x7E (Hex representation). Source ASCII From Wikipedia, the free encyclopedia Below are all those characters in that interval.

Hex: 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E
Glyph:  space ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ↑ ← @ a b c d e f g h i j k l m n o p q r s t u v w x y z { ACK } ESC

so as simple regular expression replace will do the job

SELECT REGEXP_REPLACE(columnName, '[^\\x20-\\x7E]', '') from tableName;

PHP Custom query string

$query = "select REGEXP_REPLACE(columnName, '(.*)[(].*[)](.*)', CONCAT('\\\\1', '\\\\2')) `Alias` FROM table_Name";

The above statement replaces the content in between brackets as well as brackets. e.g. if the column contains 'Staff Oreintation (CMST TOT)' then above statement will removes the brackets and its contant i.e. 'Staff Oreintation'.

PS: If you are doing any DML (select, update ...) operation using prepare statement in stored procedure OR through PHP (creating a custom query string); then remember to escape the slash i.e.

SET @sql = CONCAT("SELECT REGEXP_REPLACE(columnName, '[^\\\\x20-\\\\x7E]', '') from tableName");
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

The above SQL statement does a simple regular expression replaces (actually removes) of all the special character; i.e. In the SQL a REGEX pattern is mention of all the special characters to be replaced with nothing.

Explanation of the pattern

A character group is started with the square bracket. The first character is caret which means; negation of all the characters mention in the group (i.e. with in the squares brackets). This simply means select compliment (other character than those selected) of all the characters in the group.

Just to summarize the above statement will

FYI: Remember Enter (line feed \n 0A, Carriage Return \r 0D), Tab (Horizontal Tab \t 09, Vertical Tab \v 0B) are not printable characters but are sometimes significant; So if you want them excluded as well add them also. i.e.

[^\x20-\x7E\x0A\x0D\x09\x0B]

Unchanged: all the alphanumeric characters, punctuation characters, arithmetic operators.

Remove all the Unicode characters (other than Latin alphabets) or special characters.


The Replace() function is first choice. However, Special Characters can sometimes be tricky to write in a console. For those you can combine Replace with the Char() function.

e.g. removing €

Update products set description = replace(description, char(128), '');

You can find all the Ascii values here

Ideally you could do a regex to find all the special chars, but apparently that's not possible with MySQL.

Beyond that, you'd need to run it through your favorite scripting language.


This may also be useful.

First you have to know the character set of the database and / or of the table. For example, let us suppose you have a UTF-8 environment and you want to strip / remove symbols like circled registered symbols, circled copyright symbol, and registered trademark symbol from a field then search the internet via bing or yahoo or google for for the hex code values of these symbols in the UTF-8 system:

Symbol                       Utf-8 Hex
=======                      =========
circled copyright              C2A9
circled registered             C2AE
Trademark (i.e., TM)           E284A2

Then your scrubbing select sql for field f1 from table t1, using the hex / unhex facility in conjunction with the replace function, will most likely look like this:

SELECT cast(unhex(replace(replace(replace(hex(f1),'C2A9',''),'C2AE',''),'E284A2','')) AS char) AS cleanf1 FROM t1 ;

Above, note the original field to be scrubbed / cleansed is f1, the table is t1 and the output header is cleanf1. The "as char" casting is necessary because, w/o it, the mysql 5.5.8 on which I tested is returning blob. Hope this helps


Adeel's answer is by far the best and simplest.

The OP needed to update the db, which is what I need too. So I figured I'd put that here for the next poor sole, like me, not to have to redo what I did.

Double check first, select it and scan them to make sure you're getting the right rows, before you update.

SELECT REGEXP_REPLACE(columnName, '[^\\x20-\\x7E]', '') from tableName;

Count to do a safety check ...

SELECT count(*) from tableName WHERE columnName REGEXP '[^\\x20-\\x7E]';

For some names I had to do another mapping so as not to lose their meaning like Ramon to Ramn because the o has a umlaut or grave or circumflex. So I used this to map ... https://theasciicode.com.ar

Then update This update is a catch all after the mapping update. Change the limit number to the count value above ...

UPDATE tablename SET columnName = REGEXP_REPLACE(columnName, '[^\\x20-\\x7E]', '') WHERE columnName REGEXP '[^\\x20-\\x7E]' LIMIT 1;


Elaborating on Vinnies answer... you can use the following (note the escaping in the last two statements...

update table set column = REPLACE(column,"`","");
update table set column = REPLACE(column,"~","");
update table set column = REPLACE(column,"!","");
update table set column = REPLACE(column,"@","");
update table set column = REPLACE(column,"#","");
update table set column = REPLACE(column,"$","");
update table set column = REPLACE(column,"%","");
update table set column = REPLACE(column,"^","");
update table set column = REPLACE(column,"&","");
update table set column = REPLACE(column,"*","");
update table set column = REPLACE(column,"(","");
update table set column = REPLACE(column,")","");
update table set column = REPLACE(column,"-","");
update table set column = REPLACE(column,"_","");
update table set column = REPLACE(column,"=","");
update table set column = REPLACE(column,"+","");
update table set column = REPLACE(column,"{","");
update table set column = REPLACE(column,"}","");
update table set column = REPLACE(column,"[","");
update table set column = REPLACE(column,"]","");
update table set column = REPLACE(column,"|","");
update table set column = REPLACE(column,";","");
update table set column = REPLACE(column,":","");
update table set column = REPLACE(column,"'","");
update table set column = REPLACE(column,"<","");
update table set column = REPLACE(column,",","");
update table set column = REPLACE(column,">","");
update table set column = REPLACE(column,".","");
update table set column = REPLACE(column,"/","");
update table set column = REPLACE(column,"?","");
update table set column = REPLACE(column,"\\","");
update table set column = REPLACE(column,"\"","");


There is no regular expression replacement. Use the following code to replace all special characters with '-'.

UPDATE <table> SET <column> = REPLACE ( REPLACE ( REPLACE ( REPLACE ( REPLACE ( REPLACE ( REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (<column>, '/', '-'), ',', '-'), '.', '-'), '<', '-'), '>', '-'), '?', '-'), ';', '-'), ':', '-'), '"', '-'), "'", '-'), '|', '-'), '\\', '-'), '=', '-'), '+', '-'), '*', '-'), '&', '-'), '^', '-'), '%', '-'), '$', '-'), '#', '-'), '@', '-'), '!', '-'), '~', '-'), '`', '-'), '', '-'), '{', '-' ), '}', '-' ), '[', '-' ), ']', '-' ), '(', '-' ), ')', '-' )

Code formatted

UPDATE
    <table>
SET
    <column> =
REPLACE
    (
    REPLACE
        (
        REPLACE
            (
            REPLACE
                (
                REPLACE
                    (
                    REPLACE
                        (
                        REPLACE
                            (
                            REPLACE
                                (
                                REPLACE
                                    (
                                    REPLACE
                                        (
                                        REPLACE
                                            (
                                            REPLACE
                                                (
                                                REPLACE
                                                    (
                                                    REPLACE
                                                        (
                                                        REPLACE
                                                            (
                                                            REPLACE
                                                                (
                                                                REPLACE
                                                                    (
                                                                    REPLACE
                                                                        (
                                                                        REPLACE
                                                                            (
                                                                            REPLACE
                                                                                (
                                                                                REPLACE
                                                                                    (
                                                                                    REPLACE
                                                                                        (
                                                                                        REPLACE
                                                                                            (
                                                                                            REPLACE
                                                                                                (
                                                                                                REPLACE
                                                                                                    (
                                                                                                    REPLACE
                                                                                                        (
                                                                                                        REPLACE
                                                                                                            (
                                                                                                            REPLACE
                                                                                                                (
                                                                                                                REPLACE
                                                                                                                    (
                                                                                                                    REPLACE
                                                                                                                        (
                                                                                                                    REPLACE
                                                                                                                        (<column>, '/', '-'),
                                                                                                                        ',',
                                                                                                                        '-'
                                                                                                                    ),
                                                                                                                    '.',
                                                                                                                    '-'
                                                                                                                ),
                                                                                                                '<',
                                                                                                                '-'
                                                                                                            ),
                                                                                                            '>',
                                                                                                            '-'
                                                                                                        ),
                                                                                                        '?',
                                                                                                        '-'
                                                                                                    ),
                                                                                                    ';',
                                                                                                    '-'
                                                                                                ),
                                                                                                ':',
                                                                                                '-'
                                                                                            ),
                                                                                            '"',
                                                                                            '-'
                                                                                        ),
                                                                                        "'",
                                                                                        '-'
                                                                                    ),
                                                                                    '|',
                                                                                    '-'
                                                                                ),
                                                                                '\\',
                                                                                '-'
                                                                            ),
                                                                            '=',
                                                                            '-'
                                                                        ),
                                                                        '+',
                                                                        '-'
                                                                    ),
                                                                    '*',
                                                                    '-'
                                                                ),
                                                                '&',
                                                                '-'
                                                            ),
                                                            '^',
                                                            '-'
                                                        ),
                                                        '%',
                                                        '-'
                                                    ),
                                                    '$',
                                                    '-'
                                                ),
                                                '#',
                                                '-'
                                            ),
                                            '@',
                                            '-'
                                        ),
                                        '!',
                                        '-'
                                    ),
                                    '~',
                                    '-'
                                ),
                                '`',
                                '-'
                            ),
                            '',
                            '-'
                        ),
                        '{',
                        '-'
                    ),
                    '}',
                    '-'
                ),
                '[',
                '-'
            ),
            ']',
            '-'
        ),
        '(',
        '-'
    ),
    ')',
    '-'
)


Have a look at LIB_MYSQLUDF_PREG which needs to be compiled into the MySQL server, but has advanced regular expression facilities such as preg_replace which will help with your task.


My version of MySQL doesn't have REGEXP_REPLACE(). I used the following two workarounds: 1. Remove specified characters (if you know what characters you want to remove)

    create function fn_remove_selected_characters
        (v_input_string varchar(255),
         v_unacceptable_characters varchar(255))
    RETURNS varchar(255)
    BEGIN

    -- declare variables
    declare i int;
    declare unacceptable_values varchar(255);
    declare this_character char(1);
    declare output_string varchar(255);
    declare input_length int;
    declare boolean_value int;
    declare space varchar(3);

    -- Set variable values
    set input_length = char_length(v_input_string);
    set i = 0;
    set unacceptable_values = v_unacceptable_characters;
    set output_string = '';
    set boolean_value = 0;
    set space = 'no';

    begin
    -- Leave spaces if they aren't in the exclude list
    if instr( unacceptable_values, ' ') = 0 then
        begin
        while i < input_length do
            SET this_character = SUBSTRING( v_input_string, i, 1 );
                -- If the current character is a space, 
                -- then concatenate a space to the output
                -- Although it seems redundant to explicitly add a space,
                -- SUBSTRING() equates a space to the empty string
                if this_character = ' ' then
                    set output_string = concat(output_string, ' ');
                -- if the current character is not a space, remove it if it's unwanted
                elseif instr(unacceptable_values, this_character) then
                    set output_string = concat(output_string, '');
                -- otherwise include the character
                else set output_string = concat(output_string, this_character);
                end if;
            set i = i + 1;
        end while;
        end;
    else
        begin
        while i < input_length do
            begin
            SET this_character = SUBSTRING( v_input_string, i, 1 );
            if instr(unacceptable_values, this_character) > 0 then
                set output_string = concat(output_string, '');
            else set output_string = concat(output_string, this_character);
            end if;
            end;
            set i = i + 1;
        end while;
        end;
    end if;
    end;
        RETURN output_string;
  1. Keep only the characters you want:
    create function fn_preserve_selected_characters
        (v_input_string varchar(255),
         v_acceptable_characters varchar(255))
    returns varchar(255)

    begin
    declare i int;
    declare acceptable_values varchar(255);
    declare this_character char(1);
    declare output_string varchar(255);
    declare input_length int;
    declare boolean_value int;
    declare space varchar(3);

    set input_length = char_length(v_input_string);
    set i = 0;
    set acceptable_values = v_acceptable_characters;
    set output_string = '';
    set boolean_value = 0;
    set space = 'no';

    begin

    -- check for existence of spaces
    if instr( acceptable_values, ' ') then
        begin
        while i < input_length do
            -- SUBSTRING() treats spaces as empty strings
            -- so handle them specially
            SET this_character = SUBSTRING( v_input_string, i, 1 );
                if this_character = ' ' then
                    set output_string = concat(output_string, ' ');
                elseif instr(acceptable_values, this_character) then
                    set output_string = concat(output_string, this_character);
                else set output_string = concat(output_string, '');
                end if;
            set i = i + 1;
        end while;
        end;
    -- if there are no spaces in input string
    -- then this section is complete
    else 
        begin
        while i <= input_length do
            SET this_character = SUBSTRING( v_input_string, i, 1 );
            -- if the current character exists in the punctuation string
            if LOCATE( this_character, acceptable_values ) > 0 THEN
                set output_string = concat(output_string, this_character);
            end if;
            set i = i+1;
        end while;
        end;
    end if;
    end;
        RETURN output_string;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜