regex doubt in gawk
my csv data file is like this
title,name,gender
MRS.,MADHU,Female
MRS.,RAJ KUMAR,male
MR.,N,Male
MRS.,SHASHI,Female
MRS.,ALKA,Female
now as you can see i wanna avoid all data like line 2 and 3 (i.e no white space or data length >= 3 )
MRS.,RAJ KUMAR,male
MR.,N,Male
and place it in a file called rejected_list.csv, rest all go in a file called clean_list.csv
hence here is my gawk script for it
开发者_StackOverflow社区gawk -F ',' '{
if( $2 ~ /\S/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
My problem is this script is not recognising '\S' character set( all alphabets except space).. it is selecting all words starting with S or has a S and rejecting the rest
a simple regex like /([A-Z])/ in place of /s works perfectly but as i place a limit of {3,} the script fails..
gawk -F ',' '{
if( $2 ~ /([A-Z]){3,}/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
i have tried all sorts of combination of the regex with '*','+' etc but i cant get what i want...
can anyone tell me what is the problem?
Use [:graph:] instead of \S for all printable and visible characters. GAWK does not recognize \S as [:graph:] so it will not work.
Additionally, the {3,} interval expression only works in posix or re-interval modes.
I added a rejection condition: not exactly 3 fields
gawk -F, '
BEGIN {
titles = "MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF."
genders = "M|F|Male|Female"
}
$1 !~ titles || $2 ~ /[[:space:]]/ || length($2) < 3 || $3 !~ genders || NF != 3 {
print > "rejected_list.csv"
next
}
{ print > "clean_list.csv" }
' < DATA_file.csv
精彩评论