开发者

grep for (curly|microsoft|smart) quotes

I have a huge folder filled with xml documents, some of which may break because they contain those curly quotes, i.e. Microsoft Word quotes, i.e. smart quotes. I just want to run a quick check to see what I'm up against. Anybody know how to grep for them so I can easily find the offenders?

Edit

Here's a simplified example.

<?xml version="1.0" encoding="UTF-8"?>
<items>
  <item>Pretend this is a cur开发者_JAVA百科ly quote: '</item>
</items>


Curly quotes has the following Unicode code points and UTF-8 sequence:

Name                                     CodePoint     UTF-8 sequence
----                                     ---------     --------------
LEFT SINGLE QUOTATION MARK               U+2018        0xE2 0x80 0x98
RIGHT SINGLE QUOTATION MARK              U+2019        0xE2 0x80 0x99
SINGLE LOW-9 QUOTATION MARK              U+201A        0xE2 0x80 0x9A
SINGLE HIGH-REVERSED-9 QUOTATION MARK    U+201B        0xE2 0x80 0x9B 
LEFT DOUBLE QUOTATION MARK               U+201C        0xE2 0x80 0x9C
RIGHT DOUBLE QUOTATION MARK              U+201D        0xE2 0x80 0x9D
DOUBLE LOW-9 QUOTATION MARK              U+201E        0xE2 0x80 0x9E
DOUBLE HIGH-REVERSED-9 QUOTATION MARK    U+201F        0xE2 0x80 0x9F

XML is usually stored in UTF-8, so you could just compare directly for the byte sequence.


You can find files containing the UTF-8 sequences dalle mentioned thusly:

grep -r -P "\xE2\x80\x9C" .

The -r makes it recursive and the -P tells grep to use Perl compatible regular expressions.


If they're xml documents, you could open one of them that you know contains the offending quotes, to see exactly what they look like in the xml file (and copy them to clipboard, if you can't reproduce them easily with your keyboard).

Assuming that your quotes look like or , You could do something like sed -i .bak 's/[”„]/"/' file1 file2 ... (if using Linux/OSX/cygwin on Windows) to quickly substitute the offending quotes with normal quotes, modifying the files in-place.


MIGHT BE A DUPLICATE

I had a situation where the user would copy paste strings from anywhere and I had to allow them an entry excluding any special character except quotes. Quotes be it smart/fancy/straight for that matter. Let me exemplify:

Text    | Error
----------------
O*Connor| Yes

O'Connor| No

O’Connor| No

And I came up with below solution for my CF code.

<cfif #REFind("[[:punct:],[:digit:]]",textName)# GT 0 >     
    <cfset temp_name = textName.ReplaceAll(JavaCast( "string", "[^A-Za-z\u2018\u2019\u201A\u201B\u2032\u2035\'\-\ ]" ),JavaCast( "string", "" )) >
<cfif (len(temp_name )EQ len(textName)) >
    <!--- If you find single quote or hyphen, do nothing --->
<cfelse>
    <cfset errormsg = The Text contains special charctaer">
</cfif>

Immense help from: http://axonflux.com/handy-regexes-for-smart-quotes


I am on a Mac, and the built-in grep didn't work for me right away (neubert's answer.) I ended up installing Homebrew's version of GNU grep:

brew tap homebrew/dupes
brew install homebrew/dupes/grep

Then I could run the commands in a similar fashion:

ggrep -r -P "\xE2\x80\x9C" .
etc.

I ended up combining dalle and neubert's answers into this script which will run all of the cases that I currently know about and print them all off.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜