grep for (curly|microsoft|smart) quotes

2023-02-14 17:48 问答作者：

I have a huge folder filled with xml documents, some of which may break because they contain those curly quotes, i.e. Microsoft Word quotes, i.e. smart quotes. I just want to run a quick check to see what I'm up against. Anybody know how to grep for them so I can easily find the offenders?

Edit

Here's a simplified example.

<?xml version="1.0" encoding="UTF-8"?>
<items>
  <item>Pretend this is a cur开发者_JAVA百科ly quote: '</item>
</items>

Curly quotes has the following Unicode code points and UTF-8 sequence:

Name                                     CodePoint     UTF-8 sequence
----                                     ---------     --------------
LEFT SINGLE QUOTATION MARK               U+2018        0xE2 0x80 0x98
RIGHT SINGLE QUOTATION MARK              U+2019        0xE2 0x80 0x99
SINGLE LOW-9 QUOTATION MARK              U+201A        0xE2 0x80 0x9A
SINGLE HIGH-REVERSED-9 QUOTATION MARK    U+201B        0xE2 0x80 0x9B 
LEFT DOUBLE QUOTATION MARK               U+201C        0xE2 0x80 0x9C
RIGHT DOUBLE QUOTATION MARK              U+201D        0xE2 0x80 0x9D
DOUBLE LOW-9 QUOTATION MARK              U+201E        0xE2 0x80 0x9E
DOUBLE HIGH-REVERSED-9 QUOTATION MARK    U+201F        0xE2 0x80 0x9F

XML is usually stored in UTF-8, so you could just compare directly for the byte sequence.

You can find files containing the UTF-8 sequences dalle mentioned thusly:

grep -r -P "\xE2\x80\x9C" .

The -r makes it recursive and the -P tells grep to use Perl compatible regular expressions.

If they're xml documents, you could open one of them that you know contains the offending quotes, to see exactly what they look like in the xml file (and copy them to clipboard, if you can't reproduce them easily with your keyboard).

Assuming that your quotes look like „ or ”, You could do something like sed -i .bak 's/[”„]/"/' file1 file2 ... (if using Linux/OSX/cygwin on Windows) to quickly substitute the offending quotes with normal quotes, modifying the files in-place.

MIGHT BE A DUPLICATE

I had a situation where the user would copy paste strings from anywhere and I had to allow them an entry excluding any special character except quotes. Quotes be it smart/fancy/straight for that matter. Let me exemplify:

Text    | Error
----------------
O*Connor| Yes

O'Connor| No

O’Connor| No

And I came up with below solution for my CF code.

<cfif #REFind("[[:punct:],[:digit:]]",textName)# GT 0 >     
    <cfset temp_name = textName.ReplaceAll(JavaCast( "string", "[^A-Za-z\u2018\u2019\u201A\u201B\u2032\u2035\'\-\ ]" ),JavaCast( "string", "" )) >
<cfif (len(temp_name )EQ len(textName)) >
    <!--- If you find single quote or hyphen, do nothing --->
<cfelse>
    <cfset errormsg = The Text contains special charctaer">
</cfif>

Immense help from: http://axonflux.com/handy-regexes-for-smart-quotes

I am on a Mac, and the built-in grep didn't work for me right away (neubert's answer.) I ended up installing Homebrew's version of GNU grep:

brew tap homebrew/dupes
brew install homebrew/dupes/grep

Then I could run the commands in a similar fashion:

ggrep -r -P "\xE2\x80\x9C" .
etc.

I ended up combining dalle and neubert's answers into this script which will run all of the cases that I currently know about and print them all off.

继续阅读：grep regex

grep for (curly|microsoft|smart) quotes

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？