Remove Repeating and Control Characters in sed
Let's say I have a word at the beginning of a line, HHEELLLLOO for example. How can I replace repeat characters with single characters. The output should be HELLO.
Also does anyone know how to remove or specify control characters in sed, ^H for 开发者_开发知识库example.
Question 1
Yes, regex can handle that. In sed:
$ echo HHEELLLLOO | sed 's/\(.\)\1/\1/g'
HELLO
This will do it.
Question 2
It may vary depending on your system. Here (BSD) you can type ctrl-v ctrl-h to insert a literal backspace character to be interpreted by sed. Give it a try.
$ cat file
H^HE^HL^HL^HO^H
$ sed 's/^H//g' file > new_file
$ cat new_file
HELLO
See "limiting repetition" from this site: http://www.regular-expressions.info/repeat.html
An actual script, as inspired by chown
and that site:
sed 's/\([a-zA-Z]\)\1\+/\1/g'
However, you won't be able to get HELLO
, you would only get HELO
. A regex is not sophisticated enough to determine that there should be 2 L's. For that, you would need to match the word to a dictionary. Though, you could use the regex for that ... H+E+L+O+
. . .
For the control characters, \0xx
will match arbitrary ASCII characters. You'll have to look up what ^H
represents.
Try this for removing duplicates: sed 's/\([a-zA-Z]\)\1\+/\1/g'
but it will produce 'HELO' not 'HELLO'. See the other Answer for the reasons why this is.
$ echo BookKeeper | perl -pe 's/(.)\1+/$1/gi'
Bokeper
$ perl -le 'print "\cSome \cEvil \cControl \cMess\c?"' | perl -ple 's/\pC//g'
ome vil ontrol ess
Technically, control characters are \p{Cc}
.
精彩评论