Sed to remove underscores and promote character
I am trying to migrate some code from an old naming scheme to the new one the old naming scheme开发者_开发知识库 is:
int some_var_name;
New one is
int someVarName_:
So what I would ilke is some form of sed / regexy goodness to ease the process. So fundamentally what needs to happen is:
find lower case word with contained _ replace underscore with nothing and promote the char to the right of the _ to uppercase. After this appending an _ to the end of the match.Is it possible to do this with Sed and/or Awk and regex? If not why not?
Any examples scripts would be appreciated.
thanks very much for any assistance.
EDIT:
For a bit of clarity the renaming is for a number of files that were written with the wrong naming convention and need to be brought into line with the rest of the codebase. It is not expected that this do a perfect replace that leaves everything in a compilable state. Rather the script will be run and then looked over by hand for any anomalies. The replace script would be purely to ease the burden of having to correct everything by hand, which i'm sure you would agree is considerably tedious.sed -re 's,[a-z]+(_[a-z]+)+,&_,g' -e 's,_([a-z]),\u\1,g'
Explanation:
This is a sed command with 2 expressions (each in quotes after a -e
.) s,,,g
is a global substitution. You usually see it with slashes instead of commas, but I think this is easier to read when you're using backslashes in the patterns (and no commas). The trailing g (for "global") means to apply this substitution to all matches on each line, rather than just the first.
The first expression will append an underscore to every token made up of a lowercase word ([a-z]+
) followed by a nonzero number of lowercase words separated by underscores ((_[a-z]+)+
). We replace this with &_
, where &
means "everything that matched", and _
is just a literal underscore. So in total, this expression is saying to add an underscore to the end of every underscore_separated_lowercase_token.
The second expression matches the pattern _([a-z]))
, where everything between (
and )
is a capturing group. This means we can refer back to it later as \1
(because it's the first capturing group. If there were more, they would be \2
, \3
, and so on.). So we're saying to match a lowercase letter following an underscore, and remember the letter.
We replace it with \u\1
, which is the letter we just remembered, but made uppercase by that \u
.
This code doesn't do anything clever to avoid munging #include
lines or the like; it will replace every instance of a lowercase letter following an underscore with its uppercase equivalent.
A few years ago I successfully converted a legacy 300,000 LOC 23-year-old code base to camelCase. It took only two days. But there were a few lingering affects that took a couple of months to sort out. And it is an very good way to annoy your fellow coders.
I believe that a simple, dumb, sed-like approach has advantages. IDE-based tools, and the like, cannot, as far as I know:
- change code not compiled via #ifdef's
- change code in comments
And the legacy code had to be maintained on several different compiler/OS platforms (= lots of #ifdefs).
The main disadvantage of a dumb, sed-like approach is that strings (such as keywords) can inadvertently be changed. And I've only done this for C; C++ might be another kettle of fish.
There are about five stages:
1) Generate a list of tokens that you wish to change, and manually edit.
2) For each token in that list, determine the new token.
3) Apply these changes to your code base.
4) Compile.
5) Double-check via a manual diff, and do a final clean-up.
For step 1, to generate a list of tokens that you wish to change, the command:
cat *.[ch] | sed 's/\([_A-Za-z0-9][_A-Za-z0-9]*\)/\nzzz \1\n/g' | grep -w zzz | sed 's/^zzz //' | grep '_[a-z]' | sort -u > list1
will produce in list1:
st_atime
time_t
...
In this sample, you really don't want to change these two tokens, so manually edit the list to delete them. But you'll probably miss some, so for the sake of this example, suppose you keep these.
The next step, 2, is to generate a script to do the changes. For example, the command:
cat list1 | sed 's/\(.*\)/glob_sub "\\<\1\\>" xxxx_\1/;s/\(xxxx_.*\)_a/\1A/g;s/\(xxxx_.*\)_b/\1B/g;s/\(xxxx_.*\)_a/\1C/g;s/\(xxxx_.*\)_t/\1T/g' | sed 's/zzz //' > list2
will change _a, _b, _c, and _t to A, B, C, and T, to produce:
glob_sub "\<st_atime\>" xxxx_stAtime
glob_sub "\<time_t\>" xxxx_timeT
You just have to extend it to cover d, e, f, ..., x, y, z,
I'm presuming you have already written something like 'glob_sub' for your development environment. (If not, give up now.) My version (csh, Cygwin) looks like:
#!/bin/csh
foreach file (`grep -l "$1" */*.[ch] *.[ch]`)
/bin/mv -f $file $file.bak
/bin/sed "s/$1/$2/g" $file.bak > $file
end
(Some of my sed's don't support the --in-place option, so I have to use a mv.)
The third step is to apply this script in list2 to your code base. For example, in csh use source list2
.
The fourth step is to compile. The compiler will (hopefully!) object to xxxx_timeT
. Indeed, it should likely object to just timeT
but the extra xxx_
adds insurance. So for time_t you've made a mistake. Undo it with e.g.
glob_sub "\<xxxx_timeT\>" time_t
The fifth and final step is to do a manual inspection of your changes using your favorite diff utility, and then clean-up by removing all the unwanted xxx_
prefixes. Grepping for "xxx_
will also help check for tokens in strings. (Indeed, adding a _xxx suffix is probably a good idea.)
Consider using sed to search and replace all text like this. Without a C++ tokenizer to recognize identifiers (and specifically your identifiers and not those in the standard library, e.g.), you are screwed. push_back gets renamed to pushBack_. map::insert to map::insert_. map to map_. basic_string to basicString_. printf to printf_ (if you use C libraries), etc. You're going to be in a world of hurt if you do it indiscriminately.
I don't know of any existing tool to automagically rename some_var_name to someVarName_ without the problems described above. People voted this post down probably because they didn't understand what I meant here. I'm not saying sed can't do it, I'm just saying it won't give you what you want to just use it as is. The parser needs contextual information to do this right, else it'll replace a lot more things it shouldn't than it should.
It would be possible to write a parser that would do this (ex: using sed) if it could recognize which tokens were identifiers (specifically your identifiers), but I doubt there's a tool specifically for what you want to do that does it off the bat without some manual elbow grease (though I could be wrong). Doing a simple search and replace on all text this way would be inherently problematic.
However, Visual AssistX (which can optionally replace instances in documentation) or any other refactoring tool capable of smartly renaming identifiers for every instance in which they occur at least eases the burden of refactoring code this way quite considerably. If you have a symbol named some_var_name and it's referenced in a thousand different places in your system, with VAssistX you can just use one rename function to rename all references smartly (this is not a mere text search and replace). Check out the refactoring features of Visual Assist X.
It might take 15 minutes to a half hour to refactor a hundred variables this way with VAX (faster if you use the hotkeys), but it certainly beats using a text search and replace with sed like described in the other answer and having all kinds of code replaced that shouldn't be replaced.
[subjective]BTW: underscores still don't belong in camel case if you ask me. A lowerCamelCase naming convention should use lowerCamelCase. There are plenty of interesting papers on this, but at least your convention is consistent. If it's consistent, then that's a huge plus as opposed to something like fooBar_Baz which some goofy coders write who think it somehow makes things easier to make special exceptions to the rule.[/subjective]
精彩评论