Frequency of symbols in programming languages
I'm looking for some kind of reference which shows the frequency of symbols of popular programming languages. I'm trying t开发者_开发技巧o design an optimal keyboard layout for programming.
If there is no such reference, I wouldn't mind creating a simple utility that figures this out. However, I would need suggestions as to which files to analyze for each language.
One of the problems I can foresee is say I get some objective-c code, if it is a simple program with no objects, then the [ and ] keys will be far less frequent than an average objective-c file. So, I would say one of the guidelines is that the sample code should be representative of an average file and use the most commonly used features of the language.
Originally I was thinking that I should get the same code written in different languages, but I'm not sure if that's a good idea since some languages have different uses than others.
For large code samples to use for statistical analysis, you might try browsing popular open-source projects or searching on Koders by language.
I made some simple changes to a QWERTY layout a few years ago, and I've been using it ever since as my general-purpose layout:
- Swap digits for their corresponding shift-symbols.
- Swap
_
and-
: names with underscores are common, and now-
and+
both require Shift. - Swap
[]
and{}
: blocks are more common than subscripts.
Plus two optional changes, to taste:
- Swap
`
and~
: destructors are common. - Swap
'
and"
: strings are more common than characters.
The last is the only one that typically would interfere with typing ordinary English text. The layout works beautifully for C++, Perl, and whatever else I've used in the past two or three years. The noticeable speed increase comes from the drastic reduction in the need to hit the Shift key. I find that using Shift for the numbers isn't a big deal since the number pad is usually faster anyway.
The book The New C Standard: An economic and cultural commentary contains a lot of measurements of C source usage. The usage figures and tables are available as a stand-alone pdf
@Derek Jones cited The New C Standard: An economic and cultural commentary which has the information but here are the frequencies contained therein for quick reference:
space 15.083
! 0.102
" 0.376
# 0.175
$ 0.005
% 0.105
# 0.175
& 0.237
' 0.101
( 1.372
) 1.373
* 1.769
+ 0.182
, 1.565
- 1.176
. 1.512
/ 0.718
: 0.192
; 1.276
< 0.118
= 1.039
> 0.587
? 0.022
@ 0.009
[ 0.163
\ 0.97
] 0.163
^ 0.003
_ 2.550
{ 0.303
| 0.098
} 0.210
~ 0.002
Here is the same sorted by frequency:
space 15.083
_ 2.550
* 1.769
, 1.565
. 1.512
) 1.373
( 1.372
; 1.276
- 1.176
= 1.039
/ 0.718
> 0.587
" 0.376
{ 0.303
& 0.237
} 0.210
: 0.192
+ 0.182
# 0.175
] 0.163
[ 0.163
< 0.118
% 0.105
! 0.102
' 0.101
| 0.098
? 0.022
@ 0.009
$ 0.005
^ 0.003
~ 0.002
Their is a version of the Dvorak keyboard layout available, optimized for programmers.
http://www.kaufmann.no/roland/dvorak/
If you happen to use Ubuntu, it is already on your system.
There's a vast collection of open-source software that you could measure to gain some good data on character frequency. Sourceforge and github would be the places to look.
Developers don't just write code though, they also write design documents, emails and answers to stack overflow questions. Maybe installing a key logger on a few consenting developers computers would be the best way.
What you're looking for is a good corpus of programming languages. While nothing immediately sprung up in a cursory Googling, the following links might hopefully prove to be useful if you do create your own tool.
A novel framework to detect source code plagiarism
Calgary Corpus
Generating an NLP Corpus from Java Source Code
A Computer Science Text Corpus/Search Engine X-Tec and Its Applications
Mining search topics from a code search engine usage log
精彩评论