开发者

identifier splitting to approximately match documentation

Different software projects have different coding convention; even in the same project there may be different languages used and will have different convention. What is good for searching documentation (which appear outside the source files), with identifier tokens from the source code?

For example if the source has self.开发者_如何学Python_def_passwd, or this.defPasswrd, a query on the documentation tree should strive to match default password.

So far I've been trying to sort by Levenshtein distance, which works nicely for small edit distances, but there are too many false positives when I increase the threshold, which is problematic with white spaces in documentation.

8 0.666667 announcement getContent AnnouncementBean.java(Token.Name.Function )
8 0.666667 announcement getPercent DataObservation.java (Token.Name.Function)
8 0.666667 announcement GroupBean GroupBean.java (Token.Name.Class)

where the first value is the Levenshtein distance, second one the distance divided by the length of the word matched. I'm thinking to

  1. look into Jaccard, Tanimoto algorithms
  2. intellisence/suggest kinda code
  3. Somewhere in SO there were posts on some algorithms that bio guys use for matching sequences
  4. Come up with regular expressions chain rules based on http://en.wikipedia.org/wiki/Naming_convention_%28programming%29

the last one being literally the last option. Which other algorithms do you think would could give better results for this kinda stuff?


Try using weighted edit distance, here you can encode knowledge of usual abbreviation, probable character mistakes by distance in keyboard. For example you can zero weight to vowels like [ao] and password will be equal to pswrd. Other option is to build word level edit distance and use synonyms here. I also have builded EditDistance which works simultaneousnesly with words and characters.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜