What is a suitable lexer generator that I can use to strip identifiers from many language source files?

2022-12-17 16:20 问答作者：

I'm working on a group project for my University which is going to be used for plagiarism detection in Computer Science.

My group is primarily going off the hashing/fingerprinting techniques described in this journal article: Winnowing: Local Algorithms for Document Fingerprinting. This is very similar to how the MOSS plagiarism detection system works.

We are basically taking k-gram hashes of fellow students source code and looking them up in a database for relevant matches (along with lots of optimization in how we determine which hashes to select as a document's fingerprints).

The first aspect of our project is the "Front-End" portion of it, which will hold some semantic knowledge about each file format our 开发者_StackOverflow社区detection system can process. This will allow us to strip some details from the document that we no longer want for the purpose of plagiarism detection. Basically we want to be able to rename all variables in various programming languages to a constant string or letter.

What is a lightweight solution (lexer generator or something similar) that we can use to aid in renaming all variables in different languages source code files to constants?

Our project is being written in Java.

Ideally I'd simply like to be able to define a grammar for each language and then our front end will be able rename all identifiers in that languages source file to some constant. We would then do this for each file format we wanted to support (java, c++, python, etc).

For a lexer/parser generator, you should look at ANTLR. TXL, which is a textual transformation interpreter, is also worth a look. Ready-made grammars should be available for both.

Apart from ANTLR, which was already suggested, you can also take a look at JFlex.

Be aware that there are some languages where it's not really possible to do what you're trying to do. Specifically, those where it's not possible to determine what is or isn't a variable based on the grammar. Tcl is an example of such, but there are a number of dynamic languages that have the same issue (Lisp?).

acacia-lex lexer has method replace.

In Lexer token define, what looks like identifiers, for example, "ident1" -> "[a..d]", "ident2" -> "[e..h]".

In replace method input map provide the info, which identifier type to replace with which constant (object), for example, "ident1" -> "ident1", "ident2" -> "ident2".

继续阅读：lexer parsing

What is a suitable lexer generator that I can use to strip identifiers from many language source files?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？