How to implement Lexical Analysis in Javascript

2023-02-05 03:41 问答作者：

Hey folks, thanks for reading

I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.

I began slowly with the basics : + - / * and parenthesis handling.

I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.

This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.

I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.

Analysis

Let's define what a calculator query is:

// NON TERMINAL EXPRESSIONS //
query     -> statement
query     -> ε // means end of query

statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number

number    -> integer
number    -> float

// TERMINAL EXPRESSIONS //
operator  -> [+*/%^-]

prefix    -> -

integer   -> [0-9]+

float     -> [0-9]+[.,][0-9]+

Javascript

Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:

(I added spaces to make it more readable)

var calcPat = 
/^ (\s*
    ( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;

If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part

I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:

I created a method called isStatement(string) that's supposed to call i开发者_Go百科tself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.

Process is the following:

-If the first two tokens are a number followed by an operator:

-Then,

-- If the remaining is just one token and it is a number:

--- Then this is a statement.

--- Else, check if the remaining tokens form a statement (recursive call)

-Else, If the first token is a parenthesis

-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)

-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.

What's the problem ?

My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?

Thank you so much for having taken the time to read everything. Gael

(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)

You've got the right idea about what lexical analysis is, but you seem to have gotten confused about the distinction between the token grammar and the language grammar. Those are two different things.

The token grammar is the set of patterns (usually regular expressions) that describe the tokens for the language to be parsed. The regular expressions are expressions over a character set.
The language grammar (or target grammar, I suppose) is the grammar for the language you want to parse. This grammar is expressed in terms of tokens.

You cannot write a regular expression to parse algebraic notation. You just can't. You can write a grammar for it, but it's not a regular grammar. What you want to do is recognize separate tokens, which in your case could be done with a regular expression somewhat like what you've got. The trick is that you're not really applying that expression to the overall sentence to be parsed. Instead, you want to match a token at the current point in the sentence.

Now, because you've got Javascript regular expressions to work with, you could come up with a regular expression designed to match a string of tokens. The trick with that will be coming up with a way to identify which token was matched out of the list of possibilities. The Javascript regex engine can give you back arrays of groups, so maybe you could build something on top of that.

edit — I'm trying to work out how you could put together a (somewhat) general-purpose tokenizer builder, starting from a list of separate regular expressions (one for each token). It's possibly not very complicated, and it'd be pretty fun to have around.

继续阅读：javascript lexical-analysis pattern-matching regex

How to implement Lexical Analysis in Javascript

Analysis

Javascript

What's the problem ?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Analysis

Javascript

What's the problem ?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？