Why is there no decent sql parser? [closed]
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this questionI'm currently doing some oracle sql analysis and often encounter valid sql statement that can not be properly parsed by various sql parsers. Either they fail parsing it, or their abstract syntax tree generated is somehow messed up. It seems that the only parser that can really handle oracle sql is their own one that is not publicly available and cannot be obtained as a stand-alone parser.
I'm aware that there are different sql grammars and complying to all may not be possible. But even parsers that claim to be Oracle SQL parsers do not succeed in all cases.
What do you think are the main reasons that make it hard to implement sql parsers in general or oracle sql parsers in particular?
Best, Will
Good parsers are hard to write. That starts with the code generator for the parser code (which usually eats some (E)BNF-like syntax which has its own limitations).
Error handling in parsers is a research topic of its own. This is not only about detecting errors but also giving useful information what could be wrong and how to solve it. Some parsers don't even offer location information ("error happened at line/column").
Next, you have SQL which means "Structured Query Language", not "Standard Query Language". There is a SQL standard, even several, but you won't find a single database which implements any of them.
Oracle grudgingly offers VARCHAR but you better use VARCHAR2. Some databases offer recursive/tree-like queries. All of them use their own, special syntax for this. Joining is defined pretty clearly in the standard (join
, left join
, ...) but why bother if you can use +
?
On top of that, for every database version, new features are added to the grammar.
So while you could write a parser that can read the standard cases, writing a parser that can support all the features which all the databases around the globe offer, is nearly impossible. And I'm not even talking about the bugs which you can encounter in these parsers.
One solution would be if all database vendors would publish the grammar files. But these are crown jewels (IP). So you should be happy that you can use them without having to pay a license fee per parsed character * number of CPUs.
When a manufacturer claims to support a language X, he means "something like the X standard" but not the standard. Manufacturers for historical reasons implement language X before the standard was as standard, so they start on the wrong foot; trying to make their version match the standard usually breaks their large base of user code; and they always want to add their own goodies to lock in their users.
This is true for SQL, C, C++... the only languages I know of where people try really hard to match the standard is Ada, and even it comes in multiple dialects. (Look what browsers accept!).
So you can't expect a off-the-shelf generic SQL parser to parse PLSQL. You really have to have a PLSQL parser. And these are hard to build as the documentation is poor, Oracle has no reason to fix it, and certainly has no motivation to help the grammar builder.
My company (Semantic Designs) has a PLSQL parser that covers 10g pretty well (Oracle's documentation is poor...we keep finding variations from the reference docuements) and does most of 11g. We've run it across millions of line of PLSQL code.
They do it wrong? :)... It can obviously be done since the parsers in the database engines works fine ;)... It could probably be due to several factors. The dialect might not be well documented or there might have been recent changes to the dialect not implemented in the parser in question.
Metadata.
SELECT identifier_1.identifier_2 FROM table
could mean identifier_1
is a schema or a package, and identifier_2
might be a function or synonym to a function.
There's a whole bunch of reasons why a statement might be correct but can't be understood without the metadata about the database objects. Given those limitations, there's a limit to how far a parser can go.
If a parser can work out 80% of your code, and 15% can't be worked out without the metadata, there's diminishing returns in stretching the parser to cope with the missing "5%".
If you take a look at Oracle SQL reference: http://docs.oracle.com/cd/B28359_01/server.111/b28286/toc.htm
you will know how difficult it is to create a SQL Parser that fully support all Oracle SQL syntax, it's almost impossible.
Even the documentation listed above haven't documented all syntax precisely that can be used to create an Oracle SQL parser.
For every database version, new syntax will be added constantly.
I think a SQL Parser like general sql parser that covers the most important SQL syntax of various major databases maybe a choice.
精彩评论