Parsing Transact SQL with RegEx

2023-03-09 07:00 问答作者：

I'm quite inexperienced with RegEx - just an occasional straighforward RegEx for a programming task that I worked out by trial and error, but now I have a serious regEx challenge:

I have about 970 text files containing Sybase Transact SQL snippets, and I need to find every table name in those files and preface the table name with ' #'. So my options are to either spend a week editing the files by hand or write a script or application using regEx (Python 3 or Delphi-PRCE) that will perform this task.

The rules are as follows:

Table names are ALWAYS upperCase - so I'm only looking for upperCase words;

Column names, SQL expressions and variables are ALWAYS lowerCase;

SQL keywords, Table aliases and column values CAN BE upperCase, but must NOT be prefixed with ' #';

Table aliases (must not be prefixed) will always have whiteSpace preceding them until the end of the previous word, which will be a table name.

Column values (must not be prefixed) will either be numerical values or characters enclosed in quotes.

Here is some sample text requiring application of all the above mentioned rules:

update SYBASE_TABLE
set ok = convert(char(10),MB.limit)
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
AND PPL.mot_ind = 'B'
AND PPL.trade_type_ind = 'P'

So far w开发者_运维知识库ith I've gotten only this far: (not too far...)

(?-i)[[:upper:]]

Any help would be most appreciated. TIA,

This is not doable with a simple regex-replacement. You will not be able to make a distinction between upper case words that are tables, are string literals or are commented:

update TABLE set x='NOT_A_TABLE' where y='NOT TABLES EITHER' 
-- AND NO TABLES HERE AS WELL

EDIT

You seem to think that determining if a word is inside a string literal or not is easy, then consider SQL like this:

-- a quote: '
update TABLE set x=42 where y=666
-- another quote: '

update TABLE set x='not '' A '''' table' where y=666

EDIT II

Okay, I (obsessively) hammered on the fact that a simple regex replacements is not doable. But I didn't offer a (possible) solution yet. What you could do is create some sort of "hybrid-lexer" based on a couple of different regex-es. What you do is scan through the input file and at the start of each character, try to match either a comment, a string literal, a keyword, or a capitalized word. And if none of these 4 previous patterns matched, then just consume a single character and repeat the process.

A little demo in Python:

#!/usr/bin/env python
import re 

input = """
UPDATE SYBASE_TABLE
SET ok = convert(char(10),MB.limit) -- ignore me!
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
-- comment '
AND PPL.mot_ind = 'B '' X'
-- another comment '
AND PPL.trade_type_ind = 'P -- not a comment'
"""

regex = r"""(?xs)          # x = enable inline comments, s = enable DOT-ALL
  (--[^\r\n]*)             # [1] comments
  |                        # OR
  ('(?:''|[^\r\n'])*')     # [2] string literal
  |                        # OR
  (\b(?:AND|UPDATE|SET)\b) # [3] keywords
  |                        # OR
  ([A-Z][A-Z_]*)           # [4] capitalized word
  |                        # OR
  .                        # [5] fall through: matches any char
"""

output = ''

for m in re.finditer(regex, input): 
    # append a `#` if group(4) matched
    if m.group(4): output += '#'
    # append the matched text (any of the groups!)
    output +=  m.group()

# print the adjusted SQL
print output

which produces:

UPDATE #SYBASE_TABLE
SET ok = convert(char(10),#MB.limit) -- ignore me!
from #MOVE_BOOKS #MB, #PEOPLEPLACES #PPL
where #MB.move_num = #PPL.move_num
-- comment '
AND #PPL.mot_ind = 'B '' X'
-- another comment '
AND #PPL.trade_type_ind = 'P -- not a comment'

This may not be the exact output you want, but I'm hoping the script is simple enought for you to adjust to your needs.

Good luck.

继续阅读：parsing regex text tsql

Parsing Transact SQL with RegEx

EDIT

EDIT II

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

EDIT

EDIT II

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？