开发者

Parsing Transact SQL with RegEx

I'm quite inexperienced with RegEx - just an occasional straighforward RegEx for a programming task that I worked out by trial and error, but now I have a serious regEx challenge:

I have about 970 text files containing Sybase Transact SQL snippets, and I need to find every table name in those files and preface the table name with ' #'. So my options are to either spend a week editing the files by hand or write a script or application using regEx (Python 3 or Delphi-PRCE) that will perform this task.

The rules are as follows:

Table names are ALWAYS upperCase - so I'm only looking for upperCase words;

Column names, SQL expressions and variables are ALWAYS lowerCase;

SQL keywords, Table aliases and column values CAN BE upperCase, but must NOT be prefixed with ' #';

Table aliases (must not be prefixed) will always have whiteSpace preceding them until the end of the previous word, which will be a table name.

Column values (must not be prefixed) will either be numerical values or characters enclosed in quotes.

Here is some sample text requiring application of all the above mentioned rules:

update SYBASE_TABLE
set ok = convert(char(10),MB.limit)
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
AND PPL.mot_ind = 'B'
AND PPL.trade_type_ind = 'P'

So far w开发者_运维知识库ith I've gotten only this far: (not too far...)

(?-i)[[:upper:]]

Any help would be most appreciated. TIA,

MN


This is not doable with a simple regex-replacement. You will not be able to make a distinction between upper case words that are tables, are string literals or are commented:

update TABLE set x='NOT_A_TABLE' where y='NOT TABLES EITHER' 
-- AND NO TABLES HERE AS WELL

EDIT

You seem to think that determining if a word is inside a string literal or not is easy, then consider SQL like this:

-- a quote: '
update TABLE set x=42 where y=666
-- another quote: '

or

update TABLE set x='not '' A '''' table' where y=666 

EDIT II

Okay, I (obsessively) hammered on the fact that a simple regex replacements is not doable. But I didn't offer a (possible) solution yet. What you could do is create some sort of "hybrid-lexer" based on a couple of different regex-es. What you do is scan through the input file and at the start of each character, try to match either a comment, a string literal, a keyword, or a capitalized word. And if none of these 4 previous patterns matched, then just consume a single character and repeat the process.

A little demo in Python:

#!/usr/bin/env python
import re 

input = """
UPDATE SYBASE_TABLE
SET ok = convert(char(10),MB.limit) -- ignore me!
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
-- comment '
AND PPL.mot_ind = 'B '' X'
-- another comment '
AND PPL.trade_type_ind = 'P -- not a comment'
"""

regex = r"""(?xs)          # x = enable inline comments, s = enable DOT-ALL
  (--[^\r\n]*)             # [1] comments
  |                        # OR
  ('(?:''|[^\r\n'])*')     # [2] string literal
  |                        # OR
  (\b(?:AND|UPDATE|SET)\b) # [3] keywords
  |                        # OR
  ([A-Z][A-Z_]*)           # [4] capitalized word
  |                        # OR
  .                        # [5] fall through: matches any char
"""

output = ''

for m in re.finditer(regex, input): 
    # append a `#` if group(4) matched
    if m.group(4): output += '#'
    # append the matched text (any of the groups!)
    output +=  m.group()

# print the adjusted SQL
print output

which produces:

UPDATE #SYBASE_TABLE
SET ok = convert(char(10),#MB.limit) -- ignore me!
from #MOVE_BOOKS #MB, #PEOPLEPLACES #PPL
where #MB.move_num = #PPL.move_num
-- comment '
AND #PPL.mot_ind = 'B '' X'
-- another comment '
AND #PPL.trade_type_ind = 'P -- not a comment'

This may not be the exact output you want, but I'm hoping the script is simple enought for you to adjust to your needs.

Good luck.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜