Parsing Transact SQL with RegEx
I'm quite inexperienced with RegEx - just an occasional straighforward RegEx for a programming task that I worked out by trial and error, but now I have a serious regEx challenge:
I have about 970 text files containing Sybase Transact SQL snippets, and I need to find every table name in those files and preface the table name with ' #'. So my options are to either spend a week editing the files by hand or write a script or application using regEx (Python 3 or Delphi-PRCE) that will perform this task.
The rules are as follows:
Table names are ALWAYS upperCase - so I'm only looking for upperCase words;
Column names, SQL expressions and variables are ALWAYS lowerCase;
SQL keywords, Table aliases and column values CAN BE upperCase, but must NOT be prefixed with ' #';
Table aliases (must not be prefixed) will always have whiteSpace preceding them until the end of the previous word, which will be a table name.
Column values (must not be prefixed) will either be numerical values or characters enclosed in quotes.
Here is some sample text requiring application of all the above mentioned rules:
update SYBASE_TABLE
set ok = convert(char(10),MB.limit)
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
AND PPL.mot_ind = 'B'
AND PPL.trade_type_ind = 'P'
So far w开发者_运维知识库ith I've gotten only this far: (not too far...)
(?-i)[[:upper:]]
Any help would be most appreciated. TIA,
MN
This is not doable with a simple regex-replacement. You will not be able to make a distinction between upper case words that are tables, are string literals or are commented:
update TABLE set x='NOT_A_TABLE' where y='NOT TABLES EITHER'
-- AND NO TABLES HERE AS WELL
EDIT
You seem to think that determining if a word is inside a string literal or not is easy, then consider SQL like this:
-- a quote: '
update TABLE set x=42 where y=666
-- another quote: '
or
update TABLE set x='not '' A '''' table' where y=666
EDIT II
Okay, I (obsessively) hammered on the fact that a simple regex replacements is not doable. But I didn't offer a (possible) solution yet. What you could do is create some sort of "hybrid-lexer" based on a couple of different regex-es. What you do is scan through the input file and at the start of each character, try to match either a comment
, a string literal
, a keyword
, or a capitalized word
. And if none of these 4 previous patterns matched, then just consume a single character and repeat the process.
A little demo in Python:
#!/usr/bin/env python
import re
input = """
UPDATE SYBASE_TABLE
SET ok = convert(char(10),MB.limit) -- ignore me!
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
-- comment '
AND PPL.mot_ind = 'B '' X'
-- another comment '
AND PPL.trade_type_ind = 'P -- not a comment'
"""
regex = r"""(?xs) # x = enable inline comments, s = enable DOT-ALL
(--[^\r\n]*) # [1] comments
| # OR
('(?:''|[^\r\n'])*') # [2] string literal
| # OR
(\b(?:AND|UPDATE|SET)\b) # [3] keywords
| # OR
([A-Z][A-Z_]*) # [4] capitalized word
| # OR
. # [5] fall through: matches any char
"""
output = ''
for m in re.finditer(regex, input):
# append a `#` if group(4) matched
if m.group(4): output += '#'
# append the matched text (any of the groups!)
output += m.group()
# print the adjusted SQL
print output
which produces:
UPDATE #SYBASE_TABLE
SET ok = convert(char(10),#MB.limit) -- ignore me!
from #MOVE_BOOKS #MB, #PEOPLEPLACES #PPL
where #MB.move_num = #PPL.move_num
-- comment '
AND #PPL.mot_ind = 'B '' X'
-- another comment '
AND #PPL.trade_type_ind = 'P -- not a comment'
This may not be the exact output you want, but I'm hoping the script is simple enought for you to adjust to your needs.
Good luck.
精彩评论