开发者

Trying to remove hex codes from regular expression results

My first question here at so!

To the point;

I'm pretty newbish when it comes to regular expressions.

To learn it a bit better and create something I can actually use, I'm trying to create a regexp that will find all the CSS tags in a CSS file.

So far, I'm using:

[#.]([a-zA-Z0-9_\-])*

Which is working pretty fine and finds the #TB_window as well as the #TB_window img#TB_Image and the .TB_Image#TB_window.

The problem is it also finds the hex code tags in the CSS file. ie #FFF o开发者_开发技巧r #eaeaea.

The .png or .jpg or and 0.75 are found as well..

Actually it's pretty logical that they are found, but aren't there smart workarounds for that?

Like excluding anything between the brackets {..}?

(I'm pretty sure that's possible, but my regexp experience is not much yet).

Thanks in advance!

Cheers!

Mike


CSS is a very simple, regular language, which means it can be completely parsed by Regex. All there is to it are groups of selectors, each followed by a group of options separated by colons.

Note that all regexes in this post should have the verbose and dotall flags set (/s and /x in some languages, re.DOTALL and re.VERBOSE in Python).

To get pairs of (selectors, rules):

\s*        # Match any initial space
([^{}]+?)  # Ungreedily match a string of characters that are not curly braces.
\s*        # Arbitrary spacing again.
\{         # Opening brace.
  \s*      # Arbitrary spacing again.
  (.*?)    # Ungreedily match anything any number of times.
  \s*      # Arbitrary spacing again.
\}         # Closing brace.

This will not work in the rare case of having a quoted curly bracket in an attribute selector (e.g. img[src~='{abc}']) or in a rule (e.g. background: url('images/ab{c}.jpg')). This can be fixed by complicating the regex some more:

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^{}\"\']           # Any character other than braces or quotes.
  |                   # OR
  \"                  # An opening double quote.
    (?:[^\"\\]|\\.)*  # Either a neither-quote-not-backslash, or an escaped character.
  \"                  # And a closing double quote.
  |                   # OR
  \'(?:[^\']|\\.)*\'  # Same as above, but for single quotes.
)+?)       # Ungreedily match all that once or more.
\s*        # Arbitrary spacing again.
\{         # Opening brace.
  \s*      # Arbitrary spacing again.
  ((?:[^{}\"\']|\"(?:[^\"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')*?)
           # The above line is the same as the one in the selector capture group.
  \s*      # Arbitrary spacing again.
\}         # Closing brace.
# This will even correctly identify escaped quotes.

Woah, that's a handful. But if you approach it in a modular fashion, you'll notice it's not as complex as it seems at first glance.

Now, to split selectors and rules, we go have to match strings of characters that are either non-delimiters (where a delimiter is the comma for selectors and a semicolon for rules) or quoted strings with anything inside. We'll use the same pattern we used above.

For selectors:

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^,\"\']             # Any character other than commas or quotes.
  |                    # OR
  \"                   # An opening double quote.
    (?:[^\"\\]|\\.)*   # Either a neither-quote-not-backslash, or an escaped character.
  \"                   # And a closing double quote.
  |                    # OR
  \'(?:[^\'\\]|\\.)*\' # Same as above, but for single quotes.
)+?)       # Ungreedily match all that.
\s*        # Arbitrary spacing.
(?:,|$)      # Followed by a comma or the end of a string.

For rules:

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^,\"\']             # Any character other than commas or quotes.
  |                    # OR
  \"                   # An opening double quote.
    (?:[^\"\\]|\\.)*   # Either a neither-quote-not-backslash, or an escaped character.
  \"                   # And a closing double quote.
  |                    # OR
  \'(?:[^\'\\]|\\.)*\' # Same as above, but for single quotes.
)+?)       # Ungreedily match all that.
\s*        # Arbitrary spacing.
(?:;|$)      # Followed by a semicolon or the end of a string.

Finally, for each rule, we can split (once!) on a colon to get a property-value pair.

Putting that all together into a Python program (the regexes are the same as above, but non-verbose to save space):

import re

CSS_FILENAME = 'C:/Users/Max/frame.css'

RE_BLOCK = re.compile(r'\s*((?:[^{}"\'\\]|\"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\')+?)\s*\{\s*((?:[^{}"\'\\]|"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\')*?)\s*\}', re.DOTALL)
RE_SELECTOR = re.compile(r'\s*((?:[^,"\'\\]|\"(?:[^"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')+?)\s*(?:,|$)', re.DOTALL)
RE_RULE = re.compile(r'\s*((?:[^;"\'\\]|\"(?:[^"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')+?)\s*(?:;|$)', re.DOTALL)

css = open(CSS_FILENAME).read()

print [(RE_SELECTOR.findall(i),
        [re.split('\s*:\s*', k, 1)
         for k in RE_RULE.findall(j)])
       for i, j in RE_BLOCK.findall(css)]

For this sample CSS:

body, p#abc, #cde, a img .fgh, * {
  font-size: normal; background-color: white !important;

  -webkit-box-shadow: none
}

#test[src~='{a\'bc}'], .tester {
  -webkit-transition: opacity 0.35s linear;
  background: white !important url("abc\"cd'{e}.jpg");
  border-radius: 20px;
  opacity: 0;
  -webkit-box-shadow: rgba(0, 0, 0, 0.6) 0px 0px 18px;
}

span {display: block;} .nothing{}

... we get (spaced for clarity):

[(['body',
   'p#abc',
   '#cde',
   'a img .fgh',
   '*'],
  [['font-size', 'normal'],
   ['background-color', 'white !important'],
   ['-webkit-box-shadow', 'none']]),
 (["#test[src~='{a\\'bc}']",
   '.tester'],
  [['-webkit-transition', 'opacity 0.35s linear'],
   ['background', 'white !important url("abc\\"cd\'{e}.jpg")'],
   ['border-radius', '20px'],
   ['opacity', '0'],
   ['-webkit-box-shadow', 'rgba(0, 0, 0, 0.6) 0px 0px 18px']]),
 (['span'],
  [['display', 'block']]),
 (['.nothing'],
  [])]

Simple exercise for the reader: write a regex to remove CSS comments (/* ... */).


What about this:

([#.]\S+\s*,?)+(?=\{)


First off, I don't see how the RE you posted would find .TB_Image#TB_window. You could do something like:

/^[#\.]([a-zA-Z0-9_\-]*)\s*{?\s*$/

This would find any occurrences of # or . at the beginning of the line, followed by the tag, optionally followed by a { and then a newline.

Note that this would NOT work for lines like .TB_Image { something: 0; } (all on one line) or div.mydivclass since the . is not at the beginning of the line.

Edit: I don't think nested braces are allowed in CSS, so if you read in all the data and get rid of newlines, you could do something like:

/([a-zA-Z0-9_\-]*([#\.][a-zA-Z0-9_\-]+)+\s*,?\s*)+{.*}/

There's a way to tell a regex to ignore newlines as well, but I never seem to get that right.


It's actually not an easy task to solve with regular expressions since there are a lot of possibilities, consider:

  • descendant selectors like #someid ul img -- those are all valid tags and are separated by spaces
  • tags that don't start with . or # (i.e. HTML tag names) -- you have to provide a list of those in order to match them since they have no other difference from attributes
  • comments
  • more that I can't think of right now

I think you should instead consider some CSS parsing library suitable for your preferred language.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜