Trying to remove hex codes from regular expression results

2022-12-18 20:15 问答作者：

My first question here at so!

To the point;

I'm pretty newbish when it comes to regular expressions.

To learn it a bit better and create something I can actually use, I'm trying to create a regexp that will find all the CSS tags in a CSS file.

So far, I'm using:

[#.]([a-zA-Z0-9_\-])*

Which is working pretty fine and finds the #TB_window as well as the #TB_window img#TB_Image and the .TB_Image#TB_window.

The problem is it also finds the hex code tags in the CSS file. ie #FFF o开发者_开发技巧r #eaeaea.

The .png or .jpg or and 0.75 are found as well..

Actually it's pretty logical that they are found, but aren't there smart workarounds for that?

Like excluding anything between the brackets {..}?

(I'm pretty sure that's possible, but my regexp experience is not much yet).

Thanks in advance!

Cheers!

Mike

CSS is a very simple, regular language, which means it can be completely parsed by Regex. All there is to it are groups of selectors, each followed by a group of options separated by colons.

Note that all regexes in this post should have the verbose and dotall flags set (/s and /x in some languages, re.DOTALL and re.VERBOSE in Python).

To get pairs of (selectors, rules):

\s*        # Match any initial space
([^{}]+?)  # Ungreedily match a string of characters that are not curly braces.
\s*        # Arbitrary spacing again.
\{         # Opening brace.
  \s*      # Arbitrary spacing again.
  (.*?)    # Ungreedily match anything any number of times.
  \s*      # Arbitrary spacing again.
\}         # Closing brace.

This will not work in the rare case of having a quoted curly bracket in an attribute selector (e.g. img[src~='{abc}']) or in a rule (e.g. background: url('images/ab{c}.jpg')). This can be fixed by complicating the regex some more:

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^{}\"\']           # Any character other than braces or quotes.
  |                   # OR
  \"                  # An opening double quote.
    (?:[^\"\\]|\\.)*  # Either a neither-quote-not-backslash, or an escaped character.
  \"                  # And a closing double quote.
  |                   # OR
  \'(?:[^\']|\\.)*\'  # Same as above, but for single quotes.
)+?)       # Ungreedily match all that once or more.
\s*        # Arbitrary spacing again.
\{         # Opening brace.
  \s*      # Arbitrary spacing again.
  ((?:[^{}\"\']|\"(?:[^\"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')*?)
           # The above line is the same as the one in the selector capture group.
  \s*      # Arbitrary spacing again.
\}         # Closing brace.
# This will even correctly identify escaped quotes.

Woah, that's a handful. But if you approach it in a modular fashion, you'll notice it's not as complex as it seems at first glance.

Now, to split selectors and rules, we go have to match strings of characters that are either non-delimiters (where a delimiter is the comma for selectors and a semicolon for rules) or quoted strings with anything inside. We'll use the same pattern we used above.

For selectors:

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^,\"\']             # Any character other than commas or quotes.
  |                    # OR
  \"                   # An opening double quote.
    (?:[^\"\\]|\\.)*   # Either a neither-quote-not-backslash, or an escaped character.
  \"                   # And a closing double quote.
  |                    # OR
  \'(?:[^\'\\]|\\.)*\' # Same as above, but for single quotes.
)+?)       # Ungreedily match all that.
\s*        # Arbitrary spacing.
(?:,|$)      # Followed by a comma or the end of a string.

For rules:

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^,\"\']             # Any character other than commas or quotes.
  |                    # OR
  \"                   # An opening double quote.
    (?:[^\"\\]|\\.)*   # Either a neither-quote-not-backslash, or an escaped character.
  \"                   # And a closing double quote.
  |                    # OR
  \'(?:[^\'\\]|\\.)*\' # Same as above, but for single quotes.
)+?)       # Ungreedily match all that.
\s*        # Arbitrary spacing.
(?:;|$)      # Followed by a semicolon or the end of a string.

Finally, for each rule, we can split (once!) on a colon to get a property-value pair.

Putting that all together into a Python program (the regexes are the same as above, but non-verbose to save space):

import re

CSS_FILENAME = 'C:/Users/Max/frame.css'

RE_BLOCK = re.compile(r'\s*((?:[^{}"\'\\]|\"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\')+?)\s*\{\s*((?:[^{}"\'\\]|"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\')*?)\s*\}', re.DOTALL)
RE_SELECTOR = re.compile(r'\s*((?:[^,"\'\\]|\"(?:[^"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')+?)\s*(?:,|$)', re.DOTALL)
RE_RULE = re.compile(r'\s*((?:[^;"\'\\]|\"(?:[^"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')+?)\s*(?:;|$)', re.DOTALL)

css = open(CSS_FILENAME).read()

print [(RE_SELECTOR.findall(i),
        [re.split('\s*:\s*', k, 1)
         for k in RE_RULE.findall(j)])
       for i, j in RE_BLOCK.findall(css)]

For this sample CSS:

body, p#abc, #cde, a img .fgh, * {
  font-size: normal; background-color: white !important;

  -webkit-box-shadow: none
}

#test[src~='{a\'bc}'], .tester {
  -webkit-transition: opacity 0.35s linear;
  background: white !important url("abc\"cd'{e}.jpg");
  border-radius: 20px;
  opacity: 0;
  -webkit-box-shadow: rgba(0, 0, 0, 0.6) 0px 0px 18px;
}

span {display: block;} .nothing{}

... we get (spaced for clarity):

[(['body',
   'p#abc',
   '#cde',
   'a img .fgh',
   '*'],
  [['font-size', 'normal'],
   ['background-color', 'white !important'],
   ['-webkit-box-shadow', 'none']]),
 (["#test[src~='{a\\'bc}']",
   '.tester'],
  [['-webkit-transition', 'opacity 0.35s linear'],
   ['background', 'white !important url("abc\\"cd\'{e}.jpg")'],
   ['border-radius', '20px'],
   ['opacity', '0'],
   ['-webkit-box-shadow', 'rgba(0, 0, 0, 0.6) 0px 0px 18px']]),
 (['span'],
  [['display', 'block']]),
 (['.nothing'],
  [])]

Simple exercise for the reader: write a regex to remove CSS comments (/* ... */).

What about this:

([#.]\S+\s*,?)+(?=\{)

First off, I don't see how the RE you posted would find .TB_Image#TB_window. You could do something like:

/^[#\.]([a-zA-Z0-9_\-]*)\s*{?\s*$/

This would find any occurrences of # or . at the beginning of the line, followed by the tag, optionally followed by a { and then a newline.

Note that this would NOT work for lines like .TB_Image { something: 0; } (all on one line) or div.mydivclass since the . is not at the beginning of the line.

Edit: I don't think nested braces are allowed in CSS, so if you read in all the data and get rid of newlines, you could do something like:

/([a-zA-Z0-9_\-]*([#\.][a-zA-Z0-9_\-]+)+\s*,?\s*)+{.*}/

There's a way to tell a regex to ignore newlines as well, but I never seem to get that right.

It's actually not an easy task to solve with regular expressions since there are a lot of possibilities, consider:

descendant selectors like #someid ul img -- those are all valid tags and are separated by spaces
tags that don't start with . or # (i.e. HTML tag names) -- you have to provide a list of those in order to match them since they have no other difference from attributes
comments
more that I can't think of right now

I think you should instead consider some CSS parsing library suitable for your preferred language.

继续阅读：css filter regex

Trying to remove hex codes from regular expression results

更多精彩内容

精彩评论

最新问答

湖南株洲芦淞大桥事故中母子三人遇难孩子为一对7岁双胞胎刚上小学？

身体很好基本恢复了工作正常开展？

进周前打了降调28天后怎么知道降调成功了？？

俄罗斯“萨尔马特”洲际弹道导弹试射失败？克宫回应？

和平精英m762佳配件是什么和平精英m762佳配件一览？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？