开发者

Replace a class by another in an html string

I want to replace a class name by another in an html string : class="abc" would become class="xyz". I tried to use regular expressions (I'm using C#) with no success:

const string input = @"abc class=""abcd abc zabc ab c"" abc";

Regex regex = new Regex(string.Format(@"class="".*(?({0})).*""", "abc")); // change this line ?!!

string output = regex.Replace(input, "xyz");

Assert.AreEqual(@"abc class=""abcd xyz zabc ab c"" 开发者_高级运维abc", output);

PS: if it matters: this isn't homework :p


No wonder you had no success. Parsing HTML can't be done using regexes.

You should use a proper HTML parser like HTML Agility Pack.


Parsing HTML with Regular Expressions tends to be a futile effort; because most browsers have a fair amount of leeway for badly-formed HTML, you aren't guaranteed to get consistently formed HTML in order to parse with regular expressions easily (and as commented on by svick).

That said, you are better off using a formal HTML parser (I recomment the HTML Agility Pack) and then changing the values of the attributes after you've parsed the document, and then output the changed document if need be.


Is it a real HTML string? I mean, are you sure you are dealing with well formed HTML? Could there be some error inside your string?

Based on the answers you have given above you can choose how to solve your problem.

  • Yep: use HTML Agility Pack or something similar in order to parse correctly your string;
  • Nope: consider using an XML Parser (like the ones integrated in .NET assemblies). Make sure, however, it works well for you (remember XML is not HTML).

Whatever you choose, please: NEVER use Regular Expressions to parse HTML.


I've done a best effort attempt at answering this... a REGEX could be used similar to the following:

@"(?<=<[\w-]+\s+([\w-]+=""[^""]*""\s*)*class=""[^""]*)(?<![\w-])abc(?![\w-])(?=[^""]*""\s*([\w-]+=""[^""]*""\s*)*/?>)"

broken down a little bit:

(?<=<[\w-]+\s+([\w-]+=""[^""]*""\s*)*class=""[^""]*)  #Make sure its inside a tag
(?<![\w-])abc(?![\w-])                                #just the tag abc (not abcd, etc)
(?=[^""]*""\s*([\w-]+=""[^""]*""\s*)*/?>)             #Make sure its really INSIDE a tag

a little further:

(?<=                           #lookbehind
   <[\w-]+\s+                  # match tag name and whitespace
   ([\w-]+=""[^""]*""\s*)*     # match any attributes coming before the class attribute
   class=""[^""]*              # match the class attribute and any other classes before
)                              #end lookbehind
(?<![\w-])abc(?![\w-])         #"abc" at appropriate boundaries
(?=                            #lookahead
   [^""]*""                    # match any remaining classes in the declaration
   \s*([\w-]+=""[^""]*""\s*)*  # match any remaining attributes in the tag
   /?>                         # match the end of the tag
)                              #end lookahead

This will match the string abc inside any class attribute value that is inside a tag (not in text in between tags), and which might or might not have other attributes before or after it.

Attention!

  • IT ONLY HANDLES attribute values in double quotes (")
  • IT ONLY ALLOWS underscores, letters, numbers and dash symbols in the tag and attribute names - you'll need to add colons and periods if you want them (and make it only match names STARTING with a letter if you want it strict)
  • EDIT As discussed in a comment somewhere around here, IT WILL ALSO MATCH abc-1 or not-abc in addition to abc, thus turning <p class="abc-1 abc not-abc">text</p> into <p class="xyz-1 xyz not-zyx">text</p> - because \b will match at the dash character... this gets EXTREMELY HARD TO ACCOUNT FOR!! FOLLOW-UP I added an additional lookahead and lookbehind to hopefully account for the dashes, but who knows... END EDITS

Also, there are bound to be other situations that can break this...

In short - it's probably best not to use this, but instead to use something like HTML Agility Pack - good luck!


I'm not sure of the C# version of this regex, but here's how it would be done in Ruby:

regex = / class="[^"]*"/i

input.gsub( regex, ' class="abc"' )

This replaces the first instance of a class specifier in the input to be class="abc". It assumes no spaces around the equals, but allows for upper or lower case equivalence.

I assume C# is very similar in terms of describing the regex, and you might have to escape the double quotes.

Are you looking for something more specific? E.g., for a method that takes two inputs (s1 and s2) and replaces class "s1" to class "s2"?


Obviously Regex is unlikely to be your best choice when working with XML. You will probably have a more consistant result if you try something suggested by the other people. Meanwhile, if you really want some Regex here it is:

const string input = @"abc class=""abcd abc zabc ab c"" abc"; 

Regex regex = new Regex(string.Format(@"(?<=class\=""[^""]*\b){0}\b", "abc")); // I changed this line ?!! 

string output = regex.Replace(input, "xyz");

Assert.AreEqual(@"abc class=""abcd xyz zabc ab c"" abc", output); 

To brake it down:

(               #Start a group
    ?<=         #Positive lookbehind
    class\="    #Some charactors to match against (without consuming)
    [^"]*       #Any other charachactors which are not "
                #This stops us from accidentaly leaving the class attribute
)               #Close the lookbehind group
\b              #A word boundry (Such as whitespace or just before a ")
abc             #Your target
\b              #Another word boundry

Note the positve lookbehind means that we check for "class=" without it being part of our match. That is what we mean by "without consuming".

Note the use of the word boundries, \b, so that we don't accidently match abcd.


Disclaimer:

As others have pointed out, using regex to parse non-regular languages is fraught with peril! It is best to use a dedicated parser specifically designed for the job, especially when parsing the tag soup that is HTML.

That said...

If you insist on using a regular expression, here is a regex solution that will do a pretty good job:

text = Regex.Replace(text, @"
    # Change HTML element class attribute value: 'abc' to: 'xyz'.
    (                   # $1: Everything up to 'abc'.
      <\w+              # Begin (X)HTML element open tag.
      (?:               # Match any attribute(s) preceding 'class'.
        \s+             # Whitespace required before each attribute.
        (?!class\b)     # Assert this attribute name is not 'class'.
        [\w\-.:]+       # Required attribute name.
        (?:             # Begin optional attribute value.
          \s*=\s*       # Attribute value separated by =.
          (?:           # Group for attrib value alternatives.
            ""[^""]*""  # Either a double quoted value,
          | '[^']*'     # or a single quoted value,
          | [\w\-.:]+   # or an unquoted value.
          )             # End group for attrib value alternatives.
        )?              # End optional attribute value.
      )*                # Zero or more attributes may precede class.
      \s+               # Whitespace required before class attribute.
      class             # Literal class attribute name.
      \s*=\s*           # Attribute value separated by =.
      (?:               # Group for attrib value alternatives.
        ""              # Either a double quoted value.
        [^""]*?         # Zero or more classes may precede 'abc'.
      | '               # Or a single quoted value.
        [^']*?          # Zero or more classes may precede 'abc'.
      )?                # Or 'abc' class attrib value is unquoted.
    )                   # End $1: Everything up to 'abc'.
    (?<=['""\s=])       # Assert 'abc' not part of '123-abc'.
    abc                 # Match the 'abc' in class attribute value.
    (?=['""\s>])        # Assert 'abc' not part of 'abc-123'.",
    "$1xyz", RegexOptions.IgnorePatternWhitespace);

Example input:

class=abc ... class="abc" ... class='abc'
class = abc ... class = "abc" ... class = 'abc'
class="123 abc 456" ... class='123 abc 456'
class="123-abc abc 456-abc" ... class='123-abc abc 456-abc'
class="abc-123 abc abc-456" ... class='abc-123 abc abc-456'

Example output:

class=xyz ... class="xyz" ... class='xyz'
class = xyz ... class = "xyz" ... class = 'xyz'
class="123 xyz 456" ... class='123 xyz 456'
class="123-abc xyz 456-abc" ... class='123-abc xyz 456-abc'
class="abc-123 xyz abc-456" ... class='abc-123 xyz abc-456'

Note that there will always be edge cases where this solution will fail. e.g. Evil strings within CDATA sections, comments, scripts, styles and tag attribute values can trip this up. (See disclaimer above.) That said, this solution will do a pretty good job for many cases (but will never be 100% reliable!)

Edit: 2011-10-10 14:00 MDT Streamlined overal answer. Removed first regex solution. Modified to correctly ignore classes having similar names like: abc-123 and 123-abc.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜