开发者

How to strip out one common attribute from every form element on the page?

I have a string variab开发者_开发百科le that contains an HTML page's response. It contains hundreds of tags, including the the following three html tags:

<tag1 prefix1314030136543="2">
<tag2 prefix131403013654="1" anotherAttribute="432">
<tag3 prefix13140301376543="4">

I need to be able to strip out any attribute that starts with "prefix" along with its value, regardless of tag name. In the end, I'd like to have:

<tag1>
<tag2 anotherAttribute="432">
<tag3>

I am using C#. I'm assuming RegEx is the solution, but I'm horrible with RegEx and hope someone can help me out here.


Look at Html Agility Pack.

Using regex:

(?<=<[^<>]*)\sprefix\w+="[^"]"\s?(?=[^<>]*>)

var result = Regex.Replace(s, 
    @"(?<=<[^<>]*)\sprefix\w+=""[^""]""(?=[^<>]*>)", string.Empty);


RegEx is not the solution since HTML is not a regular language and as such shouldn't be parsed with RegEx's. I've heard good things about HTML Agility Pack for parsing and working with HTML. Check it out.


var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(/* your html here */);
foreach (var item in doc.DocumentNode.Descendants()) {
    foreach (var attr in item.Attributes.Where(x =>x.Name.StartsWith("prefix")).ToArray()) {
        item.Attributes.Remove(attr);
    }
}


html = Regex.Replace(html, @"(?<=<\w+\s[^>]*)\s" + Regex.Escape(prefix) + @"\w+\s?=\s?""[^""]*""(?=[^>]*>)", "");

You have a look behind and look ahead that will find , then you have a matcher for the prefix#####="?????".


Here's the heavy handed method of doing it.

    String str = "<tag1 prefix131403013654=\"2\">"; 
            while (str.IndexOf("prefix131403013654=\"") != -1) //At least one still exists...
            {
               int point = str.IndexOf("prefix131403013654=\"");
               int length = "prefix131403013654=\"".Length;

               //need to grab last part now. We know there's a leading double quote and a ending double quote surrounding it, so we find the second quote.
               int secondQuote = str.IndexOf("\"",point + length); //second part is your position
               if (str.Substring(point - 1, 1) == " ")
               {
                  str = str.Replace(str.Substring(point, (secondQuote - point + 1)),"");
               }
            }

edited for better code. Edited again after testing, added +1 to replace to count the final quote. It works. Basically you could encompass this in a loop that goes through an array list that has all "remove these" values in it.

If you don't know the full prefix's name you can change it up like so:

 String str = "<tag1 prefix131403013654=\"2\">"; 
            while (str.IndexOf("prefix") != -1) //At least one still exists...
            {
               int point = str.IndexOf("prefix");

               int firstQuote = str.IndexOf("\"", point);

               int length = firstQuote - point + 1;
               //need to grab last part now. We know there's a leading double quote and a ending double quote surrounding it, so we find the second quote.
               int secondQuote = str.IndexOf("\"",point + length); //second part is your position
               if (str.Substring(point - 1, 1) == " ") //checking if its actually a prefix
               {
                   str = str.Replace(str.Substring(point, (secondQuote - point + 1)),"");
               }
               //Like I said, a very heavy way of doing it.
            }

That will catch all of them that start with prefix.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜