开发者

Strip HTML tags?

How to strip this text

<html>

<body>      

<h1>My First Heading</h1>

<p>My first paragraph.</p>
<test@test.com>
</body>
</html>

to look like

My First Heading
My first paragraph.
<test@test.com>

Using the function

public static string StripHTML(this string htmlText)
    {
        var reg = new Regex("<(.|\n)*?>", RegexOptions.IgnoreCase);
        return reg.Replace(htmlText, "");
    }

I get

My First Headin开发者_Python百科g My first paragraph.


Use Html Agility Pack for these kinds of operations. It is faster than any regex and supports LINQ.


static void Main(string[] args)
    {


      string modified_html =  emas(input);

        HtmlDocument doc = new HtmlDocument();

        doc.LoadHtml(modified_html);

        string test1 = doc.DocumentNode.InnerText;


        Console.WriteLine();


        var reg = new Regex("<(.|\n)*?>", RegexOptions.IgnoreCase);

        Console.WriteLine(reg.Replace(modified_html , ""));

        Console.Read();
    }


    public static string emas(string text)
    {

        string stripped = text;

        const string MatchEmailPattern =
       @"(([\w-]+\.)+[\w-]+|([a-zA-Z]{1}|[\w-]{2,}))@"
       + @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\."
         + @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
       + @"([a-zA-Z]+[\w-]+\.)+[a-zA-Z]{2,4})";
        Regex rx = new Regex(MatchEmailPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
        // Find matches.
        MatchCollection matches = rx.Matches(text);
        // Report the number of matches found.
        int noOfMatches = matches.Count;
        // Report on each match.
        foreach (Match match in matches)
        {

            stripped = stripped.Replace("<"+ match.Value + ">" , match.Value);

        }


        return stripped;


    }



   static string input = " Your html goes here  ";
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜