开发者

What am I doing wrong with my Regex?

I am not sure what I am doing wrong. I am trying to use the asp.net regex.replace but it keeps replacing the wrong item.

I have 2 replaces. The first one does what I want it to it replaces what I want. The next replace that is almost a mirror image does not replace what I want.

So this is my sample code

<%@ Page Title="Tour" Language="C#" MasterPageFile="~/Views/Shared/Site.Master" Inherits="System.Web.Mvc.ViewPage开发者_运维知识库" %>
<asp:Content ID="Content1" ContentPlaceHolderID="HeadContent" runat="server">
    <title>Website Portfolio Section - VisionWebCS</title>
    <meta name="description" content="A" />
    <meta name="keywords" content="B" />
</asp:Content>
<asp:Content ID="Content2" ContentPlaceHolderID="MainContent" runat="server">
    <!-- **START** -->

I am looking to replace both the meta tags.

<meta name=\"description\" content=\"A\" />
<meta name=\"keywords\" content=\"B\" />

In my code first I replace the keywords meta tag with

<meta name=\"keywords\" content=\"C\" />

This works so my next task is to replace the description meta tag with this

<meta name=\"description\" content=\"D\" />

This does not work instead it replaces the "keywords" meta tag and then replaces the "description" tag.

Here is my test program so you all can try it out. Just through it in C# console app.

  private const string META_DESCRIPTION_REGEX = "<\\s* meta \\s* name=\"description\" \\s* content=\"(?<Description>.*)\" \\s* />";
        private const string META_KEYWORDS_REGEX = "<\\s* meta \\s* name=\"keywords\" \\s* content=\"(?<Keywords>.*)\" \\s* />";
        private static RegexOptions regexOptions = RegexOptions.IgnoreCase
                                   | RegexOptions.Multiline
                                   | RegexOptions.CultureInvariant
                                   | RegexOptions.IgnorePatternWhitespace
                                   | RegexOptions.Compiled;

        static void Main(string[] args)
        {

            string text = "<%@ Page Title=\"Tour\" Language=\"C#\" MasterPageFile=\"~/Views/Shared/Site.Master\" Inherits=\"System.Web.Mvc.ViewPage\" %><asp:Content ID=\"Content1\" ContentPlaceHolderID=\"HeadContent\" runat=\"server\">    <title>Website Portfolio Section - VisionWebCS</title>    <meta name=\"description\" content=\"A\" />    <meta name=\"keywords\" content=\"B\" /></asp:Content><asp:Content ID=\"Content2\" ContentPlaceHolderID=\"MainContent\" runat=\"server\"><!-- **START** -->";
            Regex regex = new Regex(META_KEYWORDS_REGEX, regexOptions);
            string newKeywords = String.Format("<meta name=\"keywords\" content=\"{0}\" />", "C");
            string output = regex.Replace(text, newKeywords);

            Regex regex2 = new Regex(META_DESCRIPTION_REGEX, regexOptions);
            string newDescription = String.Format("<meta name=\"description\" content=\"{0}\" />", "D");
            string newOutput = regex2.Replace(output, newDescription);
            Console.WriteLine(newOutput);
        }

This gets me a final output of

<%@ Page Title="Tour" Language="C#" MasterPageFile="~/Views/Shared/Site.Master"
Inherits="System.Web.Mvc.ViewPage" %>
<asp:Content ID="Content1" ContentPlaceHold erID="HeadContent" runat="server">
    <title>Website Portfolio Section - VisionW
        ebCS</title>
    <meta name="description" content="D" />
</asp:Content>
<asp:Conten t ID="Content2" ContentPlaceHolderID="MainContent" runat="server">
    <!-- **START**
    -->

Thanks


What are you doing wrong? You are parsing HTML with a regex!

Recommended library for .NET: HTML Agility Pack


To answer your question without useless life lessons, you are having troubles because of greedy quantifiers. Try making them lazy by adding question marks:

<meta\\s+?name=\"description\"\\s+?content=\"(?<Description>.*?)\"\\s*?/>

Sure this regex won't work for all pages in the world, but if you need just make some quick replacement script for your own templates, regex is the fastest and easiest solution and the way to go.


I agree with @serg555's answer - the issue is with the greedy quantifiers - making them lazy with '?' should solve the problem

<meta\\s*name=\"description\"\\s*content=\"(?<Description>.*?)\"\\s*/>


Learn, love, and use the DOM. It is the W3C (HTML standards body) approved way to parse XML (HTML is a subset of XML) documents. Unless you have sufficient reason to believe your input HTML is horribly wrong, this is usually the best approach to start with.

Learn here

You are highly encouraged to check out Walkthrough: Accessing the DHTML DOM from C#

You may also want to try jQuery, as it makes it very easy to search the DOM. Like so.


I needed description of URL in C# code and used this site to check my Regex code.

this is my final which work prefect:

      WebClient x = new WebClient { Encoding = Encoding.UTF8 };
            string source = x.DownloadString(url);

            string description = Regex.Match(source, "<meta[^>]*name=[\"|\']description[\"|\'][^>]*content=[\"]([^\"]*)[\"][^>]*>", RegexOptions.IgnoreCase).Groups[1].Value;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜