How to solve the string replace fiasco

2023-03-20 12:28 问答作者：

NOTE : My problem is NOT that my links are not being replaced. But, it's being NESTED. eg, this is the comment

some string with www.google.com/blah/blah also something else www.google.com

by the time second string replace is done, part of first one is also valid (www.google.com/blah/blah) so it's replacing that link twice.

I have a web app which lets users comment. I am processing the input string and converting all links to html link format when I display it on the page. Original user input string stays in DB and nothing ever happens so it's not corrupted over processing. Just when I show that on page, I do my function on it.

Now, this is the logic I am using to replace all links with their html formats

Regex all links
For each match, replace link with it's html format version in the original string.
Finally display string.

ex: www.google.com becomes <a href="http://www.google.com" target="_blank">www.google.com</a> just before it's displayed on page.

This was working great until recently, one of my customer posted a content with two links from same domain.

the links were, say,

www.google.com/images/blahblah
www.google.com

My problem is, when the second time around, a string replace is done (I am using StringBuilder.Replace) the first link gets replaced as well!

so, firstly,

www.google.com/images/blahblah

becomes

<a href="http://www.google.com/images/blahblah" target="_blank">www.google.com/image/blahblah</a>

which is well. But the problem arises for second string replace, since replace is global, it does a replace on already processed link so the original (above) link becomes twisted as it sees www.google.com in there as well.

This is messing up so much that I actually get a mutilated abomination of a string.

How do I avoid this?

Does the Regex.Matches provide an index of matched element for me to work with? I couldn't find it anywhere.

What's the best way to deal with? any suggestions?

sorry for lengthy question.

I can prolly do this by manually traversing string but it's long and painful there's got to be a good way to do it...

edit adding extra info as someone asked:

My regex:

    string rPattern = @"(((http|ftp|https):\/\/)|www\.)[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#!]*[\w\-\@?^=%&amp;/~\+#])?";

     Regex rLinks = new Regex(rPattern, RegexOptions.IgnoreCase);
     MatchCollection matches = rLinks.Matches(inputString);

then I am using

foreach(Match match in matches)
{
    if(开发者_如何学Gomatch.value.StartsWith("www.youtube.com/watch"))
    {
         //logic to embed youtube video - this works fine.
    } 
}

//Here I replace all hyperlinks to their <a href> parts

Regex.Matches returns a MatchCollection. Match.Index Is what you're looking for.

string pattern = @"(https?://)?(?:www(?:\.\w+)+|(?:\w+\.)+(?:com|org|us|net|...))(/\w*)*"; // your pattern here.
foreach (Match match in Regex.Matches (input, pattern))
{
   // Use match.Index and match.Length;
}

But really, you're probably looking for something more like this:

string originalPost = 
   @"Ooh shiney: www.google.com/images/blahblah
   Look here: www.google.com";

string html = Regex.Replace (
   originalPost, patternString, 
   "<a href='http://$1' target='_blank'>$1</a>");

Or, you can use a matchEvaluator to do more advanced work (like ensure we don't add a double http://.

string html = Regex.Replace (
   originalPost, patternString, 
   m => 
      string.Format (
         "<a href='{0}{1}' target='_blank'>{1}</a>",
          m.Value.StartsWith ("http", StringComparison.IgnoreCase) ? "" : "http://",
          m.Value));

I had the same need and this is what I've been using for the past couple years now:

public static string MakeCommentSafe(string strComment)
{
    // Replace carriage return / line feeds with line feeds.  Then HtmlEncode.  Then replace multiple consecutive line feeds with single line feeds.
    strComment = Regex.Replace(System.Web.HttpContext.Current.Server.HtmlEncode(Regex.Replace(strComment, "\r\n", "\n").Replace((char)13, (char)10)), "\n(\n)+", "$1\n");

    // Find all links and make them active
    return Regex.Replace(Regex.Replace(strComment, @"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", "<a href=\"$1\" target=\"_blank\" rel=\"nofollow\">$1</a>"), "\n", "<br />");
}

And here's a tip. If you really want this to perform well with lots of comments on the page, then store both the unsafe and safe versions in the database when the comment is posted. That way you don't have to call this function repeatedly when displaying every comment on a page.

Use Regex.Replace method, e.g.:

var result = Regex.Replace(input, pattern, "<a href=\"$0\" target=\"_blank\">$0</a>");

To play devils advocate:

So, you want to correct strings that look like:

www.example.com
www.example.com/foo/bar
www.example.co.tw/baz.moo?foo=1

but, not strings that look like:

www.example.com www.example.com/foo/bar www.example.co.tw/baz.moo?foo=1

I would guess that I am correct. Simple solution, expand your regex to look either side of the thing that looks like a URL and to ignore it if it:

Is between a href=" and a " target="_blank">
Is between a " target="_blank"> and a </a>

继续阅读：c#-4.0 regex string-formatting stringbuilder

How to solve the string replace fiasco

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？