How to solve the string replace fiasco
NOTE : My problem is NOT that my links are not being replaced. But, it's being NESTED. eg, this is the comment
some string with www.google.com/blah/blah also something else www.google.com
by the time second string replace is done, part of first one is also valid (www.google.com/blah/blah) so it's replacing that link twice.
I have a web app which lets users comment. I am processing the input string and converting all links to html link format when I display it on the page. Original user input string stays in DB and nothing ever happens so it's not corrupted over processing. Just when I show that on page, I do my function on it.
Now, this is the logic I am using to replace all links with their html formats
- Regex all links
- For each match, replace link with it's html format version in the original string.
- Finally display string.
ex: www.google.com
becomes <a href="http://www.google.com" target="_blank">www.google.com</a>
just before it's displayed on page.
This was working great until recently, one of my customer posted a content with two links from same domain.
the links were, say,
- www.google.com/images/blahblah
- www.google.com
My problem is, when the second time around, a string replace is done (I am using StringBuilder.Replace
) the first link gets replaced as well!
so, firstly,
www.google.com/images/blahblah
becomes
<a href="http://www.google.com/images/blahblah" target="_blank">www.google.com/image/blahblah</a>
which is well. But the problem arises for second string replace, since replace is global, it does a replace on already processed link so the original (above) link becomes twisted as it sees www.google.com in there as well.
This is messing up so much that I actually get a mutilated abomination of a string.
How do I avoid this?
Does the Regex.Matches
provide an index of matched element for me to work with? I couldn't find it anywhere.
What's the best way to deal with? any suggestions?
sorry for lengthy question.
I can prolly do this by manually traversing string but it's long and painful there's got to be a good way to do it...
edit adding extra info as someone asked:
My regex:
string rPattern = @"(((http|ftp|https):\/\/)|www\.)[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#!]*[\w\-\@?^=%&/~\+#])?";
Regex rLinks = new Regex(rPattern, RegexOptions.IgnoreCase);
MatchCollection matches = rLinks.Matches(inputString);
then I am using
foreach(Match match in matches)
{
if(开发者_如何学Gomatch.value.StartsWith("www.youtube.com/watch"))
{
//logic to embed youtube video - this works fine.
}
}
//Here I replace all hyperlinks to their <a href> parts
Regex.Matches
returns a MatchCollection
. Match.Index
Is what you're looking for.
string pattern = @"(https?://)?(?:www(?:\.\w+)+|(?:\w+\.)+(?:com|org|us|net|...))(/\w*)*"; // your pattern here.
foreach (Match match in Regex.Matches (input, pattern))
{
// Use match.Index and match.Length;
}
But really, you're probably looking for something more like this:
string originalPost =
@"Ooh shiney: www.google.com/images/blahblah
Look here: www.google.com";
string html = Regex.Replace (
originalPost, patternString,
"<a href='http://$1' target='_blank'>$1</a>");
Or, you can use a matchEvaluator to do more advanced work (like ensure we don't add a double http://.
string html = Regex.Replace (
originalPost, patternString,
m =>
string.Format (
"<a href='{0}{1}' target='_blank'>{1}</a>",
m.Value.StartsWith ("http", StringComparison.IgnoreCase) ? "" : "http://",
m.Value));
I had the same need and this is what I've been using for the past couple years now:
public static string MakeCommentSafe(string strComment)
{
// Replace carriage return / line feeds with line feeds. Then HtmlEncode. Then replace multiple consecutive line feeds with single line feeds.
strComment = Regex.Replace(System.Web.HttpContext.Current.Server.HtmlEncode(Regex.Replace(strComment, "\r\n", "\n").Replace((char)13, (char)10)), "\n(\n)+", "$1\n");
// Find all links and make them active
return Regex.Replace(Regex.Replace(strComment, @"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", "<a href=\"$1\" target=\"_blank\" rel=\"nofollow\">$1</a>"), "\n", "<br />");
}
And here's a tip. If you really want this to perform well with lots of comments on the page, then store both the unsafe and safe versions in the database when the comment is posted. That way you don't have to call this function repeatedly when displaying every comment on a page.
Use Regex.Replace
method, e.g.:
var result = Regex.Replace(input, pattern, "<a href=\"$0\" target=\"_blank\">$0</a>");
To play devils advocate:
So, you want to correct strings that look like:
www.example.com
www.example.com/foo/bar
www.example.co.tw/baz.moo?foo=1
but, not strings that look like:
www.example.com www.example.com/foo/bar www.example.co.tw/baz.moo?foo=1
I would guess that I am correct. Simple solution, expand your regex to look either side of the thing that looks like a URL and to ignore it if it:
- Is between a
href="
and a" target="_blank">
- Is between a
" target="_blank">
and a</a>
精彩评论