PHP Regex Cleaning of User Posts
I am trying to clean up user submitted c开发者_运维问答omments in PHP using regex but have become rather stuck and confused!
Is it possible using regex to:
Remove punctuation repeated more than twice so that:
OMG it was AWESOME!!!!
becomesOMG it was AWESOME!!
!!!!!!!!!!.........------
becomes!!..--
!?!?!?
becomes!?
Remove duplicate words of phrases (for example a user has copied and pasted a message) so:
spamspamspamspam
becomesspam
I love copy and paste. I love copy and paste. I love copy and paste.
becomesI love copy and paste.
Remove collections of letters and spaces longer than say 10 letters in caps:
I LOVE CAPITALS THEY ARE SO AWESOME
becomesI love capitals they are so awesome
GOOD that sounds
stays the same
Any suggestions you have?
This is for a student system (hence the urge to at least try and tidy up what they post), although I do not wish to go as far as filtering it or blocking their messages, just "correct" it with some regex.
Thanks for your time,
Edit:
If it isn't possible using regex (or regex mised with other PHP) how would you do it?
1:
// same punctuation repeated more than 2 times
preg_replace('#([?!.-])\1{2,}#', '$1$1', $string);
// sequence of different punctuations repeated more than one time
preg_replace('#([?!.-][?!.-]+?)\1+#', '$1', $string);
2:
// any sequence of characters repeated more than one time
preg_replace('#(.{2,}?)\1+#', '$1', $string);
3:
// sequence of uppercase letters and spaces
function tolower_cb($match) {
return strtolower($match[0]);
}
preg_replace_callback('#([A-Z ]{10,})#', 'tolower_cb', $string);
Try it here: http://codepad.org/iQsZ2vJ0
A good rule of thumb is to never, ever try and "fix" user input. If a user wants to type 4 exclamation points after a sentence then allow it. There is no reason not too.
You should be more concerned with injection attacks then things like this.
精彩评论