Detecting russian characters on a form in PHP
I have a site where people can submit links to sites about iPhone apps. The guy submits the application name, description, category and URL. This site has years and never received any constructive submission from a russian developer but, unfortunately it was discovered by russian spammers that annoys the hell out of me. Even with all measures against spam, as caption boxes, etc., some guys insist on sending porn russian stuff that has nothing to do with iPhone.
I would like to ban completely any URL or post that is done using russian characters. For URLs I have not much to do, except checking if the URL contains ".ru". But for descriptions, I would like to detect russian characters. How do I do that in PH开发者_运维问答P?
thanks.
Да очень просто It is easy to do with UTF-8 regular expressions (assuming your site uses UTF-8 encoding):
function isRussian($text) {
return preg_match('/[А-Яа-яЁё]/u', $text);
}
According to the PHP documentation, since version 5.1.0 it has been possible to look for specific (writing) scripts in utf-8 PCRE regular expressions by using \p{language code}. For Rusian that is
preg_match( '/[\p{Cyrillic}]/u', $text);
There is a warning on the page:
Matching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters.
now.. this code is about 5 years old, and 'worked for me' back when I had a similar problem
function detect_cyr_utf8($content)
{
return preg_match('/
[78]\d/', mb_encode_numericentity($content, array(0x0, 0x2FFFF, 0, 0xFFFF), 'UTF-8'));
}
thus no warranty, no any of the kind - but it may help you out (basically it encodes all foreign entities then checks for common cyrillic chars)
Best!
I would download the Russian alphabet and then check the input string with strstr()
. For example:
$russianChars = array('з', 'я'.. etc);
foreach($russianChars as $char) {
if(strstr($input, $char)) {
// russian char found in input, do something
}
}
A good algorithm would probably do something after finding 3 Russian chars or so, to be sure that the language is actually Russian (since Russian chars may show up in other languages, I suggest doing some research if that's the case).
SOURCE: http://zurb.com/forrst/posts/Convert_cyrillic_to_latin_in_PHP-vWz
function ru2lat($str) {
$tr = array(
"А"=>"a", "Б"=>"b", "В"=>"v", "Г"=>"g", "Д"=>"d",
"Е"=>"e", "Ё"=>"yo", "Ж"=>"zh", "З"=>"z", "И"=>"i",
"Й"=>"j", "К"=>"k", "Л"=>"l", "М"=>"m", "Н"=>"n",
"О"=>"o", "П"=>"p", "Р"=>"r", "С"=>"s", "Т"=>"t",
"У"=>"u", "Ф"=>"f", "Х"=>"kh", "Ц"=>"ts", "Ч"=>"ch",
"Ш"=>"sh", "Щ"=>"sch", "Ъ"=>"", "Ы"=>"y", "Ь"=>"",
"Э"=>"e", "Ю"=>"yu", "Я"=>"ya", "а"=>"a", "б"=>"b",
"в"=>"v", "г"=>"g", "д"=>"d", "е"=>"e", "ё"=>"yo",
"ж"=>"zh", "з"=>"z", "и"=>"i", "й"=>"j", "к"=>"k",
"л"=>"l", "м"=>"m", "н"=>"n", "о"=>"o", "п"=>"p",
"р"=>"r", "с"=>"s", "т"=>"t", "у"=>"u", "ф"=>"f",
"х"=>"kh", "ц"=>"ts", "ч"=>"ch", "ш"=>"sh", "щ"=>"sch",
"ъ"=>"", "ы"=>"y", "ь"=>"", "э"=>"e", "ю"=>"yu",
"я"=>"ya", " "=>"-", "."=>"", ","=>"", "/"=>"-",
":"=>"", ";"=>"","—"=>"", "–"=>"-"
);
return strtr($str,$tr);
}
then
echo ru2lat( "текст по-русски"); --------------> "tekst po-russki"
If you have an input for your description called description like this :
<input name="description"/>
Add a condition in your mailer file like phpmailer or other like this :
if (preg_match("/[А-Яа-яЁё]/u", $_POST['description'])) {
echo "Sorry, no russian description allowed";
die();
}
I know this is a little unrelated to php, but I had a similar problem with spam from a contact form. If your site is behind Cloudflare, then you can limit the spam by checking what country the request is coming from. You can then flag it as potential spam and verify later if it is publishable.
I eventually started to mark as spam everything that came from a different country than mine, and I take a quick look to see if there is anything valuable there, and delete the rest. I also return the information to the potential spammer that he solved the recaptcha incorrectly, although it was solved correctly. With time the number of spam messages dropped significantly.
Cloudflare returns the country code in the header and this value is available in the $_SERVER['HTTP_CF_IPCOUNTRY']
variable.
精彩评论