Is there any way to detect strings like putjbtghguhjjjanika?
People search in my website and some of these searches are these ones:
tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a
My question is there any way to detect strings that开发者_开发百科 similar to ones above ?
I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)
edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.
It doesn't matter if search result will be 0 or anything else. I can't use this logic.
Some new brands or products will be ignored if I will consider "regular words".
Thank you for your help
You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).
If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.
For background, read about Markov Chains.
Edit, I implemented this here in Python:
https://github.com/rrenaud/Gibberish-Detector
and buggedcom rewrote it in PHP:
https://github.com/buggedcom/Gibberish-Detector-PHP
my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True
You could do what Stackoverflow does and calculate the entropy of the string.
Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.
Assuming you mean jibberish searches... It would be more trouble than it's worth. You are providing them with a search functionality, let them use it however they please. I'm sure there are some algorithms out there that detect strange character groupings, but it would probably be more resource/labour intensive than just simply returning no results.
I had to solve a closely related problem for a source code mining project, and although the package is written in Python and not PHP, it seemed worth mentioning here in case it can still be useful somehow. The package is Nostril (for "Nonsense String Evaluator") and it is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector in the answer by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
The project is on GitHub and I welcome contributions.
I'd think you could detect these strings the same way you could detect "regular words." It's just pattern matching, no?
As to why users are searching for these strings, that's the bigger question. You may be able to stem off the gibberish searches some other way. For example, if it's comment spam phrases that people (or a script) is looking for, then install a CAPTCHA.
Edit: Another end-run around interpreting the input is to throttle it slightly. Allow a search every 10 seconds or so. (I recall seeing this on forum software, as well as various places on SO.) This will take some of the fun out of searching for sdfpjheroptuhdfj over and over again, and at the same time won't interfere with the users who are searching for, and finding, their stuff.
As some people commented, there are no hits in google for tapoktrpasawe or putjbtghguhjjjanika (Well, there are now, of course) so if you have a way to do a quick google search through an API, you could throw out any search terms that got no Google results and weren't the names of one of your products. Why you would want to do this is a whole other question - are you trying to save effort for your search library? Make your hand-review of "popular search terms" more meaningful? Or are you just frustrated at the inexplicable behaviour of some of the people out on the big wide internet? If it's the latter, my advice is just let it go, even if there is a way to prevent it. Some other weirdness will come along.
Short answer - Jibberish Search
Probabilistic Language Model works.
Logic
word is made up of sequence of characters, and if 2 characters come together more frequently and if we sum up all frequency of 2 contiguous characters coming together in word, and sum cross threshold limit (being an english word), it is said to proper english word. In brief, this logic is famous by Markov chains.
Link
For Mathematics of Gibberish and better understanding, refer to video https://www.youtube.com/watch?v=l15C8UJu17s . Thanks !!
If the search is performed on products, you could cache their names or codes and check them against that list before quering database. Else, if your site is for english users, you can build a dictionary of strings that aren't used in the english language, like qwkfagsd. Which, and agreeing with other answer, will be more resource intensive than if not there.
精彩评论