Google-alike search algorithm
I am trying to implement a search algorithm in my simple datastructure. However, this is not a "HOW DO I DO THIS?"-question, but rather a "how could i optimize the algorithm?"
I am trying to keep a index of files, and each file can be associated with any number of tags (which acts like a category)
开发者_如何学运维This is how my data is structured:
Entries:
------------------------------------
| id | description | short | score |
------------------------------------
Tags:
-------------
| id | text |
-------------
EntryTags:
-------------------
| entry_id | tag_id |
-------------------
In the search field, the search request, will always be turned into single words split apart with a plus (+).
in the following example i will be searching for "blue+website+simple+layout"
- split searchterm up into array named t
- convert each word in array t into a number using the id from "Tags" table
- for each element in array t, select make new array for each element with "EntryTags" matching the search
- generate array A, where elements that are in all 4 arrays are put into
- generate array B, where elements that are in 3 of the 4 arrays are put into
- generate array C, where elements that are in 2 of the 4 arrays are put into
- generate array D with the last elemenets rest
- sort array A,B,C and D by the score parameter from the table
- output array A, then B, then C, then D
ofcourse this is not optimized or anything, but my lack of experience with more complex SQL is kicking my butt :(
In the end, all of this will be written in PHP and the mysqli library (and i will ofcourse keep the thread updated as i progress further)
You could use a kind of Bloom filter (at least this is part of Google's strategy). First you look up for entries with all of the entered tags. If you find nothing, try out all combinations with one tag missing, then with two tags missing ... until you have enough matches. The lookup in the Bloom filter is very fast, so it's okay to do many lookups.
Woah, let's keep it simple (KISS), that's just too complicated and not flexible.
How about this : With SQL, do one search for each search term, and include a column that puts in a 'point' value for the term's particular relevance. Sum up the searches on this 'point' value and find the results that have the biggest relevance via 'points'.
Check this out : http://www.jarrodgoddard.com/web-development/advanced-web-site-search-with-sql
SELECT title, filename, sum(relevance)
FROM (
SELECT title, filename, 10 AS relevance FROM page WHERE title like ‘%about%’
UNION
SELECT title, filename, 7 AS relevance FROM page WHERE filename like ‘%about%’
UNION
SELECT title, filename, 5 AS relevance FROM page WHERE keywords like ‘%about%’
UNION
SELECT title, filename, 2 AS relevance FROM page WHERE description like ‘%about%’
) results
GROUP BY title, filename
ORDER BY relevance desc;
精彩评论