Algorithm to find all possible results using text search
I am currently making a web crawler to crawl all the possible characters on a video game site (Final Fantasy XIV Lodestone).
My interface for doing this is using the site's search. http://lodestone.finalfantasyxiv.com/rc/search/characterForm
If the search finds more than 1000 characters it only returns the first 1000. The text search does not seem to understand either *, ? or _.
If a search for the letter a, I'm getting all the characters that have a in their names rather than all characters that start with a.
I'm guessing I could do searches for all character combination aa, ab, ba, etc. But that doesn't guarant开发者_开发百科ee me:
- I will never get more than 1000 result
- It doesn't seem very efficient has many characters would appear multiple times and would need to be filtered out.
I'm looking for an algorithm on how to construct my search text.
Considered as a practical problem, have you asked Square Enix for some kind of API access or database dump? They might prefer this to having you scrape their search results.
Considered purely in the abstract, it's not clear that any search strategy will succeed in finding all the results. For suppose there were a character called "Ar", how would you find it? If you search for "ar", the results only go as far as at Ak—. If you search for "a" or "r", the situation is even worse. Any other search fails to find this character. (In practice you might be able to find "Ar" by guessing its world and/or main skill, but in theory there might be so many characters with that skill on that world that this remains ineffective.)
Main question here is what are you planning to do with all those characters. What is the purpose of your program? Putting that aside, you can search for single letter, and filter by both main skill and world (using double loop). It is highly unlikely that you will ever have more that 1000 hits that way for any consonant. If you want to search for name starting with vowel then use bigraph vowel-other_letter in a loop that iterates other_letter from A to Z.
Additional optimization is to try to guess at what page the list with needed letter will start. If you have total number of pages (TNOP) then your list will start somewhere near page TNOP * LETTER / 27, where LETTER is the order of the letter in the alphabet.
精彩评论