php proximity script - how to calculate the number of words/characters between 2 given terms/words?
Basically - I want to calculate the "Proximity" of various terms. By "proximity" I means Specifically the number of spaces/characters/words that sit between them.
Example:
Terms = Word1 / Word2 Chunk = "blah Word1 blah blah blah blah blah Word2 blah" Proximity = Word1-Word2:5 THe script would see the 2 terms, locate them and then see t开发者_开发问答he distance based on the words that lay between them.
A more advanced version would be to examine the semantic structure - and identify whether the terms occur within the same semantic element, or a sibling, or a parent etc. Thus proximity discovery of terms may be within the same paragraph, or in sequential paragraphs, or under the same "parent" (heading) but otherwise separate etc.
Further - introducing things like word stemming/relationships/soundings at a later date may be useful too.
.
I've looked around the net (Google, here, php forums, php script sites). Not seeing anything like it. I can see tools on some sites that do similar (limited) - usually SEO based tools. I want to be able to apply this to "text" in general ... as I may apply it to uploaded word/txt files etc.
I'm not seeing any real examples - so I can only assume it's mroe than a trifle to code it.
The question is - how can I do this? How would I handle variant order of the words (Word1+Word2 / Word2+Word1)? How could I handle identifying proximity within/outside of the same element/structure?
Hoping someone can shed some light/make some suggestions.
If you need to do a lot of this kind of search on a given text, you could begin by indexing the whole text into a database containing the word, its position in the text, and the paragraph number (if needed). Then, you could select all the Word1 and Word2 positions, and it shouldn't be too hard to infer the minimal distance.
Edit: Here is a try for a simple algorithm for a one-shot, without using database.
- Remove any html and punctuation to keep only the words
- Search for the first occurrence of Word1
- Count the number of words (or chars, or spaces) until you reach next occurrence of Word2
- If you reach Word1 again before reaching Word2, restart the counter
- Record the distance, then continue to repeat steps 2-5 to get other occurrences of Word1 and Word2
精彩评论