javascript string word wrap on matches in another string
If I have a certain string input, that I want to compare to another string and wrap matches of the input string whithin the other string using the largest possible match. How would I best wrap the matches in a tag? This is a non-trivial question.
Basically, I want to match an entered string with another string, using a span tag to show matched portions of the target found in the entered string.
- Match from start of input string first (largest possible match)
- Partial match of a search word should be highlighted (see "barge", "barges" in example)
- Special character breaks should match "fred/dred" entered would be two words.
- The input string will vary depending upon what the users input.
- Match the input string from the start as a priority
Match each word where it occurs
if the user enters a string with multiple words, I want to wrap matches of them incrementally starting from the beginning where they occur in a second string. They might or might not have spaces at the start/end of the string that is entered. I would want the largest portion to be wrapped.
Example input strings:
"Brown cats cannot be white cats"
"blue pigs "
"large, charged/marged barge pigs"
I would want them wrapped as so:
"<span class='wrapper'>Brown cat开发者_开发问答s cannot be white cats</span>"
in the destination string where matches occur, even partial but with the largest possible match wrapped.
Example of a string to wrap:
"Hi bill, brown cats cannot be white cats and cows are not blue pigs, blue melons are large but not batteries charged barges with white cats carry coal"
Final string for each example input:
"Hi bill, <span class='wrapper'>brown cats cannot be white cats</span> and cows are not blue pigs, blue melons are large but not batteries charged barges with <span class='wrapper'>white cats</span> carry coal"
"Hi bill, brown cats cannot be white cats and cows are not <span class='wrapper'>blue pigs</span>, blue melons are large but not batteries charged barges with white cats carry coal"
"Hi bill, brown cats cannot be white cats and cows are not blue <span class='wrapper'>pigs</span>, blue melons are large but not batteries <span class='wrapper'>charged</span> <span class='wrapper'>barge</span>s with white cats carry coal"
Possible matches for: "Brown cats cannot be white cats"
"Brown cats cannot be white cats"
"Brown cats cannot be white"
"Brown cats cannot be"
"Brown cats cannot"
"Brown cats"
"Brown"
"Brown" "cats" "cannot" "be" "white" "cats"
If I were to simply wrap each matched word I could do:
function replaceWords(wordsy, text) {
var re = '(' + wordsy + ')(?![^<]*(?:<\/script|>))',
regExp = new RegExp(re, 'ig'),
sTag = "<span class='wrapper'>",
eTag = "</span>";
return text.replace(regExp, sTag + '$&' + eTag);
};
var matchstring = "Brown cats cannot be white cats";
var wrapstring = "Hi bill, brown cats cannot be white cats and cows are not blue pigs, blue melons are large but not batteries charged barges with white cats carry coal";
var words = myValue.split(" ");
var i = words.length; while (i--) {
wrapstring = replaceWords(words[i], wrapstring );
};
That does not satisfy the requirement for "largest match". I would want the largest possible match of any portion of the matchstring that occurs in the wrapstring.
Solutions using pure javascript or jquery or a combo are acceptable.
EDIT: Some have suggested KMP, here is example of KMP jsfiddle.net/y5yJY/2 but it does not, it its current form fit all criterial and does a single match.
I've got an interesting solution which should work as your original spec. It hasn't been stress tested, I'm not sure if you want to process a large amount of text or not, and it does quite a few regular expression matches. Not necessarily the cleanest or simplest solution, but it works an intended.
Features and limitations:
It handles most odd cases in the match string, such as repeated words, very similar or repeated phrases, etc.
At the moment you can't reliably have
[
and]
characters in the source string, as they're used internally. If it's a problem, you have to swap them to any other character or character combination before matching.For a match string of
N
words,2*N + 5
string replaces are done using regular expressions of varying complexity.It matches the words and phrases case-insensitive, ignoring any non-word characters. At the same time, it preserves mixed-case words and non-word characters in the result.
How it works:
First, it looks for every word separately, and appends their index in the match string to them in square brackets:
word[2]
. If the word appears multiple times, it appends all indices:word[3][2][1]
.Next, it finds and marks words that are not on wrapping boundaries by looking at the indices of the surrounding words. In a separate step, it removes the indices from those words. In the end
one[1] two[2] three[3]
will becomeone[1] []two three[3]
.All that's left now is to do some assumptions in a certain order, and wrap the words/phrases. Have a look at the code to see all the replaces that's done.
It is important, that after step one, we never match the words directly, from then on a word is just refered to as any number of word characters before [index]
or any number of word characters after []
. This makes sure we wrap repeating words/phrases correctly.
Have a look at this demo. I have added a hover effect, so you can see what words are grouped and wrapped together.
And here's the crazy code, enjoy!
var matchstring = 'Brown cats cannot be white cats';
var wrapstring = 'Hi bill, brown cats cannot be white cats and cows are not blue pigs, blue melons are large but not batteries charged barges with white cats carry coal, and the word "cannot" should match ';
// Pre-process matchstring to make it a flat list of words
// separated by single spaces.
matchstring = matchstring.replace(/\W+/g,' ');
var wrapStart = '<span class="wrapped">';
var wrapEnd = '</span>';
var matcharray = matchstring.split(' ');
var i, reg;
// Mark all matched words with indices
// one -> one[1]
for (i = 0; i < matcharray.length; i++) {
reg = new RegExp('\\b' + matcharray[i] + '\\b', 'ig');
wrapstring = wrapstring.replace(reg, '$&[' + i + ']');
}
// Mark all inner words
// one[1] two[2] three[3] -> one[1] []two[2] three[3]
for (i = 1; i < matcharray.length; i++) {
reg = new RegExp('\\b(\\w+)([\\]\\d\\[]*\\[' + (i - 1) + '\\][\\]\\d\\[]*)(\\W+)(\\w+)([\\]\\d\\[]*\\[' + i + '\\][\\]\\d\\[]*)(?=\\W+\\w+[\\[\\d\\]]*\\[' + (i + 1) + '\\])', 'ig');
wrapstring = wrapstring.replace(reg, '$1$2$3[]$4$5');
}
// Remove indices from inner words
// one[1] []two[2] three[3] -> one[1] []two three[3]
wrapstring = wrapstring.replace(/\[\](\w+)[\[\d\]]*/g, '[]$1');
// Start tags
// one[1] []two three[3] -> {one []two three[3]
wrapstring = wrapstring.replace(/(\w+)\[[\[\d\]]+\](\W+)\[\]/g, wrapStart + '$1$2[]');
// End tags
// {one []two three[3] -> {one []two three}
wrapstring = wrapstring.replace(/\[\](\w+\W+\w+)\[[\[\d\]]+\]/g, '$1' + wrapEnd);
// Wrap double words
// one[1] two[2] -> {one two}
wrapstring = wrapstring.replace(/(\w+)\[[\[\d\]]+\](\W+\w+)\[[\[\d\]]*\]/g, wrapStart + '$1$2' + wrapEnd);
// Orphan words
// unmatched matched[1] unmatched -> unmatched {matched} unmatched
wrapstring = wrapstring.replace(/(\w+)\[[\[\d\]]+\]/g, wrapStart + '$1' + wrapEnd);
// Remove left-over tags
// []word -> word
wrapstring = wrapstring.replace(/\[\]/g, '');
alert(wrapstring);
Matching partial words
As previously mentioned, after the first step words are only processed by their appended indices. This means if we want to do some clever matching instead of just whole words, we just need to modify the regex in the first for
loop. This is the piece of code with which we will fiddle in this section:
reg = new RegExp('\\b' + matcharray[i] + '\\b', 'ig');
The \b
in the regex means match word boundaries, i.e. beginning or end of a sequence of word characters. That's why the above \bword\b
regex only gives us whole words, as word
needs to be surrounded by word boundaries. But it doesn't have to be this way.
If we want to match all words in the text beginning with our keyword, we can change the above line to the following:
reg = new RegExp('\\b' + matcharray[i] + '\\w*\\b', 'ig');
This results in the regex \bword\w*\b
. It matches all word
character sequences followed by 0 or more further word characters (\w*
), surrounded by word boundaries. Note that backslashes need to be escaped within javascript strings (\\
means a single \
).
Depending on requirements, we can easily create further regex combinations:
\bword\w*\b
matches words beginning with a keyword.\b\w*word\b
matches words ending with a keyword.\b\w*word\w*\b
matches words containing a keyword.\b(\w*word|word\w*)\b
matches words ending or beginning with a keyword.
You can even say that you only want to match minor modifications of a words. For example \b\w{0,2}word\w{0,2}\b
will only match a word if it has at most a two letter prefix and/or suffix. So danger
will match endanger
, cat
will match cats
, but can
won't match cannot
, as that would be 3 extra letters.
Matching complex plural forms and irregular verbs is not easy, you could build a massive dictionary of irregular words on your server and pre-process the word, so if the user enters foot
, using the regex \b(foot|feet)\b
will match both forms. An easier solution would be to only care about regular words. For most words, matching \bword(s|es|)\b
will be enough to catch plurals, it matches word
, words
and wordes
as well. For words like fly
, the regex \bfl(y|ies)\b
will do the job. For words like index
, the regex \bind(ex|exes|ices)\b
will match most common forms.
As I'm not really a language expert, I'll just leave it at that for now.
Wildcards in input
Similarly to the above, it's very easy to add support for wildcards in the input string. Say we want to let the user enter ?
to mean any character. If the input is ?red
, we just need to replace ?
with \w
in our regex. For example \b\wred\b
will match fred
and dred
as well.
Just like the above, you can also use multiple wildcards, replace them with \w+
for one or more characters or \w*
for zero or more characters. \bf\w+d\b
will match fed
and feed
, \w*
will match fd
too.
What about this: (describes algorithm only, not written in code)
Imagine you have the two strings written out on two sheets of paper. Place the two sheets so that one is above the other. Move the top sheet off to the left, so that its last letter is on top of the bottom sheet's first letter. Now, do the two overlapping letters match? If so, you have a match of length 1. Record that as your longest match. Then, move the top sheet one character to the right. Now two letters overlap. Do they match? If so, you have a max match of size 2. Keep moving the top sheet over 1 character to the right, and each time, find the largest section of matched, overlapping characters. Always keep track of your biggest match. Keep going until your top sheet is so far over to the right that its first character overlaps with the other sheet's last character.
I don't know how easy this would be to implement in javascript, but as an algorithm, I think it is sound.
PS- for the bit where you need to find the "largest section of matched, overlapping characters", you could do something like this:
/* Note: str1 and str2 are the two overlapping portions of the strings */
var largestMatch = 0;
var currMatch = 0;
for (var i = 0; i < str1.length; i++) {
if (str1[i] == str2[i]) currMatch++;
else currMatch = 0;
largestMatch = Math.max(largestMatch, currMatch);
}
// largestMatch is the size of the largest section of matched characters
Here is what I did to address this: (looking for improvement as it is not perfect) (this is wrapped in a jQuery document ready) as here:http://jsfiddle.net/KvM47/
function findStringLimit(searchChar, searchCharIndex, searchedString) {
return searchedString.substring(0, searchedString.lastIndexOf(searchChar, searchCharIndex));
};
function replaceWords(wordsy, text) {
var re = '(' + wordsy + ')(?![^<]*(?:<\/script|>))',
regExp = new RegExp(re, 'ig'),
sTag = "<span class='wrappedWord'>",
eTag = "</span>";
return text.replace(regExp, sTag + '$&' + eTag);
};
var longstring = $('#mystring');
var htmlString =longstring .html(); // instance html
myValue = "Brown cats cannot be white cats";
myValue = myValue.replace(/^\s+|\s+$/g, "");//trim whitespace at each end
var words = myValue.split(" ");
var allPhrases = [];
allPhrases.push(myValue);
var i = words.length;
while (i--) {
allPhrases.push(findStringLimit(" ", allPhrases[(words.length - i) - 1].length, allPhrases[(words.length - i) - 1]));
};
var i = allPhrases.length;
while (i--) {
if (allPhrases[i] != "") words = words.concat(allPhrases[i]);
};
var i = words.length;
while (i--) {
htmlString = replaceWords(words[i], htmlString);
};
longstring.html(htmlString);
Things to improve:
- use other characters to separate words, not just spaces.
- make it more efficient
- better detection of "chunks" of strings (two or more words together) in the "search" and "matched" strings and handle those as well.
精彩评论