How to find a Word that is enclosed from Html Tags?
I'm programming a spell checker in Javascript in combination with OpenOffice dictionary, and I have a serious problem.
I can find whole words using RegEx, but if the word looks like prog<b>ram</b>ing
, I can find it if I remove all html tags with the .text()
method from jQuery. But how can I replace this word and rebuild the original html structure?
Sp开发者_开发问答ellchecker.com does it very smartly - the spell check recognizes even words like prog<b>ram</b>ing
if they are misspelled!
/([\s>"'])prog(<[^>]+>)ram(<[^>]+>)ing([\s\.,:;"'<])/g
will match your example
So roughly the following regex will find all instances of the word, even those broken with html
var regExp = new RegExp('([\s>"\'])' + word.split('').join('(<[^>]+>)') + '([\s\.,:;"\'<])',g);
God knows how that'll help you build a spellchecker though. I suspect the approach used in spellcheckers would be more like 'do a spellcheck assuming no html, and if there is html in a word then strip it out using something like the method below, and do a spellcheck as normal for the string you get:
String.prototype.stripHtml = function() {
return this.replace(/(<[^>]+>)/, '');
}
I would use something to pull out any HTML so that you are dealing with plaintext. I cannot speak for any tools like this in javascript but I'm sure they exists. If you can find something to 'scrub' the html out of your .text() you can run a search this way.
Try something like this: http://metacpan.org/pod/HTML::Scrubber
精彩评论