Split string by HTML entities?
My string contain a lot of HTML entities, like this
"Hello <everybody> there"
And I want to split it by HTML entities into this :
Hello
everybody there
Can anybody suggest me a way to do this please? May be using R开发者_运维问答egex?
It looks like you can just split on &[^;]*;
regex. That is, the delimiter are strings that starts with &
, ends with ;
, and in between there can be anything but ;
.
If you can have multiple delimiters in a row, and you don't want the empty strings between them, just use (&[^;]*;)+
(or in general (
delim
)+
pattern).
If you can have delimiters in the beginning or front of the string, and you don't want them the empty strings caused by them, then just trim them away before you split.
Example
Here's a snippet to demonstrate the above ideas (see also on ideone.com):
var s = ""Hello <everybody> there""
print (s.split(/&[^;]*;/));
// ,Hello,,everybody,,there,
print (s.split(/(?:&[^;]*;)+/));
// ,Hello,everybody,there,
print (
s.replace(/^(?:&[^;]*;)+/, "")
.replace(/(?:&[^;]*;)+$/, "")
.split(/(?:&[^;]*;)+/)
);
// Hello,everybody,there
var a = str.split(/\&[#a-z0-9]+\;/);
should do it, although you'll end up with empty slots in the array when you have two entities next to each other.
split(/&.*?;(?=[^&]|$)/)
and cut the last and first result:
["", "Hello", "everybody", "there", ""]
>> ""Hello <everybody> there"".split(/(?:&[^;]+;)+/)
['', 'Hello', 'everybody', 'there', '']
The regex is: /(?:&[^;]+;)+/
Matches entities as &
followed by 1+ non-;
characters, followed by a ;
. Then matches at least one of those (or more) as the split delimiter. The (?:expression)
non-capturing syntax is used so that the delimiters captured don't get put into the result array (split()
puts capture groups into the result array if they appear in the pattern).
精彩评论