Regex to split on punctuation excluding URLs

2022-12-10 00:36 问答作者：

I'm trying to split a string on its punctuation, but the string may contain URLs (which conveniently has all the typical punctuation marks).

I have a basic working knowledge of RegEx, but not enough to help me out here. This is what I was using when I discovered the problem:

$text[$i] = preg_split('/[\.\?!\-]+/', $post->text);

(this also accounts for multiple co开发者_JAVA百科nsecutive punctuation characters - ellipses, !!!!, ????, ?!?, etc)

How would I split a string on the punctuation while maintaining the integrity of URLs? Thanks!

Edit:

My apologies...an example would be something along the lines of a tweet:

"Blah blah blah? A sentence. Here's a link: http://somelink.com?key=value ."

The results should look something like this:

[0] => "Blah blah blah?"
[1] => "A sentence."
[2] => "Here's a link: http://somelink.com?key=value ."

What you're doing here isn't quite splitting on punctuation, because you're trying to keep the punctuation in one of the split items. You're also attempting to discard the whitespace afterwards, but don't seem to have covered that in your question.

I would tackle this in the following way: split your input string with a regular expression which matches punctuation or a URL, and keep the pieces, including the separators. Then iterate over the items, and for each separator decide whether it was punctuation, in which case you can strip trailing whitespace and move it to the end of the previous item, or a URL, in which case you just join it with the preceding and following items.

In PHP, you can keep the delimiters using something like this:

$text[$i] = preg_split('/([\.\?!\-]+|https?:\/\/\S+)/', $post->text, PREG_SPLIT_DELIM_CAPTURE);

where the PREG_SPLIT_DELIM_CAPTURE flag is explained in the documentation as:

If this flag is set, parenthesized expression in the delimiter pattern will be captured and returned as well.

Is there a pattern that your non-URL punctuation marks follow? In most English sentences, many punctuation marks are followed (or sometimes preceeded) by a space character. I don't know what your source text is like but that MIGHT be a reliable way to do it, because the punctuation marks in a URL will NOT have space on either side - although they could END with a punctuation mark followed by a space - I guess it depends on the URLs you anticipate as well.

Another approace (if you don't mind doing this in stages) is to remove all of the URLs from the string and then do the rest of your processing on the result of this. That only works if you don't need the URLs. If you need to preserve the URLs, you can add placeholder strings on either side of the URL such as ">>>>http://placeholder.com<<<<" and then when you split on punctuation, be sure to exclude any punction that occurs between >>>> and <<<<. Afterwards, you would have to remove the >>>> and <<<<

This regex produces the example you've given:

/(?<!http[^\s]{0,2048})[\.\?\!\-]+\B/

It looks for your punctuation set not preceded by a string starting with 'http' and ending with a whitespace character. The trailing \B prevents a hyphenated word from causing a split

but...

This input:

Blah blah blah? A sentence. Here's a link: http://somelink.com?key=value.blah blah blah...

won't split the value.blah into two... but I think URL matching regex would have the same problem as 'value.blah' could be part of a valid URL. I think your data, coming from twitter users, will be very inconsistent and therefore hard to clean up, even if you go for FrustratedWithFormsDes' second suggestion.

You can try:

/((?![.?!] ).)+[.?!]+/

继续阅读：php regex

Regex to split on punctuation excluding URLs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？