开发者

How do you fix sentence spacing on extracted plain text from HTML?

I'm pulling articles from specific URLs for conversion to sentences, but the text body has a random behavior o开发者_如何学运维f eliminating whitespace between some sentences resulting in:

Jane went to the store.She bought a dog. The dog was very friendly.It had no teeth.

Some of my text is stock symbols (AZ.GAN) etc. So I can't simply insert a space between all periods which have no adjacent whitespace.

Jane bought several shares of (TY.JPN). She lost all her cash money."Arg!" She cried.

The above example would destroy the stock symbol variable.

Curious if anyone knows the cause of this. I have tried several HTML and DOM. I use Simple_DOM to grab the plaintext. Although, I get the same result if I do it manually, or with any other parsing engine.


Unfortunately I don't have an approach for your specific question, but is it possible that the missing space between sentences is actually a linebreak (e.g. \n) that your text viewer (whatever it is) isn't showing you?

Perhaps try something like this just to make sure

var articleContent = ... // get content
articleContent = articleContent.replace(/\n/g, ' NEW LINE ');


Try doing:

$str = trim(preg_replace('~([(].+?[.])\s(.+?[)])~', '$1$2', str_replace('.', '. ', $str)));
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜