Comparing two English strings for similarities

2023-03-29 02:09 问答作者：

So here is my problem. I have two paragraphs of text and I need to see if they are similar. Not in the sense of string metrics but in meaning. The following two paragraphs are related but I need to find out if they cover the 'same' topic. Any help or direction to solving this problem would be greatly appreciated.

Fossil fuels are fuels formed by natural processes such as anaerobic decomposition of buried dead organisms. The age of the organisms and their resulting fossil fuels is typically millions of years, and sometimes exceeds 650 million years. The fossil fuels, which contain high percentages of carbon, include coal, petroleum, and natural gas. Fossil fuels range from volatile materials with low carbon:hydrogen ratios like methane, to liquid petroleum to nonvolatile materials composed of almost pure carbon, like anthracite coal. Methane can be found in hydrocarbon fields, alone, associated with oil, or in the form of methane clathrates. It is generally accepted that they formed from the fossilized remains of dead plants by exposure to heat and pressure in the Earth's crust over millions of years. This biogenic theory was first introduced by Georg Agricola in 1556 and later by Mikhail Lomonosov in the 18th century.

Second:

Fossil fuel reforming is a method of producing hydrogen or other useful products from fossil fuels such as natural gas. This is achieved in a processing device called a reformer which reacts steam at high temperature with the fossil fuel. The steam methane reformer is widely used in industry to make hydrogen. There is also interest in the development of much smaller units based on 开发者_Go百科similar technology to produce hydrogen as a feedstock for fuel cells. Small-scale steam reforming units to supply fuel cells are currently the subject of research and development, typically involving the reforming of methanol or natural gas but other fuels are also being considered such as propane, gasoline, autogas, diesel fuel, and ethanol.

That's a tall order. If I were you, I'd start reading up on Natural Language Processing. NLP is a fairly large field -- I would recommend looking specifically at the things mentioned in the Wikipedia Text Analytics article's "Processes" section.

I think if you make use of information retrieval, named entity recognition, and sentiment analysis, you should be well on your way.

In general, I believe that this is still an open problem. Natural language processing is still a nascent field and while we can do a few things really well, it's still extremely difficult to do this sort of classification and categorization.

I'm not an expert in NLP, but you might want to check out these lecture slides that discuss sentiment analysis and authorship detection. The techniques you might use to do the sort of text comparison you've suggested are related to the techniques you would use for the aforementioned analyses, and you might find this to be a good starting point.

Hope this helps!

You can also have a look on Latent Dirichlet Allocation (LDA) model in machine learning. The idea there is to find a low-dimensional representation of each document (or paragraph), simply as a distribution over some 'topics'. The model is trained in an unsupervised fashion using a collection of documents/paragraphs.

If you run LDA on your collection of paragraphs, then by looking into the similarity of the hidden topics vector, you can find whether a given two paragraphs are related or not.

Of course, the baseline is to not use the LDA, and instead use the term frequencies (augmented with tf/idf) to measure similarities (vector space model).

继续阅读：algorithm compare comparison text

Comparing two English strings for similarities

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？