Parsing / Extracting Text from String in Rails?

2023-03-15 12:17 问答作者：

I have a string in Rails, e.g. "This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!", and I want to parse the text and extract only certain phrases, like "War & Peace by Leo Tolstoy".

Is this a matter of using Regex and lifting the text between "#books" to "."?

What if there's no structure to the message, like: "This is a Twitter message #books War & Peace by Leo Tolstoy I love this book!" or "This is a Twitter message. I love the book War & Peace by Leo Tolstoy #books" How can I reliably pull the phrase "War & Peace by Leo Tolstoy" without knowing the phrase ex ante.

Are there any gems, methods, etc. that can help me do this?

At the very least, what would you call what I'm trying to do? It will help me search for a solution on Google. I've tried 开发者_如何学Pythona few searches on "parsing" with no luck.

--- edit --- based on @rogeliog suggestion, I will add the following:

I can live with the garbage text that comes after #books, but nothing before. I tried "match.(/#books.*/)" -- results here: www.rubular.com/r/gM7oSZxF5M.

But how can I capture Result #6? (e.g., when someone puts #books at the end of the sentence)?

Is there a way for me to do an if-then with regex? Something like:

if [#books is at the end of the message],

then [take the last 10 words preceding #books],

else [match.(/#books.*/)]

If you offer a regex, please post your solution via a permalink using rubular.com

I think what you're going to need is Natural Language Processing. It's a very large field and has many techniques and applications. With Ruby in particular you may want to look at the Ruby Linguistics project.

Good luck to you, parsing and processing natural language is not an easy thing to do.

I Think that you are trying to parse some pretty complex variations. Do you have a DB with all the book titles? That will help allot.

To get out the title from the first example("This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!") you can simply:

"This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book".match(/#book.*\./).to_s.gsub("#books",'')

That will return: " War & Peace by Leo Tolstoy."

If you want to do an if else statement depending if #books is at the end or not, you can:

if text.match(/#books$/)
  puts text.match(/([^\s]*\s){10}(#books$)/).to_s
else
  puts text.match(/#books.*/).to_s.gsub("#books",'')
end

That will give you the last 10 words preceding books if #books is at the end, and whatever it is after #books if it is not at the end

I dont really have a better idea, hope that works for you, let me know:)

继续阅读：parsing ruby-on-rails string-parsing text text-parsing

Parsing / Extracting Text from String in Rails?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？