开发者

Regex puzzle for parsing structured posts

Assume a person posts this message:

"#books 'War and Peace' by Leo Tolstoy - I love this book."

I want to parse this into th开发者_运维技巧ree variables, like this:

@title = "War and Peace"

@author = "Leo Tolstoy"

@Comment = "I love this book"

I'm sure this is a simple puzzle for a Regex Ninja. Unfortunately, I am but a lowly villager that mops the bloody, sweaty floors upon which real Regex Ninjas train.

BONUS points if you can suggest a regex that does not require so much structure in the message post. Ideally, I want to obtain the same three variables without the structure (or at least with less structure / requirements): "@title" by @author - @comment.

Thanks!


regex = /'(.+)'\s+by\s+(.+)\s+-\s+(.+)/
"#books 'War and Peace' by Leo Tolstoy - I love this book.".scan(regex)

=>

[["War and Peace", "Leo Tolstoy", "I love this book."]]


I don't know ruby syntax but the regex itself for the format you gave would look something like this:

#books\s'([^']+)'\s+by\s+([^-]+)-\s+(.*)

But to answer your question about not making it so dependent on format...ideally you should make it 3 separate fields to fill out. Or if it's general content in a message post and it's looking for a specific format (kinda like bbcode) then I would suggest something more like

[book title='title' author='author']comment[/book]

That would be much easier to parse.


(["'])(?<title>[^"']*)\1\s+by\s+(?<author>[\p{L}\s']+)\s*-\s*(?<comment>.*)$

About 2nd comment: it is impossible implement using only regex, because look at definition of regex - Regular expression and your sentence may be irregular.


An alternate answer:

You could pick a delimiter that you know isn't going to show up very often and just split the string by that. And then enforce the standard/assumption of which order the values will be in (which you are more or less already doing). So for instance, you could have people post

"War and Peace ~ Leo Tolstoy ~ I love this book"

and then just explode/split at the ~ and assume first element to be title, 2nd to be author, 3rd to be comment.


/["'](.*?)["'] by (.*?)\s+-\s+(.*)/
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜