开发者

Coding a domain specific text generator

A friend of mine is in the real estate business and after being showed the art of writing copy for real estate ads, I realized that it is very formulaic. Especially when advertising online as there are predefined fields you fill in.

Naturally, I thought about creating a generator that pretty much automates writing 开发者_运维技巧the ads. i don't expect it to generate outstanding or even very good copy, just that it can put together words and sentences like a human would.

I have a skeleton/template that defines an ad and I've also put together a set of phrases and words that can be randomly selected, but I am interested in more general aspects of coding such a generator? Any suggestions, tips or literature that I can read to better understand this little project better?


using metadata about the listing would be one way.

Say for a given house, you have these attributes:

(type: bungalo, sq feet: <= 1400) You could use the phrase "cozy cottage".

bedrooms: obvious, same thing with bathrooms. Assume using the word Large, medium, etc.

garage spots: if > 2 then "Can park many vehicles", etc.

You could go even further with this given the lat/lon for the address, there are web services that you can find the amount of parks nearby, crime in the neighborhood, etc.

Rick


I'd say there are three basic approaches you could take to a problem like this, depending on how flexible you want the system to be and on how much work you want to put into it. The simplest is to treat it as a report generation problem, along the lines of Rick's suggestion. That's probably the way I'd go to produce a first draft of a listing. The results would be pure boilerplate, but each listing could be quickly punched up by the copywriter.

If you wanted to get fancy, though, you could come at it as a natural language generation problem. You'd start with some kind of a knowledge representation describing the meaning of the listing and set of rules (finite state transducers, say) for mapping meanings to linguistic forms. There's a sizable academic literature on that kind of stuff, though it's kind of out of fashion these days. Places to start might be Blackburn & Bos's book or the NLTK suite (especially some of the projects in the contrib package).

The third way of doing it would be to treat it as a translation problem, essentially "translating" database entries into ad copy. You'd start with a large collection of listings and the corresponding human-written ads and construct a statistical model of the relationship between the two. Moses/Giza++ is a general purpose tool for building and applying such models.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜