How best to store data for a chatbot?
I was looking on the internet for chatbots. It was only fun. But now, I love this subject so much that I want to develop my own chatbot.
But the first thing is to look for a good way to manage the "brain" of my chatbot. I think that it's the best solution to save everything in a XML file, isn't it? So the file type is clear. Comes to the relationship between different nouns etc. When I have a noun, e.g. a tree. How do I save best that a tree has leaves, branches and roots. And that a tree needs water and sunlight to survive? Should I save it like that or otherwise?This would be my XML for this tree-example:
<nouns>
<noun id="noun_0">
<name>tree</name>
<relationship>
<has>noun_1</has>
<has>noun_2</has>
<has>noun_3</has>
<need>noun_4</need>
<need>noun_5</need>
</relationship>
</noun>
<noun id="noun_1">
<name>root</name>
</noun>
<noun id="noun_2">
<name>branch</name>
<relationship>
<has>noun_3</has>
</relationship>
</noun>
<noun id="noun_3">
<name>leaf</name>
</noun>
<noun 开发者_Python百科id="noun_4">
<name>water</name>
</noun>
<noun id="noun_5">
<name>light</name>
</noun>
. . .
</nouns>
Data Storage Choices: It Depends
Simple, non-learning bots: XML is fine
It looks like you already have a basic XML structure worked out. For just starting out, I'd say that's fine, especially for AI support-chat kind of bots (if userMsg.contains('lega') then print('TOS & Copyright...'
).
Of course, switching to any new format will take time and overhead.
Learning, Complicated bots: database!
If you're looking to do something much larger, especially if you have CleverBot in mind, I think you're going to need a database. This is because when your file .. is a file and is gigantic and trying to keep it all available in memory is resource intensive. For this kind of project, I'd recommend a database.
Why? English is Complicated
A while back I wrote a nieve bayes spam sorter. It took about 10,000 pieces of spam to "train" it at a 7% accuracy rate, which took about 6 hours and 1.5GB of RAM to hold the data in memory. That's a lot of data. English is very hard and can't really be broken into if 'pony' then 'saddle'
, so for a bot to "learn" the best responses, your database is going to become massive and very quickly.
I think we can model this information as an ontology. You can encode much richer information, in terms of relations, attributes, levels etc. There are formats like RDF, OWL etc. which you can use and are supported by almost all languages.
And most importantly, managing data would be be easy if you use an ontology editor , i would recommend Protege (http://protege.stanford.edu/), take a look at it.
You could also try something like a graphdb that Freebase uses to store relations between various entities. Basically, it is a graph of nodes and edges, and each node has attributes and values for those attributes. The edges also have attributes similar to nodes and an edge connecting two nodes defines a relationship between them.
You are probably looking at a database. Any serious NLP system would be using one, unless you have a rule-based thing which operates on a small set of rules. Think about whether you would want to write a piece of C code that handles a 5 MB xml file. I would most definitely not. Stanford university host a nice demo if you are interested in the linguistic side of it.
精彩评论