Structured text and unstructured text
With respect to the data mining, what are the differences between structured text and unstructured text? What are the major considerations when c开发者_如何学运维hoosing/developing data mining approaches for analyzing these different texts?
I'll preface this by saying that the specific domain you are dealing with matters a great deal when answering these types of questions. Adding some context to your question will allow much more helpful responses.
The central difference between structured and unstructured text, in the general case, is the simple fact that structured text has an easily digested form and unstructured text does not. For some text mining, this may be as simple as a bag-of-words model (how many times does each word occur?), all the way up to extremely complicated NLP approaches that attempt to pull out deeper language structures like parts of speech or entity detection/resolution. An every-day example of structured data could be the metadata of a post on Twitter (username/time stamp/retweet info/etc.) where the related unstructured data would be the text of the post itself.
Without knowing exactly what you are interested in, a large consideration is the simple fact that structured text is often in a convenient form for simple machine learning models, while unstructured text rarely is, since it cannot be easily treated as a bunch of binary/real-valued features and thrown into your favorite statistical model.
Hope this helps on a high level -- feel free to update the original post with details if I'm being too broad with my response =)
精彩评论