Is information a subset of data?
I apologize as I don't know whether this is more of a math question that belongs on mathoverflow or if it's a computer science question that belongs here.
That said, I believe I understand the fundamental difference between data, information, and knowledge. My understanding is that information carries both data and meaning开发者_开发问答. One thing that I'm not clear on is whether information is data. Is information considered a special kind of data, or is it something completely different?
The words data
,information
and knowlege
are value-based concepts used to categorize, in a subjective fashion, the general "conciseness" and "usefulness" of a particular information set.
These words have no precise meaning because they are relative to the underlying purpose and methodology of information processing; In the field of information theory these have no meaning at all, because all three are the same thing: a collection of "information" (in the Information-theoric sense).
Yet they are useful, in context, to summarize the general nature of an information set as loosely explained below.
Information is obtained (or sometimes induced) from data, but it can be richer, as well a cleaner (whereby some values have been corrected) and "simpler" (whereby some irrelevant data has been removed). So in the set theory sense, Information is not a subset of Data, but a separate set [which typically intersects, somewhat, with the data but also can have elements of its own].
Knowledge (sometimes called insight) is yet another level up, it is based on information and too is not a [set theory] subset of information. Indeed Knowledge typically doesn't have direct reference to information elements, but rather tells a "meta story" about the information / data.
The unfounded idea that along the Data -> Information -> Knowledge chain, the higher levels are subsets of the lower ones, probably stems from the fact that there is [typically] a reduction of the volume of [IT sense] information. But qualitatively this info is different, hence no real [set theory] subset relationship.
Example:
Raw stock exchange data from Wall Street is ... Data
A "sea of data"! Someone has a hard time finding what he/she needs, directly, from this data. This data may need to be normalized. For example the price info may sometimes be expressed in a text string with 1/32th of a dollar precision, in other cases prices may come as a true binary integer with 1/8 of a dollar precision. Also the field which indicate, say, the buyer ID, or seller ID may include typos, and hence point to the wrong seller/buyer. etc.A spreadsheet made from the above is ... Information
Various processes were applied to the data:
-cleaning / correcting various values
-cross referencing (for example looking up associated codes such as adding a column to display the actual name of the individual/company next to the Buyer ID column)
-merging when duplicate records pertaining to the same event (but say from different sources) are used to corroborate each other, but are also combined in one single record.
-aggregating: for example making the sum of all transaction value for a given stock (rather than showing all the individual transactions.
All this (and then some) turned the data into Information, i.e. a body of [IT sense] Information that is easily usable, where one can quickly find some "data", such as say the Opening and Closing rate for the IBM stock on June 8th 2009.
Note that while being more convenient to use, in part more exact/precise, and also boiled down, there is not real [IT sense] information in there which couldn't be located or computed from the original by relatively simple (if only painstaking) processes.- An financial analyst's report may contain ... knowledge
For example if the report indicate [bogus example] that whenever the price of Oil goes past a certain threshold, the value of gold start declining, but then quickly spikes again, around the time the price of coffee and tea stabilize. This particular insight constitute knowledge. This knowledge may have been hidden in the data alone, all along, but only became apparent when one applied some fancy statistically analysis, and/or required the help of a human expert to find or confirm such patterns.
By the way, in the Information Theory sense of the word Information, "data", "information" and "knowlegde" all contain [IT sense] information.
One could possibly get on the slippery slope of stating that "As we go up the chain the entropy decreases", but that is only loosely true because
- entropy decrease is not directly or systematically tied to "usefulness for human"
(a typical example is that a zipped text file has less entropy yet is no fun reading) - there is effectively a loss of information (in addition to entropy loss)
(for example when data is aggregate the [IT sense] information about individual records get lost) - there is, particular in the case of Information -> Knowlege, a change in level of abstration
A final point (if I haven't confused everybody yet...) is the idea that the data->info->knowledge
chain is effectively relative to the intended use/purpose of the [IT-sense] Information.
ewernli
in a comment below provides the example of the spell checker, i.e. when the focus is on English orthography, the most insightful paper from a Wallstreet genius is merely a string of words, effectively "raw data", some of it in need of improvement (along the orthography purpose chain.
Similarly, a linguist using thousands of newspaper articles which typically (we can hope...) contain at least some insight/knowledge (in the general sense), may just consider these articles raw data, which will help him/her create automatically French-German lexicon (this will be information), and as he works on the project, he may discover a systematic semantic shift in the use of common words betwen the two languages, and hence gather insight into the distinct cultures.
Define information and data first, very carefully.
What is information and what is data is very dependent on context. An extreme example is a picture of you at a party which you email. For you it's information, but for the the ISP it's just data to be passed on.
Sometimes just adding the right context changes data to information.
So, to answer you question: No, information is not a subset of data. It could be at least the following.
A superset, when you add context
A subset, needle-in-a-haystack issue
A function of the data, e.g. in a digest
There are probably more situations.
This is how I see it...
Data is dirty and raw. You'll probably have too much of it.
... Jason ... 27 ... Denton ...
Information is the data you need, organised and meaningful.
Jason.age=27
Jason.city=Denton
Knowledge is why there are wikis, blogs: to keep track of insights and experiences. Note that these are human (and community) attributes. Except for maybe a weird science project, no computer is on Facebook telling people what it believes in.
information is an enhancement of data:
- data is inert
- information is actionable
note that information without data is merely an opinion ;-)
Information could be data if you had some way of representing the additional content that makes it information. A program that tries to 'understand' written text might transform the input text into a format that allows for more complex processing of the meaning of that text. This transformed format is a kind of data that represents information, when understood in the context of the overall processing system. From outside the system it appears as data, whereas inside the system it is the information that is being understood.
精彩评论