Is information a subset of data?

2022-12-23 06:47 问答作者：

I apologize as I don't know whether this is more of a math question that belongs on mathoverflow or if it's a computer science question that belongs here.

That said, I believe I understand the fundamental difference between data, information, and knowledge. My understanding is that information carries both data and meaning开发者_开发问答. One thing that I'm not clear on is whether information is data. Is information considered a special kind of data, or is it something completely different?

The words data,information and knowlege are value-based concepts used to categorize, in a subjective fashion, the general "conciseness" and "usefulness" of a particular information set.
These words have no precise meaning because they are relative to the underlying purpose and methodology of information processing; In the field of information theory these have no meaning at all, because all three are the same thing: a collection of "information" (in the Information-theoric sense).
Yet they are useful, in context, to summarize the general nature of an information set as loosely explained below.

Information is obtained (or sometimes induced) from data, but it can be richer, as well a cleaner (whereby some values have been corrected) and "simpler" (whereby some irrelevant data has been removed). So in the set theory sense, Information is not a subset of Data, but a separate set [which typically intersects, somewhat, with the data but also can have elements of its own].

Knowledge (sometimes called insight) is yet another level up, it is based on information and too is not a [set theory] subset of information. Indeed Knowledge typically doesn't have direct reference to information elements, but rather tells a "meta story" about the information / data.

The unfounded idea that along the Data -> Information -> Knowledge chain, the higher levels are subsets of the lower ones, probably stems from the fact that there is [typically] a reduction of the volume of [IT sense] information. But qualitatively this info is different, hence no real [set theory] subset relationship.

Example:

Raw stock exchange data from Wall Street is ... Data
A "sea of data"! Someone has a hard time finding what he/she needs, directly, from this data. This data may need to be normalized. For example the price info may sometimes be expressed in a text string with 1/32th of a dollar precision, in other cases prices may come as a true binary integer with 1/8 of a dollar precision. Also the field which indicate, say, the buyer ID, or seller ID may include typos, and hence point to the wrong seller/buyer. etc.
A spreadsheet made from the above is ... Information
Various processes were applied to the data:
-cleaning / correcting various values
-cross referencing (for example looking up associated codes such as adding a column to display the actual name of the individual/company next to the Buyer ID column)
-merging when duplicate records pertaining to the same event (but say from different sources) are used to corroborate each other, but are also combined in one single record.
-aggregating: for example making the sum of all transaction value for a given stock (rather than showing all the individual transactions.
All this (and then some) turned the data into Information, i.e. a body of [IT sense] Information that is easily usable, where one can quickly find some "data", such as say the Opening and Closing rate for the IBM stock on June 8th 2009.
Note that while being more convenient to use, in part more exact/precise, and also boiled down, there is not real [IT sense] information in there which couldn't be located or computed from the original by relatively simple (if only painstaking) processes.
An financial analyst's report may contain ... knowledge
For example if the report indicate [bogus example] that whenever the price of Oil goes past a certain threshold, the value of gold start declining, but then quickly spikes again, around the time the price of coffee and tea stabilize. This particular insight constitute knowledge. This knowledge may have been hidden in the data alone, all along, but only became apparent when one applied some fancy statistically analysis, and/or required the help of a human expert to find or confirm such patterns.

By the way, in the Information Theory sense of the word Information, "data", "information" and "knowlegde" all contain [IT sense] information.
One could possibly get on the slippery slope of stating that "As we go up the chain the entropy decreases", but that is only loosely true because

entropy decrease is not directly or systematically tied to "usefulness for human"
(a typical example is that a zipped text file has less entropy yet is no fun reading)
there is effectively a loss of information (in addition to entropy loss)
(for example when data is aggregate the [IT sense] information about individual records get lost)
there is, particular in the case of Information -> Knowlege, a change in level of abstration

A final point (if I haven't confused everybody yet...) is the idea that the data->info->knowledge chain is effectively relative to the intended use/purpose of the [IT-sense] Information.
ewernli in a comment below provides the example of the spell checker, i.e. when the focus is on English orthography, the most insightful paper from a Wallstreet genius is merely a string of words, effectively "raw data", some of it in need of improvement (along the orthography purpose chain.
Similarly, a linguist using thousands of newspaper articles which typically (we can hope...) contain at least some insight/knowledge (in the general sense), may just consider these articles raw data, which will help him/her create automatically French-German lexicon (this will be information), and as he works on the project, he may discover a systematic semantic shift in the use of common words betwen the two languages, and hence gather insight into the distinct cultures.

Define information and data first, very carefully.

What is information and what is data is very dependent on context. An extreme example is a picture of you at a party which you email. For you it's information, but for the the ISP it's just data to be passed on.

Sometimes just adding the right context changes data to information.

So, to answer you question: No, information is not a subset of data. It could be at least the following.

A superset, when you add context
A subset, needle-in-a-haystack issue
A function of the data, e.g. in a digest

There are probably more situations.

This is how I see it...

Data is dirty and raw. You'll probably have too much of it.

... Jason ... 27 ... Denton ...

Information is the data you need, organised and meaningful.

Jason.age=27
Jason.city=Denton

Knowledge is why there are wikis, blogs: to keep track of insights and experiences. Note that these are human (and community) attributes. Except for maybe a weird science project, no computer is on Facebook telling people what it believes in.

information is an enhancement of data:

data is inert
information is actionable

note that information without data is merely an opinion ;-)

Information could be data if you had some way of representing the additional content that makes it information. A program that tries to 'understand' written text might transform the input text into a format that allows for more complex processing of the meaning of that text. This transformed format is a kind of data that represents information, when understood in the context of the overall processing system. From outside the system it appears as data, whereas inside the system it is the information that is being understood.

继续阅读：information-theory

Is information a subset of data?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？