WII? (2a) Information Theory, Claude Shannon, Entropy, Redundancy, Data Compression & Bits


In the previous video I discussed several
definitions of information and I mainly concentrated on Gregory Bateson’s definition, which describes
information as “a difference that makes a difference”. I modified his definition to
information is a “perceived difference that can make a difference” and discussed how information
has the potential to be used as the basic currency for describing reality. Reality is a subjective experience, it is
perception, it is qualia. In this way, information represents the connection between the observer
and the observed, it encapsulates the link between the subject and the object, the relationship
between what we conceptualise as the external world and what it feels like to experience
this external world from within. Therefore, using information to describe reality
also embodies the recognition that, in many contexts, it doesn’t make sense anymore to
speak of things in themselves, but that all we can possibly describe are perceived distinctions,
perceived properties, perceived patterns and regularities in Nature. Information has the potential to be assigned
a meaning and therefore it has the potential to become knowledge. Meaning is subjective
so information doesn’t really have objective intrinsic meaning. Now, you may be wondering what science has
to say about all this. How does science currently define information? Has science been able
to objectively quantify information? The answer to these questions lies at the heart of Information
Theory. Information Theory is a branch of mathematics
and computer science which studies the quantification of information. As you have probably realised
by now, the concept of information can be defined in many different ways. Clearly, if
we want to quantify information, we need to use definitions which are as objective as
possible. While subjectivity can never be completely
removed from the equation (reality is, after all, always perceived and interpreted in a
subjective manner) we will now explore a definition of information that is much more technical
and objective than the definitions we discussed in the previous video. This is Claude Shannon, an American mathematician
and electronic engineer who is now considered the “Father of Information Theory”. While
working at Bell Laboratories, he formulated a theory which aimed to quantify the communication
of information. Shannon’s article titled “A Mathematical Theory
of Communication”, published in 1948, as well as his book “The Mathematical Theory of Communication”,
co-written with mathematician Warren Weaver and published in 1949, truly revolutionised
science. Shannon’s theory tackled the problem of how
to transmit information most efficiently through a given channel as well as many other practical
issues such as how to make communication more secure (for instance, how to tackle unauthorised
eavesdropping). Thanks to Shannon’s ideas on signal processing,
data compression, as well as data storage and communication, useful applications have
been found in many different areas. For instance, lossless data compression is used in ZIP files,
while lossy data compression is used in other types of files such as MP3s or JPGs. Other
important applications of Information Theory are within the fields of cryptography, thermal
physics, neurobiology or quantum computing, to name just a few. It is important to realise that Shannon’s
Theory was created to find ways to optimise the physical encoding of information, to find
fundamental limits on signal processing operations. Because of this, Shannon’s definition of information
was originally associated with the actual symbols which are used to encode a message
(for instance, the letters or words contained in a message), and not intended to relate
to the actual interpretation, assigned meaning or importance of the message. As we have seen in the previous video, information
itself does not really have intrinsic objective meaning. What Shannon’s Theory aimed to tackle
were practical issues completely unrelated to qualitative properties such as meaningfulness,
but more related to the actual encoding used in the transmission of a particular message. Therefore, what we are dealing with here is
a very different definition of information than those we have discussed so far. Shannon
defines information as a purely quantitative measure of communication exchanges. As we
will see, Shannon’s definition represents a way to measure the amount of information
that can potentially be gained when one learns of the outcome of a random process. And it
is precisely this probabilistic nature of Shannon’s definition that turns out to be
a very useful tool; one that physicists (and other science disciplines) now use in many
different areas of research. Shannon’s information is in fact known as
Shannon’s entropy (Legend says that it was the mathematician John von Neumann who suggested
that Shannon use this term, instead of information). In general, I will refer to Shannon’s definition
as Shannon’s entropy, information entropy or Shannon’s information, to avoid confusion
with other definitions of information or with the concept of thermodynamical entropy. So what is Shannon’s definition then? Well,
the amount of information contained in a given message is defined as the negative of a certain
sum of probabilities. Don’t worry, we are not going to deal with the actual details
of the equation here, but instead I will give you a flavour of what it means. Suppose we
have a probabilistic process which has a certain number of possible outcomes, each with a different
probability of occurring. Let’s call the total number of possible outcomes
N and the probabilities of each outcome p(1), p(2), p(3),…, p(N). For instance, let’s
say that the probabilistic process we are dealing with is the throw of a coin. In this
case, the total number of possible outcomes is 2, so N=2 (that is, heads and tails – let’s
call heads 1 and tails 2). The probabilities associated with each of these possible outcomes,
assuming a fair coin, is 1/2. So we have p(1)=1/2 and p(2)=1/2 (which is the same as
saying each possible outcome, heads or tails, has a 50% probability of occurring). Shannon showed that the entropy (designated
by the letter H) is equivalent to the potential information gain once the experimenter learns
the outcome of the experiment, and is given by the following formula… This formula implies that the more entropy
a system has, the more information we can potentially gain once we know the outcome
of the experiment. Shannon’s entropy can be thought of as a way to quantify the potential
reduction in our uncertainty once we have learnt the outcome of the probabilistic process. Don’t worry if you feel a bit confused right
now. I’ll explain the concept of Shannon’s entropy for you in very easy terms. Let’s
go back to the coin example: The coin is fair, so the probability of heads
is the same as the probability of tails (that is, 1/2 each). Let’s consider the event of
throwing the coin. Plugging the given probabilities into the equation gives us a Shannon entropy,
that is, an information content of one bit, because there are two possible outcomes and
each has equal probability. Once we have thrown the coin and we have learnt its outcome, we
can say that we have gained one bit of information, or alternatively, we can say that our uncertainty
has been reduced by one bit. Now imagine you have a coin which has two
heads. In this case, N=1, that is, there is only one possible outcome. The likelihood
of obtaining heads is therefore equal to 1 (a probability equal to 1 means absolute certainty,
100%). Since the uncertainty is zero in this case, Shannon’s entropy is zero, and so is
the information content. There is no longer the presence of two different alternatives
here. The information we gain after throwing the coin is therefore, zero. Look at it this
way: we already knew with certainty what was going to happen in advance, so there is no
potential gain in information after learning the outcome. Another way of describing Shannon’s entropy
is to say that it represents the amount of information the experimenter lacks prior to
learning the outcome of a probabilistic process. Hence, according to Shannon’s formula, a message’s
entropy is maximised when the occurrence of each of its individual parts is equally probable.
What this means is that we will gain the largest amount of Shannon’s information when dealing
with systems whose individual possible outcomes are equally likely to occur (for instance,
throwing a fair coin or rolling a fair die, both systems having a set of possible outcomes
which are all equally likely to occur). Shannon’s entropy is a measure of the potential
reduction in uncertainty in the receiver’s knowledge. We can see the process of gaining
information as equivalent to the process of losing uncertainty. You may be wondering what
all this has to do with the actual content and encoding of a message, since so far we
have only been talking of coins. Here’s another example, which illustrates the usefulness
of Shannon’s formula when it comes to written language. Can we estimate the information entropy of
the written English language? Consider starting with one particular letter which is picked
at random. Knowing this first letter, you then want to estimate the probability of getting
another particular letter after that one, and the probability of getting another letter
after that first and second one, and so on. Knowing these probabilities is what we need
in order to calculate the information entropy associated with the English text. If we assume we are dealing with 27 characters
(that is, 26 letters plus space), and that all of these characters are equally probable,
then we have an information entropy of about 4.8 bits per character. But we know that the
characters are not equally probable; for instance, the letter E has the highest frequency of
occurrence, while the letter Z has the lowest. This is related to the concept of redundancy,
which is nothing more than the number of constraints imposed on the text of the English language:
for example, the letter Q is always followed by U, and we also have rules such as “I before
E except after C”, and so on. There are various methods for calculating
the information entropy of the written English language. For instance, Shannon’s methods
– which take into account many factors, including redundancy and contextuality for instance
– give the English language text an information entropy of between 0.6 and 1.3 bits per character. So, if we compare this with the previous result
of around 4.8 bits per character, we can see that the constraints imposed by factors such
as redundancy have the overall effect of reducing the entropy per character. What this means
is that finding the amount of redundancy in a language can help us find the minimum amount
of information needed to encode a message. And of course, this leads us to the important
concept of data compression. In information theory, the redundancy in a
message is the number of bits used to encode it minus the number of bits of Shannon’s information
contained in the message. Redundancy in a message is related to the
extent to which it is possible to compress it. What lossless data compression does is
reduce the number of bits used to encode a message by identifying and eliminating statistical
redundancy. Hence: The more redundancy there is in a message,
the more predictability we have – that means less entropy per encoded symbol – hence the
higher the compressibility When we compress data, we extract redundancy.
When we compress a message, what we do is encode the same amount of Shannon’s information
by using less bits. Hence, we end up having more Shannon’s information per encoded symbol,
more Shannon’s information per bit of encoded message. A compressed message is less predictable,
since the repeated patterns have been eliminated; the redundancy has been removed. In fact, Shannon’s entropy represents a lower
limit for lossless data compression: the minimum amount of bits that can be used to encode
a message without loss. Shannon’s source coding theorem states that a lossless data compression
scheme cannot compress messages, on average, to have more than one bit of Shannon’s information
per bit of encoded message. Calculating the redundancy and the information
entropy of the English language has therefore many practical applications. For instance,
ASCII codes (which are codes that represent text in computers, communications equipment,
and other devices that use text) allocate exactly 8 bits per character. But this is very inefficient when we consider
Shannon’s and other similar calculations which, as we have seen, give us an information entropy
of around 1 bit per character. Put another way, a smaller amount of bits can be used
to store the same text. What this means, is that, in theory, there exists a compression
scheme which is 8 times more effective than ASCII. Let�s use an example to see how lossless
data compression works. We will use Huffman coding, an algorithm developed by electrical
engineer David Huffman in 1952. Huffman coding is a variable length code which assigns codes
using the estimated probability of occurrence of each source symbol. For instance, let’s
say we take as symbols all the letters of the English alphabet plus space. Huffman coding
assigns codes with varying length depending on the frequency of occurrence of each symbol.
Just as with Morse code, the most frequent symbols are assigned the shortest codes and
the less frequent symbols are assigned the longest codes. Let’s take a piece of text written in English,
which is long enough so that we can approximate our calculations by using standard frequency
tables for the letters of the written English language. The most frequent symbols, such
as space and the letter e will be assigned the shortest codes, while the least frequent
symbols, such as the letters q and z will be assigned the longest codes. Applying the
Huffman algorithm using standard frequency tables, we obtain the Huffman codes given
on this table. As you can see, the lengths of the codes vary from 3 to 11 bits, depending
on the character. We are going to use this short passage, from
Brave New World, by Aldous Huxley. Excluding punctuation and just counting letters
and spaces, we have a total of 462 characters to encode. If we encode this piece of text
using ASCII, we will need to use a total of 3,696 bits (8 bits per character). However,
if we use Huffman code, we will only need to use a total of 1,883 bits (an average of
about 4.1 bits per character). So we see that by using Huffman encoding instead
of ASCII, we can store the same amount of information, but twice as effectively. Although
Huffman coding is more efficient, we are still far from reaching the limit given by Shannon’s
entropy, which as we saw earlier on, can be approximated to be around 1 bit per character. By using Huffman’s code, we have managed to
extract a certain amount of redundancy, but we are still far from the limit where all
the redundancy would have been extracted: that limit is Shannon’s entropy. There are however many other compression techniques;
for instance, there is a technique called arithmetic coding, which can extract a lot
more redundancy than Huffman coding, and hence it can create compressed messages where the
average number of bits per character is much closer to Shannon’s entropy. So by using this particular example, we have
seen how the concept of Shannon’s entropy, in this case calculated from the probabilities
of occurrence associated with the letters belonging to the words of a particular language,
has very important applications; data compression being one of them. Summarising: Shannon’s entropy is a measure of uncertainty,
of unpredictability, and also a measure of information content, of potential information
gain. Shannon’s entropy can also represent a lower
limit for lossless data compression: the minimum amount of bits that can be used to encode
a message without loss. Also note that with this definition, more
information content has nothing to do with its quality. So in this sense, a larger amount
of Shannon’s entropy does not necessarily imply a better quality of its content (an
example of two subjective concepts which could be linked to quality are meaningfulness or
importance). I will expand on this topic in the following
video, when we discuss Norbert Wiener’s ideas. Now, there is one point that needs clarifying.
In this past section of the video, while discussing concepts such as compression and redundancy,
we have actually been talking about different kinds of bits, that is, bits which represent
different concepts. You may recall that in the previous video
we defined the bit as a variable which can have two possible values, which we represent
by the digits 0 and 1. This is the most popular definition, one that is usually associated
with the storage or transmission of encoded data. In this way, one bit is the capacity
of a system which can exist in only two states. In information theory, however, the bit can
be defined in a different way. As we have seen, it can be a unit of measurement for
Shannon’s information. In this context, one bit is defined as the uncertainty associated
with a binary random variable that can be in one of two possible states with equal probability. Put another way, one Shannon bit is the amount
of entropy that is present in the selection of two equally probable options, it is the
information that is gained when the value of this variable becomes known. Remember,
this is exactly what we showed earlier on when we applied Shannon’s formula to the throw
of a fair coin. Well, it turns out that the first time the
word “bit” appeared in a written document was precisely in Shannon’s ground-breaking
paper “A mathematical theory of communication”. In it, Shannon clearly states that it was
mathematician John Tukey, a fellow Bell Labs researcher, who coined the term “bit”, short
for binary digit. While Tukey’s binary digit is a unit related
to the encoding of data, Shannon’s bit is a unit that is related to uncertainty, unpredictability.
In one Tukey bit of encoded data there is often less than one bit of Shannon’s information.
Why? Well, Shannon’s entropy is maximised when
all possible states are equally likely, and one Shannon bit is defined in relation to
a system whose states are equally likely. Therefore, in a binary system, when the two
possible states are not equally likely, such as the toss of a biased coin, Shannon’s information
is less than 1 bit. Recall Shannon’s source coding theorem, which
states that a lossless data compression scheme cannot compress messages, on average, to have
more than one bit of Shannon’s information per bit of encoded message. Well, now you
know that these bits are different kinds of bits, in the sense that they represent different
concepts. The latter refers to the encoding, storage or transmission of binary data, redundant
or not, whereas the former refers to amounts of Shannon’s information, as defined by his
entropy equation. An encoded bit of data contains, at most, one bit of Shannon’s information,
usually a lot less. In information theory, therefore, it is important
to understand the distinction between encoded binary data and information. In this context,
the word information is used in reference to Shannon’s entropy, a much more abstract
concept that does not equate with the actual encoding of zeros and ones. Now, so far we have discussed the important
concepts of entropy, uncertainty, information and redundancy… and we have seen how Shannon’s
Source Coding Theorem provides us with a theoretical lower bound for data compression. Hopefully
you now understand the relationship between entropy, uncertainty reduction, information
gain, information content, redundancy and compression. We have covered the most important theoretical
aspects of Information Theory which establish the limits to how efficiently a particular
message can be encoded. But what does Shannon have to say about sending
a message not only efficiently but also reliably though a given channel? How does one transmit
information reliably in the presence of noise? That is, in the presence of physical disturbances
which can introduce random errors in the encoding of a particular message, what are the theoretical
limits to detecting and correcting these errors? This leads us to the very important topic
of Error-Detection and Correction. Coming up in Part 2b: – Shannon’s Noisy Channel Coding Theorem – Adding redundancy bits to a message (called
parity or check bits) to detect and correct errors – Error-correcting codes – What are Hamming
codes? – Example using a Hamming code – What are doubly-even self-dual linear binary
error-correcting block codes? – James Gates discovery: error-correcting
codes found buried inside the fundamental equations of physics? – Little side journey: Cosmological and biological
evolution. The Universe as a self-organised system. How does Nature store information?
Nature’s use of redundancy and error-correcting codes. Information, the Laws of Physics and
unknown organising principles: fixed or evolving? Biology, DNA and information.

Leave a Reply