## WII? (2a) Information Theory, Claude Shannon, Entropy, Redundancy, Data Compression & Bits

In the previous video I discussed several

definitions of information and I mainly concentrated on Gregory Bateson’s definition, which describes

information as “a difference that makes a difference”. I modified his definition to

information is a “perceived difference that can make a difference” and discussed how information

has the potential to be used as the basic currency for describing reality. Reality is a subjective experience, it is

perception, it is qualia. In this way, information represents the connection between the observer

and the observed, it encapsulates the link between the subject and the object, the relationship

between what we conceptualise as the external world and what it feels like to experience

this external world from within. Therefore, using information to describe reality

also embodies the recognition that, in many contexts, it doesn’t make sense anymore to

speak of things in themselves, but that all we can possibly describe are perceived distinctions,

perceived properties, perceived patterns and regularities in Nature. Information has the potential to be assigned

a meaning and therefore it has the potential to become knowledge. Meaning is subjective

so information doesn’t really have objective intrinsic meaning. Now, you may be wondering what science has

to say about all this. How does science currently define information? Has science been able

to objectively quantify information? The answer to these questions lies at the heart of Information

Theory. Information Theory is a branch of mathematics

and computer science which studies the quantification of information. As you have probably realised

by now, the concept of information can be defined in many different ways. Clearly, if

we want to quantify information, we need to use definitions which are as objective as

possible. While subjectivity can never be completely

removed from the equation (reality is, after all, always perceived and interpreted in a

subjective manner) we will now explore a definition of information that is much more technical

and objective than the definitions we discussed in the previous video. This is Claude Shannon, an American mathematician

and electronic engineer who is now considered the “Father of Information Theory”. While

working at Bell Laboratories, he formulated a theory which aimed to quantify the communication

of information. Shannon’s article titled “A Mathematical Theory

of Communication”, published in 1948, as well as his book “The Mathematical Theory of Communication”,

co-written with mathematician Warren Weaver and published in 1949, truly revolutionised

science. Shannon’s theory tackled the problem of how

to transmit information most efficiently through a given channel as well as many other practical

issues such as how to make communication more secure (for instance, how to tackle unauthorised

eavesdropping). Thanks to Shannon’s ideas on signal processing,

data compression, as well as data storage and communication, useful applications have

been found in many different areas. For instance, lossless data compression is used in ZIP files,

while lossy data compression is used in other types of files such as MP3s or JPGs. Other

important applications of Information Theory are within the fields of cryptography, thermal

physics, neurobiology or quantum computing, to name just a few. It is important to realise that Shannon’s

Theory was created to find ways to optimise the physical encoding of information, to find

fundamental limits on signal processing operations. Because of this, Shannon’s definition of information

was originally associated with the actual symbols which are used to encode a message

(for instance, the letters or words contained in a message), and not intended to relate

to the actual interpretation, assigned meaning or importance of the message. As we have seen in the previous video, information

itself does not really have intrinsic objective meaning. What Shannon’s Theory aimed to tackle

were practical issues completely unrelated to qualitative properties such as meaningfulness,

but more related to the actual encoding used in the transmission of a particular message. Therefore, what we are dealing with here is

a very different definition of information than those we have discussed so far. Shannon

defines information as a purely quantitative measure of communication exchanges. As we

will see, Shannon’s definition represents a way to measure the amount of information

that can potentially be gained when one learns of the outcome of a random process. And it

is precisely this probabilistic nature of Shannon’s definition that turns out to be

a very useful tool; one that physicists (and other science disciplines) now use in many

different areas of research. Shannon’s information is in fact known as

Shannon’s entropy (Legend says that it was the mathematician John von Neumann who suggested

that Shannon use this term, instead of information). In general, I will refer to Shannon’s definition

as Shannon’s entropy, information entropy or Shannon’s information, to avoid confusion

with other definitions of information or with the concept of thermodynamical entropy. So what is Shannon’s definition then? Well,

the amount of information contained in a given message is defined as the negative of a certain

sum of probabilities. Don’t worry, we are not going to deal with the actual details

of the equation here, but instead I will give you a flavour of what it means. Suppose we

have a probabilistic process which has a certain number of possible outcomes, each with a different

probability of occurring. Let’s call the total number of possible outcomes

N and the probabilities of each outcome p(1), p(2), p(3),…, p(N). For instance, let’s

say that the probabilistic process we are dealing with is the throw of a coin. In this

case, the total number of possible outcomes is 2, so N=2 (that is, heads and tails – let’s

call heads 1 and tails 2). The probabilities associated with each of these possible outcomes,

assuming a fair coin, is 1/2. So we have p(1)=1/2 and p(2)=1/2 (which is the same as

saying each possible outcome, heads or tails, has a 50% probability of occurring). Shannon showed that the entropy (designated

by the letter H) is equivalent to the potential information gain once the experimenter learns

the outcome of the experiment, and is given by the following formula… This formula implies that the more entropy

a system has, the more information we can potentially gain once we know the outcome

of the experiment. Shannon’s entropy can be thought of as a way to quantify the potential

reduction in our uncertainty once we have learnt the outcome of the probabilistic process. Don’t worry if you feel a bit confused right

now. I’ll explain the concept of Shannon’s entropy for you in very easy terms. Let’s

go back to the coin example: The coin is fair, so the probability of heads

is the same as the probability of tails (that is, 1/2 each). Let’s consider the event of

throwing the coin. Plugging the given probabilities into the equation gives us a Shannon entropy,

that is, an information content of one bit, because there are two possible outcomes and

each has equal probability. Once we have thrown the coin and we have learnt its outcome, we

can say that we have gained one bit of information, or alternatively, we can say that our uncertainty

has been reduced by one bit. Now imagine you have a coin which has two

heads. In this case, N=1, that is, there is only one possible outcome. The likelihood

of obtaining heads is therefore equal to 1 (a probability equal to 1 means absolute certainty,

100%). Since the uncertainty is zero in this case, Shannon’s entropy is zero, and so is

the information content. There is no longer the presence of two different alternatives

here. The information we gain after throwing the coin is therefore, zero. Look at it this

way: we already knew with certainty what was going to happen in advance, so there is no

potential gain in information after learning the outcome. Another way of describing Shannon’s entropy

is to say that it represents the amount of information the experimenter lacks prior to

learning the outcome of a probabilistic process. Hence, according to Shannon’s formula, a message’s

entropy is maximised when the occurrence of each of its individual parts is equally probable.

What this means is that we will gain the largest amount of Shannon’s information when dealing

with systems whose individual possible outcomes are equally likely to occur (for instance,

throwing a fair coin or rolling a fair die, both systems having a set of possible outcomes

which are all equally likely to occur). Shannon’s entropy is a measure of the potential

reduction in uncertainty in the receiver’s knowledge. We can see the process of gaining

information as equivalent to the process of losing uncertainty. You may be wondering what

all this has to do with the actual content and encoding of a message, since so far we

have only been talking of coins. Here’s another example, which illustrates the usefulness

of Shannon’s formula when it comes to written language. Can we estimate the information entropy of

the written English language? Consider starting with one particular letter which is picked

at random. Knowing this first letter, you then want to estimate the probability of getting

another particular letter after that one, and the probability of getting another letter

after that first and second one, and so on. Knowing these probabilities is what we need

in order to calculate the information entropy associated with the English text. If we assume we are dealing with 27 characters

(that is, 26 letters plus space), and that all of these characters are equally probable,

then we have an information entropy of about 4.8 bits per character. But we know that the

characters are not equally probable; for instance, the letter E has the highest frequency of

occurrence, while the letter Z has the lowest. This is related to the concept of redundancy,

which is nothing more than the number of constraints imposed on the text of the English language:

for example, the letter Q is always followed by U, and we also have rules such as “I before

E except after C”, and so on. There are various methods for calculating

the information entropy of the written English language. For instance, Shannon’s methods

– which take into account many factors, including redundancy and contextuality for instance

– give the English language text an information entropy of between 0.6 and 1.3 bits per character. So, if we compare this with the previous result

of around 4.8 bits per character, we can see that the constraints imposed by factors such

as redundancy have the overall effect of reducing the entropy per character. What this means

is that finding the amount of redundancy in a language can help us find the minimum amount

of information needed to encode a message. And of course, this leads us to the important

concept of data compression. In information theory, the redundancy in a

message is the number of bits used to encode it minus the number of bits of Shannon’s information

contained in the message. Redundancy in a message is related to the

extent to which it is possible to compress it. What lossless data compression does is

reduce the number of bits used to encode a message by identifying and eliminating statistical

redundancy. Hence: The more redundancy there is in a message,

the more predictability we have – that means less entropy per encoded symbol – hence the

higher the compressibility When we compress data, we extract redundancy.

When we compress a message, what we do is encode the same amount of Shannon’s information

by using less bits. Hence, we end up having more Shannon’s information per encoded symbol,

more Shannon’s information per bit of encoded message. A compressed message is less predictable,

since the repeated patterns have been eliminated; the redundancy has been removed. In fact, Shannon’s entropy represents a lower

limit for lossless data compression: the minimum amount of bits that can be used to encode

a message without loss. Shannon’s source coding theorem states that a lossless data compression

scheme cannot compress messages, on average, to have more than one bit of Shannon’s information

per bit of encoded message. Calculating the redundancy and the information

entropy of the English language has therefore many practical applications. For instance,

ASCII codes (which are codes that represent text in computers, communications equipment,

and other devices that use text) allocate exactly 8 bits per character. But this is very inefficient when we consider

Shannon’s and other similar calculations which, as we have seen, give us an information entropy

of around 1 bit per character. Put another way, a smaller amount of bits can be used

to store the same text. What this means, is that, in theory, there exists a compression

scheme which is 8 times more effective than ASCII. Let�s use an example to see how lossless

data compression works. We will use Huffman coding, an algorithm developed by electrical

engineer David Huffman in 1952. Huffman coding is a variable length code which assigns codes

using the estimated probability of occurrence of each source symbol. For instance, let’s

say we take as symbols all the letters of the English alphabet plus space. Huffman coding

assigns codes with varying length depending on the frequency of occurrence of each symbol.

Just as with Morse code, the most frequent symbols are assigned the shortest codes and

the less frequent symbols are assigned the longest codes. Let’s take a piece of text written in English,

which is long enough so that we can approximate our calculations by using standard frequency

tables for the letters of the written English language. The most frequent symbols, such

as space and the letter e will be assigned the shortest codes, while the least frequent

symbols, such as the letters q and z will be assigned the longest codes. Applying the

Huffman algorithm using standard frequency tables, we obtain the Huffman codes given

on this table. As you can see, the lengths of the codes vary from 3 to 11 bits, depending

on the character. We are going to use this short passage, from

Brave New World, by Aldous Huxley. Excluding punctuation and just counting letters

and spaces, we have a total of 462 characters to encode. If we encode this piece of text

using ASCII, we will need to use a total of 3,696 bits (8 bits per character). However,

if we use Huffman code, we will only need to use a total of 1,883 bits (an average of

about 4.1 bits per character). So we see that by using Huffman encoding instead

of ASCII, we can store the same amount of information, but twice as effectively. Although

Huffman coding is more efficient, we are still far from reaching the limit given by Shannon’s

entropy, which as we saw earlier on, can be approximated to be around 1 bit per character. By using Huffman’s code, we have managed to

extract a certain amount of redundancy, but we are still far from the limit where all

the redundancy would have been extracted: that limit is Shannon’s entropy. There are however many other compression techniques;

for instance, there is a technique called arithmetic coding, which can extract a lot

more redundancy than Huffman coding, and hence it can create compressed messages where the

average number of bits per character is much closer to Shannon’s entropy. So by using this particular example, we have

seen how the concept of Shannon’s entropy, in this case calculated from the probabilities

of occurrence associated with the letters belonging to the words of a particular language,

has very important applications; data compression being one of them. Summarising: Shannon’s entropy is a measure of uncertainty,

of unpredictability, and also a measure of information content, of potential information

gain. Shannon’s entropy can also represent a lower

limit for lossless data compression: the minimum amount of bits that can be used to encode

a message without loss. Also note that with this definition, more

information content has nothing to do with its quality. So in this sense, a larger amount

of Shannon’s entropy does not necessarily imply a better quality of its content (an

example of two subjective concepts which could be linked to quality are meaningfulness or

importance). I will expand on this topic in the following

video, when we discuss Norbert Wiener’s ideas. Now, there is one point that needs clarifying.

In this past section of the video, while discussing concepts such as compression and redundancy,

we have actually been talking about different kinds of bits, that is, bits which represent

different concepts. You may recall that in the previous video

we defined the bit as a variable which can have two possible values, which we represent

by the digits 0 and 1. This is the most popular definition, one that is usually associated

with the storage or transmission of encoded data. In this way, one bit is the capacity

of a system which can exist in only two states. In information theory, however, the bit can

be defined in a different way. As we have seen, it can be a unit of measurement for

Shannon’s information. In this context, one bit is defined as the uncertainty associated

with a binary random variable that can be in one of two possible states with equal probability. Put another way, one Shannon bit is the amount

of entropy that is present in the selection of two equally probable options, it is the

information that is gained when the value of this variable becomes known. Remember,

this is exactly what we showed earlier on when we applied Shannon’s formula to the throw

of a fair coin. Well, it turns out that the first time the

word “bit” appeared in a written document was precisely in Shannon’s ground-breaking

paper “A mathematical theory of communication”. In it, Shannon clearly states that it was

mathematician John Tukey, a fellow Bell Labs researcher, who coined the term “bit”, short

for binary digit. While Tukey’s binary digit is a unit related

to the encoding of data, Shannon’s bit is a unit that is related to uncertainty, unpredictability.

In one Tukey bit of encoded data there is often less than one bit of Shannon’s information.

Why? Well, Shannon’s entropy is maximised when

all possible states are equally likely, and one Shannon bit is defined in relation to

a system whose states are equally likely. Therefore, in a binary system, when the two

possible states are not equally likely, such as the toss of a biased coin, Shannon’s information

is less than 1 bit. Recall Shannon’s source coding theorem, which

states that a lossless data compression scheme cannot compress messages, on average, to have

more than one bit of Shannon’s information per bit of encoded message. Well, now you

know that these bits are different kinds of bits, in the sense that they represent different

concepts. The latter refers to the encoding, storage or transmission of binary data, redundant

or not, whereas the former refers to amounts of Shannon’s information, as defined by his

entropy equation. An encoded bit of data contains, at most, one bit of Shannon’s information,

usually a lot less. In information theory, therefore, it is important

to understand the distinction between encoded binary data and information. In this context,

the word information is used in reference to Shannon’s entropy, a much more abstract

concept that does not equate with the actual encoding of zeros and ones. Now, so far we have discussed the important

concepts of entropy, uncertainty, information and redundancy… and we have seen how Shannon’s

Source Coding Theorem provides us with a theoretical lower bound for data compression. Hopefully

you now understand the relationship between entropy, uncertainty reduction, information

gain, information content, redundancy and compression. We have covered the most important theoretical

aspects of Information Theory which establish the limits to how efficiently a particular

message can be encoded. But what does Shannon have to say about sending

a message not only efficiently but also reliably though a given channel? How does one transmit

information reliably in the presence of noise? That is, in the presence of physical disturbances

which can introduce random errors in the encoding of a particular message, what are the theoretical

limits to detecting and correcting these errors? This leads us to the very important topic

of Error-Detection and Correction. Coming up in Part 2b: – Shannon’s Noisy Channel Coding Theorem – Adding redundancy bits to a message (called

parity or check bits) to detect and correct errors – Error-correcting codes – What are Hamming

codes? – Example using a Hamming code – What are doubly-even self-dual linear binary

error-correcting block codes? – James Gates discovery: error-correcting

codes found buried inside the fundamental equations of physics? – Little side journey: Cosmological and biological

evolution. The Universe as a self-organised system. How does Nature store information?

Nature’s use of redundancy and error-correcting codes. Information, the Laws of Physics and

unknown organising principles: fixed or evolving? Biology, DNA and information.

## Leave a Reply