## Information Inequalities

Welcome to the course on An Introduction to
Information Theory. I am Adrish Banerjee. And today we are going to talk about information
inequalities. In particular we will talk about what is Jensen’s inequality and how this
is used to find whether a function is concave or convex. We will talk about log sum inequality
and we will show one example of how we can use log sum inequality to prove concavity
or convexity of a function. Next we will talk about data processing lemma and finally, we
will talk about Fanos lemma. So, this lecture will we will talk about these inequalities
So, before we go to Jensen’s inequality, let us just see what do we mean by a function
is concave or convex. So, function f is concave over a non zero interval I, if for each point
x naught which belongs to this interval I, there exists a real number c such that this
condition holds. What is this condition? So, let us just draw, let us say this is a function
defined over this interval from here to here. Now take any point belonging to this interval,
let us just take we take this point x naught, and this is my function f of x. So, what is
f of x 0 that is this point this is f of x 0. So there exists a real number c, of course
the value of c may depend on will depend on x naught such that function f of x is below
f x 0 plus c x minus x naught. What is c x minus x naught? It is a line passing
through x naught and passing through this point this will be basically a line, this
is my, this straight line passing through x naught. Now, note that this c the slope
of this line depends on what choice of x naught I choose. For example, if I choose this as
my x naught then the slope is negative, so this is what the line is. Since, the function
is concave, I can draw a line passing through x naught and this function will lie below
that so that is what I mean by function is concave. Now, if the function is convex, then
what is going to happen is you will have a function like this, let us say this is a function
defined over this interval x 1 to x 2, and take any point, let us say just take this
point x naught. If I draw a line passing through this thing x naught then for the function
to be convex f of x should be above this line. This is the condition for function to be convex,
and this is the condition for the function to be concave. Now, we will use this result to obtain Jensen’s
inequality which says, if the function is concave and we have a random variable x that
takes values within the interval over which this function f of x is defined. We are with
what the interval where the function is concave then expected value of the function is less
than equal to function evaluated at expected value of x. Now, let us see at the proof of
this, we will get the proof from this result. So, if I take expectation both side what do
I get f of f x this is less than expected value of f of x naught. Now, what is f of
x naught, x naught is the number in the interval over which this function is concave. So, expected
value of f of x naught will be f of x naught plus c times expected value of x minus x naught.
Now, if I choose my point x naught such that x naught is expected value of x, which is
possible because x naught is belongs to the interval over which this function is concave.
So, I can choose my x naught such that x naught is expected value of x. Now, if I choose this
value of x naught and I plug it in here, what do I get? I get expected value of f of x,
is less than equal to function. What is the x naught, x naught is expected value of x
and c this is the expected value of x minus expected value of x. So, this term will become
0, so what I will be left with is, this and this is precisely my Jensen’s inequality.
So if I have a function f, which is concave then this relation holds. Now, similarly if
the function f is convex, we can similarly show that expected value of f of x will be
greater that equal to function evaluated at expected value of x. The proof is very similar
to how we proved starting from this result, so this is our Jensen’s inequality. Now, we will show some more ways of proving
whether a function is concave or convex. So, if you have a function f which is concave
over an interval I, then it is concave if and only if for every x 1 and x 2 which belongs
to this interval I, this relation holds. What is this? This lambda times f of x 1, plus
1 minus lambda times f of x 2, this should be less than equal to function evaluated at
lambda time x 1 plus 1 minus lambda time x 2, where lambda lies between 0 and 1. Now,
using Jensen’s inequality we can prove this result. So, let us look at the proof.
Let us say we have a random variable x and that takes two values x 1 and x 2, it takes
x 1 with probability lambda, so x 2 happens with probability 1 minus lambda. Now, what
does Jensen’s inequality says that Jensen’s inequality says, if the function f is concave
then expected value of the function is less than equal to function evaluated at expected
value. So, then let us compute expected value of the function. What is the expected value
of the function? Expected value of the function is so x takes value lambda 1, x 1 and x 2
with probability lambda and 1 minus lambda. So, the expected value of the function will
be f of x 1 multiplied by probability of occurrence of x 1 which is lambda. And f of x 2 multiplied
by probability of occurrence of x 2 which is 1 minus lambda. So, this is a value for
expected value of the function f of x. Now, Jensen’s inequality says, if the function
is concave then function expected value of the function is less than equal to function
evaluated at expected value of x. So, according to Jensen’s inequality this term will be
less than this. Now, what is expected value of x? So x 1 happens with probability lambda
and x 2 happens with probability 1 minus lambda. So, the expected value of x is given by this
expression, this is your expected value of x. So, then applying Jensen’s inequality,
we get this result on the left hand side is the expected value of the function and on
the right hand side is function evaluated at expected value. So, if the function is
concave over this interval this relation holds; and vice versa if the function is convex then
this relation will be greater than equal to OK. Now, we will show another way of proving whether
a function is concave or convex. So, if we have a function f that has a second derivative
and the second derivative is non-negative over this interval over which function is
defined then the function is concave over that interval, so what do we need, we want
the second derivative to be non-negative. Now, how can we prove this? So we do a Taylor
expansion of this function around x naught, where x naught is point in this interval.
So, Taylor’s series expansion, I can write this f of x naught plus first derivative of
f, evaluated at x naught x minus x naught plus second derivative of this function f,
evaluated at some x star by 2 x by x naught, where this x star point lies between x 0 and
x. Now, what am I seeing? I am seeing for function
to be convex, I want this second derivative to be non-negative. So, what I need to show
is, if the second derivative is non-negative, then this function f of x is convex. So, let
us prove it. So, we are seeing that if second derivative is positive then function should
be convex. So, if the second derivative is positive then this last term is going to be
if this is non-negative this is square of x minus x naught, so that is also non-negative,
so this whole term will be non-negative term. So, then what I can do is I can write f of
x to be, so since this term is positive, so f of x will be greater than equal, this is
non-negative so f of x will be greater than equal to f of x naught, plus f first derivative
of f evaluated at x naught x minus x naught. So, this follows from the fact that since
this term is non-negative this whole term will be non negative. So, f of x will be greater
than equal to this term, so that is what I have here. Now, let us take the value of x
naught to be this which is lambda times x 1 plus 1 minus lambda times x 2 and let us
take x to be x 1. So, I plug in this value of x and x naught into this expression. So,
I get here f of x 1 is greater than equal to f of x naught plus first derivative of
this function f evaluated at x naught 1 minus lambda x minus x naught where x naught is
given by this. Next, so I put x equal to x 2 in this, this
equation. Here, I put x equal to x 2. So, if I similarly put x equal x 2, what I get
is f of x 2 is greater equal to f of x naught first derivative of f evaluated at x naught
lambda x 2 minus x 1, where x naught is again lambda times x 1 plus 1 minus lambda time
x 2. Now, what I am going to do is I am going to multiply this expression by 1 minus lambda
and this expression I am going to multiply by lambda. So, equation 2, I am going to multiply
by lambda; and equation 3, I am going to multiply by 1 minus lambda. And then I am going to
add them up so if I do that what I get here is I get f of x naught is less than equal
to lambda time’s f of x 1 plus 1 minus lambda time’s f of x 2. So, again I repeat if I
multiply equation two by lambda and equation three by 1 minus lambda and I add them up
I get this condition that f of x naught is less than equal to lambda times f of x 1 plus
1 minus lambda time f of x 2. Now, what is this, you go back to the previous
expression that we have, we have said if the function is concave then lambda f of x 1 plus
1 minus lambda f of x 2 is less than equal to function evaluated at lambda x 2. And if
the function is convex we said lambda f of x 1 plus 1 minus lambda f of x 2 would be
greater than equal to function evaluated at lambda x 1 plus 1 minus lambda x 2.
Now, go back and see what we have arrived at, we have arrived at the condition where
function evaluated at x naught is less than equal to lambda f of x 1 plus 1 minus lambda
f of x 2. And this is from the result which I have just shown you, this is the condition
for the function to be convex. Hence, we have proved that if the second derivative is positive
then the function is convex. So if function has second derivative that is non-negative
over the interval then the function is convex. And if the second derivative is positive then
it is strictly convex. And similarly, we can also show the second derivative is non-positive
then function is concave. So, let us take some example to illustrate
the Jensen’s inequality that we have talked about. So, use Jensen’s inequality to find
appropriate inequalities between expected value of this function e raise to the power
of minus a x and e raise to the power to minus a expected value of x, where a is greater
than equal to 0. And the second function that we are considering is expected value of under
root X and square root of expected value of X. So, in the first case, this function is
given by this. So, we need to first find out so we want to find out whether the function
is concave or convex; and depending on whether the function is concave or convex, we can
find relationship between expected value of the function and function evaluated at expected
value. So, in this case, if we look at this function
f of x, we take the first derivative that is given by this and the second derivative
is given by this. And note a is greater than equal to 0, so a square is basically positive,
this quantity e of a minus x is also positive, so this term will be greater than 0. The second
derivative is greater than 0, what do we know about the function we know that the function
is convex function. So, then our f of x is convex function and strictly convex function
because this second derivative is positive. Now, if it is a convex function then we know
the expected value of the function is greater than equal to function evaluated at expected
value of x. So, for this particular function, we can write expected value of the function
is greater than equal to function evaluated at expected value, this is because this function
was convex function. Now, let us look at this second example. So,
here the function is under root of x. Now, we take the first derivative that is given
by this, and this is the second derivative. This second derivative is non-positive so
that means this is a concave function. And if it is the concave function then we know
expected value of the function is less than equal to function evaluated at expected value.
So, what is the function here function is under root of x. So, expected value of under
root of x will be less than equal to now the function is under root x so that means, under
root and function evaluated expected value of x so this expected value of under root
x will be less than equal to under root of expected value of x. Now, let us just move now to log sum inequality.
So, what does log sum inequality says if you have non-negative numbers a 1 a 2, a 3, a
n lets say and b 1, b 2, b 3, b n where summation of if a i is bounded and summation of this
b i greater than 0. And again its bounded then log sum inequality says that summation
a i log a i by b i is greater than equal to summation of a i log of summation of a i by
summation b i. So, this is basically your log sum inequality.
Now let us prove this. So, let us consider a i dash which is a i by summation of this
a is bi’s summation of this bi’s. Now, let us find divergence between a i dash and
b i dash. Now, what do we know about divergence between two density function basically we
know the divergence is greater than equal to 0. So, divergence between a i dash and
b i dash would be greater than equal to 0. So, this we are writing it here. So, this
is the expression for divergence between a i dash and bi dash and we know that divergence
is greater than equal to 0, this we have proved in the earlier lectures.
Now, let us replace a i dash and b i dash in this expression. So, a i dash is given
by this and b i dash is given by this. So, we plug in these values of a i dash and bi
dash, what we get is this expression. You can see this is my a i dash this is my a i
dash and this is my b i dash. Now, further simplifying, I take this summation a i out,
so what I get here is summation of a i log a i by b i minus summation a i log of summation
of a i divided by summation of bi’s, now I know this is greater than equal to 0. Now,
this is sum of non-negative numbers, so that is the positive quantity. So, when will this
be greater than equal to 0, when this term is greater than equal to this term. So, what
we have proved here is now then this relationship holds and this is known as log sum inequality. Now, let us take a simple example, where we
will prove that divergence is a convex function in the probability period p and q. And we
are going to make use of log sum inequality to prove this result. So, we want to show
that divergence between p and q is convex function in the pair p and q. So, what do
we mean is if p 1, q 1 and p 2, q 2 are two pairs of probability mass function then function
evaluated at x naught is less than equal to expected value of the function. This is when
the function is convex. When the function we have just showed that function at expected
value of x is less than equal to expected value of function this is for the case when
f of x is convex, this is from the Jensen’s inequality. So, to prove that this is convex
function we have to then from Jensen’s inequality we have to show this result where lambda lies
between 0 and 1. So, let us see how we can make use of log
sum inequality to prove this result. Now, we are applying log sum inequality. Now, what
does log sum inequality says, the log sum inequality says summation a i log a i by b
i this is greater than equal to summation a i log of summation a i by summation b i.
Now, let us take a 1 to be lambda times p 1 x and a 2 to be 1 minus lambda times p 2
x. Similarly, let us take b 1 to be lambda times q 1 x and b 2 to be 1 minus lambda times
q 2 x and plug in these value of a 1, a 2 and b 1, b 2 into our log sum inequality.
Then according to log sum inequality summation a i log summation a i by summation b i is
less than equal to summation of a i log of a i by b i.
So, what is summation a i this is this plus this, so this is this quantity. What is summation
a i, so log of summation a i is this quantity. And what is summation b i, b 1 is lambda q
1 x and b 2 is 1 minus lambda q 2 x. So, this is my summation bi. Now according to log sum
inequality this summation is less than equal to summation of a i log a i by b i. So, this
summation is less than a 1 log a 1 by b 1 plus a 2 log a 2 by b 2. So, what is a 1 this
is my a 1 and this is my a 1, and this is my b 1 this is my a 2, this is my a 2 and
this is my b 2. So, this result which I have written so far is just a direct application
of log sum inequality where I chose a 1 to be given by this, a 2 to be given by this,
b 1 to be given by this, and b 2 to be given by this. So if I do this using log sum inequality,
I get this result. Now, what is the next step I do, I sum it
over all x, so I sum it over all x, sum it over all x, sum it over all x, sum it over
all x. So, what do I get here, you can think of new probability which is lambda p 1 x plus
1 minus lambda p 2 x, and this is another new probability lambda q 1 x plus 1 minus
lambda q 2 x. So, this quantity that you see here this quantity that you see here is nothing
but this divergence between this probability and this probability. So, the term that you
see on the left hand side is nothing but this term.
What about this? This is lambda p 1 x log of this lambda, lambda cancels out. So, this
is lambda times p 1 x log p 1 x by q 1 x sum over x. This quantity is nothing but this
quantity, it is 1 minus lambda times p 2 x log this 1 minus lambda 1 minus lambda cancels
out, this p 2 x log of p 2 x by q 2 x summation of over x 1 minus lambda times. So, this term
will be 1 minus lambda times divergence between p 2 and q 2. So, using log sum inequality
we have shown that this relation holds. And from Jensen’s inequality, we know this is
the condition for this function divergence to be convex and the pair p and q. Now, let us use these results to prove some
more results on concavity and convexity. So, let us prove that entropy is a concave function
of p. Now, in the previous lecture, we have shown when we have defined divergence between
two random variables x and x hat where x hat was uniformly distributed, we have shown that
entropy of p is basically log of L minus divergence between p and uniform distribution.
Now, we have already shown in the previous slide that divergence function is a convex
function, so minus of a convex function will be a concave function. So, then from the previous
result it is very clear that entropy is a concave function of p that is because this
is just a constant and minus of a convex function will be a concave function. So, as a function
of p h of p entropy is a concave function. So, this result follows from the previous
result that we have shown that divergence is a convex function. So, the concavity of
entropy function follows from the convexity of the divergence function. Now, let us prove and state what is known
as data processing lemma. So, data processing lemma says that if we have random variable
x y and z. And if they form a Markov chain then mutual information between x and z is
less than mutual information between x and y or mutual information between x and z is
less than mutual information between y and z. So, in other words, further processing
of data does not increase the information content. So, how do we prove it? So we are
first going to prove. So, let us prove, so since x y and z forms the Markov chain then
we use the property of Markov chain, so probability of Z given X Y will only dependent on probability
of Z given Y because X Y and Z form the Markov chain. Or in terms of entropy we can write
uncertainty in z given x y is same as uncertainty in z given y this follows from the point that
x y z form the Markov chain. So, let us write down the mutual information
between Y and Z. Now, following the definition of mutual information, we can write this follows
from the definition this follows from the definition. So, mutual information, I can
write as h of z minus h of z by given y now from this I know H of Z given y is same as
H of Z given X and Y. So, I replace this H of Z given by this uncertainty in z given
x y so this line follows from here Markov process.
Now, I know that conditioning cannot increase entropy so uncertainty in z given x y is less
than equal to uncertainty in Z given X, this follows from the property that conditioning
cannot increase entropy, so if I use this relationship and replace uncertainty in Z
given X Y by larger term H of Z given X. So, what I get here is greater than equal to sine
and I am replacing this by H of Z given X. And what is this, this term uncertainty in
Z minus uncertainty in Z given X thus mutual information between this is from the definition
this is the mutual information between X and Z. So, what I have shown you is mutual information
between X and Z is less than equal to mutual information between Y and Z, when X, Y and
Z forms a Markov chain. So, I have shown you this, I have proved this
result. Now, how do I prove this result? Now if you see closely, if I can show that if
x y and z forms a Markov chain. Then it also follows that z y and x forms a Markov chain
then I can use the previous derivation to show that mutual information between z and
x. If I can show that this implies, if I can show that x y and z if they form a Markov
chain then z x and z y and x also forms a Markov chain if I can prove this then from
the previous result which I just now proved. I can show that mutual information between
X and Z will be less than mutual information between X and Y because I just now showed
you that if X, Y and Z forms a Markov chain. Then mutual information between X and Z is
less than equal to mutual information between Y and Z. So to prove this result, it is sufficient
to prove that if X, Y, Z forms a Markov chain it implies that Z, Y and k also forms a Markov
chain. So, we will show now that if X, Y and Z they
form a Markov chain then Z, Y and X will also form a Markov chain. So, to prove this we
will first consider the joint entropy of X, Y and Z; and applying chain rule we can write
this as uncertainty in y plus uncertainty in X given Y plus uncertainty in Z given X
Y. Now, since x y and z they form a Markov chain we know that uncertainty in Z given
X Y is same as uncertainty in z given y so we can write this joint entropy between X,
Y and Z as H of Y plus H of X given Y plus H of Z given Y. Now, we are writing this joint entropy again
using chain rule in a different way so I can write the same thing as H of Y plus H of Z
given Y plus H of X given Z Y. Now, let us compare this is joint entropy given by this
expression and this is joint entropy of X, Y, Z given by this expression. So, let us
compare these two equations. This is a common term H of Y it is here, it is here; and then
I have a common term h of z given y and a common term here H of Z given Y. So, these
two are same what do I get, I get this condition that this term must be equal to this. So,
in other words, uncertainty in x given y z is same as uncertainty in X given Y and this
is true when so in terms of probability I can write probability of X given Y, Z is same
as probability of X given Y. So, when is this true? This is true when Z,
Y and x form a Markov chain. So, in that case probability of X given Y Z will only depend
on probability of x given y. So, what I have shown you is if X, Y and Z forms a Markov
chain this also implies that Z Y and X will also form a Markov chain. And hence using
the proof that we had just said before we can write that mutual information between
X and Z is less than equal to mutual information between X and Y. And this proves our second
result that we had this proves our result that mutual information between X and Z is
less that mutual information between X and Y. So, this is our data processing lemma. Now, finally, we are going to state and prove
what is known as Fano’s lemma. So, we have a random variable U, let us denote U hat is
the estimate of this random variable U and we define probability of error. So, when does
an error occur, an error occur if our estimated bit is not same as the transmitted bit. So,
if U hat is not same as U, error happens and probability of that will give you probability
of error. Now, what does Fano’s lemma says is as follows. If U and U hat are L-ary random
variables and probability of error is given by P sub e then this relation holds. The binary
entropy function of well the entropy of this P e plus P e log of L minus 1 is greater than
equal to uncertainty in U given U hat. And the equality here happens if and only if probability
of error given u hat is u is same for all u and if there is an error all L minus 1 erroneous
value are equally likely to happen. So, Fano’s lemma links probability of error to the estimate
of U that we are getting uncertainty in U given U hat. So, let us prove this result. So, let us denote
a random variable Z, which basically denotes or indicates whether there is an error. So,
this random variable Z which is an indicator random variable will be 0, if there is no
error and this will be 1. If there is an error, so Z take two values either 0 or 1. 0 when
there is no error and 1 when there is an error. So, what is the entropy associated with Z,
so we can see basically, so Z takes two value so with probability 1 minus P e this thing
happens there is no error this probability and with this probability there is an error.
So, then the entropy of Z will be given by minus of P e log P e minus 1 minus p e log
1 minus P e which is nothing but entropy of P e.
Now, let us look at this term uncertainty in U Z given the estimate of U. Now, we apply
chain rule. So, this if we apply chain rule what we get is uncertainty in u given u hat
plus uncertainty Z given U, U hat is equal to this quantity. Now, let us look at each
of these two quantities what is the uncertainty in Z given U, U hat. Now, what is Z? Z is
an indicator random variable which indicates whether there is an error and it depends on
what is the value of U and U hat, so if you tell me what u and u hat are I can easily
find out. What is Z, because if U is same as U hat, Z is 0 and U is not same as u hat
and z is one so there is no uncertainty in Z given U U hat, so this term is 0 then I
can write this term is equal to this. Now, the same thing I am writing, so this
expression I got get from here. Now, again I apply chain rule. So, uncertainty in U Z
given U hat can be written as uncertainty in Z given U hat plus uncertainty in U given
U hat. Now, uncertainty we know that conditioning cannot increase uncertainty so h of z given
u hat has to be less then equal to h of z. So, then I can write this as this term is
less than equal to h of z plus h of u given U hat Z and when is this, this, this, this
is same when basically when u hat is independent of Z. So, in other words, when this condition
holds when probability of error is same irrespective of what U is then H of Z given u hat will
be H of Z. So, from here to here, it follows from the fact that conditioning cannot increase
entropy. Next, we are going to look into little more
detail with this expression. So, let us look at this expression. So, we will first look
at this when Z is 0 and then we will look at the case when Z is 1. So, what is the uncertainty
in U given U hat and Z is 0. Now, what does Z equal to 0 means, Z equal to 0 means there
is no error. So, if I give you U hat and I tell you there is no error then you know precisely
what U is? So, there is no uncertainty in U given U hat and Z equal to 0. So, then this
uncertainty is 0. What about when Z is 1, when Z is 1, this corresponds to the case
when U is not same as U hat that means an error has happened. So, in this case, you
know that U is not equal to U hat, but you do not know which of the remaining L minus
1 possibility. So, we know that U can take any value except U hat when Z is 1, and we
know that if you discreet random variable the entropy is upper bounded by number of
possibilities of x. In this case, number of possibilities for U is L minus 1, because
U is not same as U hat and U is a L valued random variable. So, this entropy is upper
bounded by log of L minus 1. So, if I plug that in then the uncertainty
in U given U hat and Z can be written as probability of Z equal to 1 uncertainty in U given U hat
and Z equal 1 plus probability of z equal to 0, and uncertainty in U given U hat and
z equal 0. Now, this uncertainty is 0, so only term that will be left is probability
of Z equal to 1 multiplied by uncertainty in u given U hat and Z equal to 1. And this
quantity we just showed is upper bounded by this, so we plug this value in here what we
get is probability of Z equal to 1 that is basically probability of error and this is
log of L minus 1. So what we have shown is this quantity is
upper bounded by this quantity. Now when is this equal to this? When all the errors are
equally likely to happen, we know that H of X is equal to log of number of possibilities
of X that is when this is uniformly distributed. So, when all the L minus 1 possibilities erroneous
possibilities are equally likely then this happens with equality. So, in this expression
of H of U given U, U hat Z, we can plug in this upper bound. If we do that what we get
is this expression. So again let me just go back. We had this
expression; H of U given U hat is less than equal to H of Z plus H of U given U hat and
Z. We have shown H of Z is given by H of p e and we showed that this is this quantity
is upper bounded by P e log of L minus 1. So, if we plug this values this is H of P
e, if we plug this values in we get our Fano’s lemma which is this. Note, this is convex
function of P e this is a linear function of P e, so sum of a convex function of P e
and linear function, so this is what is our Fano’s lemma. Now we can also get a weaken
version of this Fano’s lemma, we know that h of p e is upper bounded by 1. So, we can
get a weaker version if we just replaced this by 1. We can also say 1 plus p e log of L
minus 1 is greater than equal to H of U given U hat, so this relates probability of error
to the uncertainty in estimation of U. Thank you.