Information theory provides a different perspective for evaluating the performance of a commu- nication system in that the performance can be compared with the theoretically best system for a given bandwidth and SNR. Significant insight into the performance characteristics of a communi- cation system can often be gained through the study of information theory. More explicitly, information theory provides a quantitative measure of the information contained in message signals and allows us to determine the capability of a system to transfer this information from source to destination. Coding, a major application area of information theory, will be briefly presented in this chapter. We make no attempt in this chapter to be complete or rigorous. Rather we present an overview of basic ideas and illustrate these ideas through simple examples. We hope that students who study this chapter will be motivated to study these topics in more detail.
Information theory provides us with the performance characteristics of anideal, or optimum, communication system. The performance of an ideal system provides a meaningful basis against which to compare the performance of the realizable systems studied in previous chapters.
Performance characteristics of ideal systems illustrate the gain in performance that can be obtained by implementing more complicated transmission and detection schemes.
Motivation for the study of information theory is provided byShannons coding theorem, which can be stated as follows: If a source has an information rate less than the channel capacity, there exists a coding procedure such that the source output can be transmitted over the channel with an arbitrarily small probability of error. This is a powerful result. Shannon tells us that transmission and reception can be accomplished withnegligibleerror, even in the presence of noise. An understanding of this process calledcodingand an understanding of its impact on the design and performance of communication systems require an understanding of several basic concepts of information theory.
We will see that there are two basic applications of coding. The first of these is referred to as source coding. Through the use of source coding, redundancy can be removed from message signals so that each transmitted symbol carries maximum information. In addition, through the use of channel, or error-correcting, coding, systematic redundancy can be induced into the transmitted signal so that errors caused by imperfect practical channels can be corrected.
606
n 11.1 BASIC CONCEPTS
Consider a hypothetical classroom situation occurring early in a course at the end of a class period. The professor makes one of the following statements to the class:
A. I shall see you next period.
B. My colleague will lecture next period.
C. Everyone gets an A in the course, and there will be no more class meetings.
What is the relative information conveyed to the students by each of these statements, assuming that there had been no previous discussion on the subject? Obviously, there is little information conveyed by statement (A), since the class would normally assume that their regular professor would lecture; that is, the probabilityP Að ịof the regular professor lecturing is nearly unity.
Intuitively, we know that statement (B) contains more information, and the probability of a colleague lecturingP Bð ịis relatively low. Statement (C) contains a vast amount of information for the entire class, and most would agree that such a statement has a very low probability of occurrence in a typical classroom situation. It appears that the lower the probability of a statement, or event, the greater is the information conveyed by that statement. Stated another way, the studentssurprise on hearing a statement appears to be a good measure of the information contained in that statement. Information is defined consistent with this intuitive example.
11.1.1 Information
Letxjbe an event that occurs with probabilitypðxjị. If we are told that eventxjhas occurred, we say that we have received
Iðxjị ẳloga 1 pðxjị
ẳlogapðxjị ð11:1ị units of information. This definition is consistent with the previous example since Iðxjị increases aspðxjịdecreases. Note thatIðxjịis nonnegative since 0pðxjị 1. The base of the logarithm in (11.1) is arbitrary and determines the units by which information is measured.
R. V. Hartley,1who first suggested the logarithmic measure of information in 1928, used logarithms to the base 10 since tables of base 10 logarithms were widely available, and the resulting measure of information was thehartley. Today it is standard to use logarithms to the base 2, and the unit of information is the binary unit, orbit. If logarithms to the baseeare used, the corresponding unit is thenat, or natural unit.
There are several reasons for us to adopt the base 2 logarithm to measure information. The simplest random experiment that one can imagine is an experiment with two equally likely outcomes. Flipping an unbiased coin is a common example. Knowledge of each outcome has associated with it one bit of information since the logarithm base is 2 and the probability of each outcome is 0.5. Since the digital computer is a binary machine, each logical 0 and each logical 1 has associated with it one bit of information, assuming that each of these logical states are equally likely.
1Hartley (1928)
EXAMPLE 11.1
Consider a random experiment with 16 equally likely outcomes. The information associated with each outcome is
Iðxjị ẳlog2
1
16ẳlog216ẳ4 bits ð11:2ị
wherejranges from 1 to 16. The information is associated with each outcome is greater than one bit, since the random experiment generates more than two equally likely outcomes, and therefore, the probability of each outcome is less than one-half.
&
11.1.2 Entropy
In general, the average information associated with the outcomes of an experiment is of interest rather than the information associated with a particular output. The average information associated with a discrete random variableXis defined as the entropyH Xð ị. Thus
H Xð ị ẳE Iðx jị
ẳXn
jẳ1
pðxjịlog2pðxjị ð11:3ị where n is the total number of possible outcomes. Entropy can be regarded as average uncertainty and therefore achieves a maximum when all outcomes are equally likely.
EXAMPLE 11.2
For a binary source letpð ị ẳ1 aandpð ị ẳ0 1aẳb. From (11.3), the entropy is
Hð ị ẳa alog2a ð1aịlog2ð1aị ð11:4ị This is sketched in Figure 11.1. We note that ifaẳ12, each symbol is equally likely, and our uncertainty, and therefore the entropy, is a maximum. Ifa6ẳ12, one of the two symbols becomes more likely than the other. Therefore uncertainty, and consequently the entropy, decreases. Ifais equal to zero or one, our uncertainty is zero, since we know exactly which symbol will occur.
&
From Example 11.2 we conclude, at least for the special case illustrated in Figure 11.1, that the entropy function has a maximum, which occurs when all probabilities are equal. This fact is of sufficient importance to warrant a more complete derivation. Assume that a chance
1.0
0.5
1 0.75 0.25
0 0.5 α
Entropy
Figure 11.1
Entropy of a binary source.
experiment hasnpossible outcomes and thatpnis a dependent variable depending on the other probabilities. Thus
pnẳ1ðp1ỵp2ỵ ỵpkỵ ỵpn1ị ð11:5ị wherepj is concise notation forpðxjị. The entropy associated with the chance experiment is
HẳXn
iẳ1
pilog2pi ð11:6ị
In order to find the maximum value of entropy, the entropy is differentiated with respect topk, holding all probabilities constant exceptpkandpn. This gives a relationship betweenpkandpn
that yields the maximum value ofH. Since all derivatives are zero except the ones involvingpk
andpn,
dH dpkẳ d
dpkðpklog2pk pnlog2pnị ð11:7ị Using (11.5) and
d
dxlogauẳ1
ulogaedu
dx ð11:8ị
gives
dH
dpkẳpk 1
pklog2elog2pkþpn 1
pnlog2eỵlog2pn ð11:9ị or
dH dpk
ẳlog2 pn pk
ð11:10ị which is zero ifpkẳpn. Sincepkis arbitrary,
p1ẳp2ẳ ẳpnẳ1
n ð11:11ị
To show that the preceding condition yields a maximum and not a minimum, note that when p1ẳ1 and all other probabilities are zero, the entropy is zero. From (11.6), the case where all probabilities are equal yieldsHẳlog2n.
11.1.3 Discrete Channel Models
Throughout most of this chapter we will assume the communications channel to bememory- less. For such channels, the channel output at a given time is a function of the channel inputat that timeand is not a function of previous channel inputs. Discrete memoryless channels are completely specified by the set of conditional probabilities that relate the probability of each output state to the input probabilities. An example illustrates the technique. A diagram of a channel with two inputs and three outputs is illustrated in Figure 11.2. Each possible input-to- output path is indicated along with a conditional probabilitypij, which is concise notation for pðyjjxiị. Thuspijis the conditional probability of outputyjgiven inputxjand is called achannel transition probability. The complete set of transition probabilities defines the channel. In this
chapter, the transition probabilities are assumed constant. However, in many commonly encountered situations, the transition probabilities are time varying. An example is the wireless mobile channel in which the transmitter–receiver distance is changing with time.
We can see from Figure 11.2 that the channel is completely specified by the set of transition probabilities. Accordingly, the memoryless channel illustrated in Figure 11.2 can be defined by the matrix of transition probabilitiesẵP Yð jXị, where
P YjXð ị
ẵ ẳ p yð 1jx1ị p yð 2jx1ị p yð 3jx1ị p yð 1jx2ị p yð 2jx2ị p yð 3jx2ị
ð11:12ị Since each channel input results in some output, each row ofẵP YjXð ịmust sum to unity. We refer to the matrix of transition probabilities as thechannel matrix.
The channel matrix is useful in deriving the output probabilities given the input probabilities. For example, if the input probabilitiesP Xð ịare represented by the row matrix
P Xð ị
ẵ ẳẵp xð ị1 p xð ị2 ð11:13ị then
P Yð ị
ẵ ẳẵp yð ị1 p yð ị2 p yð ị3 ð11:14ị which is computed by
P Yð ị
ẵ ẳẵP Xð ịẵP YjXð ị ð11:15ị IfẵP Xð ịis written as a diagonal matrix, (11.15) yields a matrixẵP X;ð Yị. Each element in the matrix has the formp xð ịpðyi jjxiịorpðxj;yjị. This matrix is known as thejoint probability matrix, and the termpðxi;yjịis the joint probability of transmittingxi and receivingyj.
EXAMPLE 11.3
Consider the binary input–output channel shown in Figure 11.3. The matrix of transition probabilities is P YjXð ị
ẵ ẳ 0:7 0:3 0:4 0:6
ð11:16ị If the input probabilities areP xð ị ẳ1 0:5 andP xð ị ẳ2 0:5, the output probabilities are
P Yð ị
ẵ ẳẵ0:5 0:5 0:7 0:3 0:4 0:6
ẳẵ0:55 0:45 ð11:17ị y2
y1
y3 p22 p13 p23 p11
x1
x2
p12 p21
Figure 11.2 Channel diagram.
and the joint probability matrix for the channel is P X;ð Yị
ẵ ẳ 0:5 0
0 0:5
0:7 0:3 0:4 0:6
ẳ 0:35 0:15 0:2 0:3
ð11:18ị
&
As we first observed in Chapter 9, a binary satellite communication system can often be represented by the cascade combination of two binary channels. This is illustrated in Figure 11.4(a), in which the first binary channel represents the uplink and the second binary channel represents the downlink. These channels can be combined as shown in Figure 11.4(b).
By determining all possible paths fromxitozj, it is clear that the following probabilities define the overall channel illustrated in Figure 11.4(b):
p11ẳa1b1ỵa2b3 ð11:19ị p12ẳa1b2ỵa2b4 ð11:20ị p21ẳa3b1ỵa4b3 ð11:21ị p22ẳa3b2ỵa4b4 ð11:22ị
x1 1
x2
x1 p11
p12
p21 p22 x2
z1
z2
z1
z2 y2
y1 α
α 2
α 3
β 2
β 3
α 4
β 1
β 4
k n i l n w o D k
n i l p U
(a)
(b)
Figure 11.4
Two-hop satellite system.
(a) Binary satellite channel.
(b) Composite satellite channel.
x1
x2
y1
y2 0.7
0.4 0.3 0.6
Figure 11.3 Binary channel.
Thus the overall channel matrix
P Zð jXị
ẵ ẳ p11 p12
p21 p22
ð11:23ị can be represented by the matrix multiplication
P ZjXð ị
ẵ ẳ a1 a2
a3 a4
b1 b2 b3 b4
ð11:24ị For a two-hop communications system, the right-hand side of the preceding expression is simply the uplink channel matrix multiplied by the downlink channel matrix.
11.1.4 Joint and Conditional Entropy
Using the input probabilitiesp xð ị, the output probabilitiesi pðyjị, the transition probabilities pðyjjxiị, and the joint probabilitiespðxi;yjị, we can define several different entropy functions for a channel withninputs andmoutputs. These are
H Xð ị ẳXn
iẳ1
p xð ịi log2p xð ịi ð11:25ị
H Yð ị ẳXm
jẳ1
pðyjịlog2pðyjị ð11:26ị
H YjXð ị ẳXn
iẳ1
Xm
jẳ1
pðxi;yjịlog2pðyjjxiị ð11:27ị
and
H Xð ;Yị ẳXm
jẳ1
pðxi;yjịlog2pðxi;yjị ð11:28ị An important and useful entropy,H XjYð ịis defined as
H XjYð ị ẳXn
iẳ1
Xm
jẳ1
pðxi;yjịlog2pðxijyjị ð11:29ị These entropies are easily interpreted.H Xð ịis the average uncertainty of the source, whereas H Yð ịis the average uncertainty of the received symbol. Similarly,H XjYð ịis a measure of our average uncertainty of the transmitted symbol after we have received a symbol. The function H YjXð ịis the average uncertainty of the received symbol given thatXwas transmitted. The joint entropyH X;ð Yịis the average uncertainty of the communication system as a whole.
Two important and useful relationships, which can be obtained directly from the previously defined entropies, are
H Xð ;Yị ẳH XjYð ị ỵH Yð ị ð11:30ị
and
H Xð ;Yị ẳH YjXð ị ỵH Xð ị ð11:31ị These are developed in Problem 11.13.
11.1.5 Channel Capacity
Consider for a moment an observer at the channel output. The observers average uncertainty concerning the channel input will have valueH Xð ịbefore the reception of an output, and this average uncertainty of the input will typically decrease when the output is received. In other words, H XjYð ị H Xð ị. The decrease in the average uncertainty of the transmitted signal when the output is received is a measure of the average information transmitted through the channel. This is defined asmutual information I X;ð Yị. Thus
I X;ð Yị ẳH Xð ị H XjYð ị ð11:32ị It follows from (11.30) and (11.31) that we can also write (11.32) as
I X;ð Yị ẳH Yð ị H YjXð ị ð11:33ị It should be observed that mutual information is a function of the source probabilities as well as of the channel transition probabilities.
It is easy to show mathematically that
H Xð ị H XjYð ị ð11:34ị by showing that
H XjYð ị H Xð ị ẳI Xð ;Yị 0 ð11:35ị Substitution of (11.29) forH XjYð ịand (11.25) forH Xð ịallows us to write I X;ð Yịas
I X;ð Yị ẳXn
iẳ1
Xm
jẳ1
pðxi;yjịlog2 p xð ịi pðxijyjị
ð11:36ị Since
log2xẳlnx
ln 2 ð11:37ị
and
p xð ịi
pðxijyjịẳp xð ịpðyi jị
pðxi;yjị ð11:38ị
we can write I X;ð Yịas
I Xð ;Yị ẳ 1 ln 2
Xn
iẳ1
Xm
jẳ1
pðxi;yjịln p xð ịpðyi jị pðxi;yjị
ð11:39ị In order to carry the derivation further, we need the often used inequality
lnxx1 ð11:40ị
which we can easily derive by considering the function
f xð ị ẳlnxðx1ị ð11:41ị
The derivative off xð ị
df dxẳ1
x1 ð11:42ị
is equal to zero atxẳ1. It follows thatfð ị ẳ1 0 is the maximum value off xð ị, since we can makef xð ịless than zero by choosingxsufficiently largeð>1ị. Using the inequality (11.40) in (11.39) results in
I X;ð Yị 1 ln 2
Xn
iẳ1
Xm
jẳ1
p xi;yj
p xð ịpðyi jị pðxi;yjị 1
ð11:43ị which yields
I Xð ;Yị 1 ln 2
Xn
iẳ1
Xm
jẳ1
p xð ịpðyi jị Xn
iẳ1
Xm
jẳ1
pðxi;yjị
" #
ð11:44ị Since both the double sums equal 1, we have the desired result
I X;ð Yị 0 or I X;ð Yị 0 ð11:45ị Thus we have shown that mutual information is always nonnegative and, consequently, H Xð ị H XjYð ị.
Thechannel capacity Cis defined as the maximum value of mutual information, which is the maximum average information per symbol that can be transmitted through the channel for each channel use. Thus
CẳmaxẵI X;ð Yị ð11:46ị The maximization is with respect to the source probabilities, since the transition probabilities are fixed by the channel. However, the channel capacity is a function of only the channel transition probabilities, since the maximization process eliminates the dependence on the source probabilities. The following examples illustrate the method.
EXAMPLE 11.4
The channel capacity of the discrete noiseless channel illustrated in Figure 11.5 is easily determined. We start with
I Xð ;Yị ẳH Xð ị H XjYð ị and write
H XjYð ị ẳXn
iẳ1
Xm
jẳ1
p x i;yj
log2p x ijyj
ð11:47ị
x2 x1
xn
y2 y1
yn 1
1
1 Figure 11.5
Noiseless channel.
For the noiseless channel, allpðxi;yjịandpðxijyjịare zero unlessiẳj. Foriẳj;pðxijyjịis unity. Thus H XjYð ịis zero for the noiseless channel, and
I Xð ;Yị ẳH Xð ị ð11:48ị
We have seen that the entropy of a source is maximum if all source symbols are equally likely. Thus CẳXn
iẳ1
1
nlog2nẳlog2n ð11:49ị
&
EXAMPLE 11.5
An important and useful channel model is thebinary symmetric channel(BSC) illustrated in Figure 11.6.
We determine the capacity by maximizing
I X;ð Yị ẳH Yð ị H YjXð ị where
H Yð jXị ẳX2
iẳ1
X2
jẳ1
pðxi;yjịlog2pðxijyjị ð11:50ị Using the probabilities defined in Figure 11.6, we obtain
H YjXð ị ẳ aplog2pð1aịplog2p
aqlog2qð1aịqlog2q ð11:51ị or
H YjXð ị ẳplog2pqlog2q ð11:52ị Thus
I Xð ;Yị ẳH Yð ị ỵplog2pỵqlog2q ð11:53ị which is maximum whenH Yð ịis maximum. Since the system output is binary,H Yð ịis a maximum when each output has a probability of12. Note that for a BSC equally likely outputs are for equally likely inputs.
Since the maximum value ofH Yð ịfor a binary channel is unity, the channel capacity is
Cẳ1ỵplog2pỵqlog2qẳ1H pð ị ð11:54ị whereH pð ịis defined in (11.4).
The capacity of a BSC is sketched in Figure 11.7. As expected, ifpẳ0 or 1, the channel output is completely determined by the channel input, and the capacity is 1 bit per symbol. Ifpis equal to 0.5, an input symbol yields either output symbol with equal probability, and the capacity is zero.
x1 P(x1) =
P(x2) = 1 – x2
y1
y2 p
p
q q α
α
Figure 11.6
Binary symmetric channel.
&
It is worth noting that the capacity of the channel illustrated in Figure 11.6 is most easily found by starting with (11.32), while the capacity of the channel illustrated in Figure 11.6 is most easily found starting with (11.33). Choosing the appropriate expression forI Xð ;Yịcan often save considerable effort. It sometimes takes insight and careful study of a problem to choose the expression forI Xð ;Yịthat yields the capacity with minimum computational effort.
The error probabilityPE of a binary symmetric channel is easily computed. From PE ẳX2
iẳ1
p ejxð iịp xð ịi ð11:55ị wherep ejxð iịis the error probability given inputxi, we have
PEẳqp xð ị ỵ1 qp xð ị ẳ2 q p xẵ ð ị ỵ1 p xð ị2 ð11:56ị Thus
PEẳq
which states that theunconditional error probability PE is equal to the conditional error probabilitypðyjjxiị,i6ẳj.
In Chapter 8 we showed thatPE is a decreasing function of the energy of the received symbols. Since the symbol energy is the received power multiplied by the symbol period, it follows that if the transmitter power is fixed, the error probability can be reduced by decreasing the source rate. This can be accomplished by removing the redundancy at the source through a process calledsource coding.
EXAMPLE 11.6
In Chapter 8 we showed that for binary coherent FSK systems, the probability of symbol error is the same for each transmitted symbol. Thus, a BSC model is a suitable model for FSK transmission. In this example we determine the channel matrix assuming that the transmitter power is 1000 W, the attenuation in the channel from transmitter to receiver input is 30 dB, the source rateris 10,000 symbols per second, and that the noise power spectral densityN0is 2105W=Hz.
Since the channel attenuation is 30 dB, the signal powerPRat the input to the receiver is PRẳð1000ị103
ẳ1 W ð11:57ị
This corresponds to a received energy per symbol of EsẳPRTẳ 1
10;000ẳ104J ð11:58ị
1.0
0.5
0 0.25 0.5 0.75 1 p
Capacity
Figure 11.7
Capacity of a binary symmetric channel.
In Chapter 8 we saw that the error probability for a coherent FSK receiver is PEẳQ
ffiffiffiffiffiffi Es N0
r
ð11:59ị which, with the given values, yieldsPEẳ0:0127. Thus, the channel matrix is
P YjXð ị
ẵ ẳ 0:9873 0:0127 0:0127 0:9873
ð11:60ị It is interesting to compute the change in the channel matrix resulting from a moderate reduction in source symbol rate with all other parameters held constant. If the source symbol rate is reduced 25% to 7500 symbols per second, the received energy per symbol becomes
Esẳ 1
7500ẳ1:333104J ð11:61ị
With the other given parameters, the symbol-error probability becomesPEẳ0:0049, which yields the channel matrix
P YjXð ị
ẵ ẳ 0:9951 0:0049 0:0049 0:9951
ð11:62ị Thus the 25% reduction in source symbol rate results in an improvement of the system symbol-error probability by a factor of almost 3. In Section 11.2 we will investigate a technique that sometimes allows the source symbol rate to be reduced without reducing the source information rate.
&
n 11.2 SOURCE CODING
We determined in the preceding section that the information from a source producing symbols according to some probability scheme could be described by the entropyH Xð ị. Since entropy has units of bits per symbol, we also must know the symbol rate in order to specify the source information rate in bits per second. In other words, the source information rateRsis given by
RsẳrH Xð ịbps ð11:63ị
whereH Xð ịis the source entropy in bits per symbol andris the symbol rate in symbols per second.
Let us assume that this source is the input to a channel with capacityCbits per symbol or SCbits per second, whereSis the available symbol rate for the channel. An important theorem of information theory, Shannonsnoiseless coding theorem, as is stated as follows:Given a channel and a source that generates information at a rate less than the channel capacity, it is possible to code the source output in such a manner that it can be transmitted through the channel. A proof of this theorem is beyond the scope of this introductory treatment of information theory and can be found in any of the standard information theory textbooks.2 However, we demonstrate the theorem by a simple example.
2See for example, Gallagher (1968).