MA DEC 3 1983 Ol Symbols, s.ln ITOHlfK^llk SYMBOLS, SIGNALS AND NOISE: The Nature and Process of Communication . MAR3 OJ9S9 .gill ^** BOOKS BY J. R. PIERCE SYMBOLS, SIGNALS AND NOISE MAN'S WORLD OF SOUND (WITH EDWARD E. DAVID, JR.) ELECTRONS, WAVES AND MESSAGES TRAVELING-WAVE TUBES THEORY AND DESIGN OF ELECTRON BEAMS SYMBOLS, SIGNALS AND NOISE: The Nature and Process of Communication BY J. R. PIERCE HARPER MODERN SCIENCE SERIES EDITED BY JAMES R. NEWMAN HARPER & BROTHERS, NEW YORK SYMBOLS, SIGNALS AND NOISE COPYRIGHT 1961 BY JOHN R. PIERCE PRINTED IN THE UNITED STATES OF AMERICA ALL RIGHTS RESERVED. NO PART OF THE BOOK MAY BE USED OR REPRODUCED IN ANY MANNER WHATSOEVER WITHOUT WRITTEN PERMISSION EXCEPT IN THE CASE OF BRIEF QUOTATIONS EMBODIED IN CRITICAL ARTICLES AND REVIEWS. FOR INFORMATION ADDRESS HARPER & BROTHERS, 49 EAST 33RD STREET, NEW YORK 16, N.Y. FIRST EDITION LIBRARY OF CONGRESS CATALOG CARD NUMBER: 61-10215 TO CLAUDE AND BETTY SHANNON - O 8 4 5 Contents PREFACE ix I. THE WORLD AND THEORIES 1 II. THE ORIGINS OF INFORMATION THEORY 19 III. A MATHEMATICAL MODEL 45 IV. ENCODING AND BINARY DIGITS 64 V. ENTROPY 78 VI. LANGUAGE AND MEANING 107 VII. EFFICIENT ENCODING 125 VIII. THE NOISY CHANNEL 145 IX. MANY DIMENSIONS 166 X. INFORMATION THEORY AND PHYSICS 184 XL CYBERNETICS 208 XII. INFORMATION THEORY AND PSYCHOLOGY 229 XIII. INFORMATION THEORY AND ART 250 XIV. BACK TO COMMUNICATION THEORY 268 APPENDIX: ON MATHEMATICAL NOTATION 278 GLOSSARY 287 INDEX 295 Preface WHEN JAMES R. NEWMAN suggested to me that I write a book about communication I was delighted. All my technical work has been inspired by one aspect or another of communication. Of course I would like to tell others what seems to me to be interest- ing and challenging in this important field. It would have been difficult to do this and to give any sense of unity to the account before Claude E. Shannon published "A Mathematical Theory of Communication" in 1948. Shannon's com- munication theory, which is also called information theory, has brought into a reasonable relation the many problems that have been troubling communication engineers for years. It has created a broad but clearly defined and limited field where before there were many special problems and ideas whose interrelations were not well understood. No one can accuse me of being a Shannon worshiper and get away unrewarded. Thus, I felt that my account of communication must be an account of communication theory as Shannon formulated it. The account would have to be broader than Shannon's hi that it would discuss the relation or lack of relation of communication theory to the many fields to which people have applied it or tried to apply it. It would have to be narrower than Shannon's account in that it would have to be less mathematical. Here came the rub. My account could be less mathematical than Shannon's, but it could not be nonmathematical Communication theory is a mathematical theory. It starts from certain premises x Preface that define the aspects of communication with which it will deal, and it proceeds from these premises to various logical conclusions. The glory of communication theory lies in certain mathematical theorems which are both surprising and important. To talk about communication theory without communicating its real mathe- matical content would be like endlessly telling a man about a wonderful composer yet never letting him hear an example of the composer's music. How was I to proceed? It seemed to me that I had to make the book self-contained, so that any mathematics in it could be under- stood without referring to other books or without calling for the particular content of early mathematical training, such as high school algebra. Did this mean that I had to avoid mathematical notation? Not necessarily, but any mathematical notation would have to be explained in the most elementary terms. I have done this both in the text and in an appendix; by going back and forth between the two the mathematically untutored reader should be able to resolve any difficulties. But just how difficult should the most difficult mathematical arguments be? Although it meant sliding over some very important points, I resolved to keep things easy compared with, say, the more difficult parts of Newman's The World of Mathematics. When the going is very difficult, I have merely indicated the general nature of the sort of mathematics used rather than trying to describe its content clearly. Nonetheless, this book has sections which will be hard for the nonmathematical reader. I advise him merely to skim through these, gathering what he can. When he has gone through the book in this manner he will see why the difficult sections are there. Then he can turn back and understand them if he wishes. But, had I not put these difficult sections in, and had the reader wanted the sort of understanding that takes real thought, he would have been stuck. As far as I know, other available literature on communica- tion theory is either too simple or too difficult to help the diligent but inexpert reader beyond the easier parts of this book. I might note also that some of the literature is confused and some of it is just plain wrong. By this sort of talk I may have raised wonder in the reader's Preface xi mind as to whether or not communication theory is really worth so much trouble, either on his part or on mine for that matter. I can only say that to the degree that the whole world of science and technology around us is important, communication theory is im- portant, for it is an important part of that world. To the degree to which an intelligent reader wants to know something both about that world and about communication theory, it is worth his while trying to get a clear picture. Such a picture must show communica- tion theory neither as something utterly alien and unintelligible nor as something that can be epitomized in a few easy words and appreciated without effort. The process of writing this book was not easy. Of course it could never have been written at all but for the work of Claude Shannon, who, besides inspiring the book through his work, read the manuscript and suggested several valuable changes. David Slepian jolted me out of the rut of error and confusion in an even more vigorous way. E. N. Gilbert deflected me from error in several instances. Milton Babbitt reassured me concerning the major con- tents of the chapter on information theory and art and suggested a few changes. P. D. Bricker, H. M. Jenkins, and R. N. Shepard advised me in the field of psychology, but the views I finally expressed should not be attributed to them. M. V. Mathews pro- vided the computer program in Chapter XI. Benoit Mandelbrot helped me with Chapter XII. J. P. Runyon read the manuscript with care, and Eric Wolman uncovered an appalling number of textual errors as well as making valuable suggestions. The reader is indebted to James R. Newman for the fact that I have provided a glossary, summaries at the ends of some chapters, and for my final attempts to make some difficult points a little clearer. To all of these I am indebted and not less to Miss F. M. Costello, who triumphed over the chaos of preparing and correcting the manu- script and figures. J. R. PIERCE CHAPTER I The World and Theories IN 1948, CLAUDE E. SHANNON published a paper called "A Mathematical Theory of Communication"; it appeared in book form in 1949. Before that time, a few isolated workers had from time to time taken steps toward a general theory of communication. Now, twelve years later, communication theory, or information theory as it is sometimes called, is an accepted field of research. Several books on communication theory have been published, and several international symposia and conferences have been held. The Institute of Radio Engineers has a professional group on information theory, whose learned Transactions appears quarterly, and the journal Information and Control is largely devoted to communication theory. All of us use the words communication and information, and we are unlikely to underestimate their importance. A modern philosopher, A. J. Ayer, has commented on the wide meaning and importance of communication in our lives. We communicate, he observes, not only information, but also knowledge, error, opinions, ideas, experiences, wishes, orders, emotions, feelings, moods. Heat and motion can be communicated. So can strength and weakness and disease. He cites other examples and comments on the mani- fold manifestations and puzzling features of communication in man's world. Surely, communication being so various and so important, a 2 Symbols, Signals and Noise theory of communication, a theory of generally accepted soundness and usefulness, must be of incomparable importance to all of us. When we add to theory the word mathematical with all its impli- cations of rigor and magic, the attraction becomes almost irre- sistible. Perhaps if we learn a few formulae our problems of communication will be solved, and we shall become the masters of information rather than the slaves of misinformation. Unhappily, this is not the course of science. Some 2,300 years ago, another philosopher, Aristotle, discussed in his Physics a notion as universal as that of communication, that is, motion. Aristotle defined motion as the fulfillment, insofar as it exists potentially, of that which exists potentially. He included in the concept of motion the increase and decrease of that which can be increased or decreased, coming to and passing away, and also being built. He spoke of three categories of motion, with respect to magnitude, affection, and place. He found, indeed, as he said, as many types of motion as there are meanings of the word is. Here we see motion in all its manifest complexity. The com- plexity is perhaps a little bewildering to us, for the associations of words differ in different languages, and we would not necessarily associate motion with all the changes of which Aristotle speaks. How puzzling this universal matter of motion must have been to the followers of Aristotle. It remained puzzling for over two millennia, until Newton enunciated the laws which engineers still use in designing machines and astronomers in studying the motions of stars, planets, and satellites. While later physicists have found that Newton's laws are only the special forms which more general laws assume when velocities are small compared with that of light and when the scale of the phenomena is large compared with the atom, they are a living part of our physics rather than a historical monument. Surely, when motion is so important a part of our world, we should study Newton's laws of motion. They say: 1 . A body continues at rest or in motion with a constant velocity in a straight line unless acted upon by a force. 2. The change in velocity of a body is in the direction of the force acting on it, and the magnitude of the change is proportional to the force acting on the body times the time during which the force acts, and is inversely proportional to the mass of the body. The World and Theories 3 3. Whenever a first body exerts a force on a second body, the second body exerts an equal and oppositely directed force on the first body. To these laws Newton added the universal law of gravitation: 4. Two particles of matter attract one another with a force act- ing along the line connecting them, a force which is proportional to the product of the masses of the particles and inversely propor- tional to the square of the distance separating them. Newton's laws brought about a scientific and a philosophical revolution. Using them, Laplace reduced the solar system to an explicable machine. They have formed the basis of aviation and rocketry, as well as of astronomy. Yet, they do little to answer many of the questions about motion which Aristotle considered. New- ton's laws solved the problem of motion as Newton defined it, not of motion in all the senses in which the word could be used in the Greek of the fourth century before our Lord or in the English of the twentieth century after. Our speech is adapted to our daily needs or, perhaps, to the needs of our ancestors. We cannot have a separate word for every distinct object and for every distinct event; if we did we should be forever coining words, and communication would be impossible. In order to have language at all, many things or many events must be referred to by one word. It is natural to say that both men and horses run (though we may prefer to say that horses gallop) and convenient to say that a motor runs and to speak of a run in a stocking or a run on a bank. The unity among these concepts lies far more in our human language than in any physical similarity with which we can expect science to deal easily and exactly. It would be foolish to seek some elegant, simple, and useful scientific theory of running which would embrace runs of salmon and runs in hose. It would be equally foolish to try to embrace in one theory all the motions discussed by Aristotle or all the sorts of communication and information which later philosophers have discovered. In our everyday language, we use words in a way which is con- venient in our everyday business. Except in the study of language itself, science does not seek understanding by studying words and their relations. Rather, science looks for things in nature, including 4 Symbols, Signals and Noise our human nature and activities, which can be grouped together and understood. Such understanding is an ability to see what complicated or diverse events really do have in common (the planets in the heavens and the motions of a whirling skater on ice, for instance) and to describe the behavior accurately and simply. The words used in such scientific descriptions are often drawn from our everyday vocabulary. Newton used force, mass, velocity, and attraction. When used in science, however, a particular mean- ing is given to such words, a meaning narrow and often new. We cannot discuss in Newton's terms force of circumstance, mass media, or the attraction of Brigitte Bardot. Neither should we expect that communication theory will have something sensible to say about every question we can phrase using the words communi- cation or information. A valid scientific theory seldom if ever offers the solution to the pressing problems which we repeatedly state. It seldom supplies a sensible answer to our multitudinous questions. Rather than rationalizing our ideas, it discards them entirely, or, rather, it leaves them as they were. It tells us in a fresh and new way what aspects of our experience can profitably be related and simply understood. In this book, it will be our endeavor to seek out the ideas concerning communication which can be so related and understood. When the portions of our experience which can be related have been singled out, and when they have been related and understood, we have a theory concerning these matters. Newton's laws of motion form an important part of theoretical physics, a field called mechanics. The laws themselves are not the whole of the theory; they are merely the basis of it, as the axioms or postulates of geometry are the basis of geometry. The theory embraces both the assumptions themselves and the mathematical working out of the logical consequences which must necessarily follow from the assumptions. Of course, these consequences must be in accord with the complex phenomena of the world about us if the theory is to be a valid theory, and an invalid theory is useless. The ideas and assumptions of a theory determine the generality of the theory, that is, to how wide a range of phenomena the theory applies. Thus, Newton's laws of motion and of gravitation The World and Theories 5 are very general; they explain the motion of the planets, the time- keeping properties of a pendulum, and the behavior of all sorts of machines and mechanisms. They do not, however, explain radio waves. Maxwell's equations 1 explain all (non-quantum) electrical phe- nomena; they are very general. A branch of electrical theory called network theory deals with the electrical properties of electrical circuits, or networks, made by interconnecting three sorts of ideal- ized electrical structures: resistors (devices such as coils of thin, poorly conducting wire or films of metal or carbon, which impede the flow of current), inductors (coils of copper wire, sometimes wound on magnetic cores), and capacitors (thin sheets of metal separated by an insulator or dielectric such as mica or plastic; the Leyden jar was an early form of capacitor). Because network theory deals only with the electrical behavior of certain specialized and idealized physical structures, while Maxwell's equations de- scribe the electrical behavior of any physical structure, a physicist would say that network theory is less general than are Maxwell's equations, for Maxwell's equations cover the behavior not only of idealized electrical networks but of all physical structures and include the behavior of radio waves, which lies outside of the scope of network theory. Certainly, the most general theory, which explains the greatest range of phenomena, is the most powerful and the best; it can always be specialized to deal with simple cases. That is why physi- cists have sought a unified field theory to embrace mechanical laws and gravitation and all electrical phenomena. It might, indeed, seem that all theories could be ranked in order of generality, and, if this is possible, we should certainly like to know the place of communication theory in such a hierarchy. Unfortunately, life isn't as simple as this. In one sense, network theory is less general than Maxwell's equations. In another sense, 1 In 1873, in his treatise Electrictity and Magnetism, James Clerk Maxwell pre- sented and fully explained for the first time the natural laws relating electric and magnetic fields and electric currents. He showed that there should be electromagnetic waves (radio waves) which travel with the speed of light. Hertz later demonstrated these experimentally, and we now know that light is electromagnetic waves. Max- well's equations are the mathematical statement of Maxwell's theory of electricity and magnetism. They are the foundation of all electric art. 6 Symbols, Signals and Noise however, it is more general, for all the mathematical results of network theory hold for vibrating mechanical systems made up of idealized mechanical components as well as for the behavior of interconnections of idealized electrical components. In mechanical applications, a spring corresponds to a capacitor, a mass to an inductor, and a dashpot or damper, such as that used in a door closer to keep the door from slamming, corresponds to a resistor. In fact, network theory might have been developed to explain the behavior of mechanical systems, and it is so used in the field of acoustics. The fact that network theory evolved from the study of idealized electrical systems rather than from the study of idealized mechanical systems is a matter of history, not of necessity. Because all of the mathematical results of network theory apply to certain specialized and idealized mechanical systems, as well as to certain specialized and idealized electrical systems, we can say that in a sense network theory is more general than Maxwell's equations, which do not apply to mechanical systems at all. In another sense, of course, Maxwell's equations are more general than network theory, for Maxwell's equations apply to all electrical systems, not merely to a specialized and idealized class of electrical circuits. To some degree we must simply admit that this is so, without being able to explain the fact fully. Yet, we can say this much. Some theories are very strongly physical theories. Newton's laws and Maxwell's equations are such theories. Newton's laws deal with mechanical phenomena; Maxwell's equations deal with elec- trical phenomena. Network theory is essentially a mathematical theory. The terms used in it can be given various physical mean- ings. The theory has interesting things to say about different physi- cal phenomena, about mechanical as well as electrical vibrations. Often a mathematical theory is the offshoot of a physical theory or of physical theories. It can be an elegant mathematical formula- tion and treatment of certain aspects of a general physical theory. Network theory is such a treatment of certain physical behavior common to electrical and mechanical devices. A branch of mathe- matics called potential theory treats problems common to electric, magnetic, and gravitational fields and, indeed, in a degree to aero- dynamics. Some theories seem, however, to be more mathematical than physical in their very inception. The World and Theories 1 We use many such mathematical theories in dealing with the physical world. Arithmetic is one of these. If we label one of a group of apples, dogs, or men 1, another 2, and so on, and if we have used up just the first 16 numbers when we have labeled all members of the group, we feel confident that the group of objects can be divided into two equal groups each containing 8 objects (16 -r- 2 = 8) or that the objects can be arranged in a square array of four parallel rows of four objects each (because 16 is a perfect square; 16 = 4 x 4). Further, if we line the apples, dogs, or men up in a row, there are 2,092,278,988,800 possible sequences in which they can be arranged, corresponding to the 2,092,278,- 988,800 different sequences of the integers 1 through 16. If we used up 13 rather than 16 numbers in labeling the complete collection of objects, we feel equally certain that the collection could not be divided into any number of equal heaps, because 13 is a prime number and cannot be expressed as a product of factors. This seems not to depend at all on the nature of the objects. Insofar as we can assign numbers to the members of any collection of objects, the results we get by adding, subtracting, multiplying, and dividing numbers or by arranging the numbers in sequence hold true. The connection between numbers and collections of objects seems so natural to us that we may overlook the fact that arithmetic is itself a mathematical theory which can be applied to nature only to the degree that the properties of numbers correspond to properties of the physical world. Physicists tell us that we can talk sense about the total number of a group of elementary particles, such as electrons, but we can't assign particular numbers to particular particles because the par- ticles are in a very real sense indistinguishable. Thus, we can't talk about arranging such particles in different orders, as numbers can be arranged in different sequences. This has important conse- quences in a part of physics called statistical mechanics. We may also note that while Euclidean geometry is a mathematical theory which serves surveyors and navigators admirably in their practical concerns, there is reason to believe that Euclidean geometry is not quite accurate in describing astronomical phenomena. How can we describe or classify theories? We can say that a theory is very narrow or very general in its scope. We can also distinguish theories as to whether they are strongly physical or 8 Symbols, Signals and Noise strongly mathematical. Theories are strongly physical when they describe very completely some range of physical phenomena, which in practice is always limited. Theories become more mathe- matical or abstract when they deal with an idealized class of phenomena or with only certain aspects of phenomena. Newton's laws are strongly physical in that they afford a complete description of mechanical phenomena such as the motions of the planets or the behavior of a pendulum. Network theory is more toward the mathematical or abstract side in that it is useful in dealing with a variety of idealized physical phenomena. Arithmetic is very mathe- matical and abstract; it is equally at home with one particular property of many sorts of physical entities, with numbers of dogs, numbers of men, and (if we remember that electrons are indistin- guishable) with numbers of electrons. It is even useful in reckoning numbers of days. In these terms, communication theory is both very strongly mathematical and quite general. Although communication theory grew out of the study of electrical communication, it attacks prob- lems in a very abstract and general way. It provides, in the bit, a universal measure of amount of information in terms of choice or uncertainty. Specifying or learning the choice between two equally probable alternatives, which might be messages or numbers to be transmitted, involves one bit of information. Communication theory tells us how many bits of information can be sent per second over perfect and imperfect communication channels in terms of rather abstract descriptions of the properties of these channels. Communication theory tells us how to measure the rate at which a message source, such as a speaker or a writer, generates informa- tion. Communication theory tells us how to represent, or encode, messages from a particular message source efficiently for trans- mission over a particular sort of channel, such as an electrical circuit, and it tells us when we can avoid errors in transmission. Because communication theory discusses such matters in very general and abstract terms, it is sometimes difficult to use the understanding it gives us in connection with particular, practical problems. However, because communication theory has such an abstract and general mathematical form, it has a very broad field of application. Communication theory is useful in connection with The World and Theories 9 written and spoken language, the electrical and mechanical trans- mission of messages, the behavior of machines, and, perhaps, the behavior of people. Some feel that it has great relevance and importance to physics in a way that we shall discuss much later in this book. Primarily, however, communication theory is, as Shannon de- scribed it, a mathematical theory of communication. The concepts are formulated in mathematical terms, of which widely different physical examples can be given. Engineers, psychologists, and physicists may use communication theory, but it remains a mathe- matical theory rather than a physical or psychological theory or an engineering art. It is not easy to present a mathematical theory to a general audience, yet communication theory is a mathematical theory, and to pretend that one can discuss it while avoiding mathematics entirely would be ridiculous. Indeed, the reader may be startled to find equations and formulae in these pages; these state accur- ately ideas which are also described in words, and I have included an appendix on mathematical notation to help the nonmathe- matical reader who wants to read the equations aright. I am aware, however, that mathematics calls up chiefly unpleas- ant pictures of multiplication, division, and perhaps square roots, as well as the possibly traumatic experiences of high-school class- rooms. This view of mathematics is very misleading, for it places emphasis on special notation and on tricks of manipulation, rather than on the aspect of mathematics that is most important to mathe- maticians. Perhaps the reader has encountered theorems and proofs in geometry; perhaps he has not encountered them at all, yet theorems and proofs are of primary importance in all mathe- matics, pure and applied. The important results of information theory are stated in the form of mathematical theorems, and these are theorems only because it is possible to prove that they are true statements. Mathematicians start out with certain assumptions and defini- tions, and then by means of mathematical arguments or proofs they are able to show that certain statements or theorems are true. This is what Shannon accomplished in his "Mathematical Theory of Communication." The truth of a theorem depends on the validity 10 Symbols, Signals and Noise of the assumptions made and on the validity of the argument or proof which is used to establish it. All of this is pretty abstract. The best way to give some idea of the meaning of theorem and proof is certainly by means of ex- amples. I cannot do this by asking the general reader to grapple, one by one and in all their gory detail, with the difficult theorems of communication theory. Really to understand thoroughly the proofs of such theorems takes time and concentration even for one with some mathematical background. At best, we can try to get at the content, meaning, and importance of the theorems. The expedient I propose to resort to is to give some examples of simpler mathematical theorems and their proof. The first example concerns a game called hex, or Nash. The theorem which will be proved is that the player with first move can win. Hex is played on a board which is an array of forty-nine hexa- gonal cells or spaces, as shown in Figure 1-1, into which markers may be put. One player uses black markers and tries to place them so as to form a continuous, if wandering, path between the black area at the left and the black area at the right. The other player uses white markers and tries to place them so as to form a continuous, if wandering, path between the white area at the top and the white area at the bottom. The players play alternately, each placing one marker per play. Of course, one player has to start first. Fig. 1-1 The World and Theories 1 1 In order to prove that the first player can win, it is necessary first to prove that when the game is played out, so that there is either a black or a white marker in each cell, one of the players must have won. Theorem I: Either one player or the other wins. Discussion: In playing some games, such as chess and ticktack- toe, it may be that neither player will win, that is, that the game will end in a draw. In matching heads or tails, one or the other necessarily wins. What one must show to prove this theorem is that, when each cell of the hex board is covered by either a black or a white marker, either there must be a black path between the black areas which will interrupt any possible white path between the white areas or there must be a white path between the white areas which will interrupt any possible black path between the black areas, so that either white or black must have won. Proof: Assume that each hexagon has been filled in with either a black or a white marker. Let us start from the left-hand corner of the upper white border, point I of Figure 1-2, and trace out the boundary between white and black hexagons or borders. We will proceed always along a side with black on our right and white on our left. The boundary so traced out will turn at the successive corners, or vertices, at which the sides of hexagons meet. At a corner, or vertex, we can have only two essentially different con- Fig. 1-2 12 Symbols, Signals and Noise ditions. Either there will be two touching black hexagons on the right and one white hexagon on the left, as in a of Figure 1-3, or two touching white hexagons on the left and one black hexagon on the right, as shown in b of Figure 1-3. We note that in either case there will be a continuous black path to the right of the boundary and a continuous white path to the left of the boundary. We also note that in neither a nor b of Figure 1-3 can the boundary cross or join itself, because only one path through the vertex has black on the right and white on the left. We can see that these two facts are true for boundaries between the black and white borders and hexagons as well as for boundaries between black and white hexagons. Thus, along the left side of the boundary there must be a continuous path of white hexagons to the upper white border, and along the right side of the boundary there must be a continu- ous path of black hexagons to the left black border. As the boundary cannot cross itself, it cannot circle indefinitely, but must eventually reach a black border or a white border. If the boundary reaches a black border or white border with black on its right and white on its left, as we have prescribed, at any place except corner II or corner III, we can extend the boundary further with black on its right and white on its left. Hence, the boundary will reach either point II or point III. If it reaches point II, as shown in Figure 1-2, the black hexagons on the right, which are connected to the left black border, will also be connected to the right black border, while the white hexagons to the left will be connected to the upper white border only, and black will have won. It is clearly impossible for white to have won also, for the continuous band of adjacent (a) Fig. 1-3 (b) The World and Theories 13 black cells from the left border to the right precludes a continuous band of white cells to the bottom border. We see by similar argu- ment that, if the boundary reaches point III, white will have won. Theorem II: The player with the first move can win. Discussion: By can is meant that there exists a way, if only the player were wise enough to know it. The method for winning would consist of a particular first move (more than one might be allow- able but are not necessary) and a chart, formula, or other specifi- cation or recipe giving a correct move following any possible move made by his opponent at any subsequent stage of the game, such that if, each time he plays, the first player makes the prescribed move, he will win regardless of what moves his opponent may make. Proof: Either there must be some way of play which, if followed by the first player, will insure that he wins or else, no matter how the first player plays, the second player must be able to choose moves which will preclude the first player from winning, so that he, the second player, will win. Let us assume that the player with the second move does have a sure recipe for winning. Let the player with the first move make his first move in any way, and then, after his opponent has made one move, let the player with the first move apply the hypothetical recipe which is supposed to allow the player with the second move to win. If at any time a move calls for putting a piece on a hexagon occupied by a piece he has already played, let him place his piece instead on any unoccupied space. The designated space will thus be occupied. The fact that by starting first he has an extra piece on the board may keep his opponent from occupying a particular hexagon but not the player with the extra piece. Hence, the first player can occupy the hexa- gons designated by the recipe and must win. This is contrary to the original assumption that the player with the second move can win, and so this assumption must be false. Instead, it must be possible for the player with the first move to win. A mathematical purist would scarcely regard these proofs as rigorous in the form given. The proof of theorem II has another curious feature; it is not a constructive proof. That is, it does not show the player with the first move, who can win in principle, how to go about winning. We will come to an example of a constructive 14 Symbols, Signals and Noise proof in a moment. First, however, it may be appropriate to phil- osophize a little concerning the nature of theorems and the need for proving them. Mathematical theorems are inherent in the rigorous statement of the general problem or field. That the player with the first move can win at hex is necessarily so once the game and its rules of play have been specified. The theorems of Euclidean geometry are necessarily so because of the stated postulates. With sufficient intelligence and insight, we could presumably see the truth of theorems immediately. The young Newton is said to have found Euclid's theorems obvious and to have been impatient with their proofs. Ordinarily, while mathematicians may suspect or conjecture the truth of certain statements, they have to prove theorems in order to be certain. Newton himself came to see the importance of proof, and he proved many new theorems by using the methods of Euclid. By and large, mathematicians have to proceed step by step in attaining sure knowledge of a problem. They laboriously prove one theorem after another, rather than seeing through everything hi a flash. Too, they need to prove the theorems in order to convince others. Sometimes a mathematician needs to prove a theorem to con- vince himself, for the theorem may seem contrary to common sense. Let us take the following problem as an example: Consider the square, 1 inch on a side, at the left of Figure 1-4. We can specify any point in the square by giving two numbers,/, the height of the point above the base of the square, and x, the distance of the point from the left-hand side of the square. Each of these numbers will be less than one. For instance, the point shown will be repre- sented by x = 0.547000 . . . (ending in an endless sequence of zeros) y = 0312000 . . . (ending in an endless sequence of zeros) Suppose we pair up points on the square with points on the line, so that every point on the line is paired with just one point on the square and every point on the square with just one point on the line. If we do this, we are said to have mapped the square onto the line in a one-to-one way, or to have achieved a one-to-one map- ping of the square onto the line. The World and Theories 15 y JL P Fig. 1-4 Theorem: It is possible to map a square of unit area onto a line of unit length in a one-to-one way. 2 Proof: Take the successive digits of the height of the point in the square and let them form the first, third, fifth, and so on digits of a number x'. Take the digits of the distance of the point P from the left side of the square, and let these be the second, fourth, sixth, etc., of the digits of the number x f . Let x r be the distance of the point P' from the left-hand end of the line. Then the point P r maps the point P of the square onto the line uniquely, in a one- to-one way. We see that changing either x or y will change x' to a new and appropriate number, and changing x f will change x and y. To each point x 9 y in the square corresponds just one point x' on the line, and to each point x' on the line corresponds just one point x y y in the square, the requirement for one-to-one mapping. 3 In the case of the example given before x = 0.547000 . . . y = 0.312000.. . yf = 0.351427000 . . . In the case of most points, including those specified by irrational numbers, the endless string of digits representing the point will not become a sequence of zeros nor will it ever repeat. Here we have an example of a constructive proof. We show that we can map each point of a square into a point on a line segment in a one-to-one way by giving an explicit recipe for doing this. Many mathematicians prefer constructive proofs to proofs which 2 This has been restricted for convenience; the size doesn't matter. 3 This proof runs into resolvable difficulties in the case of some numbers such as Vfii, which can be represented decimally .5 followed by an infinite sequence of zeros or .4 followed by an infinite sequence of nines. 16 Symbols, Signals and Noise are not constructive, and mathematicians of the intuitionist school reject nonconstructive proofs in dealing with infinite sets, in which it is impossible to examine all the members individually for the property in question. Let us now consider another matter concerning the mapping of the points of a square on a line segment. Imagine that we move a pointer along the line, and imagine a pointer simultaneously moving over the face of the square so as to point out the points in the square corresponding to the points that the first pointer indicates on the line. We might imagine (contrary to what we shall prove) the following: If we moved the first pointer slowly and smoothly along the line, the second pointer would move slowly and smoothly over the face of the square. All the points lying in a small cluster on the line would be represented by points lying in a small cluster on the face of the square. If we moved the pointer a short distance along the line, the other pointer would move a short distance over the face of the square, and if we moved the pointer a shorter distance along the line, the other pointer would move a shorter distance across the face of the square, and so on. If this were true we could say that the one-to-one mapping of the points of the square into points on the line was continuous. However, it turns out that a one-to-one mapping of the points in a square into the points on a line cannot be continuous. As we move smoothly along a curve through the square, the points on the line which represent the successive points on the square neces- sarily jump around erratically, not only for the mapping described above but for any one-to-one mapping whatever. Any one-to-one mapping of the square onto the line is discontinuous. Theorem: Any one-to-one mapping of a square onto a line must be discontinuous. Proof: Assume that the one-to-one mapping is continuous. If this is to be so then all the points along some arbitrary curve AB of Figure 1-5 on the square must map into the points lying between the corresponding points A f and B f . If they did not, in moving along the curve in the square we would either jump from one end of the line to the other (discontinuous mapping) or pass through one point on the line twice (not one-to-one mapping). Let us now choose a point C to the left of line segment AB' and D r to the right of AB' and locate the corresponding points C and D in the The World and Theories 17 C' A' B' D' Fig. 1-5 square. Draw a curve connecting C and > and crossing the curve from A to B. Where the curve crosses the curve AB it will have a point in common with AB; hence, this one point of CD must map into a point lying between A' and B f , and all other points which are not on AB must map to points lying outside of A'B', either to the left or the right of A 'B'. This is contrary to our assumption that the mapping was continuous, and so the mapping cannot be continuous. We shall find that these theorems, that the points of a square can be mapped onto a line and that the mapping is necessarily discontinuous, are both important in communication theory, so we have proved one theorem which, unlike those concerning hex, will be of some use to us. Mathematics is a way of finding out, step by step, facts which are inherent in the statement of the problem but which are not immediately obvious. Usually, in applying mathematics one must first hit on the facts and then verify them by proof. Here we come upon a knotty problem, for the proofs which satisfied mathema- ticians of an earlier day do not satisfy modern mathematicians. In our own day, an irascible minor mathematician who reviewed Shannon's original paper on communication theory expressed doubts as to whether or not the author's mathematical intentions were honorable. Shannon's theorems are true, however, and proofs have been given which satisfy even rigor-crazed mathematicians. The simple proofs which I have given above as illustrations of mathematics are open to criticism by purists. What I have tried to do is to indicate the nature of mathematical reasoning, to give some idea of what a theorem is and of how it may be proved. With this in mind, we will go on to the mathe- matical theory of communication, its theorems, which we shall not really prove, and to some implications and associations which 18 Symbols, Signals and Noise extend beyond anything that we can establish with mathematical certainty. As I have indicated earlier in this chapter, communication theory as Shannon has given it to us deals in a very broad and abstract way with certain important problems of communication and information, but it cannot be applied to all problems which we can phrase using the words communication and information in their many popular senses. Communication theory deals with certain aspects of communication which can be associated and organized in a useful and fruitful way, just as Newton's laws of motion deal with mechanical motion only, rather than with all the named and indeed different phenomena which Aristotle had in mind when he used the word motion. To succeed, science must attempt the possible. We have no reason to believe that we can unify all the things and concepts for which we use a common word. Rather we must seek that part of experience which can be related. When we have succeeded in relating certain aspects of experience we have a theory. Newton's laws of motion are a theory which we can use in dealing with mechanical phenomena. Maxwell's equations are a theory which we can use in connection with electrical phenomena. Network theory we can use in connection with certain simple sorts of elec- trical or mechanical devices. We can use arithmetic very generally in connection with numbers of men, stones, or stars, and geometry in measuring land, sea, or galaxies. Unlike Newton's laws of motion and Maxwell's equations, which are strongly physical in that they deal with certain classes of physical phenomena, communication theory is abstract in that it applies to many sorts of communication, written, acoustical, or electrical. Communication theory deals with certain important but abstract aspects of communication. Communication theory pro- ceeds from clear and definite assumptions to theorems concerning information sources and communication channels. In this it is essentially mathematical, and in order to understand it we must understand the idea of a theorem as a statement which must be proved, that is, which must be shown to be the necessary conse- quence of a set of initial assumptions. This is an idea which is the very heart of mathematics as mathematicians understand it. CHAPTER II The Origins of Information Theory MEN HAVE BEEN at odds concerning the value of history. Some have studied earlier times in order to find a universal system of the world, in whose inevitable unfolding we can see the future as well as the past. Others have sought in the past prescriptions for success in the present. Thus, some believe that by studying scientific discovery in another day we can learn how to make discoveries. On the other hand, one sage observed that we learn nothing from history except that we never learn anything from history, and Henry Ford asserted that history is bunk. All of this is as far beyond me as it is beyond the scope of this book. I will, however, maintain that we can learn at least two things from the history of science. One of these is that many of the most general and powerful discoveries of science have arisen, not through the study of phe- nomena as they occur in nature, but, rather, through the study of phenomena in man-made devices, in products of technology, if you will. This is because the phenomena in man's machines are simpli- fied and ordered in comparison with those occurring naturally, and it is these simplified phenomena that man understands most easily. Thus, the existence of the steam engine, in which phenomena involving heat, pressure, vaporization, and condensation occur in a simple and orderly fashion, gave tremendous impetus to the very powerful and general science of thermodynamics. We see this 19 20 Symbols, Signals and Noise especially in the work of Carnot. 1 Our knowledge of aerodynamics and hydrodynamics exists chiefly because airplanes and ships exist, no because of the existence of birds and fishes. Our knowl- edge of electricity came mainly not from the study of lightning, but from the study of man's artifacts. Similarly, we shall find the roots of Shannon's broad and ele- gant theory of communication in the simplified and seemingly easily intelligible phenomena of telegraphy. The second thing that history can teach us is with what difficulty understanding is won. Today, Newton's laws of motion seem simple and almost inevitable, yet there was a day when they were undreamed of, a day when brilliant men had the oddest notions about motion. Even discoverers themselves sometimes seem in- credibly dense as well as inexplicably wonderful. One might expect of Maxwell's treatise on electricity and magnetism a bold and simple pronouncement concerning the great step he had taken. Instead, it is cluttered with all sorts of such lesser matters as once seemed important, so that a naive reader might search long to find the novel step and to restate it in the simple manner familiar to us. It is true, however, that Maxwell stated his case clearly elsewhere. Thus, a study of the origins of scientific ideas can help us to value understanding more highly for its having been so dearly won. We can often see men of an earlier day stumbling along the edge of discovery but unable to take the final step. Sometimes we are tempted to take it for them and to say, because they stated many of the required concepts in juxtaposition, that they must really have reached the general conclusion. This, alas, is the same trap into which many an ungrateful fellow falls in his own life. When some- one actually solves a problem that he merely has had ideas about, he believes that he understood the matter aU along. Properly understood, then, the origins of an idea can help to show what its real content is; what the degree of understanding was before the idea came along and how unity and clarity have been attained. But to attain such understanding we must trace the actual course of discovery, not some course which we feel discovery 1 N. L. S. Carnot (1796-1832) first proposed an ideal expansion of gas (the Carnot cycle) which will extract the maximum possible mechanical energy from the thermal energy of the steam. The Origins of Information Theory 21 should or could have taken, and we must see problems (if we can) as the men of the past saw them, not as we see them today. In looking for the origin of communication theory one is apt to fall into an almost trackless morass. I would gladly avoid this entirely but cannot, for others continually urge their readers to enter it. I only hope that they will emerge unharmed with the help of the following grudgingly given guidance. A particular quantity called entropy is used in thermodynamics and in statistical mechanics. A quantity called entropy is used in communication theory. After all, thermodynamics and statistical mechanics are older than communication theory. Further, in a paper published in 1929, L. Szilard, a physicist, used an idea of information in resolving a particular physical paradox. From these facts we might conclude that communication theory somehow grew out of statistical mechanics. This easy but misleading idea has caused a great deal of confu- sion even among technical men. Actually, communication theory evolved from an effort to solve certain problems in the field of electrical communication. Its entropy was called entropy by mathe- matical analogy with the entropy of statistical mechanics. The chief relevance of this entropy is to problems quite different from those which statistical mechanics attacks. In thermodynamics, the entropy of a body of gas depends on its temperature, volume, and mass and on what gas it is just as the energy of the body of gas does. If the gas is allowed to expand in a cylinder, pushing on a slowly moving piston, with no flow of heat to or from the gas, the gas will become cooler, losing some of its thermal energy. This energy appears as work done on the piston. The work may, for instance, lift a weight, which thus stores the energy lost by the gas. This is a reversible process. By this we mean that if work is done in pushing the piston slowly back against the gas and so recom- pressing it to its original volume, the exact original energy, pres- sure, and temperature will be restored to the gas. In such a reversible process, the entropy of the gas remains constant, while its energy changes. Thus, entropy is an indicator of reversibility; when there is no change of entropy, the process is reversible. In the example dis- 22 Symbols, Signals and Noise cussed above, energy can be transferred repeatedly back and forth between thermal energy of the compressed gas and mechanical energy of a lifted weight. Most physical phenomena are not reversible. Irreversible phe- nomena always involve an increase of entropy. Imagine, for instance, that a cylinder which allows no heat flow in or out is divided into two parts by a partition, and suppose that there is gas on one side of the partition and none on the other. Imagine that the partition suddenly vanishes, so that the gas expands and fills the whole container. In this case, the thermal energy remains the same, but the entropy increases. Before the partition vanished we could have obtained mechani- cal energy from the gas by letting it flow into the empty part of the cylinder through a little engine. After the removal of the par- tition and the subsequent increase in entropy, we cannot do this. The entropy can increase while the energy remains constant in other similar circumstances. For instance, this happens when heat flows from a hot object to a cold object. Before the temperatures were equalized, mechanical work could have been done by making use of the temperature difference. After the temperature difference has disappeared, we can no longer use it in changing part of the thermal energy into mechanical energy. Thus, an increase in entropy means a decrease in our ability to change thermal energy, the energy of heat, into mechanical energy. An increase of entropy means a decrease of available energy. While thermodynamics gave us the concept of entropy, it does not give a detailed physical picture of entropy, in terms of positions and velocities of molecules, for instance. Statistical mechanics does give a detailed mechanical meaning to entropy in particular cases. In general, the meaning is that an increase in entropy means a decrease in order. But, when we ask what order means, we must in some way equate it with knowledge. Even a very complex arrangement of molecules can scarcely be disordered if we know the position and velocity of every one. Disorder in the sense in which it is used in statistical mechanics involves unpredictability based on a lack of knowledge of the positions and velocities of molecules. Ordinarily we lack such knowledge when the arrange- ment of positions and velocities is "complicated." The Origins of Information Theory 23 Let us return to the example discussed above in which all the molecules of a gas are initially on one side of a partition in a cylinder. If the molecules are all on one side of the partition, and if we know this, the entropy is less than if they are distributed on both sides of the partition. Certainly, we know more about the positions of the molecules when we know that they are all on one side of the partition than if we merely know that they are some- where within the whole container. The more detailed our knowl- edge is concerning a physical system, the less uncertainty we have concerning it (concerning the location of the molecules, for instance) and the less the entropy is. Conversely, more uncertainty means more entropy. Thus, in physics, entropy is associated with the possibility of converting thermal energy into mechanical energy. If the entropy does not change during a process, the process is reversible. If the entropy increases, the available energy decreases. Statistical me- 1 chanics interprets an increase of entropy as a decrease in order or, if we wish, as a decrease in our knowledge. The applications and details of entropy in physics are of course much broader than the examples I have given can illustrate, but I believe that I have indicated its nature and something of its impor- tance. Let us now consider the quite different purpose and use of the entropy of communication theory. In communication theory we consider a message source, such as a writer or a speaker, which may produce on a given occasion any one of many possible messages. The amount of information conveyed by the message increases as the amount of uncertainty as to what message actually will be produced becomes greater. A message which is one out of ten possible messages conveys a smaller amount of information than a message which is one out of a million possible messages. The entropy of communication theory is a measure of this uncertainty and the u^ertainty.^or enjmglajjj^ conveyedjay ^-message fram a source, The more we knowLabmii gj^t meSSflg? ttlS OT^^ wiH produce,, the less uncertainty, the less the entropy, and_lhe_kss the information, We see that the ideas which gave rise to the entropy of physics and the entropy of communication theory are quite different. One 24 Symbols, Signals and Noise can be fully useful without any reference at all to the other. None- theless, both the entropy of statistical mechanics and that of communication theory can be described in terms of uncertainty, in similar mathematical terms. Can some significant and useful relation be established between the two different entropies and, indeed, between physics and the mathematical theory of com- munication? Several physicists and mathematicians have been anxious to show that communication theory and its entropy are extremely important in connection with statistical mechanics. This is still a confused and confusing matter. The confusion is sometimes aggra- vated when more than one meaning of information creeps into a discussion. Th^infnrfnatiQfi is sopietime^a^sQjcialed.with the idea of knowledge through its popular use ratherjMji with uncertainty jind thejesolution of uncertainty, as it is in communication theory. We will consider the relation between communication theory and physics in Chapter X, after arriving at some understanding of communication theory. Here I will merely say that the efforts to marry communication theory and physics have been more interest- ing than fruitful. Certainly, such attempts have not produced important new results or understanding, as communication theory has in its own right. Communication theory has its origins in the study of electrical communication, not in statistical mechanics, and some of the ideas important to communication theory go back to the very origins of electrical communication. During a transatlantic voyage in 1832, Samuel F. B. Morse set to work on the first widely successful form of electrical telegraph. As Morse first worked it out, his telegraph was much more com- plicated than the one we know. It actually drew short and long lines on a strip of paper, and sequences of these represented, not the letters of a word, but numbers assigned to words in a diction- ary or code book which Morse completed in 1837. This is (as we shall see) an efficient form of coding, but it is clumsy. While Morse was working with Alfred Vail, the old coding was given up, and what we now know as the Morse code had been devised by 1838. In this code, letters of the alphabet are represented by spaces, dots, and dashes. The space is the absence of an electric The Origins of Information Theory 25 current, the dot is an electric current of short duration, and the dash is an electric current of longer duration. Various combinations of dots and dashes were cleverly assigned to the letters of the alphabet. E, the letter occurring most frequently in English text, was represented by the shortest possible code symbol, a single dot, and, in general, short combinations of dots and dashes were used for frequently used letters and long combi- nations for rarely used letters. Strangely enough, the choice was not guided by tables of the relative frequencies of various letters in English text nor were letters in text counted to get such data. Relative frequencies of occurrence of various letters were estimated by counting the number of types in the various compartments of a printer's type box! We can ask, would some other assignment of dots, dashes, and spaces to letters than that used by Morse enable us to send English text faster by telegraph? Our modern theory tells us that we could only gain about 15 per cent in speed. Morse was very successful indeed in achieving his end, and he had the end clearly in mind. The lesson provided by Morse's code is that it matters profoundly how one translates a message into electrical signals. This matter is at the very heart of communication theory. In 1843, Congress passed a bill appropriating money for the construction of a telegraph circuit between Washington and Balti- more. Morse started to lay the wire underground, but ran into difficulties which later plagued submarine cables even more severely. He solved his immediate problem by stringing the wire on poles. The difficulty which Morse encountered with his underground wire remained an important problem. Different circuits which conduct a steady electric current equally well are not necessarily equally suited to electrical communication. If one sends dots and dashes too fast over an underground or undersea circuit, they are run together at the receiving end. As indicated in Figure II- 1, when we send a short burst of current which turns abruptly on and off, we receive at the far end of the circuit a longer, smoothed-out rise and fall of current. This longer flow of current may overlap the current of another symbol sent, for instance, as an absence of current. Thus, as shown in Figure II-2, when a clear and distinct 26 Symbols, Signals and Noise ^ A cc T SENT RECEIVED TIME Fig. II-l signal is transmitted it may be received as a vaguely wandering rise and fall of current which is difficult to interpret. Of course, if we make our dots, spaces, and dashes long enough, the current at the far end will follow the current at the sending end better, but this slows the rate of transmission. It is clear that there is somehow associated with a given transmission circuit a limiting speed of transmission for dots and spaces. For submarine cables this speed is so slow as to trouble telegraphers; for wires on poles it is so fast as not to bother telegraphers. Early telegraphists were aware of this limitation, and it, too, lies at the heart of communi- cation theory. SENT RECEIVED Fig. 11-2 The Origins of Information Theory 27 Even in the face of this limitation on speed, various things can be done to increase the number of letters which can be sent over a given circuit in a given period of time. A dash takes three times as long to send as a dot. It was soon appreciated that one could gain by means of double-current telegraphy. We can understand this by imagining that at the receiving end a galvanometer, a device which detects and indicates the direction of flow of small currents, is connected between the telegraph wire and the ground. To indicate a dot, the sender connects the positive terminal of his battery to the wire and the negative terminal to ground, and the needle of the galvanometer moves to the right. To send a dash, the sender connects the negative terminal of his battery to the wire and the positive terminal to the ground, and the needle of the galva- nometer moves to the left. We say that an electric current in one direction (into the wire) represents a dot and an electric current in the other direction (out of the wire) represents a dash. No current at all (battery disconnected) represents a space. In actual double-current telegraphy, a different sort of receiving instrument is used. In single-current telegraphy we have two elements out of which to construct our code: current and no current, which we might call 1 and 0. In double-current telegraphy we really have three elements, which we might characterize as forward current, or current into the wire; no current; backward current, or current out of the wire; or as +1,0, 1. Here the + or sign indicates the direction of current flow and the number 1 gives the magnitude or strength of the current, which in this case is equal for current flow in either direction. In 1874, Thomas Edison went further; in his quadruplex tele- graph system he used two intensities of current as well as two directions of current. He used changes in intensity, regardless of changes in direction of current flow to send one message, and changes of direction of current flow regardless of changes in intensity, to send another message. If we assume the currents to differ equally one from the next, we might represent the four different conditions of current flow by means of which the two messages are conveyed over the one circuit simultaneously as +3, 4-1, 1, 3. The interpretation of these at the receiving end is shown in Table I. 28 Symbols, Signals and Noise TABLE I Current Transmitted Meaning Message 1 Message 2 + 3 on on + 1 off on -1 off off -3 on off Figure II-3 shows how the dots, dashes, and spaces of two simultaneous, independent messages can be represented by a suc- cession of the four different current values. Clearly, how much information it is possible to send over a circuit depends not only on how fast one can send successive symbols (successive current values) over the circuit but also on how many different symbols (different current values) one has available to choose among. If we have as symbols only the two currents 4- 1 or or, which is just as effective, the two currents + 1 and 1, we can convey to the receiver only one of two possibilities at a time. We have seen above, however, that if we can choose among any one of four current values (any one of four symbols) at a MESSAGE 1 ON OFF J OFF- CURRENT + 3 f 1 - t ' -3 Fig. n-3 The Origins of Information Theory 29 time, such as +3 or +1 or 1 or 3, we can convey by means of these current values (symbols) two independent pieces of infor- mation: whether we mean a or 1 in message 1 and whether we mean a or 1 in message 2. Thus, for a given rate of sending succes- sive symbols, the use of four current values allows us to send two independent messages, each as fast as two current values allow us to send one message. We can send twice as many letters per minute by using four current values as we could using two current values. The use of multiplicity of symbols can lead to difficulties. We have noted that dots and dashes sent over a long submarine cable tend to spread out and overlap. Thus, when we look for one symbol at the far end we see, as Figure II-2 illustrates, a little of several others. Under these circumstances, a simple identification, as 1 or or else + 1 or 1, is easier and more certain than a more com- phcated ^identification, as among +3, +1, 1, 3. Further, other matters limit our ability to make complicated distinctions. During magnetic storms, extraneous signals appear on telegraph lines and submarine cables. 2 And if we look closely enough, as we can today with sensitive electronic amplifiers, we see that minute, undesired currents are always present. These are akin to the erratic Brownian motion of tiny particles observed under a microscope and to the agitation of air molecules and of all other matter which we associate with the idea of heat and temperature. Extraneous currents, which we call noise, are always present to interfere with the signals sent. Thus, even if we avoid the overlapping of dots and spaces which is called intersymbol interference, noise tends to distort the received signal and to make difficult a distinction among many alternative symbols. Of course, increasing the current transmitted, which means increasing the power of the transmitted signal, helps to overcome the effect of noise. There are limits on the power that can be used, however. Driving a large current through a submarine cable takes a large voltage, and a large enough voltage can destroy the insulation of the cable can in fact cause a short circuit. It is likely that the large transmitting voltage used caused the failure of the first transatlantic telegraph cable in 1858. 2 The changing magnetic field of the earth induces currents in the cables. The changes in the earth's magnetic field are presumably caused by streams of charged particles due to solar storms. 30 Symbols, Signals and Noise Even the early telegraphists understood intuitively a good deal about the limitations associated with speed of signaling, interfer- ence, or noise, the difficulty in distinguishing among many alter- native values of current, and the limitation on the power that one could use. More than an intuitive understanding was required, however. An exact mathematical analysis of such problems was needed. Mathematics was early applied to such problems, though their complete elucidation has come only in recent years. In 1855, William Thomson, later Lord Kelvin, calculated precisely what the received current will be when a dot or space is transmitted over a submarine cable. A more powerful attack on such problems followed the invention of the telephone by Alexander Graham Bell in 1875. Telephony makes use, not of the slowly sent off-on signals of telegraphy, but rather of currents whose strength varies smoothly and subtly over a wide range of amplitudes with a rapidity several hundred times as great as encountered in manual telegraphy. Many men helped to establish an adequate mathematical treat- ment of the phenomena of telephony: Henri Poincare, the great French mathematician; Oliver Heaviside, an eccentric, English, minor genius; Michael Pupin, of From Immigrant to Inventor fame; and G. A. Campbell, of the American Telephone and Telegraph Company, are prominent among these. The mathematical methods which these men used were an extension of work which the French mathematician and physicist, Joseph Fourier, had done early in the nineteenth century in connec- tion with the flow of heat. This work had been applied to the study of vibration and was a natural tool for the analysis of the behavior of electric currents which change with time in a complicated fash- ion as the electric currents of telephony and telegraphy do. It is impossible to proceed further on our way without under- standing something of Fourier's contribution, a contribution which is absolutely essential to all communication and communication theory. Fortunately, the basic ideas are simple; it is their proof and the intricacies of their application which we shall have to omit here. Fourier based his mathematical attack on some of the problems of heat flow on a very particular mathematical function called a The Origins of Information Theory 31 sine wave. Part of a sine wave is shown at the right of Figure II-4. The height of the wave h varies smoothly up and down as time passes, fluctuating so forever and ever. A sine wave has no begin- ning or end. A sine wave is not just any smoothly wiggling curve. The height of the wave (it may represent the strength of a current or voltage) varies in a particular way with time. We can describe this variation in terms of the motion of a crank connected to a shaft which revolves at a constant speed, as shown at the left of Figure II-4. The height h of the crank above the axle varies exactly sinusoidally with time. A sine wave is a rather simple sort of variation with time. It can be characterized, or described, or differentiated completely from any other sine wave by means of just three quantities. One of these is the maximum height above zero, called the amplitude. Another is the time at which the maximum is reached, which is specified as the phase. The third is the time T between maxima, called the period. Usually, we use instead of the period the reciprocal of the period called the frequency, denoted by the letter/ If the period T of a sine wave is 1/100 second, the frequency /is 100 cycles per second, abbreviated cps. A cycle is a complete variation from crest, through trough, and back to crest again. The sine wave is periodic in that one variation from crest through trough to crest again is just like any other. Fourier succeeded in proving a theorem concerning sine waves which astonished his, at first, incredulous contemporaries. He showed that any variation of a quantity with time can be accurately represented as the sum of a number of sinusoidal variations of L^ j _J TIME Fig. II-4 32 Symbols, Signals and Noise different amplitudes, phases, and frequencies. The quantity con- cerned might be the displacement of a vibrating string, the height of the surface of a rough ocean, the temperature of an electric iron, or the current or voltage in a telephone or telegraph wire. All are amenable to Fourier's analysis. Figure II-5 illustrates this in a simple case. The height of the periodic curve a above the centerline is the sum of the heights of the sinusoidal curves b and c. The mere representation of a complicated variation of some physical quantity with time as a sum of a number of simple sinus- oidal variations might seem a mere mathematician's trick. Its utility depends on two important physical facts. The circuits used in the transmission of electrical signals do not change with time, and they behave in what is called a linear fashion. Suppose, for instance, we send one signal, which we will call an input signal over the line and draw a curve showing how the amplitude of the received signal varies with time. Suppose we send a second input signal and draw a curve showing how the corresponding received signal varies with time. Suppose we now send the sum of the two input signals, that is, a signal whose current is at every moment the simple sum of the currents of the two separate input signals. Then, the received output signal will be merely the sum of the two output signals corresponding to the input signals sent separately. We can easily appreciate the fact that communication circuits don't change significantly with time. Linearity means simply that (a) \7 (w -^- ^ /"\ (c) - Fig. II-5 The Origins of Information Theory 33 if we know the output signals corresponding to any number of input signals sent separately, we can calculate the output signal when several of the input signals are sent together merely by adding the output signals corresponding to the input signals. In a linear electrical circuit or transmission system, signals act as if they were present independently of one another; they do not interact. This is, indeed, the very criterion for a circuit being called a linear circuit. While linearity is a truly astonishing property of nature, it is by no means a rare one. AH circuits made up of the resistors, capaci- tors, and inductors discussed in Chapter I in connection with network theory are linear, and so are telegraph lines and cables. Indeed, usually electrical circuits are linear, except when they include vacuum tubes, or transistors, or diodes, and sometimes even such circuits are substantially linear. Because telegraph wires are linear, which is just to say because telegraph wires are such that electrical signals on them behave independently without interacting with one another, two telegraph signals can travel in opposite directions on the same wire at the same time without interfering with one another. However, while linearity is a fairly common phenomenon in electrical circuits, it is by no means a universal natural phenomenon. Two trains can't travel in opposite directions on the same track without interference. Presumably they could, though, if all the physical phenomena comprised in trains were linear. The reader might speculate on the unhappy lot of a truly linear race of beings. With the very surprising property of linearity in mind, let us return to the transmission of signals over electrical circuits. We have noted that the output signal corresponding to most input signals has a different shape or variation with time from the input signal. Figures II- 1 and II-2 illustrate this. However, it can be shown mathematically (but not here) that, if we use a sinusoidal signal, such as that of Figure II-4 ? as an input signal to a linear transmission path, we always get out a sine wave of the same period, or frequency. The amplitude of the output sine wave may be less than that of the input sine wave; we call this attenuation of the sinusoidal signal. The output sine wave may rise to a peak later than the input sine wave; we call ibis phase shift, or delay of the sinusoidal signal. 34 Symbols, Signals and Noise The amounts of the attenuation and delay depend on the fre- quency of the sine wave. In fact, the circuit may fail entirely to transmit sine waves of some frequencies. Thus, corresponding to an input signal made up of several sinusoidal components, there will be an output signal having components of the same frequencies but of different relative phases or delays and of different ampli- tudes. Thus, in general the shape of the output signal will be different from the shape of the input signal. However, the difference can be thought of as caused by the changes in the relative delays and amplitudes of the various components, differences associated with their different frequencies. If the attenuation and delay of a circuit is the same for all frequencies, the shape of the output wave will be the same as that of the input wave; such a circuit is distortionless. Because this is a very important matter, I have illustrated it in Figure II-6. In a we have an input signal which can be expressed as the sum of the two sinusoidal components, b and c. In trans- mission, b is neither attenuated nor delayed, so the output b' of the same frequency as b is the same as b. However, the output c f due to the input c is attenuated and delayed. The total output of, the sum of b' and c f , clearly has a different shape from the input a. Yet, the output is made up of two components having the same frequencies that are present in the input. The frequency compo- nents merely have different relative phases or delays and different relative amplitudes in the output than in the input. The Fourier analysis of signals into components of various fre- quencies makes it possible to study the transmission properties of a linear circuit for all signals in terms of the attenuation and delay it imposes on sine waves of various frequencies as they pass through it. Fourier analysis is a powerful tool for the analysis of transmis- sion problems. It provided mathematicians and engineers with a bewildering variety of results which they did not at first clearly understand. Thus, early telegraphists invented all sorts of shapes and combinations of signals which were alleged to have desirable properties, but they were often inept in their mathematics and wrong in their arguments. There was much dispute concerning the efficacy of various signals in ameliorating the limitations imposed The Origins of Information Theory 35 by circuit speed, intersymbol interference, noise, and limitations on transmitted power. In 1917, Harry Nyquist came to the American Telephone and Telegraph Company immediately after receiving his Ph.D. at Yale (Ph.D.'s were considerably rarer in those days). Nyquist was a much better mathematician than most men who tackled the prob- lems of telegraphy, and he has remained a clear, original, and philosophical thinker concerning communication. He tackled the problems of telegraphy with powerful methods and with clear insight. In 1924, he published his results in an important paper, "Certain Factors Affecting Telegraph Speed." (a) (b) (c') (a') Fig. II-6 36 Symbols, Signals and Noise This paper deals with a number of problems of telegraphy. Among other things, it clarifies the relation between the speed of telegraphy and the number of current values such as +1, 1 (two current values) or +3, +1, 1, 3 (four current values). Nyquist says that if we send symbols (successive current values) at a constant rate, the speed of transmission, W, is related to m, the number of different symbols or current values available, by W = K log m Here K is a constant whose value depends on how many successive current values are sent each second. The quantity log m means logarithm of m. There are different bases for taking logarithms. If we choose 2 as a base, then the values of log m for various values ofm are given in Table II. TABLE II m log m 1 2 1 3 1.6 4 2 8 3 16 4 To sum up the matter by means of an equation, log x is such a number that 21og x _ x We may see by taking the logarithm of each side that the following relation must be true: log 2 lo s * = log x If we write M in place of log x, we see that log 2 M = M All of this is consistent with Table II. We can easily see by means of an example why the logarithm is the appropriate function in Nyquist's relation. Suppose that we The Origins of Information Theory 37 wish to specify two independent choices of off-or-on, 0-or-l, simul- taneously. There are four possible combinations of two independ- ent 0-or-l choices, as shown in Table III. TABLE III Number of Combination First O-OR-1 Choice Second O-OR-1 Choice 1 2 3 4 1 1 1 1 Further, if we wish to specify three independent choices of 0-or-l at the same time, we find eight combinations, as shown in Table IV. TABLE IV x , , -_ ,. ,. First O-OR-1 Second O-OR-1 7%i/tf O-OR-! Number of Combination -,. . _, , _, . 7 C/zozce C/ztfzce Choice 1 2 1 3 1 4 1 1 5 1 6 1 1 7 1 1 8 1 1 1 Similarly, if we wish to specify four independent 0-or-l choices, we find sixteen different combinations, and, if we wish to specify M different independent 0-or-l choices, we find 2 M different combinations. If we can specify M independent 0-or-l combinations at once, we can in effect send M independent messages at once, so surely the speed should be proportional to M. But, in sending M messages at once we have 2 M possible combinations of the M independent 0-or-l choices. Thus, to send M messages at once, we need to be able to send 2 M different symbols or current values. Suppose that we can choose among 2 M different symbols. Nyquist tells us that 38 Symbols, Signals and Noise we should take the logarithm of the number of symbols in order to get the line speed, and log 2 M = M Thus, the logarithm of the number of symbols is just the number of independent 0-or-l choices that can be represented simulta- neously, the number of independent messages we can send at once, so to speak. Nyquist's relation says that by going from off-on telegraphy to three-current (+1,0, 1) telegraphy we can increase the speed of sending letters or other symbols by 60 per cent, and if we use four current values ( + 3, +1, 1, 3) we can double the speed. This is, of course, just what Edison did with his quadruplex telegraph, for he sent two messages instead of one. Further, Nyquist showed that the use of eight current values (0, 1, 2, 3, 4, 6, 7, or +7, +5, + 3, + 1, 1, 3, 5, 7) should enable us to send four times as fast as with two current values. However, he clearly realized that fluctuations in the attenuation of the circuit, interference or noise, and limitations on the power which can be used, make the use of many current values difficult. Turning to the rate at which signal elements can be sent, Nyquist defined the line speed as one half of the number of signal elements (dots, spaces, current values) which can be transmitted in a second. We will find this definition particularly appropriate for reasons which Nyquist did not give in this early paper. By the time that Nyquist wrote, it was common practice to send telegraph and telephone signals on the same wires. Telephony makes use of frequencies above 150 cps, while telegraphy can be carried out by means of lower frequency signals. Nyquist showed how telegraph signals could be so shaped as to have no sinusoidal components of high enough frequency to be heard as interference by telephones connected to the same line. He noted that the line speed, and hence also the speed of transmission, was proportional to the width or extent of the range or band (in the sense of strip) of frequencies used in telegraphy; we now call this range of fre- quencies the band width of a circuit or of a signal. * Finally, in analyzing one proposed sort of telegraph signal, The Origins of Information Theory 39 Nyquist showed that it contained at all times a steady sinusoidal component of constant amplitude. While this component formed a part of the transmitter power used, it was useless at the receiver, for its eternal, regular fluctuations were perfectly predictable and could have been supplied at the receiver rather than transmitted thence over the circuit. Nyquist referred to this useless component of the signal, which, he said, conveyed no intelligence, as redundant, a word which we will encounter later. Nyquist continued to study the problems of telegraphy, and in 1928 he published a second important paper, "Certain Topics in Telegraph Transmission Theory." In this he demonstrated a num- ber of very important points. He showed that if one sends some number 2N of different current values per second, all the sinusoidal components of the signal with frequencies greater than N are redundant, in the sense that they are not needed in deducing from the received signal the succession of current values which were sent. If all of these higher frequencies were removed, one could still deduce by studying the signal which current values had been transmitted. Further, he showed how a signal could be constructed which would contain no frequencies about N cps and from which it would be very easy to deduce at the receiving point what current values had been sent. This second paper was more quantitative and exact than the first; together, they embrace much important mate- rial that is now embodied in communication theory. R. V. L. Hartley, the inventor of the Hartley oscillator, was thinking philosophically about the transmission of information at about this time, and he summarized his reflections in a paper, "Transmission of Information," which he published in 1928. Hartley had an interesting way of formulating the problem of communication, one of those ways of putting things which may seem obvious when stated but which can wait years for the insight that enables someone to make the statement. He regarded the sender of a message as equipped with a set of symbols (the letters of the alphabet for instance) from which he mentally selects symbol after symbol, thus generating a sequence of symbols. He observed that a chance event, such as the rolling of balls into pockets, might equally well generate such a sequence. He then defined H, the 40 Symbols, Signals and Noise information of the message, as the logarithm of the number of possible sequences of symbols which might have been selected and showed that H = n log s Here n is the number of symbols selected, and s is the number of different symbols in the set from which symbols are selected. This is acceptable in the light of our present knowledge of information theory only if successive symbols are chosen independ- ently and if any of the s symbols is equally likely to be selected. In this case, we need merely note, as before, that the logarithm of s, the number of symbols, is the number of independent 0-or-l choices that can -be represented or sent simultaneously, and it is reasonable that the rate of transmission of information should be the rate of sending symbols per second n, times the number of independent 0-or-l choices that can be conveyed per symbol. Hartley goes on to the problem of encoding the primary symbols (letters of the alphabet, for instance) in terms of secondary symbols (e.g., the sequences of dots, spaces, and dashes of the Morse code). He observes that restrictions on the selection of symbols (the fact that E is selected more often than Z) should govern the lengths of the secondary symbols (Morse code representations) if we are to transmit messages most swiftly. As we have seen, Morse himself understood this, but Hartley stated the matter in a way which encouraged mathematical attack and inspired further work. Hart- ley also suggested a way of applying such considerations to con- tinuous signals, such as telephone signals or picture signals. Finally, Hartley stated, in accord with Nyquist, that the amount of information which can be transmitted is proportional to the band width times the time of transmission. But this makes us wonder about the number of allowable current values, which is also important to speed of transmission. How are we to enumerate them? After the work of Nyquist and Hartley, communication theory appears to have taken a prolonged and comfortable rest. Workers busily built and studied particular communication systems. The art grew very complicated indeed during World War II. Much new understanding of particular new communication systems and The Origins of Information Theory 41 devices was achieved, but no broad philosophical principles were laid down. During the war it became important to predict from inaccurate or "noisy" radar data the courses of airplanes, so that the planes could be shot down. This raised an important question: Suppose that one has a varying electric current which represents data con- cerning the present position of an airplane but that there is added to it a second meaningless erratic current, that is, a noise. It may be that the frequencies most strongly present in the signal are different from the frequencies most strongly present in the noise. If this is so, it would seem desirable to pass the signal with the noise added through an electrical circuit or filter which attenuates the frequencies strongly present in the noise but does not attenuate very much the frequencies strongly present in the signal. Then, the resulting electric current can be passed through other circuits in an effort to estimate or predict what the value of the original signal, without noise, will be a few seconds from the present. But what sort of combination of electrical circuits will enable one best to predict from the present noisy signal the value of the true signal a few seconds in the future? In essence, the problem is one in which we deal with not one but with a whole ensemble of possible signals (courses of the plane), so that we do not know in advance which signal we are dealing with. Further, we are troubled with an unpredictable noise. This problem was solved in Russia by A. N. Kolmogoroff. In this country it was solved independently by Norbert Wiener. Wiener is a mathematician whose background ideally fitted him to deal with this sort of problem, and during the war he produced a yellow-bound document, affectionately called "the yellow peril" (because of the headaches it caused), in which he solved the diffi- cult problem. During and after the war another mathematician, Claude E. Shannon, interested himself in the general problem of communica- tion. Shannon began by considering the relative advantages of many new and fanciful communication systems, and he sought some basic method of comparing their merits. In the same year (1948) that Wiener published his book, Cybernetics, which deals with communication and control, Shannon published in two parts 42 Symbols, Signals and Noise a paper which is regarded as the foundation of modern communi- cation theory. Wiener and Shannon alike consider, not the problem of a single signal, but the problem of dealing adequately with any signal selected from a group or ensemble of possible signals. There was a free interchange among various workers before the publication of either Wiener's book or Shannon's paper, and similar ideas and expressions appear in both, although Shannon's interpretation appears to be unique. Chiefly, Wiener's name has come to be associated with the field of extracting signals of a given ensemble from noise of a known type. An example of this has been given above. The enemy pilot follows a course which he choses, and our radar adds noise of natural origin to the signals which represent the position of the plane. We have a set of possible signals (possible courses of the airplane), not of our own choosing, mixed with noise, not of our own choosing, and we try to make the best estimate of the present or future value of the signal (the present or future position of the airplane) despite the noise. Shannon's name has come to be associated with matters of so encoding messages chosen from a known ensemble that they can be transmitted accurately and swiftly in the presence of noise. As an example, we may have as a message source English text, not of our own choosing, and an electrical circuit, say, a noisy telegraph cable, not of our own choosing. But in the problem treated by Shannon, we are allowed to choose how we shall represent the message as an electrical signal how many current values we shall allow, for instance, and how many we shall transmit per second. The problem, then, is not how to treat a signal plus noise so as to get a best estimate of the signal, but what sort of signal to send so as best to convey messages of a given type over a particular sort of noisy circuit. This matter of efficient encoding and its consequences form the chief substance of information theory. In that an ensemble of messages is considered, the work reflects the spirit of the work of Kolmogoroff and Wiener and of the work of Morse and Hartley as well. It would be useless to review here the content of Shannon's The Origins of Information Theory 43 work, for that is what this book is about. We shall see, however, that it sheds further light on all the problems raised by Nyquist and Hartley and goes far beyond those problems. In looking back on the origins of communication theory, two other names should perhaps be mentioned. In 1946, Dennis Gabor published an ingenious paper, "Theory of Communication." This, suggestive as it is, missed the inclusion of noise, which is at the heart of modern communication theory. Further, in 1949, W. G. Tuller published an interesting paper, "Theoretical Limits on the Rate of Transmission of Information," which in part parallels Shannon's work. The gist of this chapter has been that the very general theory of communication which Shannon has given us grew out of the study of particular problems of electrical communication. Morse was faced with the problem of representing the letters of the alphabet by short or long pulses of current with intervening spaces of no current that is, by the dots, dashes, and spaces of telegraphy. He wisely chose to represent common letters by short combinations of dots and dashes and uncommon letters by long combinations; this was a first step in efficient encoding of messages, a vital part of communication theory. Ingenious inventors who followed Morse made use of different intensities and directions of current flow in order to give the sender a greater choice of signals than merely off-or-on. This made it possible to send more letters per unit time, but it made the signal more susceptible to disturbance by unwanted electrical disturb- ances called noise as well as by inability of circuits to transmit accurately rapid changes of current. An evaluation of the relative advantages of many different sorts of telegraph signals was desirable. Mathematical tools were needed for such a study. One of the most important of these is Fourier analysis, which makes it possible to represent any signal as a sum of sine waves of various frequencies. Most communication circuits are linear. This means that several signals present in the circuit do not interact or interfere. It can be shown that while even linear circuits change the shape of most signals, the effect of a linear circuit on a sine wave is merely to make it weaker and to delay its time of arrival. Hence, when a 44 Symbols, Signals and Noise complicated signal is represented as a sum of sine waves of various frequencies, it is easy to calculate the effect of a linear circuit on each sinusoidal component separately and then to add up the weakened or attenuated sinusoidal components in order to obtain the over-all received signal. Nyquist showed that the number of distinct, different current values which can be sent over a circuit per second is twice the total range or band width of frequencies used. Thus, the rate at which letters of text can be transmitted is proportional to band width. Nyquist and Hartley also showed that the rate at which letters of text can be transmitted is proportional to the logarithm of the number of current values used. A complete theory of communication required other mathe- matical tools and new ideas. These are related to work done by Kolmogoroff and Wiener, who considered the problem of an unknown signal of a given type disturbed by the addition of noise. How does one best estimate what the signal is despite the presence of the interfering noise? Kolmogoroff and Wfener solved this problem. The problem Shannon set himself is somewhat different. Suppose we have a message source which produces messages of a given type, such as English text. Suppose we have a noisy communication channel of specified characteristics. How can we represent or encode messages from the message source by means of electrical signals so as to attain the fastest possible transmission over the noisy channel? Indeed, how fast can we transmit a given type of message over a given channel without error? In a rough and general way, this is the problem that Shannon set himself and solved. CHAPTER 111 A Mathematical Model A MATHEMATICAL THEORY which seeks to explain and to predict the events in the world about us always deals with a simplified model of the world, a mathematical model in which only things pertinent to the behavior under consideration enter. Thus, planets are composed of various substances, solid, liquid, and gaseous, at various pressures and temperatures. The parts of their substances exposed to the rays of the sun reflect various fractions of the different colors of the light which falls upon them, so that when we observe planets we see on them various colored features. However, the mathematical astronomer in predicting the orbit of a planet about the sun need take into account only the total mass of the sun, the distance of the planet from the sun, and the speed and direction of the planet's motion at some initial instant. For a more refined calculation, the astronomer must also take into account the total mass of the planet and the motions and masses of other planets which exert gravitational forces on it. This does not mean that astronomers are not concerned with other aspects of planets, and of stars and nebulae as well. The important point is that they need not take these other matters into consideration in computing planetary orbits. The great beauty and power of a mathematical theory or model lies in the separation of the relevant from the irrelevant, so that certain observable behavior 45 46 Symbols, Signals and Noise can be related and understood without the need of comprehending the whole nature and behavior of the universe. Mathematical models can have various degrees of accuracy or applicability. Thus, we can accurately predict the orbits of planets by regarding them as rigid bodies, despite the fact that no truly rigid body exists. On the other hand, the long-term motions of our moon can only be understood by taking into account the motion of the waters over the face of the earth, that is, the tides. Thus, in dealing very precisely with lunar motion we cannot regard the earth as a rigid body. In a similar way, in network theory we study the electrical properties of interconnections of ideal inductors, capacitors, and resistors, which are assigned certain simple mathematical proper- ties. The components of which the actual useful circuits in radio, TV, and telephone equipment are made only approximate the properties of the ideal inductors, capacitors, and resistors of net- work theory. Sometimes, the difference is trivial and can be disre- garded. Sometimes it must be taken into account by more refined calculations. Of course, a mathematical model may be a very crude or even an invalid representation of events in the real world. Thus, the self-interested, gain-motivated "economic man" of early economic theory has fallen into disfavor because the behavior of the eco- nomic man does not appear to correspond to or to usefully explain the actual behavior of our economic world and of the people in it. In the orbits of the planets and the behavior of networks, we have examples of idealized deterministic systems which have the sort of predictable behavior we ordinarily expect of machines. Astronomers can compute the positions which the planets will occupy millennia in the future. Network theory tells us all the subsequent behavior of an electrical network when it is excited by a particular electrical signal. Even the individual economic man is deterministic, for he will always act for his economic gain. But, if he at some time gambles on the honest throw of a die because the odds favor him, his economic fate becomes to a degree unpredictable, for he may lose even though the odds do favor him. We can, however, make a mathematical model for purely chance A Mathematical Model 47 events, such as the drawing of some number, say three, of white or black balls from a container holding equal numbers of white and black balls. This model tells us, in fact, that after many trials we will have drawn all white about % of the time, two whites and a black about % of the time, two blacks and a white about % of the time, and all black about V& of the time. It can also tell us how much of a deviation from these proportions we may reasonably expect after a given number of trials. Our experience indicates that the behavior of actual human beings is neither as determined as that of the economic man nor as simply random as the throw of a die or as the drawing of balls from a mixture of black and white balls. It is clear, however, that a deterministic model will not get us far in the consideration of human behavior, such as human communication, while a random or statistical model might. We all know that the actuarial tables used by insurance com- panies make fair predictions of the fraction of a large group of men in a given age group who will die in one year, despite the fact that we cannot predict when a particular man will die. Thus a statistical model may enable us to understand and even to make some sort of predictions concerning human behavior, even as we can predict how often, on the average, we will draw three black balls by chance from an equal mixture of white and black balls. It might be objected that actuarial tables make predictions con- cerning groups of people, not predictions concerning individuals. However, experience teaches us that we can make predictions concerning the behavior of individual human beings as well as of groups of individuals. For instance, in counting the frequency of usage of the letter E in all English prose we will find that E con- stitutes about 0.13 of all the letters appearing, while W, for instance, constitutes only about 0.02 of all letters appearing. But, we also find almost the same proportions of E's and W's in the prose written by any one person. Thus, we can predict with some confi- dence that if you, or I, or Joe Doakes, or anyone else writes a long letter, or an article, or a book, about 0.13 of the letters he uses will be E's. This predictability of behavior limits our freedom no more than does any other habit. We don't have to use in our writing the same 48 Symbols, Signals and Noise fraction of E's, or of any other letter, that everyone else does. In fact, several untrammeled individuals have broken away from the common pattern. William F. Friedman, the eminent cryptanalyst and author of The Shakesperian Cipher Examined, has supplied me with the following examples. Gottlob Burmann, a German poet who lived from 1737 to 1805, wrote 130 poems, including a total of 20,000 words, without once using the letter R. Further, during the last seventeen years of his life, Burmann even omitted the letter from his daily conversation. In each of five stories published by Alonso Alcala y Herrera in Lisbon in 1641 a different vowel was suppressed. Francisco Navar- rete y Ribera (1659), Fernando Jacinto de Zurita y Haro (1654), and Manuel Lorenzo de Lizarazu y Berbuizana (1654) provided other examples. In 1939, Ernest Vincent Wright published a 267-page novel, Gadsby, in which no use is made of the letter E. I quote a paragraph below: Upon this basis I am going to show you how a bunch of bright young folks did find a champion; a man with boys and girls of his own; a man of so dominating and happy individuality that Youth is drawn to him as is a fly to a sugar bowl. It is a story about a small town. It is not a gossipy yarn; nor is it a dry, monotonous account, full of such customary "fill-ins" as "romantic moonlight casting murky shadows down a long, winding country road." Nor will it say anything about tinklings lulling distant folds; robins carolling at twilight, nor any "warm glow of lamplight" from a cabin window. No. It is an account of up-and-doing activity; a vivid portrayal of Youth as it is today; and a practical discarding of that worn- out notion that "a child don't know anything." While such exercises of free will show that it is not impossible to break the chains of habit, we ordinarily write in a more conven- tional manner. When we are not going out of our way to demon- strate that we can do otherwise, we customarily use our due fraction of 0.13 E's with almost the consistency of a machine or a mathematical rule. We cannot argue from this to the converse idea that a machine into which the same habits were built could write English text. However, Shannon has demonstrated how English words and text A Mathematical Model 49 can be approximated by a mathematical process which could be carried out by a machine. Suppose, for instance, that we merely produce a sequence of letters and spaces with equal probabilities. We might do this by putting equal numbers of cards marked with each letter and with the space into a hat, mixing them up, drawing a card, recording its symbol, returning it, remixing, drawing another card, and so on. This gives what Shannon calls the zero-order approximation to English text. His example, obtained by an equivalent process, goes: 1. Zero-order approximation (symbols independent and equi- probable) XFOML RXKHRJFFJTUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD. Here there are far too many Zs and Ws, and not nearly enough E's and spaces. We can approach more nearly to English text by choosing letters independently of one another, but choosing E more often than W or Z. We could do this by putting many E's and few W's and Z's into the hat, mixing, and drawing out the letters. As the probability that a given letter is an E should be .13, out of every hundred letters we put into the hat, 13 should be E's. As the probability that a letter will be W should be .02, out of each hundred letters we put into the hat, 2 should be W's, and so on. Here is the result of an equivalent procedure, which gives what Shannon calls a first-order approximation of English text: 2. First-order approximation (symbols independent but with frequencies of English text). OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL In English text we almost never encounter any pair of letters beginning with Q except QU. The probability of encountering QX or QZ is essentially zero. While the probability of QU is not 0, it is so small as not to be listed in the tables I consulted. On the other hand, the probability of TH is .037, the probability of OR is .010 and the probability of WE is .006. These probabilities have the following meaning. In a stretch of text containing, say, 10,001 50 Symbols, Signals and Noise letters, there are 10,000 successive pairs of letters, i.e., the first and second, the second and third, and so on to the next to last and the last. Of the pairs a certain number are the letters TH. This might be 370 pairs. If we divide the total number of times we find TH, which we have assumed to be 370 times, by the total number of pairs of letters, which we have assumed to be 10,000, we get the probability that a randomly selected pair of letters in the text will be TH, that is, 370/10,000, or .037. Diligent cryptanalysts have made tables of such digram prob- abilities for English text. To see how we might use these in con- structing sequences of letters with the same digram probabilities as English text, let us assume that we use 27 hats, 26 for digrams beginning with each of the letters and one for digrams beginning with a space. We will then put a large number of digrams into the hats according to the probabilities of the digrams. Out of 1,000 digrams we would put in 37 TH's, 10 WE's, and so on. Let us consider for a moment the meaning of these hats full of digrams in terms of the original counts which led to the evaluations of digram probabilities. In going through the text letter by letter we will encounter every T in the text. Thus, the number of digrams beginning with T, all of which we put in one hat, will be the same as the number of T's. The fraction these represent of the total number of digrams counted is the probability of encountering T in the text; that is, .10. We might call this probability XT) XT) = .10 We may note that this is also the fraction of digrams, distributed among the hats, which end in T as well as the fraction that begin with T. Again, basing our total numbers on 1,001 letters of text, or 1,000 digrams, the number of times the digram TH is encountered is 37, and so the probability of encountering the digram TH, which we might call /(T, H) is XT, H) = .037 Now we see that 0.10, or 100, of the digrams will begin with T and hence will be in the T hat and of these 37 will be TH. Thus, A Mathematical Model 51 the fraction of the T digrams which are TH will be 37/100, or 0.37. Correspondingly, we say that the probability that a digram begin- ning with T is TH, which we might call/> T (H), is ^r(H) = .37 This is called the conditional probability that the letter following a T will be an H. One can use these probabilities, which are adequately repre- sented by the numbers of various digrams in the various hats, in the construction of text which has both the same letter frequencies and digram frequencies as does English text. To do this one draws the first digram at random from any hat and writes down its letters. He then draws a second digram from the hat indicated by the second letter of the first digram and writes down the second letter of this second digram. Then he draws a third digram from the hat indicated by the second letter of the second digram and writes down the second letter of this third digram, and so on. The space is treated just like a letter. There is a particular probability that a space will follow a particular letter (ending a "word") and a particular probability that a particular letter will follow a space (starting a new "word"). By an equivalent process, Shannon constructed what he calls a second-order approximation to English; it is: 3. Second-order approximation (digram structure as in English). ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ELONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE Cryptanalysts have even produced tables giving the probabilities of groups of three letters, called trigram probabilities. These can be used to construct what Shannon calls a third-order approxima- tion to English. His example goes: 4. Third-order approximation (trigram structure as in English). IN NO 1ST LAT WHEY CRATTCT FROURE BIRS GROC1D PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTTONA OF CRE When we examine Shannon's examples 1 through 4 we see an increasing resemblance to English text. Example 1, the zero-order 52 Symbols, Signals and Noise approximation, has no wordlike combinations. In example 2, which takes letter frequencies into account, OCRO and NAH somewhat resemble English words. In example 3, which takes digram frequen- cies into account, all the "words" are pronounceable, and ON, ARE, BE, AT, and ANDY occur in English. In example 4, which takes trigram frequencies into account, we have eight English words and many English-sounding words, such as GROCID, PONDENOME, and DEMONSTURES. G. T. Guilbaud has carried out a similar process using the statistics of Latin and has so produced a third-order approximation (one taking into account trigram frequencies) resembling Latin, which I quote below: IBUS CENT IPITIA VETIS IPSE CUM VFVTVS SE ACETITI DEDENTUR The underlined words are genuine Latin words. It is clear from such examples that by giving a machine certain statistics of a language, the probabilities of finding a particular letter or group of 1, or 2, or 3, or n letters, and by giving the machine an ability equivalent to picking a ball from a hat, flipping a coin, or choosing a random number, we could make the machine produce a close approximation to English text or to text in some other language. The more complete information we gave the machine, the more closely would its product resemble English or other text, both in its statistical structure and to the human eye. If we allow the machine to choose groups of three letters on the basis of their probability, then any three-letter combination which it produces must be an English word or a part of an English word and any two letter "word" must be an English word. The machine is, however, less inhibited than a person, who ordinarily writes down only sequences of letters which do spell words. Thus, he misses ever writing down pompous PONDENOME, suspect ILONASIVE, somewhat vulgar GROCID, learned DEMONSTURES, and wacky but delightful DEAMY. Of course, a man in principle could write down such combinations of letters but ordinarily he doesn't. We could cure the machine of this ability to produce un-English words by making it choose among groups of letters as long as the longest English word. But, it would be much simpler merely to A Mathematical Model 53 supply the machine with words rather than letters and to let it produce these words according to certain probabilities. Shannon has given an example in which words were selected independently, but with the probabilities of their occurring in English text, so that the, and, man, etc., occur in the same propor- tion as in English. This could be achieved by cutting text into words, scrambling the words in a hat, and then drawing out a succession of words. He calls this a first-order word approximation. It runs as follows: 5. First-order word approximation. Here words are chosen inde- pendently but with their appropriate frequencies. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE There are no tables which give the probability of different pairs of words. However, Shannon constructed a random passage in which the probabilities of pairs of words were the same as in English text by the following expedient. He chose a first pair of words at random in a novel. He then looked through the novel for the next occurrence of the second word of the first pair and added the word which followed it in this new occurrence, and so on. This process gave him the following second-order word approxi- mation to English. 6. Second-order word approximation. The word transition prob- abilities are correct, but no further structure is included. THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED. We see that there are stretches of several words in this final passage which resemble and, indeed, might occur in English text. Let us consider what we have found. In actual English text, in that text which we send by teletypewriter, for instance, particular letters occur with very nearly constant frequencies. Pairs of letters 54 Symbols, Signals and Noise and triplets and quadruplets of letters occur with almost constant frequencies over long stretches of the text. Words and pairs of words occur with almost constant frequencies. Further, we can by means of a random mathematical process, carried out by a machine if you like, produce sequences of English words or letters exhibiting these same statistics. Such a scheme, even if refined greatly, would not, however, produce all sequences of words that a person might utter. Carried to an extreme, it would be confined to combinations of words which had occurred; otherwise, there would be no statistical data available on them. Yet I may say, "The magenta typhoon whirled and farded bishop away," and this may well never have been said before. The real rules of English text deal not with letters or words alone but with classes of words and their rules of association, that is, with grammar. Linguists and engineers who try to make machines for translating one language into another must find these rules, so that their machines can combine words to form grammatical utterances even when these exact combinations have not occurred before (and also so that the meaning of words in the text to be translated can be deduced from the context). This is a big problem. It is easy, however, to describe a "machine" which randomly produces end- less, grammatical utterances of a limited sort. Figure III-l is a diagram of such a "machine." Each numbered box represents a state of the machine. Because there is only a finite number of boxes or states, this is called & finite-state machine. From each box a number of arrows go to other boxes. In this particular machine, only two arrows go from each box to each of two other boxes. Also, in this case, each arrow is labeled Vi. This indicates that the probability of the machine passing from, for instance, state 2 to state 3 is l h and the probability of the machine passing from state 2 to state 4 is l h. To make the machine run, we need a sequence of random choices, which we can obtain by flipping a coin repeatedly. We can let heads (H) mean/0/W the top arrow and tails (T), follow the bottom arrow. This will tell us to pass to a new state. When we do this we print out the word, words, or symbol written in that state box and flip again to get a new state. A Mathematical Model 55 56 Symbols, Signals and Noise As an example, if we started in state 7 and flipped the following sequence of heads and tails: THHHTTHTTTHHH H, the "machine would print out" THE COMMUNIST PARTY INVESTIGATED THE CONGRESS. THE COMMUNIST PARTY PURGED THE CONGRESS AND DESTROYED THE COMMUNIST PARTY AND FOUND EVIDENCE OF THE CONGRESS. This can go on and on, never retracing its whole course and producing "sentences" of unlimited length. Random choice according to a table of probabilities of sequences of symbols (letters and space) or words can produce material resembling English text. A finite-state machine with a random choice among allowed transitions from state to state can produce material resembling English text. Either process is called a stochas- tic process, because of the random element involved in it. We have examined a number of properties of English text. We have seen that the average frequency of E's is commonly constant for both the English text produced by one writer and, also, for the text produced by all writers. Other more complicated statistics, such as the frequency of digrams (TH, WE, and other letter pairs), are also essentially constant. Further, we have shown that English- like text can be produced by a sequence of random choices, such as drawings of slips of paper from hats, or flips of a coin, if the proper probabilities are in some way built into the process. One way of producing such text is through the use of a finite-state machine, such as that of Figure III-l. We have been seeking a mathematical model of a source of English text. Such a m$del should be capable of producing text which corresponds closely to actual English text, closely enough so that the problem of encoding and transmitting such text is essentially equivalent to the problem of encoding and transmitting actual English text. The mathematical properties of the model must be mathematically defined so that useful theorems can be proved concerning the encoding and transmission of the text is produces, theorems which are applicable to a high degree of approximation to the encoding of actual English text. It would, however, be asking too much to insist that the production of actual English text con- form with mathematical exactitude to the operation of the model. A Mathematical Model 57 The mathematical model which Shannon adopted to represent the production of text (and of spoken and visual messages as well) is the ergodic source. To understand what an ergodic source is, we must first understand what a stationary source is, and to explain this is our next order of business. The general idea of a stationary source is well conveyed by the name. Imagine, for instance, a process, i.e., an imaginary machine, that produces forever after it is started the sequences of characters AEAEAEAEAE, etc. Clearly, what comes later is like what has gone before, and stationary seems an apt designation of such a source of characters. We might contrast this with a source of characters which, after starting, produced AEAAEEAAAEEE, etc. Here the strings of A's and E's get longer and longer without end; certainly this is not a stationary source. Similarly, a sequence of characters chosen at random with some assigned probabilities (the first-order letter approximation of ex- ample 1 above) constitutes a stationary source and so do the digram and trigram sources of examples 2 and 3. The general idea of a stationary source is clear enough. An adequate mathematical definition is a little more difficult. The idea of stationarity of a source demands no change with time. Yet, consider a digram source, in which the probability of the second character depends on what the previous character is. If we start such a source out on the letter A, several different letters can follow, while if we start such a source out on the letter Q, the second letter must be U. In general, the manner of starting the source will influence the statistics of the sequence of characters produced, at least for some distance from the start. To get around this, the mathematician says, let us not consider just one sequence of characters produced by the source. After all, our source is an imaginary machine, and we can quite well imagine that it has been started an infinite number of times, so as to produce an infinite number of sequences of characters. Such an infinite number of sequences is called an ensemble of sequences. These sequences could be started in any specified manner. Thus, 58 Symbols, Signals and Noise in the case of a digram source, we can if we wish start a fraction, 0.13, of the sequences with E (this is just the probability of E in English text), a fraction, 0.02, with W (the probability of W), and so on. If we do this, we will find that the fraction of E's is the same, averaging over all the first letters of the ensemble of sequences, as it is averaging over all the second letters of the ensemble, as it is averaging over all the third letters of the ensemble, and so on. No matter what position from the beginning we choose, the fraction of E's or of any other letter occurring in that position, taken over all the sequences in the ensemble, is the same. This independence with respect to position will be true also for the probability with which TH or WE occurs among the first, second, third, and sub- sequent pairs of letters in the sequences of the ensemble. This is what we mean by stationarity. If we can find a way of assigning probabilities to the various starting conditions used in forming the ensemble of sequences of characters which we allow the source to produce, probabilities such that any statistic obtained by averaging over the ensemble doesn't depend on the distance from the start at which we take an average, then the source is said to be stationary. This may seem difficult or obscure to the reader, but the difficulty arises in giving a useful and exact mathematical form to an idea which would otherwise be mathematically useless. In the argument above we have, in discussing the infinite en- semble of sequences produced by a source, considered averaging over- all first characters or over- all second or third characters (or pairs, or triples of characters, as other examples). Such an average is called an ensemble average. It is different from a sort of average we talked about earlier in this chapter, in which we lumped together all the characters in one sequence and took the average over them. Such an average is called a time average. The time average and the ensemble average can be different. For instance, consider a source which starts a third of the time with A and produces alternately A and B, a third of the time with B and produces alternately B and A, and a third of the time with E and produces a string of E's. The possible sequences are 1. ABABABAB, etc. 2. BABABABA, etc. 3. EEEEEEEE, etc. A Mathematical Model 59 We can see that this is a stationary source, yet we have the probabilities shown in Table V. TABLE V Probability Time Average Time Average Time Average Ensemble of Sequence (1) Sequence (2) Sequence (3) Average A y 2 l /2 fc B J /2 l /l ! /i E 1 '/3 When a source is stationary, and when every possible ensemble average (of letters, digrams, trigrams, etc.) is equal to the corre- sponding time average, the source is said to be ergodic. The theorems of information theory which are discussed in subsequent chapters apply to ergodic sources, and their proofs rest on the assumption that the message source is ergodic. 1 While we have here discussed discrete sources which produce sequences of characters, information theory also deals with con- tinuous sources, which generate smoothly varying signals, such as the acoustic waves of speech or the fluctuating electric currents which correspond to these in telephony. The sources of such signals are also assumed to be ergodic. Why is an ergodic message source an appropriate and profitable mathematical model for study? For one thing, we see by examining the definition of an ergodic source as given above that for an ergodic source the statistics of a message, for instance, the fre- quency of occurrence of a letter, such as E, or of a digram, such as TH, do not vary along the length of the message. As we analyze a longer and longer stretch of a message, we get a better and better estimate of the probabilities of occurrence of various letters and letter groups. In other words, by examining a longer and longer stretch of a message we are able to arrive at and refine a mathe- matical description of the source. Further, the probabilities, the description of the source arrived at through such an examination of one message, apply equally well to all messages generated by the source and not just to the 1 Some work has been done on the encoding of nonstationary sources, but it is not discussed in this book. 60 Symbols, Signals and Noise particular message examined. This is assured by the fact that the time and ensemble averages are the same. Thus, an ergodic source is a particularly simple kind of prob- abilistic or stochastic source of messages, and simple processes are easier to deal with mathematically than are complicated processes. However, simplicity in itself is not enough. The ergodic source would not be of interest in communication theory if it were not reasonably realistic as well as simple. Communication theory has two sides. It has a mathematically exact side, which deals rigorously with hypothetical, exactly ergodic sources, sources which we can imagine to produce infinite en- sembles of infinite sequences of symbols. Mathematically, we are free to investigate rigorously either such a source itself or the infinite ensemble of messages which it can produce. We use the theorems of communication theory in connection with the transmission of actual English text. A human being is not a hypothetical, mathematically defined machine. He cannot pro- duce even one infinite sequence of characters, let alone an infinite ensemble of sequences. A man does, however, produce many long sequences of charac- ters, and all the writers of English together collectively produce a great many such long sequences of characters. In fact, part of this huge output of very long sequences of characters constitutes the messages actually sent by teletypewriter. We will, thus, think of all the different Americans who write out telegrams in English as being, approximately at least, an ergodic source pf telegraph messages and of all Americans speaking over telephones as being, approximately at least, an ergodic source of telephone signals. Clearly, however, all men writing French plus all men writing English could not constitute an ergodic source. The output of each would have certain time-average probabilities for letters, digrams, trigrams, words, and so on, but the probabilities for the English text would be different from the probabilities for the French text, and the ensemble average would resemble neither. We will not assert that all writers of English (and all speakers of English) constitute a strictly ergodic message source. The statis- tics of the English we produce change somewhat as we change subject or purpose, and different people write somewhat differently. A Mathematical Model 61 Too, in producing telephone signals by speaking, some people speak softly, some bellow, and some bellow only when they are angry. What we do assert is that we find a remarkable uniformity in many statistics of messages, as in the case of the probability of E for different samples of English text. Speech and writing as ergodic sources are not quite true to the real world, but they are far truer than is the economic man. They are true enough to be useful. This difference between the exactly ergodic source of the mathe- matical theory of communication and the approximately ergodic message sources of the real world should be kept in mind. We must exercise a reasonable caution in applying the conclusions of the mathematical theory of communication to actual problems. We are used to this in other fields. For instance, mathematics tells us that we can deduce the diameter of a circle from the coordinates or locations of any three points on the circle, and this is true for absolutely exact coordinates. Yet no sensible man would try to determine the diameter of a somewhat fuzzy real circle drawn on a sheet of paper by trying to measure very exactly the positions of three points a thousandth of an inch apart on its circumference. Rather, he would draw a line through the center and measure the diameter directly as the distance between diametrically opposite points. This is just the sort of judgment and caution one must always use in applying an exact mathematical theory to an inexact practical case. Whatever caution we invoke, the fact that we have used a ran- dom, probabilistic, stochastic process as a model of man in his role of a message source raises philosophical questions. Does this mean that we imply that man acts at random? There is no such impli- cation. Perhaps if we knew enough about a man, his environment, and his history, we could always predict just what word he would write or speak next. In communication theory, however, we assume that our only knowledge of the message source is obtained either from the messages that the source produces or perhaps from some less-than- complete study of man himself. On the basis of information so obtained, we can derive certain statistical data which, as we have seen, help to narrow the probability as to what the next word or 62 Symbols, Signals and Noise letter of a message will be. There remains an element of uncer- tainty. For us who have incomplete knowledge of it, the message source behaves as if certain choices were made at random, insofar as we cannot predict what the choices will be. If we could predict them, we should incorporate the knowledge which enables us to make the predictions into our statistics of the source. If we had more knowledge, however, we might see that the choices which we cannot predict are not really random, in that they are (on the basis of knowledge that we do not have) predictable. We can see that the view we have taken of finite-state machines, such as that of Figure III-l, has been limited. Finite-state machines can have inputs as well as outputs. The transition from a particular state to one among several others need not be chosen randomly; it could be determined or influenced by various inputs to the machine. For instance, the operation of an electronic digital com- puter, which is a finite-state machine, is determined by the program and data fed to it by the programmer. It is, in fact, natural to think that man may be a finite-state machine, not only in his function as a message source which pro- duces words, but in all his other behavior as well. We can think if we like of all possible conditions and configurations of the cells of the nervous system as constituting states (states of mind, perhaps). We can think of one state passing to another, sometimes with the production of a letter, word, sound, or a part thereof, and some- times with the production of some other action or of some part of an action. We can think of sight, hearing, touch, and other senses as supplying inputs which determine or influence what state the machine passes into next. If man is a finite-state machine, the number of states must be fantastic and beyond any detailed mathe- matical treatment. But, so are the configurations of the molecules in a gas, and yet we can explain much of the significant behavior of a gas in terms of pressure and temperature merely. Can we someday say valid, simple, and important things about the working of the mind in producing written text and other things as well? As we have seen, we can already predict a good deal concerning the statistical nature of what a man will write down on paper, unless he is deliberately trying to behave eccentrically, and, even then, he cannot help conforming to habits of his own. Such broad considerations are not, of course, the real purpose A Mathematical Model 63 or meat of this chapter. We set out to find a mathematical model adequate to represent some aspects of the human being in his role as a source of messages and adequate to represent some aspects of the messages he produces. Taking English text as an example, we noted that the frequencies of occurrence of various letters are remarkably constant, unless the writer deliberately avoids certain letters. Likewise, frequencies of occurrence of particular pairs, triplets, and so on, of letters are very nearly constant, as are frequencies of various words. We also saw that we could generate sequences of letters with frequencies corresponding to those of English text by various ran- dom or stochastic processes, such as, cutting a lot of text into letters (or words), scrambling the bits of paper in a hat, and drawing them out one at a time. More elaborate stochastic processes, including finite-state machines, can produce an even closer approximation to English text. Thus, we take a generalized stochastic process as a model of a message source, such as, a source producing English text. But, how must we mathematically define or limit the stochastic sources we deal with so that we can prove theorems concerning the encoding of messages generated by the sources? Of course, we must choose a definition consistent with the character of real English text. The sort of stochastic source chosen as a model of actual message sources is the ergodic source. An ergodic source can be regarded as a hypothetical machine which produces an infinite number of or ensemble of infinite sequences of characters. Roughly, the nature or statistics of the sequences of characters or messages produced by an ergodic source do not change with time; that is, the source is stationary. Further, for an ergodic source the statistics based on one message apply equally well to all messages that the source generates. The theorems of communication theory are proved exactly for truly ergodic sources. All writers writing English text together constitute an approximately ergodic source of text. The mathe- matical model the truly ergodic source is close enough to the actual situation so that the mathematics we base on it is very useful. But we must be wise and careful in applying the theorems and results of communication theory, which are exact for a mathe- matical ergodic source, to actual communication problems. CHAPTER 1 V Encoding and Binary Digits A SOURCE OF INFORMATION may be English text, a man speaking, the sound of an orchestra, photographs, motion picture films, or scenes at which a television camera may be pointed. We have seen that in information theory such sources are regarded as having the properties of ergodic sources of letters, numbers, characters, or electrical signals. A chief aim of information theory is to study how such sequences of characters and such signals can be most effec- tively encoded for transmission, commonly by electrical means. Everyone has heard of codes and the encoding of messages. Romantic spies use secret codes. Edgar Allan Poe popularized cryptography in The Gold Bug. The country is full of amateur cryptanalysts who delight in trying to read encoded messages that others have devised. In this historical sense of cryptography or secret writing, codes are used to conceal the content of an important message from these for whom it is not intended. This may be done by substituting for the words of the message other words which are listed in a code book. Or, in a type of code called a cipher, letters or numbers may be substituted for the letters in the message according to some previously agreed upon secret scheme. The idea of encoding, of the accurate representation of one thing by another, occurs in other contexts as well. Geneticists believe that the whole plan for a human body is written out in the 64 Encoding and Binary Digits 65 chromosomes of the germ cell. Some assert that the "text" consists of an orderly linear arrangement of four different units, or "bases," in the DNA (desoxyribonucleic acid) forming the chromosome. This text in turn produces an equivalent text in RNA (ribonucleic acid), and by means of this RNA text proteins made up of sequences of twenty amino acids are synthesized. Some cryptana- lytic effort has been spent in an effort to determine how the four- character message of RNA is reencoded into the twenty-character code of the protein. Actually, geneticists have been led to such considerations by the existence of information theory. The study of the transmission of information has brought about a new general understanding of the problems of encoding, an understanding which is important to any sort of encoding, whether it be the encoding of cryptography or the encoding of genetic information. We have already noted in Chapter II that English text can be encoded into the symbols of Morse code and represented by short and long pulses of current separated by short and long spaces. This is one simple form of encoding. From the point of view of infor- mation theory, the electromagnetic waves which travel from an FM transmitter to the receiver in your home are an encoding of the music which is transmitted. The electric currents in telephone circuits are an encoding of speech. And the sound waves of speech are themselves an encoding of the motions of the vocal tract which produce them. Nature has specified the encoding of the motions of the vocal tract into the sounds of speech. The communication engineer, however, can choose the form of encoding by means of which he will represent the sounds of speech by electric currents, just as he can choose the code of dots, dashes, and spaces by means of which he represents the letters of English text in telegraphy. He wants to perform this encoding well, not poorly. To do this he must have some standard which distinguishes good encoding from bad encod- ing, and he must have some insight into means for achieving good encoding. We learned something of these matters in Chapter II. It is the study of this problem, a study that might in itself seem limited, which has provided through information theory new ideas important to all encoding, whether cryptographic or genetic. These 66 Symbols, Signals and Noise new ideas include a measure of amount of information, called entropy, and a unit of measurement, called the bit. I would like to believe that at this point the reader is clamoring to know the meaning of "amount of information" as measured in bits, and if so I hope that this enthusiasm will carry him over a considerable amount of intervening material about the encoding of messages. It seems to me that one can't understand and appreciate the solution to a problem unless he has some idea of what the problem is. You can't explain music meaningfully to a man who has never heard any. A story about your neighbor may be full of insight, but it would be wasted on a Hottentot. I think it is only by considering in some detail how a message can be encoded for transmission that we can come to appreciate the need for and the meaning of a measure of amount of information. It is easiest to gain some understanding of the important prob- lems of coding by considering simple and concrete examples. Of course, in doing this we want to learn something of broad value, and here we may foresee a difficulty. Some important messages consist of sequences of discrete char- acters, such as the successive letters of English text or the successive digits of the output of an electronic computer. We have seen, however, that other messages seem inherently different. Speech and music are variations with time of the pressure of air at the ear. This pressure we can accurately represent in telephony by the voltage of a signal traveling along a wire or by some other quantity. Such a variation of a signal with time is illustrated in a of Figure IV- 1. Here we assume the signal to be a voltage which varies with time, as shown by the wavy line. Information theory would be of limited value if it were not applicable to such continuous signals or messages as well as to discrete messages, such as English text. In dealing with continuous signals, information theory first invokes a mathematical theorem called the sampling theorem, which we will use but not prove. This theorem states that a con- tinuous signal can be represented completely by and reconstructed perfectly from a set of measurements or samples of its amplitude which are made at equally spaced times. The interval between such Encoding and Binary Digits 67 r\ ^ r\ ,.. ? o Ih > ~ . I I I I . . I ' I ' , (b) samples must be equal to or less than one-half of the period of the highest frequency present in the signal A set of such measurements or samples of the amplitude of the signal a, Figure IV- 1, is repre- sented by a sequence of vertical lines of various heights in b of Figure IV- 1. We should particularly note that for such samples of the signal to represent a signal perfectly they must be taken frequently enough. For a voice signal including frequencies from to 4,000 cycles per second we must use 8,000 samples per second. For a television signal including frequencies from to 4 million cycles per second we must use 8 million samples per second. In general, if the frequency range of the signal is /cycles per second we must use at least 2f samples per second in order to describe it perfectly. Thus, the sampling theorem enables us to represent a smoothly varying signal by a sequence of samples which have different amplitudes one from another. This sequence of samples is, how- ever, still inherently different from a sequence of letters or digits. There are only ten digits and there are only twenty-six letters, but a sample can have any of an infinite number of amplitudes. The amplitude of a sample can lie anywhere in a continuous range of values, while a character or a digit has only a limited number of discrete values. The manner in which information theory copes with samples having a continuous range of amplitudes is a topic all in itself, to which we will return later. Here we will merely note that a signal 68 Symbols, Signals and Noise need not be described or reproduced perfectly. Indeed, with real physical apparatus a signal cannot be reproduced perfectly. In the transmission of speech, for instance, it is sufficient to represent the amplitude of a sample to an accuracy of about 1 per cent. Thus, we can, if we wish, restrict ourselves to the numbers to 99 in describing the amplitudes of successive speech samples and repre- sent the amplitude of a given sample by that one of these hundred integers which is closest to the actual amplitude. By so quantizing the signal samples, we achieve a representation comparable to the discrete case of English text. We can, then, by sampling and quantizing, convert the problem of coding a continuous signal, such as speech, into the seemingly simpler problem of coding a sequence of discrete characters, such as the letters of English text. We noted in Chapter II that English text can be sent, letter by letter, by means of the Morse code. In a similar manner, such messages can be sent by teletypewriter. Pressing a particular key on the transmitting machine sends a particular sequence of elec- trical pulses and spaces out on the circuit. When these pulses and spaces reach the receiving machine, they activate the corresponding type bar, and the machine prints out the character that was trans- mitted. Patterns of pulses and spaces indeed form a particularly useful and general way of describing or encoding messages. Although Morse code and teletypewriter codes make use of pulses and spaces of different lengths, it is possible to transmit messages by means of a sequence of pulses and spaces of equal length, transmitted at perfectly regular intervals. Figure IV-2 shows how the electric current sent out on the line varies with time for two different patterns, each six intervals long, of such equal pulses and spaces. Sequence a is a pulse-space-space-pulse-space-pulse. Sequence b is pulse-pulse-pulse-space-pulse-pulse. The presence of a pulse or a space in a given interval specifies one of two different possibilities. We could use any pair of symbols to represent such patterns of pulses or spaces as those of Figure IV-2: yes, no; + , ; 1,0. Thus we could represent pattern a as follows: pulse Yes Encoding and Binary Digits space No space No pulse Yes space No 69 pulse Yes The representation by 1 or is particularly convenient and important. It can be used to relate patterns of pulses to numbers expressed in the binary system of notation. When we write 315 we mean 3 X 10 2 + 1 X 10 1 + 5 x 1 = 3 X 100 + 1 x 10 + 5 x 1 = 315 In this ordinary decimal system of representing numbers we make use of the ten different digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. In the binary system we use only two digits, and 1. When we write 1 1 1 we mean 1x25 + 0x24 + 0x23+1x22 + 0x2+1x1 = 1 X 32 + X 16 + 0x8+1x4 + 0x2+1x1 = 37 in decimal notation It is often convenient to let zeros precede a number; this does not change its value. Thus, in decimal notation we can say, 0016 = 16 LLJ 1 2 3 J 4 5 6 . TIME > Fig. IV-2 (a) Cb) 70 Symbols, Signals and Noise Or in binary notation 001010 = 1010 In binary numbers, each or 1 is a binary digit. To describe the pulses or spaces occurring in six successive intervals, we can use a sequence of six binary digits. As a pulse or space in one interval is equivalent to a binary digit, we can also refer to a pulse group of six binary digits, or we can refer to the pulse or space occurring in one interval as one binary digit. Let us consider how many patterns of pulses and spaces there are which are three intervals long. In other words, how many three-digit binary numbers are there? These are all shown in Table VI. TABLE VI 000 (0) 001 (1) 010 (2) on (3) 100 (4) 101 (5) 110 (6) 111 (7) The decimal numbers corresponding to these sequences of 1's and O's regarded as binary numbers are shown in parentheses to the right. We see that there are 8 (0 and 1 through 7) three-digit binary numbers. We may note that 8 is 2 3 . We can, in fact, regard an orderly listing of binary digits n intervals long as simply setting down 2 n successive binary numbers, starting with 0. As examples, in Table VII the numbers of different patterns corresponding to different numbers n of binary digits are tabulated. We see that the number of different patterns increases very rapidly with the number of binary digits. This is because we double the number of possible patterns each time we add one digit. When we add one digit, we get all the old sequences preceded by a plus all the old sequences preceded by a 1. The binary system of notation is not the only alternative to the Encoding and Binary Digits 71 TABLE VII n (Number of Binary Digits) Number of Patterns (2 n ) . _ 2 4 3 8 4 16 5 32 10 1,024 20 1,048,576 decimal system. The octal system is very important to people who use computers. We can regard the octal system as made up of the eight digits 0, 1, 2, 3, 4, 5, 6, 7. When we write 356 in the octal system we mean 3 X 82 + 5 x 8 + 6 X 1 -3x64 + 5x8 + 6x1 = 238 in decimal notation We can convert back and forth between the octal and the binary systems very simply. We need merely replace each successive block of three binary digits by the appropriate octal digit, as, for instance, binary 010 111 Oil 110 octal 2736 People who work with binary notation in connection with com- puters find it easier to remember and transcribe a short sequence of octal digits than a long group of binary digits. They learn to regard patterns of three successive binary digits as an entity, so that they will think of a sequence of twelve binary digits as a succession of four patterns of three, that is, as a sequence of four octal digits. It is interesting to note, too, that, just as a pattern of pulses and spaces can correspond to a sequence of binary digits, so a sequence of pulses of various amplitudes (0, 1, 2, 3, 4, 5, 6, 7) can correspond to a sequence of octal digits. This is illustrated in Figure IV-3. In a, we have the sequence of off-on, 0-1 pulses corresponding to the binary number 0101 1 101 1 1 10. The corresponding octal number is 2736, and in b this is represented by a sequence of four pulses of current having amplitudes 2, 7, 3, 6. 72 Symbols, Signals and Noise 1 1 1 1 1 1 1 1 1 1 I 1 2 7 3 6 ~ I (a) (b) Fig, IV-3 Conversion from binary to decimal numbers is not so easy. On the average, it takes about 3.32 binary digits to represent one decimal digit. Of course we can assign four binary digits to each decimal digit, as shown in Table VIII, but this means that some patterns are wasted; there are more patterns than we use. It is convenient to think of sequences of O's and Ps or sequences of pulses and spaces as binary numbers. This helps us to under- TABLE VIII Binary Number Decimal Digit 0000 0001 1 0010 2 0011 3 0100 4 0101 5 0110 6 0111 1 1000 8 1001 9 1010 not used 1011 not used 1100 not used 1101 not used 1110 not used mi not used Encoding and Binary Digits 73 stand how many sequences of a different length there are and how numbers written in the binary system correspond to numbers written in the octal or in the decimal system. In the transmission of information, however, the particular number assigned to a sequence of binary digits is irrelevent. For instance, if we wish merely to transmit representations of octal digits, we could make the assignments shown in Table IX rather than those in Table VI. TABLE IX Sequence of Binary Digits Octal Digit Represented 000 5 001 7 010 1 011 6 100 101 4 110 2 111 3 Here the "binary numbers" in the left column designate octal numbers of different numerical value. In fact, there is another way of looking at such a correspondence between binary digits and other symbols, such as octal digits, a way in which we do not regard the sequence of binary digits as part of a binary number but rather as means of choosing or designating a particular symbol. We can regard each or 1 as expressing an elementary choice between two possibilities. Consider, for instance, the "tree of choice" shown in Figure IV-4. As we proceed upward from the root to the twigs, let signify that we take the left branch and let 1 signify that we take the right branch. Then 1 1 means left, right, right and takes us to the octal digit 6, just as in Table DC. Just as three binary digits give us enough information to deter- mine one among eight alternatives, four binary digits can determine one among sixteen alternatives, and twenty binary digits can deter- mine one among 1,048,576 alternatives. We can do this by assign- ing the required binary numbers to the alternatives in any order we wish. 74 Symbols, Signals and Noise 57160423 0\ /1 0\ /I 0\ /I 0\ /I Fzg. IV-4 The alternatives which we wish to specify by successions of binary digits need not of course be numbers at all. In fact, we began by considering how we might encode English text so as to transmit it electrically by sequences of pulses and spaces, which can be represented by sequences of binary digits. A bare essential in transmitting English text letter by letter is twenty-six letters plus a space, or twenty-seven symbols in all. This of course allows us no punctuation and no Arabic numbers. We can write out the numbers (three, not 3) if we wish and use words for punctuation, (stop, comma, colon, etc.). Mathematics says that a choice among 27 symbols corresponds to about 4.75 binary digits. If we are not too concerned with efficiency, we can assign a different 5-digit binary number to each character, which will leave five 5-digit binary numbers unused. My typewriter has 48 keys, including shift and shift lock. We might add two more "symbols" representing carriage return and line advance, making a total of 50. I could encode my actions in typing, capitalization, punctuation, and all (but not insertion of the paper) by a succession of choices among 50 symbols, each choice corresponding to about 5.62 binary digits. We could use 6 binary digits per character and waste some sequences of binary digits. This waste arises because there are only thirty-two 5-digit binary numbers, which is too few, while there are sixty-four 6-digit binary numbers, which is too many. How can we avoid this waste? If we have 50 characters, we have 125,000 possible different groups of 3 ordered characters. There are 131,072 different combinations of Encoding and Binary Digits 75 17 binary digits. Thus, if we divide our text into blocks of 3 succes- sive characters, we can specify any possible block by a 17-digit binary number and have a few left over. If we had represented each separate character by 6 binary digits, we would have needed 18 binary digits to represent 3 successive characters. Thus, by this block coding, we have cut down the number of binary digits we use in encoding a given length of text by a factor 17/18. Of course, we might encode English text in quite a different way. We can say a good deal with 16,384 English words. That's quite a large vocabulary. There are just 16,384 fourteen-digit binary num- bers. We might assign 16,357 of these to different useful words and 27 to the letters of the alphabet and the space, so that we could spell out any word or sequence of words we failed to include in our word vocabulary. We won't need to put a space between words to which numbers have been assigned; it can be assumed that a space goes with each word. If we have to spell out words very infrequently, we will use about 14 binary digits per word in this sort of encoding. In ordinary English text there are on the average about 4.5 letters per word. As we must separate words by a space, when we send the message character by character, even if we disregard capitalization and punctuation, we will require on the average 5.5 characters per word. If we encode these using 5 binary digits per character, we will use on the average 27.5 binary digits per word, while in encod- ing the message word by word we need only 14 binary digits per word. How can this be so? It is because, in spelling out the message letter by letter, we have provided means for sending with equal facility all sequences of English letters, while, in sending word by word, we restrict ourselves to English words. Clearly, the average number of binary digits per word required to represent English text depends strongly on how we encode the text. Now, English text is just one sort of message we might want to transmit. Other messages might be strings of numbers, the human voice, a motion picture, or a photograph. If there are efficient and inefficient ways of encoding English text, we may expect that there will be efficient and inefficient ways of encoding other signals as well. 76 Symbols, Signals and Noise Indeed, we may be led to believe that there exists in principle some best way of encoding the signals from a given message source, a way which will on the average require fewer binary digits per character or per unit time than any other way. If there is such a best way of encoding a signal, then we might use the average number of binary digits required to encode the signal as a measure of the amount of information per character or the amount of information per second of the message source which produced the signal. This is just what is done in information theory. How it is done and further reasons for so doing will be considered in the next chapter. Let us first, however, review very briefly what we have covered in this chapter. In communication theory, we regard coding very broadly, as representing one signal by another. Thus a radio wave can represent the sounds of speech and so form an encoding of these sounds. Encoding is, however, most simply explained and explored in the case of discrete message sources, which produce messages consisting of sequences of characters or numbers. For- tunately, we can represent a continuous signal, such as the current in a telephone line, by a number of samples of its amplitude, using, each second, twice as many samples as the highest frequency present in the signal. Further we can if we wish represent the ampli- tude of each of these samples approximately by a whole number. The representation of letters or numbers by sequences of oflF-or- on signals, which can in turn be represented directly by sequences of the binary digits and 1, is of particular interest in communi- cation theory. For instance, by using sequences of 4 binary digits we can form 16 binary numbers, and we can use 10 of these to represent the 10 decimal digits. Or, by using sequences of 5 binary digits we can form 32 binary numbers, and we can use 27 of these to represent the letters of the English alphabet plus the space. Thus, we can transmit decimal numbers or English text by sending sequences of off-or-on signals. We should note that while it may be convenient to regard the sequences of binary digits so used as binary numbers, the numerical value of the binary number has no particular significance; we can choose any binary number to represent a particular decimal digit. Encoding and Binary Digits 77 If we use 10 of the 16 possible 5-digit binary numbers to encode the 10 decimal digits, we never use (we waste) 6 binary numbers. We could, but never do, transmit these sequences as sequences of off-or-on signals. We can avoid such waste by means of block coding, in which we encode sequences of 2, 3, or more decimal digits or other characters by means of binary digits. For instance, all sequences of 3 decimal digits can be represented by 10 binary digits, while it takes a total of 12 binary digits to represent sepa- rately each of 3 decimal digits. Any sequence of decimal digits may occur, but only certain sequences of English letters ever occur, that is, the words of the English language. Thus, it is more efficient to encode English words as sequences of binary digits rather than to encode the letters of the words individually. This again emphasizes the gain to be made by encoding sequences of characters, rather than encoding each character separately. All of this leads us to the idea that there may be a best way of encoding the messages from a message source, a way which calls for the least number of binary digits. CHAPTER V Entropy IN THE LAST CHAPTER, we have considered various ways in which messages can be encoded for transmission. Indeed, all communica- tion involves some sort of encoding of messages. In the electrical case, letters may be encoded in terms of dots or dashes of electric current or in terms of several different strengths of current and directions of current flow, as in Edison's quadruplex telegraph. Or we can encode a message in the binary language of zeros and ones and transmit it electrically as a sequence of pulses or absences of pulses. Indeed, we have shown that by periodically sampling a continu- ous signal such as a speech wave and by representing the ampli- tudes of each sample approximately by the nearest of a set of discrete values, we can represent or encode even such a continuous wave as a sequence of binary digits. We have also seen that the number of digits required in encoding a given message depends on how it is encoded. Thus, it takes fewer binary digits per character when we encode a group or block of English letters than when we encode the letters one at a time. More important, because only a few combinations of letters form words, it takes considerably fewer digits to encode English text word by word than it does to encode the same text letter by letter. Surely, there are still other ways of encoding the messages pro- duced by a particular ergodic source, such as a source of English text. How many binary digits per letter or per word are really needed? Must we try all possible sorts of encoding in order to find 78 Entropy 79 out? But, if we did try all forms of encoding we could think of, we would still not be sure we had found the best form of encoding, for the best form might be one which had not occurred to us. Is there not, in principle at least, some statistical measurement we can make on the messages produced by the source, a measure which will tell us the minimum average number of binary digits per symbol which will serve to encode the messages produced by the source? In considering this matter, let us return to the model of a mes- sage source which we discussed in Chapter III. There we regarded the message source as an ergodic source of symbols, such as letters or words. Such an ergodic source has certain unvarying statistical properties: the relative frequencies of symbols; the probability that one symbol will follow a particular other symbol, or pair of sym- bols, or triplet of symbols; and so on. In the case of English text, we can speak in the same terms of the relative frequencies of words and of the probability that one word will follow a particular word or a particular pair, triplet, or other combination of words. In illustrating the statistical properties of sequences of letters or words, we showed how material resembling English text can be produced by a sequence of random choices among letters and words, provided that the letters or words are chosen with due regard for their probabilities or their probabilities of following a preceding sequence of letters or words. In these examples, the throw of a die or the picking of a letter out of a hat can serve to "choose" the next symbol. In writing or speaking, we exercise a similar choice as to what we shall set down or say next. Sometimes we have no choice; Q must be followed by U. We have more choice as to the next symbol in beginning a word than in the middle of a word. How- ever, in any message source, living or mechanical, choice is con- tinually exercised. Otherwise, the messages produced by the source would be predetermined and completely predictable. Corresponding to the choice exercised by the message source in producing the message, there is an uncertainty on the part of the recipient of the message. This uncertainty is resolved when the recipient examines the message. It is this resolution of uncertainty which is the aim and outcome of communication. 80 Symbols, Signals and Noise If the message source involved no choice, if, for instance, it could produce only an endless string of ones or an endless string of zeros, the recipient would not need to receive or examine the message to know what it was; he could predict it in advance. Thus, if we are to measure information in a rational way, we must have a measure that increases with the amount of choice of the source and, thus, with the uncertainty of the recipient as to what message the source may produce and transmit. Certainly, for any message source there are more long messages than there are short messages. For instance, there are 2 possible messages consisting of 1 binary digit, 4 consisting of 2 binary digits, 16 consisting of 4 binary digits, 256 consisting of 8 binary digits, and so on. Should we perhaps say that amount of informa- tion should be measured by the number of such messages? Let us consider the case of four telegraph lines used simultaneously in transmitting binary digits between two points, all operating at the same speed. Using the four lines, we can send 4 times as many digits in a given period of time as we could using one line. It also seems reasonable that we should be able to send 4 times as much information by using four lines. If this is so, we should measure information in terms of the number of binary digits rather than in terms of the number of different messages that the binary digits can form. This would mean that amount of information should be measured, not by the number of possible messages, but by the logarithm of this number. The measure of amount of information which communication theory provides does this and is reasonable in other ways as well. This measure of amount of information is called entropy. If we want to understand this entropy of communication theory, it is best first to clear our minds of any ideas associated with the entropy of physics. Once we understand entropy as it is used in communica- tion theory thoroughly, there is no harm in trying to relate it to the entropy of physics, but the literature indicates that some workers have never recovered from the confusion engendered by an early admixture of ideas concerning the entropies of physics and communication theory. The entropy of communication theory is measured in bits. We may say that the entropy of a message source is so many bits per Entropy 81 letter, or per word, or per message. If the source produces symbols at a constant rate, we can say that the source has an entropy of so many bits per second. Entropy increases as the number of messages among which the source may choose increases. It also increases as the freedom of choice (or the uncertainty to the recipient) increases and decreases as the freedom of choice and the uncertainty are restricted. For instance, a restriction that certain messages must be sent either very frequently or very infrequently decreases choice at the source and uncertainty for the recipient, and thus such a restriction must decrease entropy. It is best to illustrate entropy first in a simple case. The mathe- matical theory of communication treats the message source as an ergodic process, a process which produces a string of symbols that are to a degree unpredictable. We must imagine the message source as selecting a given message by some random, i.e., unpredictable means, which, however, must be ergodic. Perhaps the simplest case we can imagine is that in which there are only two possible sym- bols, say, X and Y f between which the message source chooses repeatedly, each choice uninfluenced by any previous choices. In this case we can know only that X will be chosen with some probability p$ and Y with some probability p\, as in the outcomes of the toss of a biased coin. The recipient can determine these probabilities by examining a long string of characters (X's, Y's) produced by the source. The probabilities p Q and p must not change with time if the source is to be ergodic. For this simplest of cases, the entropy H of the message source is defined as H = - (/?o log^o + p\ log pi) bits per symbol Thus, the entropy is the negative of the sum of the probability p Q that X will be chosen (or will be received) times the logarithm of pa and the probability pi that Y will be chosen (or will be received) times the logarithm of this probability. Whatever plausible arguments one may give for the use of entropy as defined in this and in more complicated cases, the real and true reason is one that will become apparent only as we proceed, and the justification of this formula for entropy will 82 Symbols, Signals and Noise therefore be deferred. It is, however, well to note again that there are different kinds of logarithms and that, in information theory, we use logarithms to the base 2. Some facts about logarithms to the base 2 are noted in Table X. TABLE X Fraction p Another Way of Writing p Still Another Way of Writing p Logp 3 1 2~.415 415 4 2-415 1 J_ 2" 1 -1 2 2 1 3 1 2~1.415 1 415 8 21-415 1 1 2-2 -2 4 22 1 1 2-3 -3 8 23 1 1 2-4 -4 16 24 1 1 2~ 6 -6 64 26 1 1 8 256 28 The logarithm to the base 2 of a number is the power to which 2 must be raised to give the number. Let us consider, for instance, a "message source" which consists of the tossing of an honest coin. We can let X represent heads and Y represent tails. The probability pi that the coin will turn up heads is % and the probability /? that the coin will turn up tails is also 2. Accordingly, from our expression for entropy and from Table X we find that H= -(fclogfc + #=-[06)(-i) H = 1 bit per toss Entropy 83 If the message source is the sequence of heads and tails obtained by tossing a coin, it takes one bit of information to convey whether heads or tails has turned up. Let us notice, now, that we can represent the outcome of succes- sively tossing a coin by a number of binary digits equal to the number of tosses, letting 1 stand for heads and stand for tails. Hence, in this case at least, the entropy, one bit per toss, and the number of binary digits which can represent the outcome, one binary digit per toss, are equal. In this case at least, the number of binary digits necessary to transmit the message generated by the source (the succession of heads and tails) is equal to the entropy of the source. Suppose the message source produces a string of 1's and O's by tossing a coin so weighted that it turns up heads % of the time and tails only 1 A of the time. Then Pi = % po = V 4 H = - (% lOg l /4 -f % lOg 3/4) H= .811 bit per toss We feel that, in the case of a coin which turns up heads more often than tails, we know more about the outcome than if heads or tails were equally likely. Further, if we were constrained to choose heads more often than tails we would have less choice than if we could choose either with equal probability. We feel that this must be so, for if the probability for heads were 1 and for tails 0, we would have no choice at all. And, we see that the entropy for the case above is only .81 1 bit per toss. We feel somehow that we ought to be able to represent the outcome of a sequence of such biased tosses by fewer than one binary digit per toss, but it is not immediately clear how many binary digits we must use. If we choose heads over tails with probability/?!, the probability PQ of choosing tails must of course be 1 p. Thus, if we know/?i we know p Q as well. We can compute H for various values of p\ and plot a graph of H vs. p^ Such a curve is shown in Figure V-l. H has a maximum value of 1 when/?! is 0.5 and is when/?i is or 1, that is, when it is certain that the message source always produces either one symbol or the other. 84 Symbols, Signals and Noise Really, whether we call heads X and tails Y or heads Y and tails X is immaterial, so the curve of H vs. p\ must be the same as H vs. PQ. Thus, the curve of Figure V-l is symmetrical about the dashed center line at/?! and/? equal to 0.5. A message source may produce successive choices among the ten decimal digits, or among the twenty-six letters of the alphabet, or among the many thousands of words of the English language. Let us consider the case in which the message source produces one among n symbols or words, with probabilites which are independ- ent of previous choices. In this case the entropy is defined as (5.1) H == ^ pi log pi bits per symbol Here the sign 2 (sigma) means to sum or to add up various terms. Entropy 85 pi is the probability of the i& symbol being chosen. The / = 1 below and n above the 2 mean to let z be 1, 2, 3, etc. up to n, so the equa- tion says that the entropy will be given by adding pi logpi and p% log/?2 and so on, including all symbols. We see that when n = 2 we have the simple case which we considered earlier. Let us take an example. Suppose, for instance, that we toss two coins simultaneously. Then there are four possible outcomes, which we can label with the numbers 1 through 4: H Hoi I H Tor 2 T HOT 3 T Tor 4 If the coins are honest, the probability of each outcome is 1 A and the entropy is H = - 0/4 log Vi + 1 A log 1/4 + % log % + % log V4) H= -(-%_%-%-%) H = 2 bits per pair tossed It takes 2 bits of information to describe or convey the outcome of tossing a pair of honest coins simultaneously. As in the case of tossing one coin which has equal probabilities of landing heads or tails, we can in this case see that we can use 2 binary digits to describe the outcome of a toss: we can use 1 binary digit for each coin. Thus, in this case too, we can transmit the message generated by the process (of tossing two coins) by using a number of binary digits equal to the entropy. If we have some number n of symbols all of which are equally probable, the probability of any particular one turning up is l/n, so we have n terms, each of which is l/n log l/n. Thus, the entropy is in this case H = log l/n bits per symbol For instance, an honest die when rolled has equal probabilities of turning up any number from 1 to 6. Hence, the entropy of the sequence of numbers so produced must be log %, or 2.58 bits per throw. More generally, suppose that we choose each time with equal 86 Symbols, Signals and Noise likelihood among all binary numbers with N digits. There are such numbers, so From Table X we easily see that log 1/Ti = log 2-^ = -JV Thus, for a source which produces at each choice with equal likeli- hood some TV-digit binary number, the entropy is N bits per num- ber. Here the message produced by the source is a binary number which can certainly be represented by binary digits. And, again, the message can be represented by a number of binary digits equal to the entropy of the message, measured in bits. This example illustrates graphically how the logarithm must be the correct mathematical function in the entropy. Ordinarily the probability that the message source will produce a particular symbol is different for different symbols. Let us take as an example a message source which produces English words independently of what has gone before but with the probabilities characteristic of English prose. This corresponds to the first-order word approximation given in Chapter III. In the case of English prose, we find as an empirical fact that if we order the words according to frequency of usage, so that the most frequently used, the most probable word (the, in fact ) is word number 1, the next most probable word (of) is number 2, and so on, then the probability for the r th word is very nearly (if r is not too large) p r =.l/r (5.2) If equation 5.2 were strictly true, the points in Figure V-2, in which word probability or frequency p r is plotted against word order or rank r, would fall on the solid line which extends from upper left to lower right. We see that this is very nearly so. This empirical inverse relation between word probability and word rank is known as ZipFs law. We will discuss ZipFs law in Chapter XII; here, we propose merely to use it. We can show that this equation (5.2) cannot hold for all words. To see this, let us consider tossing a coin. If the probability of heads Entropy 87 CD CO O tr a. s UJ S cc .001 .0001 .00001 -OR - SAY ^- QUALITY 1 10 100 WORD ORDER 1000 10000 Fig. V-2 turning up is % and the probability of tails turning up is 2, then there is no other possible outcome: l /i + 2 = 1. If there were an additional probability of Vio that the coin would stand on edge, we would have to conclude that in a hundred tosses we would expect 1 10 outcomes: heads 50 times, tails 50 times, and standing on edge 10 times. This is patently absurd. The probabilities of all outcomes must add up to unity. Now, let us note that if we add up succes- sively/?! plus/?2, etc., as given by equation 5.2, we find that by the time we came to p%727 the sum of the successive probabilities has become unity. If we took this literally, we would conclude that no additional word could ever occur. Equation 5.1 must be a little in error. Nonetheless, the error is not great, and Shannon used equation 88 Symbols, Signals and Noise 5.2 in computing the entropy of a message source which produces words independently but with the probability of their occurring in English text. In order to make the sum of the probabilities of all words unity, he included only the 8,727 most frequently used words. He found the entropy to be 11.8 bits per word. In Chapter IV, we saw that English text can be encoded letter by letter by using 5 binary digits per character or 27.5 binary digits per word. We also saw that by providing different sequences of binary digits for each of 16,357 words and 27 characters, we could encode English text by using about 14 binary digits per word. We are now beginning to suspect that the number of binary digits actually required is given by the entropy, and, as we have seen, Shannon's estimate, based on the relative probabilities of English words, would be 11.8 binary digits per word. As a next step in exploring this matter of the number of binary digits required to encode the message produced by a message source, we will consider a startling theorem which Shannon proved concerning the "messages" produced by an ergodic source which selects a sequence of letters or words independently with certain probabilities. Let us consider all of the messages the source can produce which consist of some particular large number of characters. For exam- ple, we might consider all messages which are 100,000 symbols (letters, words, characters) long. More generally, let us consider messages having a number M of characters. Some of these messages are more probable than others. In the probable messages, symbol 1 occurs about Mp\ times, symbol 2 occurs about Mp% times, etc. Thus, in these probable messages each symbol occurs with about the frequency characteristic of the source. The source might pro- duce other sorts of messages, for instance, a message consisting of one symbol endlessly repeated or merely a message in which the numbers of the various symbols differed markedly from M times their probabilities, but it seldom does. The remarkable fact is that, if H is the entropy of the source per symbol, there are just about 2 MH probable messages, and the rest of the messages all have vanishingly small probabilities of ever occurring. In other words, if we ranked the messages from most probable to least probable, and assigned binary numbers ofMH Entropy 89 digits to the 2 MH most probable messages, we would be almost certain to have a number corresponding to any M-symbol message that the source actually produced. Let us illustrate this in particular simple cases. Suppose that the symbols produced are 1 or 0. If these are produced with equal probabilities, a probability % that for 1 and a probability Vi that for the entropy H is, as we have seen, 1 bit per symbol. Let us let the source produce messages M digits long. Then MH = 1,000, and, according to Shannon's theorem, there must be 2 1000 different probable messages. Now, by using 1,000 binary digits we can write just 2 1000 different binary numbers. Thus, in order to assign a different binary num- ber to each probable message, we must use binary numbers 1,000 digits long. This is just what we would expect. In order to desig- nate to the message destination which 1,000 digit binary number the message source produces, we must send a message 1,000 binary digits long. But, suppose that the digits constituting the messages produced by the message source are obtained by tossing a coin which turns up heads, designating 1, % of the time and tails, designating 0, 1 A of the time. The typical messages so produced will contain more 1's than O's, but that is not all. We have seen that in this case the entropy H is only .8 1 1 bit per toss. If M } the length of the message, is again taken as 1,000 binary digits, MH is only 811. Thus, while as before there are 2 1000 possible messages, there are only 2 811 probable messages. Now, by using 811 binary digits we can write 2 811 different binary numbers, and we can assign one of these to each of the 1,000-digit probable messages, leaving the other improbable 1,000- digit messages unnumbered. Thus, we can send word to a message destination which probable 1,000-digit message our message source produces by sending only 81 1 binary digits. And the chance that the message source will produce an improbable 1,000-digit mes- sage, to which we have assigned no number, is negligible. Of course, the scheme is not quite foolproof. The message source may still very occasionally turn up a message for which we have no label among all 2 811 of our 81 1 -digit binary labels. In this case we can- not transmit the message at least, not by using 811 binary digits. 90 Symbols, Signals and Noise We see that again we have a strong indication that the number of binary digits required to transmit a message is just the entropy in bits per symbol times the number of symbols. And, we might note that in this last illustration we achieved such an economical transmission by block encoding that is, by lumping 1,000 (or some other large number) message digits together and representing each probable combination of digits by its individual code (of 81 1 binary digits). How firmly and generally can this supposition be established? So far we have considered only cases in which the message source produces each symbol (number, letter, word) independently of the symbols it has produced before. We know this is not true for English text. Besides the constraints of word frequency, there are constraints of word order, so that the writer has less choice as to what the next word will be than he would if he could choose it independently of what has gone before. How are we to handle this situation? We have a clue in the block coding which we discussed in Chapter IV, and which has been brought to our mind again in the last example. In an ergodic process the probability of the next letter may depend only on the preceding 1, 2, 3, 4, 5, or more letters but not on earlier letters. The second and third order approximations to English given in Chapter III illustrate text produced by such a process. Indeed, in any ergodic process of which we are to make mathematical sense the effect of the past on what symbol will be produced next must decrease as the remoteness of that past is greater. This is reasonably valid in the case of real English as well. While we can imagine examples to the contrary (the consistent use of the same name for a character in a novel), in general the word I write next does not depend on just what word I wrote 10,000 words back. Now, suppose that before we encode a message we divide it up into very long blocks of symbols. If the blocks are long enough, only the symbols near the beginning will depend on symbols in the previous block, and, if we make the block long enough, these symbols that do depend on symbols in the previous block will form a negligible part of all the symbols in the block. This makes it possible for us to compute the entropy per block of symbols by means of equation 5.1. To keep matters straight, let us call the Entropy 91 probability of a particular one of the multitudinous long blocks of symbols, which we will call the i ^ block, P(Bi). Then the entropy per block will be H = -2^P(Bi) log P(Bi) bits per block Any mathematician would object to calling this the entropy. He would say, the quantity H given by the above equation approaches the entropy as we make the block longer and longer, so that it includes more and more symbols. Thus, we must assume that we make the blocks very long indeed and get a very close approxima- tion to the entropy. With this proviso, we can obtain the entropy per symbol by dividing the entropy per block by the number N of symbols per block H = - (l/N)^P(Bi) log P(Bi) bits per symbol (5.3) In general, an estimate of entropy is always high if it fails to take into account some relations between symbols. Thus, as we make N, the number of symbols per block, greater and greater, H as given by 5.3 will decrease and approach the true entropy. We have insisted from the start that amount of information must be so defined that if separate messages are sent over several tele- graph wires, the total amount of information must be the sum of the amounts of information sent over the separate wires. Thus, to get the entropy of several message sources operating simultane- ously, we add the entropies of the separate sources. We can go further and say that if a source operates intermittently we must multiply its information rate or entropy by the fraction of the time that it operates in order to get its average information rate. Now, let us say that we have one message source when we have just sent a particular sequence of letters such as TH. In this case the probability that the next letter will be E is very high. We have another particular message source when we have just sent NQ. In this case the probability that the next symbol will be U is unity. We calculate the entropy for each of these message sources. We multiply the entropy of a source which we label BI by the proba- bility p(Bi) that this source will occur (that is, by the fraction of 92 Symbols, Signals and Noise instances in which this source is in operation). We multiply the entropy of each other source by the probability that that source will occur, and so on. Then we add all the numbers we get in this way in order to get the average entropy or rate of the over-all source, which is a combination of the many different sources, each of which operates only part time. As an example, consider a source involving digram probabilities only, so that the whole effect of the past is summed up in the letter last produced. One source will be the source we have when this letter is E; this will occur in .13 of the total instances. Another source will be the source we have when the letter just produced is W; this will occur in .02 of the total instances. Putting this in formal mathematical terms, we say that if a particular block of N symbols, which we designate by Bi, has just occurred, the probability that the next symbol will be symbol Sj is The entropy of this "source" which operates only when a particu- lar block of N symbols designated by BI has just been produced is J But, in what fraction of instances does this particular message source operate? The fraction of instances in which this source operates is the fraction of instances in which we encounter block Bi rather than some other block of symbols; we call this fraction Thus, taking into account all blocks of N symbols, we write the sum of the entropies of all the separate sources (each separate source defined by what particular block BI of N symbols has preceded the choice of the symbol Sj) as H N = -/>GBi)/>B*($) logMOS,) (5.4) y The ij under the summation sign mean to let i andy assume all possible values and to add all the numbers we get in this way. As we let the number N of symbols preceding symbol Sj become very large, HN approaches the entropy of the source. If there are Entropy 93 no statistical influences extending over more than N symbols (this will be true for a digram source for TV = 1 and for a trigram source for N = 2), then H N is the entropy. Shannon writes equation 5.4 a little differently. The probability p(Bi, Sj) of encountering the block Bi followed by the symbol Sj is the probability p(Bfi of encountering the block Bi times the probability ^(Sy) that symbol Sj will follow block B^ Hence, we can write 5.4 as follows: H N = - In Chapter III we consider a finite-state machine, such as that shown in Figure III-3, as a source of text. We can, if we wish, base our computation of entropy on such a machine. In this case, we regard each state of the machine as a message source and compute the entropy for that state. Then we multiply the entropy for that state by the probability that the machine will be in that state and sum (add up) all states in order to get the entropy. Putting the matter symbolically, suppose that when the machine is in a particular state z it has a probability pi(J) of producing a particular symbol which we designate by/ For instance, in a state labeled i 10 it might have a probability of 0.03 of producing the third letter of the alphabet, which we label j 3. Then />io(3) = .03 The entropy Hi of state i is computed in accord with 5.1: J Now, we say that the machine has a probability Pi of being in the z th state. The entropy per symbol for the machine as a source of symbols is then H = y PiHi bits per symbol z" We can write this as H = - Pipi(j) logjPiO") bits P er symbol (5.5) y 94 Symbols, Signals and Noise Pi is the probability that the finite-state machine is in the / th state, and/?i(/) is the probability that it produces the/ h symbol when it is in the z th state. The i andy under the 2 mean to allow both / andy to assume all possible values and to add all the numbers so obtained. Thus, we have gone easily and reasonably from the entropy of a source which produces symbols independently and to which equation 5.1 applies to the more difficult case in which the proba- bility of a symbol occurring depends on what has gone before. And, we have three alternative methods for computing or defining the entropy of the message source. These three methods are equivalent and rigorously correct for true ergodic sources. We should remem- ber, of course, that the source of English text is only approximately ergodic. Once having defined entropy per symbol in a perfectly general way, the problem is to relate it unequivocally to the average number of binary digits per symbol necessary to encode a message. We have seen that if we divide the message into a block of letters or words and treat each possible block as a symbol, we can com- pute the entropy per block by the same formula we used per independent symbol and get as close as we like to the source entropy merely by making the blocks very long. Thus, the problem is to find out how to encode efficiently in binary digits a sequence of symbols chosen from a very large group of symbols, each of which has a certain probability of being chosen. Shannon and Fano both showed ways of doing this, and Huffman found an even better way, which we shall consider here. Let us for convenience list all the symbols vertically in order of decreasing probability. Suppose the symbols are the eight words the, man, to, runs, house, likes, horse, sells, which occur independ- ently with probabilities of their being chosen, or appearing, as listed in Table XI. We can compute the entropy per word by means of 5. 1 ; it is 2.21 bits per word. However, if we merely assigned one of the eight 3 -digit binary numbers to each word, we would need 3 digits to transmit each word. How can we encode the words more efficiently? Figure V-3 shows how to construct the most efficient code for encoding such a message word by word. The words are listed to the Entropy TABLE XI Word Probability the .50 man .15 to .12 runs .10 house .04 likes .04 horse .03 sells .02 95 left, and the probabilities are shown in parentheses. In construct- ing the code, we first find the two lowest probabilities, .02 (sells) and .03 (horse), and draw lines to the point marked .05, the prob- ability of either horse or sells. We then disregard the individual probabilities connected by the lines and look for the two lowest probabilities, which are .04 (like) and .04 (house). We draw lines to the right to a point marked .08, which is the sum of .04 and .04. The two lowest remaining probabilities are now .05 and .08, so we draw a line to the right connecting them, to give a point marked THE (.50) MAN (.15) TO (.12) RUNS ( .1 0) HOUSE ( .04) LIKE (.04) HORSE (.03) SELLS (.02) 0.00) (.13) (.05) Pig. V-3 96 Symbols, Signals and Noise . 13. We proceed thus until paths run from each word to a common point to the right, the point marked 1.00. We then label each upper path going to the left from a point 1 and each lower path 0. The code for a given word is then the sequence of digits encountered going left from the common point 1.00 to the word in question. The codes are listed in Table XII. TABLE XII Word Probability p _ , Number of Digits Code . j *r ND in Code, N ^ the .50 1 1 .50 man .15 001 3 .45 to .12 Oil 3 .36 runs .10 010 3 .30 house .04 00011 5 .20 likes .04 00010 '' 5 .20 horse .03 00001 5 .15 sells .02 00000 5 .10 2.26 In Table XII we have shown not only each word and its code but also the probability of each code and the number of digits in each code. The probability of a word times the number of digits in the code gives the average number of digits per word in a long message due to the use of that particular word. If we add the products of the probabilities and the numbers of digits for all the words, we get the average number of digits per word, which is 2.26. This is a little larger than the entropy per word, which we found to be 2.21 bits per word, but it is a smaller number of digits than the 3 digits per word we would have used if we had merely assigned a different 3-digit code to each word. Not only can it be proved that this Huffman code is the most efficient code for encoding a set of symbols having different prob- abilities, it can be proved that it always calls for less than one binary digit per symbol more than the entropy (in the above example, it calls for only 0.05 extra binary digits per symbol). Now suppose that we combine our symbols into blocks of 1, 2, 3, or more symbols before encoding. Each of these blocks will have Entropy 97 a probability (in the case of symbols chosen independently, the probability of a sequence of symbols will be the product of the probabilities of the symbols). We can find a Huffman code for these blocks of symbols. As we make the blocks longer and longer, the number of binary digits in the code for each block will increase. Yet, our Huffman code will take less than one extra digit per block above the entropy in bits per block! Thus, as the blocks and their codes become very long, the less-than-one extra digit of the Huff- man code will become a negligible fraction of the total number of digits, and, as closely as we like (by making the blocks longer), the number of binary digits per block will equal the entropy in bits per block. Suppose we have a communication channel which can transmit, a number C of off-or-on pulses per second. Such a channel can transmit C binary digits per second. Each binary digit is capable of transmitting one bit of information. Hence we can say that the information capacity of this communication channel is C bits per second. If the entropy H of a message source, measured in bits per second, is less than C, then, by encoding with a Huffman code, the signals from the source can be transmitted over the channel. Not all channels transmit binary digits. A channel, for instance, might allow three amplitudes of pulses, or it might transmit differ- ent pulses of different lengths, as in Morse code. We can imagine connecting various different message sources to such a channel. Each source will have some entropy or information rate. Some source will give the highest entropy that can be transmitted over the channel, and this highest possible entropy is called the channel capacity C of the channel and is measured in bits per second. By means of the Huffman code, the output of the channel when it is transmitting a message of this greatest possible entropy can be coded into some least number of binary digits per second, and, when long stretches of message are encoded into long stretches of binary digits, it must take very close to C binary digits per second to represent the signals passing over the channel. This encoding can, of course, be used in the reverse sense, and C independent binary digits per second can be so encoded as to be transmitted over the channel. Thus, a source of entropy H can be encoded into H binary digits per second, and a general discrete 98 Symbols, Signals and Noise channel of capacity C can be used to transmit C bits per second. We are now in a position to appreciate one of the fundamental theorems of information theory. Shannon calls this the funda- mental theorem of the noiseless channel. He states it as follows: Let a source have entropy H (bits per symbol) and a channel have a capacity [to transmit] C bits per second. Then it is possible to encode the ousput of the source in such a way as to transmit at the average rate (C/H) - e symbols per second over the channel, where e is arbitrarily small. It is not possible to transmit at an average rate greater than C/H. Let us restate this without mathematical niceties. Any discrete channel that we may specify, whether it transmits binary digits, letters and numbers, or dots, dashes, and spaces of certain distinct lengths has some particular unique channel capacity C Any ergodic message source has some particular entropy H. If His less than or equal to C, we can transmit the messages generated by the source over the channel. If H is greater than C, we had better not try to do so, because we just plain can't. We have indicated above how the first part of this theorem can be proved. We have not shown that a source of entropy H cannot be encoded in less than H binary digits per symbol, but this also can be proved. We have now firmly arrived at the fact that the entropy of a message source measured in bits tells us how many binary digits (or off-or-on pulses, or yeses-or-noes) are required, per character, or per letter, or per word, or per second in order to transmit messages produced by the source. This identification goes right back to Shannon's original paper. In fact, the word bit is merely a contraction of binary digit and is generally used in place of binary digit. Here I have used bit in a particular sense, as a measure of amount of information, and in other contexts I have used a differ- ent expression, binary digit. I have done this in order to avoid a confusion which might easily have arisen had I started out by using bit to mean two different things. After all, in practical situations the entropy in bits is usually different from the number of binary digits involved. Suppose, for instance, that a message source randomly produces the symbol 1 Entropy 99 with a probability 1 A and the symbol with the probability % and that it produces 10 symbols per second. CertaMy such a source produces binary digits at a rate of 10 per second, but the informa- tion rate or entropy of the source is .811 bit per binary digit and 8.11 bits per second. We could encode the sequence of binary digits produced by this source by using on the average only 8.11 binary digits per second. Similarly, suppose we have a communication channel which is capable of transmitting 10,000 arbitrarily chosen off-or-on pulses per second. Certainly, such a channel has a channel capacity of 10,000 bits per second. However, if the channel is used to transmit a completely repetitive pattern of pulses, we must say that the actual rate of transmission of information is bits per second, despite the fact that the channel is certainly transmitting 10,000 binary digits per second. Here we have used bit only in the sense of a binary measure of amount of information, as a measure of the entropy or information rate of a message source in bits per symbol or in bits per second or as a measure of the information transmission capabilities of a channel in bits per symbol or bits per second. We can describe it as an elementary binary choice or decision among two possibilities which have equal probabilities. At the message source a bit repre- sents a certain amount of choice as to the message which will be generated; in writing grammatical English we have on the average a choice of about one bit per letter. At the destination a bit of information resolves a certain amount of uncertainty; in receiving English text there is on the average, about one bit of uncertainty as to what the next letter will be. When we are transmitting messages generated by an information source by means of ofF-or-on pulses, we know how many binary digits we are transmitting per second even when (as in most cases) we don't know the entropy of the source. (If we know the entropy of the source in bits per second to be less than the binary digits used per second, we would know that we could get along in prin- ciple with fewer binary digits per second.) We know how to use the binary digits to specify or determine one out of several possibilities, either by means of a tree such as that of Figure IV-4 or by means of a Huifman code such as that of Figure V-3. It is common in such 100 Symbols, Signals and Noise a case to speak of the rate of transmission of binary digits as a bit rate, but there is a certain danger that the inexperienced may muddy their thinking if they do this. All that I really ask of the reader is to remember that we have used bit in one sense only, as a measure of information and have called or 1 a binary digit. If we can transmit 1,000 freely chosen binary digits per second, we can transmit 1,000 bits of information a second. It may be convenient to use bit to mean binary digit, but when we do so we should be sure that we understand what we are doing. Let us now return for a moment to an entirely different matter, the Huffman code given in Table XII and Figure V-3. When we encode a message by using this code and get an uninterrupted string of symbols, how do we tell whether we should take a particu- lar 1 in the string of symbols as indicating the word the or as part of the code for some other word? We should note that of the codes in Table XII, none forms the first part of another. This is called the prefix property. It has important and, indeed, astonishing consequences, which are easily illustrated. Suppose, for instance, that we encode the message: the man sells the house to the man the horse runs to the man. The encoded message is as follows: ithei the man sells the house 10010 00010 00 likes i man i to the man the horse 01110 01 1 00001 to the man the horse runs to the man 01001 11001 i runs to the man 1 Here the message words are written above the code groups. Entropy 101 Now suppose we receive only the digits following the first vertical dashed line below the digits. We start to decode by looking for the shortest sequence of digits which constitutes a word in our code. This is 00010, which corresponds to likes. We go on in this fashion. The "decoded" words are written under the code, separated by dashed lines. We see that after a few errors the dashed lines correspond to the solid lines, and from that point on the deciphered message is correct. We see that we don't even need to know where the sequence of digits representing a message starts in order to decode it cor- rectly (as correctly as possible). When we look back we can see that we have fulfilled the purpose of this chapter. We have arrived at a measure of the amount of information per symbol or per unit time of an ergodic source, and we have shown how this is equal to the average number of binary digits per symbol necessary to transmit the messages produced by the source. We have noted that to attain transmission with neg- ligibly more bits than the entropy, we must encode the messages produced by the source in long blocks, not symbol by symbol. We might ask, however, how long do the blocks have to be? Here we come back to another consideration. There are two reasons for encoding in long blocks. One is, in order to make the average number of binary digits per symbol used in the Huffman code negligibly larger than the entropy per symbol. The other is, that to encode such material as English text efficiently we must take into account the influence of preceding symbols on the probability that a given symbol will appear next. We have seen that we can do this using equation 5.3 and taking very long blocks. We return, then, to the question: how many symbols N must the block of characters have so that (1) the Huffman code is very efficient, (2) the entropy per block, disregarding interrelations outside of the block, is very close to N times the entropy per symbol? In the case of English text, condition 2 is governing. Shannon has estimated the entropy per letter for English text by measuring a person's ability to guess the next letter of a message after seeing 1, 2, 3, etc., preceding letters. In these texts the "alphabet" used consisted of 26 letters plus the space. Figure V-4 shows the upper and lower bounds on the entropy of English plotted vs. the number of letters the person saw in 102 Symbols, Signals and Noise -<3 i 8 5119 Entropy 103 making his prediction. While the curve seems to drop slowly as the number of letters is increased from 10 to 15, it drops substan- tially between 15 and 100. This would appear to indicate that we might have to encode in blocks as large as 100 letters long in order to encode English really efficiently. From Figure V-4 it appears that the entropy of English text lies somewhere between 0.6 and 1.3 bits per letter. Let us assume a value of 1 bit per letter. Then it will take on the average 100 binary digits to encode a block of 100 letters. This means that there are 2ioo probable English sequences of 100 letters. In our usual decimal notation, 2 100 can be written as 1 followed by 30 zeroes, a fantas- tically large number. In endeavoring to find the probability in English text of all meaningful blocks of letters 100 letters long, we would have to count the relative frequency of occurrence of each such block. Since there are 10 30 highly likely blocks, this would be physically impossible. Further, this is impossible in principle. Most of these 10 30 sequences of letters and spaces (which do not include all meaning- ful sequences) have never been written down! Thus, it is impossible to speak of their relative frequencies or probabilities of such long blocks of letters as derived from English text. Here we are really confronted with two questions: the accuracy of the description of English text as the product of an ergodic source and the most appropriate statistical description of that source. One may believe that appropriate probabilities do exist in some form in the human being even if they cannot be evaluated by the examination of existing text. Or one may believe that the probabilities exist and that they can be derived from data taken in some way more appropriate than a naive computation of the probabilities of sequences of letters. We may note, for instance, that equations 5.4 and 5.5 also give the entropy of an ergodic source. Equation 5.5 applies to a finite-state machine. We have noted at the close of Chapter III that the idea of a human being being in some particular state and in that state producing some particular symbol or word is an appealing one. Some linguists hold, however, that English grammar is incon- sistent with the output of a finite-state machine. Clearly, in trying 104 Symbols, Signals and Noise to understand the structure and the entropy of actual English text we would have to consider such text much more deeply than we have up to this point. It is safe if not subtle to apply an exact mathematical theory blindly and mechanically to the ideal abstraction for which it holds. We must be clever and wise in using even a good and appropriate mathematical theory in connection with actual, nonideal problems. We should seek a simple and realistic description of the laws gov- erning English text if we are to relate it with communication theory as successfully as possible. Such a description must certainly involve the grammar of the language, which we will discuss in the next chapter. In any event, we know that there are some valid statistics of English text, such as letter and word frequencies, and the coding theorems enable us to take advantage of such known statistics. If we encode English letter by letter, disregarding the relative frequencies of the letters, we require 4.76 binary digits per character (including space). If we encode letter by letter, taking into account the relative probabilities of various letters, we require 4.03 binary digits per character. If we encode word by word, taking into account relative frequencies of words, we require 2.14 binary digits per character. And, by using an ingenious and appropriate means, Shannon has estimated the entropy of English text to be between .6 and 1.3 bits per letter, so that we may hope for even more efficient encoding. If, however, we mechanically push some particular procedure for finding the entropy of English text to the limit, we can easily engender not only difficulties but nonsense. Perhaps we can ascribe this nonsense partly to differences between man as a source of English text and our model of an ideal ergodic source, but partly we should ascribe it to the use of an inappropriate approach. We can surely say that the model of man as an ergodic source of text is good and useful if not perfect, and we should regard it highly for these qualities. This chapter has been long and heavy going, and a summary seems in order. Clearly, it is impossible to recapitulate briefly all those matters which took so many pages to expound. We can only re-emphasize the most vital points. Entropy 105 In communication theory the entropy of a signal source in bits per symbol or per second gives the average number of binary digits, per symbol or per second, necessary to encode the messages produced by the source. We think of the message source as randomly, that is, unpre- dictably, choosing one among many possible messages for trans- mission. Thus, in connection with the message source we think of entropy as a measure of choice, the amount of choice the source excercises in selecting the one particular message that is actually transmitted. We think of the recipient of the message, prior to the receipt of the message, as being uncertain as to which among the many possible messages the message source will actually generate and transmit to him. Thus, we think of the entropy of the message source as measuring the uncertainty of the recipient as to which message will be received, an uncertainty which is resolved on receipt of the message. If the message is one among n equally probable symbols or messages, the entropy is log n. This is perfectly natural, for if we have log n binary digits, we can use them to write out = n different binary numbers, and one of these numbers can be used as a label for each of the n messages. More generally, if the symbols are not equally probable, the entropy is given by equation 5.1. By regarding a very long block of symbols, whose content is little dependent on preceding symbols, as a sort of super symbol, equation 5.1 can be modified to give the entropy per symbol for information sources in which the proba- bility that a symbol is chosen depends on what symbols have been chosen previously. This gives us equation 5.3. Other general expressions for entrop) are given by equations 5.4 and 5.5. By assuming that the symbols or blocks of symbols which a source produces are encoded by a most efficient binary code called a Huffman code, it is possible to prove that the entropy of an ergodic source measured in bits is equal to the average number of binary digits necessary to encode it. An error-free communication channel may not transmit binary 106 Symbols, Signals and Noise digits; it may transmit letters or other symbols. We can imagine attaching different message sources to such a channel and seeking (usually mathematically) the message source that causes the en- tropy of the message transmitted over the channel to be as large as possible. This largest possible entropy of a message transmitted over an error-free channel is called the channel capacity. It can be proved that, if the entropy of a source is less than the channel capacity of the channel, messages from the source can be encoded so that they can be transmitted over the channel. This is Shannon's fundamental theorem for the noiseless channel. In principle, expressions such as equations 5.1, 5.3, 5.4, and 5.5 enable us to compute the entropy of a message source by statistical analysis of messages produced by the source. Even for an ideal ergodic source, this would often call for impractically long compu- tations. In the case of an actual source, such as English text, some naive prescriptions for computing entropy can be meaningless. An approximation to the entropy can be obtained by disregard- ing the effect of some past symbols on the probability of the source producing a particular symbol next. Such an approximation to the entropy is always too large and calls for encoding by means of more binary digits than are absolutely necessary. Thus, if we encode English text letter by letter, disregarding even the relative proba- bilities of letters, we require 4.76 binary digits per letter, while if we encode word by word, taking into account the relative proba- bility of words, we require 2.14 binary digits per letter. If we wanted to do even better we would have to take into account other features of English such as the effect of the con- straints imposed by grammar on the probability that a message source will produce a particular word. While we do not know how to encode English text in a highly efficient way, Shannon made an ingenious experiment which shows that the entropy of English text must lie between .6 and 1.3 bits per character. In this experiment a person guessed what letter would follow the letters of a passage of text many letters long. CHAPTER V 1 Language and Meaning THE TWO GREAT TRIUMPHS of information theory are establishing the channel capacity and, in particular, the number of binary digits required to transmit information from a particular source and showing that a noisy communication channel has an information rate in bits per character or bits per second up to which errorless transmission is possible despite the noise. In each case, the results must be demonstrated for discrete and for continuous sources and channels. After four chapters of by no means easy preparation, we were finally ready to essay in the previous chapter the problem of the number of binary digits required to transmit the information gen- erated by a truly ergodic discrete source. Were this book a text on information theory, we would proceed to the next logical step, the noisy discrete channel, and then on to the ergodic continuous channel. At the end of such a logical progress, however, our thoughts would necessarily be drawn back to a consideration of the message sources of the real world, which are only approximately ergodic, and to the estimation of their entropy and the efficient encoding of the messages they produce. Rather than proceeding further with the strictly mathematical aspects of communication theory at this point, is it not more attractive to pause and consider that chief form of communication, 107 108 Symbols, Signals and Noise language, in the light of communication theory? And, in doing so, why should we not let our thoughts stray a little in viewing an im- portant part of our world from the small eminence we have attained? Why should we not see whether even the broad problems of language and meaning seem different to us in the light of what we have learned? In following such a course the reader should heed a word of caution. So far the main emphasis has been on what we know. What we know is the hard core of science. However, scientists find it very difficult to share the things that they know with laymen. To under- stand the sure and the reasonably sure knowledge of science takes the sort of hard thought which I am afraid was required of the reader in the last few chapters. There is, however, another and easier though not entirely frivo- lous side to science. This is a peculiar type of informed ignorance. The scientist's ignorance is rather different from the layman's ignorance, because the background of established fact and theory on which the scientist bases his peculiar brand of ignorance ex- cludes a wide range of nonsense from his speculations. In the higher and hazier reaches of the scientist's ignorance, we have scientifically informed ignorance about the origin of the universe, the ultimate basis of knowledge, and the relation of our present scientific knowl- edge to politics, free will, and morality. In this particular chapter we will dabble in what I hope to be scientifically informed ignor- ance about language. The warning is, of course, that much of what will be put forward here about language is no more than informed ignorance. The warning seems necessary because it is very hard for laymen to tell scientific ignorance from scientific fact. Because the ignorance is necessarily expressed in broader, sketchier, and less qualified terms than is the fact, it is easier to assimilate. Because it deals with grand and unsolved problems, it is more romantic. Generally, it has a wider currency and is held in higher esteem than is scientific fact. However hazardous such ignorance may be to the layman, it is valuable to the scientist. It is this vision of unattained lands, of unsealed heights, which rescues him from complacency and spurs him beyond mere plodding. But when the scientist is airing his ignorance he usually knows what he is doing, while the unwarned Language and Meaning 109 layman apparently often does not and is left scrambling about on cloud mountains without ever having set foot on the continents of knowledge. With this caution in mind, let us return to what we have already encountered concerning language and proceed thence. In what follows we will confine ourselves to a discussion of grammatical English. We all know (and especially those who have had the misfortune of listening to a transcription of a seemingly intelligible conversation or technical talk) that much spoken Eng- lish appears to be agrammatical, as, indeed, much of Gertrude Stein is. So are many conventions and cliches. "Me heap big chief" is perfectly intelligible anywhere in the country, yet it is certainly not grammatical. Purists do not consider the inverted word order which is so characteristic of second-rate poetry as being grammatical. /** Thus, a discussion of grammatical English by no means covers > the field of spoken and written communication, but it charts a course which we can follow with some sense of order and interest. We have noted before that, if we are to write what will be accepted as English text, certain constraints must be obeyed. We cannot simply set down any word following any other. A complete grammar of a language would have to express all of these con- straints fully. It should allow within its rules the construction of any sequence of English words which will be accepted, at some particular time and according to some particular standard, as grammatical. The matter of acceptance of constructions as grammatical is a difficult and hazy one. The translators who produced the King James Bible were free to say "fear not," "sin not," and "speak not" as well as "think not," "do not," or "have not," and we frequently repeat the aphorism "want not, waste not." Yet in our everyday speech or writing we would be constrained to say "do not fear," "do not sin," or "do not speak," and we might perhaps say, "If you are not to want, you should not waste." What is grammatical certainly changes with time. Here we can merely notice this and pass on to other matters. Certainly, a satisfactory grammar must prescribe certain rules which allow the construction of all possible grammatical utterances 1 10 Symbols, Signals and Noise and of grammatical utterances only. Besides doing this, satisfactory rules of grammar should allow us to analyze a sentence so as to distinguish the features which were determined merely by the rules of grammar from any other features. s lf we once had such rules, we would be able to make a new esti- mate of the entropy of English text, for we could see what part of sentence structure is a mere mechanical following of rules and what part involves choice or uncertainty and hence contributes to en- tropy. Further, we could transmit English efficiently by transmit- ting as a message only data concerning the choices exercised in constructing sentences; at the receiver, we could let a grammar machine build grammatical sentences embodying the choices speci- fied by the received message. Even grammar, of course, is not the whole of language, for a sentence can be very odd even if it is grammatical. We can imagine that, if a machine capable of producing only grammatical sentences made its choices at random, it might perhaps produce such a sen- tence as "The chartreuse semiquaver skinned the feelings of the manifold." A man presumably makes his choices in some other way if he says, "The blue note flayed the emotions of the multi- tude." The difference lies in what choices one makes while follow- ing grammatical rules, not in the rules themselves. An understand- ing of grammar would not unlock to us all of the secrets of language, but it would take us a long step forward. What sort of rules will result in the production of grammatical sentences only and of all grammatical sentences, even when choices are made at random? In Chapter III we saw that English-like sequences of words can be produced by choosing a word at ran- dom according to its probability of succeeding a preceding se- quence of words some M words long. An example of a second-order word approximation, in which a word is chosen on the basis of its succeeding the previous word, was given. One can construct higher-order word approximations by using the knowledge of English which is stored in our heads. One can, for instance, obtain a fourth-order word approximation by simply showing a sequence of three connected words to a person and ask- ing him to think up a sentence in which the sequence of words occurs and to add the next word. By going from person to person a long string of words can be constructed, for instance: Language and Meaning 1 1 1 1. When morning broke after an orgy of wild abandon he said here head shook vertically aligned in a sequence of words signify- ing what. 2. It happened one frosty look of trees waving gracefully against the wall. 3. When cooked asparagus has a delicious flavor suggesting apples. 4. The last time I saw turn when he lived. These "sentences" are as sensible as they are because selections of words were not made at random but by thinking beings. The point to be noted is how astonishingly grammatical the sentences are, despite the fact that rules of grammar (and sense) were ap- plied to only four words at a time (the three shown to each person and the one he added). Still, example 4 is perhaps dubiously grammatical. If Shannon is right and there is in English text a choice of about 1 bit per symbol, then choosing among a group of 4 words could involve about 22 binary choices, or a choice among some 10 mil- lion 4-word combinations. In principle, a computer could be made to add words by using such a list of combinations, but the result would not be assuredly grammatical, nor could we be sure that this cumbersome procedure would produce all possible grammati- cal sequences of words. There probably are sequences of words which could form a part of a grammatical sentence in one case and could not in another case. If we included such a sequence, we would produce some nongrammatical sentences, and, if we ex- cluded it, we would fail to produce all grammatical sentences. If we go to combinations of more than four words, we will favor grammar over completeness. If we go to fewer than four words, we will favor completeness over grammar. We can't have both. The idea of a finite-state machine recurs at this point. Perhaps at each point hi a sentence a sentence-producing machine should be in a particular state, which allows it certain choices as to what state it will go to next. Moreover, perhaps such a machine can deal with certain classes or subclasses of words, such as singular nouns, plural nouns, adjectives, adverbs, verbs of various tense and num- ber, and so on, so as to produce grammatical structures into which words can be fitted rather than sequences of particular words. The idea of grammar as a finite-state machine is particularly 1 12 Symbols, Signals and Noise appealing because a mechanist would assert that man must be a finite-state machine, because he consists of only a finite number of cells, or of atoms if we push the matter further. Noam Chomsky, a brilliant and highly regarded modern linguist, rejects the finite-state machine as either a possible or a proper model of grammatical structure. Chomsky points out that there are many rules for constructing sequences of characters which can- not be embodied in a finite-state machine. For instance, the rule might be, choose letters at random and write them down until the letter Z shows up, then repeat all the letters since the preceding Z in reverse order, and then go on with a new set of letters, and so on. This process will produce a sequence of letters showing clear evidence of long-range order. Further, there is no limit to the pos- sible length of the sequence between Z's. No finite-state machine can simulate this process and this result. Chomsky points out that there is no limit to the possible length of grammatical sentences in English and argues that English sen- tences are organized in such a way that this is sufficient to rule out a finite-state machine as a source of all possible English text. But, can we really regard a sentence miles long as grammatical when we know darned well that no one ever has or will produce such a sentence and that no one could understand it if it existed? To decide such a question, we must have a standard of being grammatical. While Chomsky seems to refer being or not being grammatical, and some questions of punctuation and meaning as well, to spoken English, I think that his real criterion is: a sen- tence is grammatical if, in reading or saying it aloud with a natural expression and thoughtfully but ingenuously, it is deemed gram- matical by a person who speaks it, or perhaps by a person who hears it. Some problems which might plague others may not bother Chomsky because he speaks remarkably well-connected and gram- matical English. Whether or not the rules of grammar can be embodied in a finite-state machine, Chomsky offers persuasive evidence that it is wrong and cumbersome to try to generate a sentence by basing the choice of the next word entirely and solely on words already written down. Rather, Chomsky considers the course of sentence generation to be something of this sort: Language and Meaning 1 1 3 We start with one or another of several general forms the sen- tence might take; for example, a noun phrase followed by a verb phrase. Chomsky calls such a particular form of sentence a kernel sentence. We then invoke rules for expanding each of the parts of the kernel sentence. In the case of a noun phrase we may first de- scribe it as an article plus a noun and finally as "the man." In the case of a verb phrase we may describe it as a verb plus an object, the object as an article plus a noun, and, in choosing particular words, as "hit the ball." Proceeding in this way from the kernel sentence, noun phrase plus verb phrase, we arrive at the sentence, "The man hit the ball." At any stage we could have made other choices. By making other choices at the final stages we might have arrived at "A girl caught a cat." Here we see that the element of choice is not exercised sequen- tially along the sentence from beginning to end. Rather, we choose an over-all skeletal plan or scheme for the whole final sentence at the start. That scheme or plan is the kernel sentence. Once the kernel sentence has been chosen, we pass on to parts of the kernel sentence. From each part we proceed to the constituent elements of that part and from the constituent elements to the choice of particular words. At each branch of this treelike structure grow- ing from the kernel sentence, we exercise choice in arriving at the particular final sentence, and, of course, we chose the kernel sen- tence to start with. Here I have indicated Chomsky's ideas very incompletely and very sketchily. For instance, in dealing with irregular forms of words Chomsky will first indicate the root word and its particular grammatical form, and then he will apply certain obligatory rules in arriving at the correct English form. Thus, in the branching con- struction of a sentence, use is made both of optional rules, which allow choice, and of purely mechanical, deterministic obligatory rules, which do not To understand this approach further and to judge its merit, one must refer to Chomsky's book, 1 and to the references he gives. Chomsky must, of course, deal with the problem of ambiguous sentences, such as, "The lady scientist made the robot fast while she ate." The author of this sentence, a learned information theo- 1 Noam Chomsky, Syntactic Structures, Mouton and Co., VGravenhage, 1957. 1 14 Symbols, Signals and Noise rist, tells me that, allowing for the vernacular, it has at least four different meanings. It is perhaps too complicated to serve as an example for detailed analysis. We might think that ambiguity arises only when one or more words can assume different meanings in what is essentially the same grammatical structure. This is the case in "he was mad" (either angry or insane) or "the pilot was high" (in the sky or in his cups). Chomsky, however, gives a simple example of a phrase in which the confusion is clearly grammatical. In "the shooting of the hunters," the noun hunters may be either the subject, as in "the growling of lions" or the object, as in "the growing of flowers." Chomsky points out that different rules of transformation applied to different kernel sentences can lead to the same sequence of grammatical elements. Thus, "the picture was painted by a real artist" and "the picture was painted by a new technique" seem to correspond grammatically word for word, yet the first sentence could have arisen as a transformation of "a real artist painted the picture" while the second could not have arisen as a transforma- tion of a sentence having this form. When the final words as well as the final grammatical elements are the same, the sentence is ambiguous. Chomsky also faces the problem that the distinction between the provinces of grammar and meaning is not clear. Shall we say that grammar allows adjectives but not adverbs to modify nouns? This allows "colorless green." Or should grammar forbid the asso- ciation of some adjectives with some nouns, of some nouns with some verbs, and so on? With one choice, certain constructions are grammatical but meaningless; with the other they are ungram- matical. We see that Chomsky has laid out a plan for a grammar of English which involves at each point in the synthesis of a sentence certain steps which are either obligatory or optional. The processes allowed in this grammar cannot be carried out by a finite-state machine, but they can be carried out by a more general machine called a Turing machine, which is a finite-state machine plus an infinitely long tape on which symbols can be written and from which symbols can be read or erased. The relation of Chomsky's grammar to such machines is a proper study for those interested in automata. Language and Meaning \ 1 5 We should note, however, that if we arbitrarily impose some bound on the length of a sentence, even if we limit the length to 1,000 or 1 million words, then Chomsky's grammar does correspond to a finite-state machine. The imposition of such a limit on sen- tence length seems very reasonable in a practical way. Once a general specification or model of a grammar of the sort Chomsky proposes is set up, we may ask under what circumstances and how can an entropy be derived which will measure the choice or uncertainty of a message source that produces text according to the rules of the grammar? This is a question for the mathema- tically skilled information theorist. Much more important is the production of a plausible and workable grammar. This might be & phrase-structure grammar, as Chomsky proposes, or it might take some other form. Such a grammar might be incomplete hi that it failed to produce or ana- lyze some constructions to be found in grammatical English. It seems more important that its operation should correspond to what we know of the production of English by human beings. Further, it should be simple enough to allow the generation and analysis of text by means of an electronic computer. I believe that com- puters must be used in attacking problems of the structure and statistics of English text. While a great many people are convinced that Chomsky's phrase-structure approach is a very important aspect of grammar, some feel that his picture of the generation of sentences should be modified or narrowed if it is to be used to describe the actual gen- eration of sentences by human beings. Subjectively, in speaking or listening to a speaker one has a strong impression that sentences are generated largely from beginning to end. One also gets the impression that the person generating a sentence doesn't have a very elaborate pattern in his head at any one time but that he elaborates the pattern as he goes along. I suspect that studies of the form of grammars and of the statis- tics of their use as revealed by language will in the not distant future tell us many new things about the nature of language and about the nature of men as well. But, to say something more par- ticular than this, I would have to outreach present knowledge- mine and others. A grammar must specify not only rules for putting different types 1 16 Symbols, Signals and Noise of words together to make grammatical structures; it must divide the actual words of English into classes on the basis of the places in which they can appear in grammatical structures. Linguists make such a division purely on the basis of grammatical function with- out invoking any idea of meaning. Thus, all we can expect of a grammar is the generation of grammatical sentences, and this in- cludes the example given earlier: "The chartreuse semiquaver skinned the feelings of the manifold." Certainly the division of words into grammatical categories such as nouns, adjectives, and verbs is not our sole guide concerning the use of words in produc- ing English text. What does influence the choice among words when the words used in constructing grammatical sentences are chosen, not at random by a machine, but rather by a live human being who, through long training, speaks or writes English according to the rules of the grammar? This question is not to be answered by a vague appeal to the word meaning. Our criteria in producing Eng- lish sentences can be very complicated indeed. Philosophers and psychologists have speculated about and studied the use of words and language for generations, and it is as hard to say anything en- tirely new about this as it is to say anything entirely true. In par- ticular, what Bishop Berkeley wrote in the eighteenth century concerning the use of language is so sensible that one can scarcely make a reasonable comment without owing him credit. Let us suppose that a poet of the scanning, rhyming school sets out to write a grammatical poem. Much of his choice will be exer- cised in selecting words which fit into the chosen rhythmic pattern, which rhyme, and which have alliteration and certain consistent or agreeable sound values. This is particularly notable in Poe's "The Bells," "Ulalume," and "The Raven." Further, the poet will wish to bring together words which through their sound as well as their sense arouse related emotions or im- pressions in the reader or hearer. The different sections of Poe's "The Bells" illustrate this admirably. There is a marked contrast between: How they tinkle, tinkle, tinkle, In the icy air of night! While the stars that oversprinkle Language and Meaning 1 1 7 All the heavens, seem to twinkle In a crystalline delight; . . . and Through the balmy air of night How they ring out their delight! From the molten-golden notes, And all in tune, What a liquid ditty floats . . . Sometimes, the picture may be harmonious, congruous, and moving without even the trivial literal meaning of this verse of Poe's, as in Blake's two lines: Tyger, Tyger, burning bright In the forests of the night . . . In instances other than poetry, words may be chosen for euphony, but they are perhaps more often chosen for their associations with and ability to excite passions such as those listed by Berkeley: fear, love, hatred, admiration, disdain. Particular words or expressions move each of us to such feelings. In a given culture, certain words and phrases will have a strong and common effect on the majority of hearers, just as the sights, sounds or events with which they are associated do. The words of a hymn or psalm can induce a strong religious emotion; political or racial epithets, a sense of alarm or contempt, and the words and phrases of dirty jokes, sexual excitement. One emotion which Berkeley does not mention is a sense of understanding. By mouthing commonplace and familiar patterns of words in connection with ill-understood matters, we can asso- ciate some of our emotions of familiarity and insight with our per- plexity about history, life, the nature of knowledge, consciousness, death, and Providence. Perhaps such philosophy as makes use of common words should be considered in terms of assertion of a reassurance concerning the importance of man's feelings rather than in terms of meaning. One could spend days on end examining examples of motivation in the choice of words, but we do continually get back to the matter of meaning. Whatever meaning may be, all else seems lost without 118 Symbols, Signals and Noise it. A Chinese poem, hymn, deprecation, or joke will have little effect on me unless I understand Chinese in whatever sense those who know a language understand it. Though Colin Cherry, a well-known information theorist, ap- pears to object, I think that it is fair to regard meaningful language as a sort of code of communication. It certainly isn't a simple code in which one mechanically substitutes a word for a deed. It's more like those elaborate codes of early cryptography, in which many alternative code words were listed for each common letter or word (in order to suppress frequencies). But in language, the listings may overlap. And one person's code book may have different entries from another's, which is sure to cause confusion. If we regard language as an imperfect code of communication, we must ultimately refer meaning back to the intent of the user. It is for this reason that I ask, "What do you mean?" even when I have heard your words. Scholars seek the intent of authors long dead, and the Supreme Court seeks to establish the intent of Con- gress in applying the letter of the law. Further, if I become convinced that a man is lying, I interpret his words as meaning that he intends to flatter or deceive me. If I find that a sentence has been produced by a computer, I interpret it to mean that the computer is functioning very cleverly. - I don't think that such matters are quibbles; it seems that we are driven to such considerations in connection with meaning if we do regard language as an imperfect code of communication, and as one which is sometimes exploited in devious ways. We are certainly far from any adequate treatment of such problems. Grammatical sentences do, however, have what might be called a formal meaning, regardless of intent. If we had a satisfactory grammar, a machine should be able to establish the relations be- tween the words of a sentence, indicating subject, verb, object, and what modifying phrases or clauses apply to what other words. The next problem beyond this in seeking such formal meaning in sen- tences is the problem of associating words with objects, qualities, actions, or relations in the world about us, including the world of man's society and of Ms organized knowledge. In the simple communications of everyday life, we don't have much trouble in associating the words that are used with the proper Language and Meaning 1 1 9 objects, qualities, actions, and relations. No one has trouble with "close the east window" or "Henry is dead," when he hears such a simple sentence in simple, unambiguous surroundings. In a familiar American room, anyone can point out the window; we have closed windows repeatedly, and we know what direction east is. Also, we know Henry (if we don't get Henry Smith mixed up with Henry Jones), and we have seen dead people. If the sentence is misheard or misunderstood, a second try is almost sure to succeed. Think, however, how puzzling the sentence about the window would be, even in translation, to a shelterless savage. And we can get pretty puzzled ourselves concerning such a question as, is a virus living or dead? It appears that much of the confusion and puzzlement about the associations of words with things of the world arose through an effort by philosophers from Plato to Locke to give meaning to such ideas as window, cat, or dead by associating them with general ideas or ideal examples. Thus, we are presumed to identify a window by its resemblance to a general idea of a window, to an ideal window, in fact, and a cat by its resemblance to an ideal cat which embodies all the attributes of cattiness. As Berkeley points out, the abstract idea of a (or the ideal) triangle must at once be "neither oblique, rectangle, equilateral, equicrural nor scaleron, but all and none of these at once." C Actually, when a doctor pronounces a man dead he does so on the basis of certain observed signs which he would be at a loss to identify in a virus. Further, when a doctor makes a diagnosis, he does not start out by making an over-all comparison of the patient's condition with an ideal picture of a disease. He first looks for such signs as appearance, temperature, pulse, lesions of the skin, inflam- mation of the throat, and so on, and he also notes such symptoms as the patient can describe to him. Particular combinations of signs and symptoms indicate certain diseases, and in differential diag- noses further tests may be used to distinguish among diseases pro- ducing similar signs and symptoms. In a similar manner, a botanist identifies a plant, familiar or unfamiliar, by the presence or absence of certain qualities of size, color, leaf shape and disposition, and so on. Some of these quali- 120 Symbols, Signals and Noise ties, such as the distinction between the leaves of monocotyledon- ous and dicotyledonous plants, can be decisive; others, such as size, can be merely indicative. In the end, one is either sure he is right or perhaps willing to believe that he is right; or the plant may be a new species. Thus, in the workaday worlds of medicine and botany, the ideal disease or plant is conspicuous by its absence as any actual useful criterion. Instead, we have lists of qualities, some decisive and some merely indicative. The value of this observation has been confirmed strongly in recent work toward enabling machines to carry out tasks of recog- nition or classification. Early workers, perhaps misled by early philosophers, conceived the idea of matching a letter to an ideal pattern of a letter or the spectrogram of a sound to an ideal spec- trogram of the sound. The results were terrible. Audrey, a pattern- matching machine with the bulk of a hippo and brains beneath contempt, could recognize digits spoken by one voice or a selected group of voices, but Audrey was sadly fallible. We should, I think, conclude that human recognition works this way in very simple cases only, if at all. Later and more sophisticated workers in the field of recognition look for significant features. Thus, as a very simple example, rather than having an ideal pattern of a capital Q, one might describe Q as a closed curve without corners or reversals of curvature and with something attached between four and six o'clock. In 1959, L. D. Harmon built at the Bell Laboratories a simple device weighing a few pounds which almost infallibly recognizes the digits from one to zero written out as words in longhand. Does this gadget match the handwriting against patterns? You bet it doesn't! Instead, it asks such questions as, how many times did the stylus go above or below certain lines? Were Fs dotted or Ps crossed? Certainly, no one doubts that words refer to classes of objects, actions, and so on. We are surrounded by and involved with a large number of classes and subclasses of objects and actions which we can usefully associate with words. These include such objects as plants (peas, sunflowers . . .), animals (cats, dogs , . .)> machines (autos, radios . . .), buildings (houses, towers . , .), clothing (skirts, Language and Meaning 1 2 1 socks . . .), and so on. They include such very complicated sequences of actions as dressing and undressing (the absent-minded, includ- ing myself, repeatedly demonstrate that they can do this uncon- sciously); tying one's shoes (an act which children have considerable difficulty in learning), eating, driving a car, reading, writing, adding figures, playing golf or tennis (activities involving a host of distinct subsidiary skills), listening to music, making love, and so on and on and on. It seems to me that what delimits a particular class of objects, qualities, actions, or relations is not some sort of ideal example. Rather, it is a list of qualities. Further, the list of qualities cannot be expected to enable us to divide experience up into a set of logi- cal, sharply delimited, and all-embracing categories. The language of science may approach this in dealing with a narrow range of experience, but the language of everyday life makes arbitrary, overlapping, and less than all-inclusive divisions of experience. Yet, I believe that it is by means of such lists of qualities that we iden- tify doors, windows, cats, dogs, men, monkeys, and other objects of daily life. I feel also that this is the way in which we identify common actions such as running, skipping, jumping, and tying, and such symbols as words, written and spoken, as well. I think that it is only through such an approach that we can hope to make a machine classify objects and experience in terms of language, or recognize and interpret language in terms of other language or of action. Further, I believe that when a word cannot offer a table of qualities or signs whose elements can be traced back to common and familiar experiences, we have a right to be wary of the word. If we are to understand language in such a way that we can hope some day to make a machine which will use language successfully, we must have a grammar and we must have a way of relating words to the world about us, but this is of course not enough. If we are to regard sentences as meaningful, they must in some way correspond to life as we live it. Our lives do not present fresh objects and fresh actions each day. They are made up of familiar objects and familiar though compli- cated sequences of actions presented in different groupings and orders. Sometimes we learn by adding new objects, or actions, or 122 Symbols, Signals and Noise combinations of objects or sequences of actions to our stock, and so we enrich or change our lives. Sometimes we forget objects and actions. Our particular actions depend on the objects and events about us. We dodge a car (a complicated sequence of actions). When thirsty, we stop at the fountain and drink (another complicated but recurrent sequence). In a packed crowd we may shoulder someone out of the way as we have done before. But our information about the world does not all come from direct observation, and our in- fluence on others is happily not confined to pushing and shoving. We have a powerful tool for such purposes: language and words. We use words to learn about relations among objects and activi- ties and to remember them, to instruct others or to receive instruc- tion from them, to influence people in one way or another. For the words to be useful, the hearer must understand them in the same sense that the speaker means them, that is, insofar as he associates them with nearly enough the same objects or skills. It's no use, however, to tell a man to read or to add a column of figures if he has never carried out these actions before, so that he doesn't have these skills. It is no use to tell him to shoot the aardvark and not the gnu if he has never seen either. Further, for the sequences of words to be useful, they must refer to real or possible sequences of events. It's of no use to advise a man to walk from London to New York in the forenoon immedi- ately after having eaten a seven o'clock dinner. Thus, in some way the meaningfulness of language depends not only on grammatical order and on a workable way of associating words with collections of objects, qualities, and so on; it also de- pends on the structure of the world around us. Here we encounter a real and an extremely serious difficulty with the idea that we can in some way translate sentences from one language into another and accurately preserve the "meaning." One obvious difficulty in trying to do this arises from differences in classification. We can refer to either the foot or the lower leg; the Russians have one word for the foot plus the lower leg. Hun- garians have twenty fingers (or toes), for the word is the same for either appendage. To most of us today, a dog is a dog, male or female, but men of an earlier era distinguished sharply between a Language and Meaning 123 dog and a bitch. Eskimos make, it is said, many distinctions among snow which in our language would call for descriptions, and for us even these descriptions would have little real content of impor- tance or feeling, because in our lives the distinctions have not been important. Thus, the parts of the world which are common and meaningful to those speaking different languages are often divided into somewhat different classes. It may be impossible to write down in different languages words or simple sentences that specify exactly the same range of experience. There is a graver problem than this, however. The range of experience to which various words refer is not common among all cultures. What is one to do when faced with the problem of trans- lating a novel containing the phrase, "tying one's shoelace," which as we have noted describes a complicated action, into the language of a shoeless people? An elaborate description wouldn't call up the right thing at all. Perhaps some cultural equivalent (?) could be found. And how should one deal with the fact that "he built a house" means personal tree cutting and adzing in a pioneer novel, while it refers to the employment of an architect and a contractor in a contemporary story? It is possible to make some sort of translation between closely related languages on a word-for-word or at least phrase-for-phrase basis, though this is said to have led from "out of sight, out of mind" to "blind idiot." When the languages and cultures differ in major respects, the translator has to think what the words mean in terms of objects, actions, or emotions and then express this meaning in the other language. It may be, of course, that the cul- ture with which the language is associated has no close equivalents to the objects or actions described in the passage to be translated. Then the translator is really stuck. How, oh how is the man who sets out to build a translating machine to cope with a problem such as this? He certainly cannot do so without in some way enabling the machine to deal effectively with what we refer to as understanding. In fact, we see understand- ing at work even in situations which do not involve translation from one language into another. A screen writer who can quite accurately transfer the essentials of a scene involving a dying uncle in Omsk to one involving a dying father in Dubuque will repeatedly 124 Symbols, Signals and Noise make complete nonsense in trying to rephrase a simple technical statement. This is clearly because he understands grief but not science. Having grappled painfully with the word meaning, we are now faced with the word understanding. This seems to have two sides. If we understand algebra or calculus, we can use their manipula- tions to solve problems we haven't encountered before or to supply proofs of theorems we haven't seen proved. In this sense, under- standing is manifested by a power to do, to create, not merely to repeat. To some degree, an electronic computer which proves theorems in mathematical logic which it has not encountered be- fore (as computers can be programmed to do) could perhaps be said to understand the subject. But there is an emotional side to understanding, too. When we can prove a theorem hi several ways and fit it together with other theorems or facts in various manners, when we can view a field from many aspects and see how it all fits together, we say that we understand the subject deeply. We attain a warm and confident feeling about our ability to cope with it. Of course, at one time or another most of us have felt the warmth without manifesting the ability. And how disillusioned we were at the critical test! In discussing language from the point of view of information theory, we have drifted along a tide of words, through the imper- fectly charted channels of grammar and on into the obscurities of meaning and understanding. This shows us how far ignorance can take one. It would -be absurd to assert that information theory, or anything else, has enabled us to solve the problems of linguistics, of meaning, of understanding, of philosophy, of life. At best, we can perhaps say that we are pushing a little beyond the mechani- cal constraints of language and getting at the amount of choice that language affords. This idea suggests views concerning the use and function of language, but it does not establish them. The reader may share my freely offered ignorance concerning these matters, or he may prefer his own sort of ignorance. CHAPTER VII Efficient Encoding WE WILL NEVER AGAIN understand nature as well as Greek philosophers did. A general explanation of common phenomena in terms of a few all-embracing principles no longer satisfies us. We know too much. We must explain many things of which the Greeks were unaware. And, we require that our theories harmonize in detail with the very wide range of phenomena which they seek to explain. We insist that they provide us with useful guidance rather than with rationalizations. The glory of Newtonian me- chanics is that it has enabled men to predict the positions of planets and satellites and to understand many other natural phenomena as well; it is surely not that Newtonian mechanics once inspired and supported a simple mechanistic view of the universe at large, including life. Present-day physicists are gratified by the conviction that all (non-nuclear) physical, chemical, and biological properties of mat- ter can in principle be completely and precisely explained in all their detail by known quantum laws, assuming only the existence of electrons and of atomic nuclei of various masses and charges. It is somewhat embarrassing, however, that the only physical sys- tem all of whose properties actually have been calculated exactly is the isolated hydrogen atom. Physicists are able to predict and explain some other physical phenomena quite accurately and many more semiquantitatively. However, a basic and accurate theoretical treatment, founded on electrons, nuclei, and quantum laws only, without recourse to 125 126 Symbols, Signals and Noise other experimental data, is lacking for most common thermal, mechanical, electrical, magnetic, and chemical phenomena. Trac- ing complicated biological phenomena directly back to quantum first principles seems so difficult as to be scarcely relevant to the real problems of biology. It is almost as if we knew the axioms of an important field of mathematics but could prove only a few simple theorems. Thus, we are surrounded in our world by a host of intriguing problems and phenomena which we cannot hope to relate through one universal theory, however true that theory may be in principle. Until recently the problems of science which we commonly asso- ciate with the field of physics have seemed to many to be the most interesting of all the aspects of nature which still puzzle us. Today, it is hard to find problems more exciting than those of biochem- istry and physiology. I believe, however, that many of the problems raised by recent advances in our technology are as challenging as any that face us. What could be more exciting than to explore the potentialities of electronic computers in proving theorems or in simulating other behavior we have always thought of as "human"? The problems raised by electrical communication are just as challenging. Accu- rate measurements made by electrical means have revolutionized physical acoustics. Studies carried out in connection with tele- phone transmission have inaugurated a new era in the study of speech and hearing, in which previously accepted ideas of phys- iology, phonetics, and liguistics have proved to be inadequate. And, it is this chaotic and intriguing field of much new ignorance and of a little new knowledge to which communication theory most directly applies. If communication theory, like Newton's laws of motion, is to be taken seriously, it must give us useful guidance in connection with problems of communication. It must demonstrate that it has a real and enduring substance of understanding and power. As the name implies, this substance should be sought in the efficient and accurate transmission of information. The substance indeed exists. As we have seen, it existed in an incompletely understood form even before Shannon's work unified it and made it intelligible. Efficient Encoding 127 To deal with the matter of accurate transmission of information we need new basic understanding, and this matter will be tackled in the next chapter. The foregoing chapters have, however, put us in a position to discuss some challenging aspects of the efficient transmission of information. We have seen that in the entropy of an information source measured in bits per symbol or per second we have a measure of the number of binary digits, of off-or-on pulses, per symbol or per second which are necessary to transmit a message. Knowing this number of binary digits required for encoding and transmission, we naturally want a means of actually encoding messages with, at the most, not many more binary digits than this minimum number. Novices in mathematics, science, or engineering are forever de- manding infallible, universal, mechanical methods for solving problems. Such methods are valuable in proving that problems can be solved, but in the case of difficult problems they are sel- dom practical, and they may sometimes be completely unfeasible. As an example, we may note that an explicit solution of the gen- eral cubic equation exists, but no one ever uses it in a practical problem. Instead, some approximate method suited to the type or class of cubics actually to be solved is resorted to. The person who isn't a novice thinks hard about a specific prob- lem in order to see if there isn't some better approach than a machine-like application of what he has been taught. Let us see how this applies in the case of information theory. We will first consider the case of a discrete source which produces a string of symbols or characters. In Chapter \ 9 we saw that the entropy of a source can be com- puted by examining the relative probabilities of occurrence of various long blocks of characters. As the length of the block is increased, the approximation to the entropy gets closer and closer. In a particular case, perhaps blocks 5, or 10, or 100 characters in length might be required to give a very good approximation to the entropy. We also saw that by dividing the message into successive blocks of characters, to each of which a probability of occurrence can be attached, and by encoding these blocks into binary digits by means 128 Symbols, Signals and Noise of the Huffman code, the number of digits used per character approaches the entropy as the blocks of characters are made longer and longer. Here indeed is our foolproof mechanical scheme. Why don't we simply use it in all cases? To see one reason, let us examine a very simple case. Suppose that an information source produces a binary digit, a 1 or a 0, randomly and with equal probability and then follows it with the same digit twice again before producing independently another digit. The message produced by such a source might be: 000111000111111000000111 Would anyone be foolish enough to divide such a message successively into blocks of 1, 2, 3, 4, 5, etc., characters, compute the probabilities of the blocks, encode them with a Huffman code, and note the improvement in the number of binary digits required for transmission? I don't know; it sometimes seems to me that there are no limits to human folly. Clearly, a much simpler procedure is not only adequate but absolutely perfect. Because of the repetition, the entropy is clearly the same as for a succession of a third as many binary digits chosen randomly and independently with equal probability of 1 or 0. That is, it is & binary digit per character of the repetitious message. And, we can transmit the message perfectly efficiently simply by sending every third character and telling the recipient to write down each received character three times. This example is simple but important It illustrates the fact that we should look for natural structure in a message source, for salient features of which we can take advantage. The discussion of English text in Chapter IV illustrates this. We might, for instance, transmit text merely as a picture by television or facsimile. This would take many binary digits per character. We would be providing a transmission system capable of sending not only English text, but Cyrillic, Greek, Sanskrit, Chinese, and other text, and pictures of landscapes, storms, earthquakes, and Marilyn Monroe as well. We would not be taking advantage of the elemen- tary and all-important fact that English text is made up of letters. If we encode English text letter by letter, taking no account of Efficient Encoding 129 the different probabilities of various letters (and excluding the space), we need 4.7 binary digits per letter. If we take into account the relative probabilities of letters, as Morse did, we need 4.14 binary digits per letter. If we proceeded mechanically to encode English text more efficiently, we might go on to encoding pairs of letters, sequences of three letters, and so on. This, however, would provide for encoding many sequences of letters which aren't English words. It seems much more sensible to go on to the next larger unit of English text, the word. We have seen in Chapter IV that we would expect to use only about 14 binary digits per word or 2.5 binary digits per character in so encoding English text. If we want to proceed further, the next logical step would be to consider the structure of phrases or sentences; that is, to take advantage of the rules of grammar. The trouble is that we don't know the rules of grammar completely enough to help us, and if we did, a communication system which made use of these rules would probably be unpractically complicated. Indeed, in practical cases it still seems best to encode the letters of English text inde- pendently, using at least 5 binary digits per character. It is, however, important to get some idea of what could be accomplished in transmitting English text. To this end, Shannon considered the following communication situation. Suppose we ask a man, using all his knowledge of English, to guess what the next character in some English text is. If he is right we tell him so, and he writes the character down. If he is wrong, we may either tell him what the character actually is or let him make further guesses until he guesses the right character. Now, suppose that we regard this process as taking place at the transmitter, and say that we have an absolutely identical twin to guess for us at the receiver, a twin who makes just the same mis- takes that the man at the transmitter does. Then, to transmit the text, we let the man at the receiver guess. When the man at the transmitter guesses right, so will the man at the receiver. Thus, we need send information to the man at the receiver only when the man at the transmitter guesses wrong and then only enough infor- mation to enable the men at the transmitter and the receiver to write down the right character. 1 30 Symbols, Signals and Noise Shannon has drawn a diagram of such a communication system, which is shown in Figure VII- 1. A predictor acts on the original text. The prediction of the next letter is compared with the actual letter. If an error is noted, some information is transmitted. At the receiver, a prediction of the next character is made from the already reconstructed text. A comparison involving the received signal is carried out. If no error has been made, the predicted character is used; if an error has been made, the "reduced text" information coming in will make it possible to correct the error. Of course, we don't have such identical twins or any other highly effective identical predictors. Nonetheless, a much simpler but purely mechanical system based on this diagram has been used in transmitting pictures. Shannon's purpose was different, however. By using just one person, and not twins, he was able to find what transmission rate would be required in such a system merely by examining the errors made by the one man in the transmitter situation. The results are summed up in Figure V-4 of Chapter V. A better prediction is made on the basis of the 100 preceding letters than on the basis of the preceding 10 or 15. To correct the errors in prediction, something between 0.6 and 1.3 binary digits per character is required. This tells us that, insofar as this result is correct, the entropy of English text must lie between .6 and 1.3 bits per letter. A discrete source of information provides a good example for discussion but not an example of much practical importance in communication. The reason is that, by modern standards of elec- trical communication, it takes very few binary digits or off-or-on pulses to send English text. We have to hurry to speak a few hundred words a minute, yet it is easy to send over a thousand words of text over a telephone connection in a minute or to send 10 million words a minute over a TV channel, and, in principle if not in practice, we could transmit some 50,000 words a minute over COMPARISON COMPARISON ORIGINAL - REDUCED TEXT - TEXT | - j ^ TEXT L-1PREDICTORU-I Fig. Vll-l Efficien t Encoding 1 3 1 a telephone channel and some 50 million words a minute over a TV channel. As a matter of fact, in practical cases we have even retreated from Morse's ingenious code which sends an E faster than a Z. A teletype system uses the same length of signal for any letter. Efficient encoding is thus potentially more important for voice transmission than for transmission of text, for voice takes more binary digits per word than does text. Further, efficient encoding is potentially more important for TV than for voice. Now, a voice or a TV signal is inherently continuous as opposed to English text, numbers, or binary digits, which are discrete. Disregarding capitalization and punctuation, an English character may be any one of the letters or the space. At a given moment, the sound wave or the human voice may have any pressure at all lying within some range of pressures. We have noted in Chapter IV that if the frequencies of such a continuous signal are limited to some bandwidth B, the signal can be accurately represented by 2B samples or measurements of amplitude per second. We remember, however, that the entropy per character depends on how many values the character can assume. Since a continuous signal can assume an infinite number of different values at a sample point, we are led to assume that a continuous signal must have an entropy of an infinite number of bits per sample. This would be true if we required an absolutely accurate repro- duction of the continuous signal. However, signals are transmitted to be heard or seen. Only a certain degree of fidelity of reproduc- tion is required. Thus, in dealing with the samples which specify continuous signals, Shannon introduces a fidelity criterion. To reproduce the signal in a way meeting the fidelity criterion requires only a finite number of binary digits per sample or per second, and hence we can say that, within the accuracy imposed by a particular fidelity criterion, the entropy of a continuous source has a particu- lar value in bits per sample or bits per second. It is extremely important to realize that the fidelity criterion should be associated with long stretches of the signal, not with individual samples. For instance, in transmitting a sound, if we make each sample 10 per cent larger, we will merely make the sound louder, and no damage will be done to its quality. If we make a random error of 10 per cent in each sample, the recovered signal 132 Symbols, Signals and Noise will be very noisy. Similarly, in picture transmission an error in brightness or contrast which changes smoothly and gradually across the picture will pass unnoticed, but an equal but random error differing from point to point will be intolerable. We have seen that we can send a continuous signal by quantizing each sample, that is, by allowing it to assume only certain pre- assigned values. It appears that 128 values are sufficient for the transmission of telephone-quality speech or of pictures. We must realize, however, that, in quantizing a speech signal or a picture signal sample by sample, we are proceeding in a very unsophisti- cated manner, just as we are if we encode text letter by letter rather than word by word. The name hyperquantization has been given to the quantization of continuous signals of more than one sample at a time. This is undoubtedly the true road to efficient encoding of continuous signals. One can easily ruin his chances of efficient encoding com- pletely by quantizing the samples at the start. Yet, to hyperquantize a continuous signal effectively is not easy, and in the present art independent quantization of samples is the method commonly used. It is used in pulse code modulation^ which is used in some military telephone systems and is being developed for multiplex transmission of telephone signals, that is, for sending many speech signals over the same circuit. In pulse code modulation, the nearest of one of a number of standard levels or amplitudes is assigned to each sample. As an example, if eight levels were used, they might be equally spaced as in a of Figure VII-2. The level representing the sample is then transmitted by sending the binary number written to the right of it. Some subtlety of encoding can be used even in such a system. Instead of the equally spaced amplitudes of Figure VIl-2a, we can use quantization levels which are close together for small signals and farther apart for large signals, as shown in Figure VII-2& The reason for doing this is, of course, that our ears are sensitive to a fractional error in signal amplitude rather than to an error of so many dynes below or above average pressure or so many volts positive or negative, in the signal. By such companding (compressing the high amplitudes at the transmitter and expanding them again at the receiver), 7 binary digits per sample can give a signal almost Efficient Encoding 1 33 1 1 1 1 1 1 1 1 o 101 11 Q* 101 Dj 100 1 00 ZERO 011 AMPLITUDE 011 010 010 001 001 000 000 (a) (b) Fig. VII-2 as good as 1 1 binary digits would if the signal levels transmitted were separated by equal differences in amplitude. To send speech more efficiently than this, we need to examine the characteristics both of speech and of hearing. After all, we require only enough accuracy of transmission to convince the hearer that transmission is good enough. There have been many efforts to encode speech efficiently merely on the basis of an examination of the speech wave. None has been highly effective. One may note that the speech wave doesn't ordi- narily change much from sample to sample. This has led to the transmission of differences between successive samples rather than to transmission of samples themselves. Figure VII-3 shows the wave forms of several speech sounds, that is, how the pressure of the sound wave or the voltage repre- senting it in a communication system varies with time. We see that many of the wave forms, and especially those for the vowels (a through rf), repeat over and over almost exactly. Couldn't we perhaps transmit just one complete period of variation and use it 134 Symbols, Signals and Noise 0.01 TIME \H SECONDS Fig. VII-3 0,02 Efficient Encoding 135 to replace several succeeding periods? This is very difficult, for it is hard for a machine to determine just how long a period is in actual speech. It has been tried. The speech reproduced is intelli- gible but seriously distorted. If speech is to be encoded efficiently, a much more fundamental approach is required. We must know how great a variety of speech sounds must be transmitted and how effective our sense of hearing is in distinguishing among speech sounds. The fluctuations of air pressure which constitute the sounds of speech are very rapid indeed, of the order of thousands per second. Our voluntary control over our vocal tracts is exercised at a much lower rate. At the most, we change the manner of production of sounds a few tens of times a second. Thus, speech may well be (and is) simpler than we might conclude by examining the rapidly fluctuating sound waves of speech. What control do we exercise over our vocal organs? First of all, we control the production of voiced sounds by our control over our vocal cords. These are two lips or folds of muscular tissue attached to a cartilaginous box called the larynx, which is prominent in man as the Adam's apple. When we are not giving voice to sound, these are wide open. They can be drawn together more or less tightly, so that when air from the lungs is forced through them they emit a sound something like a Bronx cheer. If they are held very tight, the sound has a high pitch; if they are more relaxed, the sound has a lower pitch. The pulses of air passing the vocal cords contain many frequen- cies. The mouth and lips act as a complex resonator which empha- sizes certain frequencies more than others. What frequencies are emphasized depends on how much and at what position the tongue is raised or humped in the mouth, on whether the soft palate opens the nasal cavities to the mouth and throat, and on the opening of the jaws and the position of the lips. Particular sounds of voiced speech, which includes vowels and other continuants, such as m and r, are formed by exciting the vocal cords and giving particular characteristic shapes to the mouth. Stop consonants, or plosives, such as p, b, g, t ? are formed by stopping off the vocal passage at various points with the tongue or lips, creating an air pressure, and suddenly releasing it. The vocal 136 Symbols, Signals and Noise cords are used in producing some of these sounds (b, for instance) and not in producing others (p, for instance). Fricatives, such as s and sh, are produced by the passage of air through various constrictions. Sometimes the vocal cords are used as well (in a zh sound, as in azure). A specification of the movements of the vocal organs would be much more slowly changing than a description of the sound pro- duced. May this not be a clue to efficient encoding of speech? In the early thirties, long before Shannon's work on information theory, Homer Dudley of the Bell Laboratories invented such a form of speech transmission, which he called the vocoder (from voice coder). The transmitting (analyzer) and receiving (synthe- sizer) units of a vocoder are illustrated in Figure 11-4. In the analyzer, an electrical replica of the speech is fed to 16 filters, each of which determines the strength of the speech signal in a particular band of frequencies and transmits a signal to the synthesizer which gives this information. In addition, an analysis is made to determine whether the sound is voiceless (s, f ) or voiced (o, u) and, if voiced, what the pitch is. At the synthesizer, if the sound is voiceless, a hissing noise is produced; if the sound is voiced a sequence of electrical pulses is produced at the proper rate, corresponding to the puffs of air passing the vocal cords of the speaker. The hiss or pulses are fed to an array of filters, each passing a band of frequencies corresponding to a particular filter in the analyzer. The amount of sound passing through a particular filter in the synthesizer is controlled by the output of the corresponding analyzer filter so as to be the same as that which the analyzer filter indicates to be present in the voice in that frequency range. This process results in the reproduction of intelligible speech. In effect, the analyzer listens to and analyzes speech, and then instructs the synthesizer, which is an artificial speaking machine, how to say the words all over again with the very pitch and accent of the speaker. Most vocoders have a strong and unpleasant electrical accent. The study of this has led to new and important ideas concerning what determines and influences speech quality; we cannot afford time to go into this matter here. Even imperfect vocoders can be Efficient Encoding 137 138 Symbols, Signals and Noise very useful. For instance, it is sometimes necessary to resort to enciphered speech transmission. If one merely directly reduces speech to binary digits by pulse code modulation, 30,000 to 60,000 binary digits per second must be sent. By using a vocoder, speech can be sent with around 1,500 binary digits per second. The sort of vocoder described sends information concerning from 10 to 30 frequency bands (16 in the example of Figure VII-4). Speech sounds actually have only a few very prominent frequency ranges called formants. These correspond to the resonances of the vocal tract. One can recreate intelligible speech by sending infor- mation concerning the location and intensity of two or three formants. Such zformant tracking vocoder can be used to transmit speech with even fewer binary digits per second than the channel vocoder of Figure VII-4 needs. In an even more economical and less intelligible vocoder, called the phoneme vocoder, the analyzer recognizes a number of basic voice sounds called phonemes and instructs the synthesizer to speak these. For ordinary telephone use, vocoder quality is scarcely adequate. The unnatural sound of the channel vocoder appears to be associ- ated with the failure of the sound generator of the synthesizer to adequately follow pitch, changes from voiced to voiceless sound, and other qualities of the excitation of the speaker's vocal tract. By sending a band a few hundred cycles wide of the speech to be recreated and distorting this at the synthesizer, a more satisfactory source of sound with which to feed the synthesizer filters is obtained. Such a voice-excited vocoder sounds almost as good as regular telephone speech and takes only one-half as much channel capacity to transmit. The cost of the vocoder equipment would preclude its use on any but long and expensive communication circuits, such as transatlantic telephone cables. Let us consider the vocoder for a moment before leaving it. We note that transmission of voice using even the most economi- cal of vocoders takes many more binary digits per word than transmission of English text. Partly, this is because of the technical difficulties of analyzing and encoding speech as opposed to print. Partly, it is because, in the case of speech, we are actually trans- mitting information about speech quality, pitch, and stress, and accent as well as such information as there is in text. In other Efficient Encoding 139 words, the entropy of speech is somewhat greater per word than the entropy of text. That the vocoder does encode speech more efficiently than other methods depends on the fact that the configuration of the vocal tract changes less rapidly than the fluctuations of the sound waves which the vocal tract produces. Its effectiveness also depends on limitations of the human sense of hearing. From an electrical point of view, the most complicated speech sounds are the hissing fricatives, such as sh (/of Figure VII-3) and s (g of Figure VII-3). Furthermore, the wave forms of two s's uttered successively may have quite a different sequence of ups and downs. It would take many binary digits per second to transmit each in full detail. But, to the ear, one s sounds just like another if it has in a broad way the same frequency content. Thus, the vocoder doesn't have to reproduce the s sound the speaker uttered; it has merely to reproduce an s sound that has roughly the same frequency content and hence sounds the same. We see that, in transmitting speech, the royal road to efficient encoding appears to be the detection of certain simple and impor- tant patterns and their recreation at the receiving end. Because of the greater channel capacity required, efficient encoding is even more important in TV transmission than in speech transmission. Can we perhaps apply a similar principle in TV? The TV problem is much more difficult than the speech trans- mission problem. Partly, this is because the sense of sight is inher- ently more detailed and discriminating than the sense of hearing. Partly, though, it is because many sorts of pictures from many sources are transmitted by TV, while speech is all produced by the same sort of vocal apparatus. In the face of these facts, is some vocoder-like way of trans- mitting pictures possible if we confine ourselves to one sort of picture source, for instance, the human face? One can conceive of such a thing. Imagine that we had at the receiver a sort of rubbery model of a human face. Or we might have a description of such a model stored in the memory of a huge electronic computer. First, the transmitter would have to look at the face to be transmitted and "make up" the model at the receiver in shape and tint. The transmitter would also have to note the 140 Symbols, Signals and Noise sources of light and reproduce these in intensity and direction at the receiver. Then, as the person before the transmitter talked, the transmitter would have to follow the movements of his eyes, lips and jaws, and other muscular movements and transmit these so that the model at the receiver could do likewise. Such a scheme might be very effective, and it could become an important inven- tion if anyone could specify a useful way of carrying out the operations I have described. Alas, how much easier it is to say what one would like to do (whether it be making such an invention, composing Beethoven's tenth symphony, or painting a masterpiece on an assigned subject) than it is to do it. In our day of unlimited science and technology, people's unful- filled aspirations have become so important to them that a special word, popular in the press, has been coined to denote such dreams. That word is breakthrough. More rarely, it may also be used to describe something, usually trivial, which has actually been accomplished. If we turn from such dreams of the future, we find that all actual picture-transmission systems follow a common pattern. The picture or image to be transmitted is scanned to discover the brightness at successive points. The scanning is carried out along a sequence of closely spaced lines. In color TV, three images of different colors are scanned simultaneously. Then, at the receiver, a point of light whose intensity varies in accord with the signal from the transmitter paints out the picture in light and shade, following the same line pattern. So far all practical attempts at efficient encoding have started out with the signal generated by such a scanning process. The outstanding efficient encoding scheme is that used in color TV. The brightness of a color TV picture has very fine detail; the pattern of color has very much less detail. Thus, color TV of almost the same detail as monochrome TV can be sent over the same channel as is used for monochrome. Of course, color TV uses an analog signal; the picture is not reduced to discrete on-or-off pulses. A proposed method for the efficient encoding of monochrome TV is to send the slow variations of the signal in great detail and the fast variations either less accurately or only intermittently, as they occur. There is a good deal of controversy as to how effective this is. Efficient Encoding 141 In TV, a complete picture is sent every 1/30 second in order to avoid flicker. In motion pictures a new picture is used every 1/24 second, but, in order to avoid flicker, it is turned on and off by a shutter several times before the next picture is substituted. In the case of many subjects, such as a face, a new picture every 1/10 second would be sufficient if flicker could be avoided by showing it several times. This would require repeatedly storing a length of signal corresponding to a complete picture at the receiver. At present, this seems too expensive to do, but such a scheme might cut down the required number of binary digits per second by a factor of 3. Suppose that the voltage of the picture signal varied with time as shown in a of Figure VII-5. A great many samples, also shown, might be used to represent it. Instead, couldn't we perhaps use a number of straight lines to approximate the picture signal, as in b of Figure VII-5? Then we would send only the heights of the end points of the lines, (hi~h Q in the figure) and the distances between the end points of the lines (/i-fe in the figure). This is quite an old idea. It has been tried experimentally recently, but there is little agreement as to how effective it is. nidi (a) 142 Symbols, Signals and Noise We may remember that, in transmitting speech by pulse code modulation, it is effective to assign closely spaced amplitudes or levels of quantization for small signals and more widely spaced levels for large signals. This is not effective in the case of picture transmission, for fine detail, as the texture of hair or cloth, may occur in either the dark or the bright part of the picture, that is, at either high or low signal levels. However, it is not necessary to reproduce large changes in light intensity as accurately as small changes. Thus, if we send the differences in amplitude of successive samples, we can use closely spaced levels of quantization for small differences (as in hair) and coarsely spaced levels for large differ- ences and get a saving similar to that attained in speech transmis- sion. By using a refined form of this scheme, in which one can choose to send the difference from an already transmitted sample either just above or just to the left of the sample to be sent, one can do almost as well with 3 binary digits per sample as one can with 7 binary digits per sample if the amplitude of each sample is encoded and sent separately. Reviewing what has been said, we see that there are three im- portant principles in encoding signals efficiently: (1) Don't encode the signal one sample or one character at a time; encode a con- siderable stretch of a signal at a time (hyperquantization); (2) take into account the limitations on the source of the signal; (3) take into account any inabilities of the eye or the ear to detect errors in a reconstruction of the signal. The vocoder illustrates these principles excellently. The fine temporal structure of the speech wave is not examined in detail. Instead, a description specifying the average intensities over certain ranges of frequencies is transmitted, together with a signal which tells whether the speech is voiced or unvoiced and, if it is voiced, what its pitch is. This description of a signal is efficient because the vocal organs don't change position rapidly in producing speech. At the receiver, the vocoder generates a speech signal which doesn't resemble the original speech signal in fine detail but sounds like the original speech signal, because of the natural limitations of our hearing. The vocoder is a sort of paragon of efficient transmission devices. Next perhaps comes color TV, in which the variations of Efficient Encoding 143 color over the picture are defined much less sharply than variations of intensity are. This takes advantage of the eyes' inability to see fine detail in color patterns. Beyond this, the present art of communication has had to make use of means which, because they do not encode long stretches of signal at a time, must, according to communication theory, be rather inefficient. Still, efficient encoding is potentially important. This is especially so in the case of the transmission of relatively broad-band signals (TV or even voice signals) over very expensive circuits, such as transoceanic telephone cables. No doubt much ingenuity will be spent in efficient encoding in the future, and many startling results will be attained. But we should perhaps beware of going too far. Imagine, for instance, that we send English text letter by letter. If we make an error in sending a few letters we can still make some sense out of the text: More I hove reploced a few vowols by o. We can even replace the vowels by x's and read with some facility: Hxrx X hxvx rxplxcxd thx vxwxls bx x. It is more efficient to encode English text word by word. In this case, if an error is made in transmission, we are not tipped off by finding a misspelled word. Instead, one word is replaced by another. This might have embarrassing results. Suppose it changed "The President is a good Republican" to "The President is a good Communist" (or donkey, or poltroon, or many other nouns). We might still detect an error by the fact that the word was inappropriate. But suppose we used a more refined encoding scheme that could reproduce grammatical utterances only. Then we would have little chance of detecting an error in transmission. English text, and most other information sources are redundant in that the messages they produce give many clues to the recipient. A few errors caused by replacing one letter by another don't destroy the message because we can infer it from other letters which are transmitted correctly. Indeed, it is only because of this redundancy that anyone can read my handwriting. When a con- tinuous signal is sent a sample at a time, a few errors in sample 144 Symbols, Signals and Noise amplitude result in a few clicks in sound transmission or in a few specks in picture transmission. Our ideal so far has been to remove this redundancy, so that we transmit the absolutely minimum number of clues by means of which the message can be reconstructed. But we see that if we do this with perfect success, any error in transmission will send, not a distorted message, but a false and misleading message. If we fall a little short of the ideal, an error may produce merely a terrible garble. We all know that there is some noise in electrical communication a hiss in the background on radio and a little snow at least in TV. That such noise is an inevitable fact of nature we must accept. Is this going to vitiate in principle our grand plan to encode the messages from a signal source into scarcely more binary digits than the entropy of the source? This is the subject that we will consider in the next chapter. CHAPTER VIII The Noisy Channel IT is HARD TO PUT ONESELF in the place of another, and, especially, it is hard to put oneself in the place of a person of an earlier day. What would a Victorian have thought of present-day dress? Were Newton's laws of motion and of gravitation as aston- ishing and disturbing to his contemporaries as Einstein's theory of relativity appears to have been to his? And what is disturbing about relativity? Present-day students accept it, not only without a murmur, but with a feeling of inevitability, as if any other idea must be very odd, surprising, and inexplicable. Partly, this is because our attitudes are bred of our times and surroundings. Partly, in the case of science at least, it is because ideas come into being as a response to new or better-phrased questions. We remember that according to Plato, Socrates drew a geometrical proof from a slave simply by means of an ingenious sequence of questions. Those who have not seriously asked them- selves a particular question are not likely to have come upon the proper answer, and, sometimes, when the question is phrased with the answer in mind, the answer appears to be obvious. Those interested in communication have been aware from the very beginning that communication circuits or channels are im- perfect. In telephony and radio, we hear the desired signal against a background of noise, which may be strong or faint and which may vary in quality from the crackling of static to a steady hiss. 145 146 Symbols, Signals and Noise In TV, the picture is overlaid faintly or strongly with an ever- changing granular "snow." In teletypewriter transmission, the received character may occasionally differ from that transmitted. Suppose that one had questioned a communication engineer about this general problem of "noise" in 1945. One might have asked, "What can one do about noise?" The engineer might have answered, "You can increase the transmitter power or make the receiver less noisy. And be sure that the receiver is insensitive to disturbances with frequencies other than the signal frequencies." One might have persisted, "Can't one do anything else?" The engineer might have answered, "Well, by using frequency modu- lation, which takes a very large band width, one can reduce the effect of noise." Suppose, however, that one had asked, "In teletypewriter sys- tems, noise may cause some received characters to be wrong; how can one guard against this?" The engineer could and might perhaps have answered, "I know that if I use five off-or-on pulses to repre- sent a decimal digit and assign to the decimal digits only such sequences as all have two ons and three offs, I can often tell when an error has been made in transmission, for when errors are made the received sequence may have other than 2 ons." One might have pursued the matter further with, "If the teletype- writer circuit does cause errors is there any way that one can get the correct message to the destination?" The engineer might have answered, "I suppose you can if you repeat it enough times, but that's very wasteful You'd better fix the circuit." Here we are getting pretty close to questions that just hadn't been asked before Shannon asked them. Nonetheless, let us go on and imagine that one had said, "Suppose that I told you that by properly encoding my message, I can send it over even a noisy channel with a completely negligible fraction of errors, a fraction smaller than any assignable value. Suppose that I told you that, if the sort of noise in the channel is known and if its magnitude is known, I can calculate just how many characters I can send over the channel per second and that, if I send any number fewer than this, I can do so virtually without error, while if I try to send more, I will be bound to make errors." The engineer might well have answered, "You'd sure have to The Nois)> Channel 147 show me. I never thought of things in quite that way before, but what you say seems extremely improbable. Why, every time the noise increases, the error rate increases. Of course, repeating a message several times does work better when there aren't too many errors. But, it is always very costly. Maybe there's something in what you say, but I'd be awfully surprised if there was. Still, the way you put it . . ." Whatever we may imagine concerning an engineer benighted in the days of error, mathematicians and engineers who have survived the transition all feel that Shannon's results concerning the trans- mission of information over a noisy channel were and still are very surprising. Yet I have known an intelligent layman to see nothing remarkable in Shannon's results. What is one to think of this? Perhaps the best course is merely to describe and explain the problem of the noisy channel as we now understand it, raising and answering questions that, however natural and inevitable they now seem, belong in their trend and content to the post-Shannon era. The reader can be surprised or not as he chooses. So far we have discussed both simple and complex means for encoding text and numbers for efficient transmission. We have noted further that any electrical signal of limited band width W can be represented by 2W amplitudes or samples per second, measured or taken at intervals 1/2W seconds apart. We have seen that, by means of pulse code modulation, we can use some num- ber, around 7, of binary digits to represent adequately the ampli- tude of any sample. Thus, by using pulse code modulation or some more complicated and more efficient scheme, we can transmit speech or picture signals by means of a sequence of binary digits or off-or-on or positive-or-negative pulses of current. All of this works perfectly if the recipient of the message receives the same signal that the sender transmits. The actual facts are dif- ferent. Sometimes he receives a when a 1 is transmitted, and sometimes he receives a 1 when a is transmitted. This can hap- pen through the malfunction of electrical relays in a slow-speed telegraph circuit or through the malfunction of vacuum tubes or transistors in a higher speed circuit. It can also happen because of interfering signals or noise, either noise from man-made apparatus, or noise from magnetic storms. 148 Symbols, Signals and Noise We can easily see in a simple case how errors can occur because of the admixture of noise with a signal. Imagine that we want to send a large number of binary digits, or 1 , per second over a wire by means of an electrical signal. We may represent the signal con- veying these digits by the succession of samples s of Figure VHI-1, each of which will be + 1 or - 1. Here we have a succession of positive and negative voltages which represent the digits 1011 10010. Now suppose a random noise voltage, which may be either positive or negative, is added to the signal. We can represent this also by a number of noise samples n of Figure VIII- 1 taken simul- taneously with the signal samples. The signal plus the noise is obtained by adding the signal and the noise samples and is shown as s + n in Figure VIII- 1. If we interpret a positive signal-plus-noise in the received mes- sage as a 1 and a negative signal-plus-noise as a 0, then the received 1 I 0111 I I I 0010 I s+n r ERRORS POSITION 1 X 2 I I 1 1 1 1 1 1 1 X 345 Fig. VIII-1 O 1 X 6 7 8 9 The Noisy Channel 149 message will be represented by the digits r of Figure VIII- L Thus, errors in transmission, as indicated, occur in positions 2, 3, and 7. The effect of such errors in transmission can range from annoy- ing to dangerous. In speech or picture transmission by means of simple coding schemes, they result in clicks, hissing noises, or "snow." If more efficient, block encoding schemes are used (hyper- quantization) the effects of errors will be more pronounced. In general, however, we may expect the most dangerous effects of errors in the transmission of text. In the transmission of English text by conventional means, errors merely put a wrong letter in here and there. The text is so redun- dant that we catch such errors by eye. However, when type is set remotely by teletypewriter signals, as it is, for instance, in the simultaneous printing of news magazines in several parts of the country, even errors of this sort can be costly. When numbers are sent errors are much more serious. An error might change $1,000 into $9,000. If the error occurred in a pro- gram intended to make an electronic computer carry out a com- plicated calculation, the error could easily cause the whole calcu- lation to be meaningless. Further, we have seen that, if we encode English text or any other signal very efficiently, so as largely to remove the redundancy, an error can cause a gross change in the meaning of the received signal. When errors are very important to us, how indeed may we guard against them? One way would be to send every letter twice or to send every binary digit used in transmitting a letter or a number twice. Thus, in transmitting the binary sequence 101001101, we might send and receive as follows: sent 1 1 1 1 0' 1 1 1 1 1 1 received 110011000111110011 X error For a given rate of sending binary digits, this will cut our rate of transmitting information in half, for we have to pause and retrans- mit every digit. However, we can now see from the received signal than an error has occurred at the marked point, because instead 150 Symbols, Signals and Noise of a pair of like digits, or 1 1, we have received a pair of unlike digits, 1. We don't know whether the correct, transmitted pair was or 1 1. We have detected the error, but we have not corrected it. If errors aren't too frequent, that is, if the chance of two errors occurring in the transmission of three successive digits is negligible, we can correct as well as detect an error by transmitting each digit three times, as follows: sent 111000111000000111111 received 111000101000000111111 A error We have now cut our rate of transmission to one-third, because we have to pause and retransmit each digit twice. However, we can now correct the error indicated by the fact that the digits in the indicated group 101 are not all the same. If we assume that there was only one error in the transmission of this group of digits, then the transmitted group must have been 111, representing 1, rather than 000, representing 0. We see that a very simple scheme of repeating transmitted digits can detect or even correct infrequent errors of transmission. But how costly it is! If we use this means of error correction or detec- tion, even when almost all of the transmitted digits are correct we have to cut our rate of transmission in half by repeating digits in order just to detect errors, and we have to cut our rate of trans- mission to one-third by transmitting each digit three times in order to get error correction. Moreover, these schemes won't work if errors are frequent enough so that more than one will sometimes occur in the transmission of two or three digits. Clearly, this simple approach will never lead to a sound under- standing of the possibility of error correction. What is required is a deep and powerful mathematical attack. This is just what Shan- non provided in discovering and proving his fundamental theorem for the noisy channel. It is the course of his reasoning that we are about to follow. In formulating an abstract and general model of noise or errors, we will deal with the case of a discrete communication system The Noisv Channel 151 which transmits some group of characters, such as the digits from to 9 or the letters of the alphabet. For convenience, let us consider a system for transmitting the digits through 9. This is illustrated in Figure VIII-2. At the left we have a number of little circles labeled with the digits; we may regard these little circles as push- buttons. To +he right we have a number of little circles, again labeled with the digits. We may regard these as lights. When we push a digit button at the transmitter to the left, some digit light lights up at the receiver to the right. If our communication system were noiseless, pushing the button would always light the light, pushing the 1 button would always light the 1 light, and so on. However, in an imperfect or noisy communication system, pushing the 4 button, for instance, may light the light, or the 1 light, or the 2 light, or any other light, as shown by the lines radiating from the 4 button in Figure VIII-2. In a simple, noisy communication system, we can say that when we press a button the light which lights is a matter of chance, Fig. VIII-2 152 Symbols, Signals and Noise independent of what has gone before and that, if the 4 button is pressed, there is some probability p 4 (6) that the 6 light will light, and so on. If the sender can't be sure which light will light when he presses a particular button, then the recipient of the message can't be sure which button was pressed when a particular light lights. This is indicated by the arrows from light 6 to various buttons on the left. If, for instance, light 6 lights, there is some probability /? 6 (4) that button 4 was pressed, and so on. Only for a noiseless system will p 6 (6) be unity and p Q (4\ p Q (9\ etc., be zero. The diagram of Figure VIII-2 would be too complicated if all possible arrows were put in, and the number of probabilities is too great to list, but I believe that the general idea of the degree and nature of uncertainty of the character received when the sender tries to send a particular character and the uncertainty of the character sent when the recipient receives a particular character, have been illustrated. Let us now consider this noisy communica- tion channel in a rather general way. In doing so we will represent by x all of the characters sent and by y all of the characters received. The characters x are just the characters generated by the message source from which the message comes. If there are m of these characters and if they occur independently with probabilities p(x), then we know from Chapter V that the entropy H(x) of the message source, the rate at which the message source generates information, must be H(x)=^-p(x)logp(x) (8.1) JC=1 We can regard the output of the device, which we designate by y, as another message source. The number of lights need not be equal to the number of buttons, but we will assume that it is, so that there are m lights. The entropy of the output will be m (8.2) We note that while H(x) depends only on the input to the com- munication channel, H(y) depends both on the input to the channel and on the errors made in transmission. Thus, the probability of The Noisy Channel 153 receiving a 4 if nothing but a 4 is ever sent is different from the probability of receiving a 4 if transmitting buttons are pressed at random. If we imagine that we can see both the transmitter and the receiver, we can observe how often certain combinations of x and y occur; say, how often 4 is sent and 6 is received. Or, knowing the statistics of the message source and the statistics of the noisy channel, we can compute such probabilities. From these we can compute another entropy. m m H(*>y) = 22 -X**jO i&p(x>y) < 8 - 3 ) xi x=i This is the uncertainty of the combination of x and y. Further, we can say, suppose that we know x (that is, we know what key was pressed). What are the probabilities of various lights lighting (as illustrated by the arrows to the right in Figure VHI-2)? This leads to an entropy, m m H*(y} = 2 ^L*-P( x )P*(y) log^(jF) (8-4) x=ly=l This is a conditional entropy of uncertainty. Its form is reminis- cent of the entropy of a finite-state machine. As in that case, we multiply the uncertainty for a given condition (state, value of x) by the probability that that condition (state, value of x) will occur and sum over all conditions (states, values of x). Finally, suppose we know what light lights. We can say what the probabilities are that various buttons were pressed. This leads to another conditional entropy m m H y (x) = 22 -PW&W io g/>*(*> ( g - 5 ) y=Ix=l This is the sum over/ of the probability that y is received times the uncertainty that x is sent when y is received. These conditional entropies depend on the statistics of the message source, because they depend on how often x is transmitted or how often y is received, as well as on the errors made in transmission. 154 Symbols, Signals and Noise The entropies listed above are best interpreted as uncertainties involving the characters generated by the message source and the characters received by the recipient. Thus: H(x) is the uncertainty as to x, that is, as to which character will be transmitted. H(y) is the uncertainty as to which character will be received in the case of a given message source and a given communication channel. H(x, y) is the uncertainty as to when x will be transmitted and y received. Hx(y) is the uncertainty of receiving/ when x is transmitted. It is the average uncertainty of the sender as to what will be received. H y (x) is the uncertainty that x was transmitted when y is received. It is the average uncertainty of the message recipient as to what was actually sent. There are relations among these quantities: H(x,y)=H(x}+H x (y) (8.6) That is, the uncertainty of sending x and receiving y is the uncertainty of sending x plus the uncertainty of receiving y when x is sent. H(x,y)=H(y) + H y (x) (8.7) That is, the uncertainty of receiving y and sending x is the uncertainty of receiving/ plus the uncertainty that x was sent when y was received. We see that when Hy) is zero, H y (x) must be zero, and H(y) is then just H(x). This is the case of the noiseless channel, for which the entropy of the received signal is just the same as the entropy of the transmitted signal. The sender knows just what will be received, and the recipient of the message knows just what was sent. The uncertainty as to which symbol was transmitted when a given symbol is received, that is, H y (x) seems a natural measure of the information lost in transmission. Indeed, this proves to be the case, and the quantity H y (x) has been given a special name; it is called the equivocation of the communication channel. If we The Noisy Channel 155 take H(x) and H y (x) as entropies in bits per second, the rate R of transmission of information over the channel can be shown to be, in bits per second, R = ff(jc) - H y (x) (8.8) That is, the rate of transmission of information is the source rate or entropy less the equivocation. It is the entropy of the message as sent less the uncertainty of the recipient as to what message was sent, The rate is also given by R = H(y) - H f (y) (8,9) That is, the rate is the entropy of the received signal y less the uncertainty that/ was recevied when x was sent. It is the entropy of the message as received less the sender's uncertainty as to what will be received. The rate is also given by H(y) -H(x,y) (8.10) The rate is the entropy of x plus the entropy of y less the uncer- tainty of occurrence of the combination x and/. We will note from 8.3 that for a noiseless channel, since p (x, y) is zero exceprwhen x = y, and H(x,y) - H(x) = H(y). The information rate is just the entropy of the information source, H(x). Shannon makes expression 8.8 for the rate plausible by means of the sketch shown in Figure VIII-3, Here we assume a system in which an observer compares transmitted and received signals and then sends correction data by means of which the erroneous received signal is corrected. Shannon is able to show that in order to correct the message, the entropy of the correction signal must be equal to the equivocation. We see that the rate R of relation 8.8 depends both on the channel and on the message source. How can we describe the capacity of a noisy or imperfect channel for transmitting informa- tion? We can choose the message source so as to make the rate R as large as possible for a given channel. This maximum possible rate of transmission for the channel is called the channel capacity 156 Symbols, Signals and Noise CORRECTION DATA Fig. VIII-3 C. Shannon's fundamental theorem for a noisy channel involves the channel capacity C It says: Let a discrete channel have a capacity C and a discrete source the entropy per second H. If H < C there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrarily small frequency of errors (or an arbitrarily small equivocation). If H > C it is possible to encode the source so that the equivocation is less than H C -j- e, where e is arbitrarily small. There is no method of encoding which gives an equivocation less than H C. This is a precise statement of the result which so astonished engineers and mathematicians. As errors in transmission become more probable, that is, as they occur more frequently, the channel capacity as defined by Shannon gradually goes down. For instance, if our system transmits binary digits and if some are in error, the channel capacity C that is, number of bits of information we can send per binary digit transmitted, decreases. But the channel capacity decreases gradually as the errors in transmission of digits become more frequent. To achieve transmission with as few errors as we may care to specify, we have to reduce our rate of trans- mission so that it is equal to or less than the channel capacity. How are we to achieve this result? We remember that in effi- ciently encoding an information source, it is necessary to lump many characters together and so to encode the message a long block of characters at a time. In making very efficient use of a noisy channel, it is also necessary to transmit and interpret blocks of The Noisy Channel 1 57 received characters, each many characters long. Among such blocks, only certain transmitted and received sequences of charac- ters will occur with other than a vanishing probability. In proving the fundamental theorem for a noisy channel, Shan- non finds the average frequency of error for all possible codes (for all associations of particular input blocks of characters with partic- ular output blocks of characters), when the codes are chosen at random, and he then shows that when the channel capacity is greater than the entropy of the source, the error rate averaged over all of these encoding schemes goes to zero as the block length is made very long. If we get this good a result by averaging over all codes chosen at random, then there must be some one of the codes which gives this good a result. One information theorist has char- acterized this mode of proof as weird. It is certainly not the sort of attack that would occur to an uninspired mathematician. The problem isn't one which would have occurred to an uninspired mathematician, either. The foregoing work is entirely general, and hence it applies to all problems. I think it is illuminating, however, to return to the example of the binary channel with errors, which we discussed early in this chapter and which is illustrated in Figure VIII- 1, and see what Shannon's theorem has to say about this simple and common case. Suppose that the probability that over this noisy channel a will be received as a is equal to the probability p that a 1 will be received as a 1. Then the probability that a 1 will be received as a or a as a 1 must be (1 p). Suppose further that these prob- abilities do not depend on past history and do not change with time. Then, the proper abstract representation of this situation is a symmetric binary channel (in the manner of Figure VIII-2) as shown in Figure VIII-4. Because of the symmetry of this channel, the maximum infor- mation rate, that is, the channel capacity, will be attained for a message source such that the probability of sending a 1 is equal to the probability of sending a zero. Thus, in the case of x (and, because the channel is symmetrical, in the case of y also) 158 Symbols, Signals and Noise We already know that under these circumstances H(x) = H(y) = ~ (Vi lOg l /2 + Vl log V4) = 1 bit per symbol What about the conditional probabilities? What about the equivocation, for instance, as given by 8.5? Four terms will con- tribute to this conditional entropy. The sources and contributions are: The probability that 1 is received is l /2. When 1 is received, the probability that 1 was sent is p and the probability that was sent is (1 p). The contribution to the equivocation from these events is: %(^log^- (l -/Olog(l -;>)) There is a probability of Vi that is received. When is received, the probability that was sent is p and the probability that 1 was sent is (1 p). The contribution to the equivocation from these events is: Accordingly, we see that, for the symmetrical binary channel, the equivocation, the sum of these terms, is H y (x) = -plogp - (1 -X> log (1 -p) Thus the channel capacity C of the symmetrical binary channel is, from 8.8, C = 1 +/?logp+ (1 -p) log (1 -p) O O The Noisy Channel 159 We should note that this channel capacity C is just unity less the function plotted against/? in Figure V-l. We see that ifp is ! /2, the channel capacity is 0. This is natural, for in this case, if we receive a 1, it is equally likely that a 1 or a was transmitted, and the received message does nothing to resolve our uncertainty as to what digit the sender sent. We should also note that the channel capacity is the same for p = as for p = 1. If we consistently receive a when we transmit a 1 and a 1 when we transmit a 0, we are just as sure of the sender's intentions as if we always get a 1 for a 1 and a for a 0. If, on the average, 1 digit in 10 is in error, the channel capacity is reduced to .53 of its value for errorless transmission, and for one error in 100 digits, the channel capacity is reduced to .92 merely. The writer would like to testify at this point that the simplicity of the result we have obtained for the symmetrical binary channel is in a sense misleading (it was misleading to the writer at least). The expression for the optimum rate (channel capacity) of an unsymmetrical binary channel in which the probability that a 1 is received as a 1 is p and the probability that a is received as a is a different number q is a mess, and more complicated channels must offer almost intractable problems. Perhaps for this reason as well as for its practical importance, much consideration has been given to transmission over the sym- metrical binary channel. What sort of codes are we to use in order to attain errorless transmission over such a channel? Examples devised by R. W. Hamming were mentioned by Shannon in his original paper. Later, Marcel J. E. Golay published concerning error-correcting codes in 1949, and Hamming published his work in 1950. We should note that these codes were devised subsequent to Shannon's work. They might, I suppose, have been devised before, but it was only when Shannon showed error-free trans- mission to be possible that people asked, "How can we achieve it?" We have noted that to get an efficient correction of errors, we must encode a long block of message digits at a time. As a simple example, suppose we encode our message digits in blocks of 16 and add after each block a sequence of check digits which enable us to detect a single error in any one of the digits, message digits or check digits. As a particular example, consider the sequence of 160 Symbols, Signals and Noise message digits 1 1 1 1 1 1 1 1 0. To find the appropriate check digits, we write the O's and Fs constituting the message digits in the 4 by 4 grid shown in Figure VIII-5. Associ- ated with each row and each column is a circle. In each circle is a or a 1 chosen so as to make the total number of 1's in the column or row (including the circle as well as the squares) even. Such added digits are called check digits. For the particular assort- ment of message digits used as an example, together with the appropriately chosen check digits, the numbers of 1's in successive columns (left to right) and 2, 2, 2, 4, all being even numbers, and the numbers of 1's in successive rows (top to bottom) are 4, 2, 2, 2, which are again all even. What happens if a single error is made in the transmission of a message digit among the 16? There will be an odd number of ones in a row and in a column. This tells us to change the message digit where the row and column intersect. What happens if a single error is made in a check digit? In this case there will be an odd number of ones in a row or in a column. We have detected an error, but we see that it was not among the message digits. The total number of digits transmitted for 16 message digits is 16 + 8, or 24; we have increased the number of digits needed in the ratio 24/16, or 1.5. If we had started out with 400 message digits, we would have needed 40 check digits and we would have increased the number of digits needed only in the ratio of 440/400, (5) CO CO c @ G 1 1 1 1 1 1 1 1 Fig. VII I -5 The Noisy Channel 161 or 1.1. Of course, we would have been able to correct only one error in 440 rather than one error in 24. Codes can be devised which can be used to correct larger num- bers of errors in a block of transmitted characters. Of course, more check digits are needed to correct more errors. A final code, how- ever we may devise it, will consist of some set of 2 M blocks of O's and Ts representing all of the blocks of digits M digits long which we wish to transmit. If the code were not error correcting, we could use a block just M digits long to represent each block of M digits which we wish to transmit. We will need more digits per block because of the error-correcting feature. When we receive a given block of digits, we must be able to deduce from it which block was sent despite some number n of errors in transmission (changes of to 1 or 1 to 0). A mathema- tician would say that this is possible if the distance between any two blocks of the code is at least 2n + 1. Here distance is used in a queer sense indeed, as defined by the mathematician for his particular purpose. In this sense, the dis- tance between two sequences of binary digits is the number of O's or 1's that must be changed in order to convert one sequence into the other. For instance, the distance between 0010 and 1111 is 3, because we can convert one sequence into the other only by changing three digits in one sequence or in the other. We can get this distance between two binary sequences or numbers by a process called addition modulo 2, or, more usually, simply addition mod 2. The rules for adding binary digits mod 2 are + = 0+1 = 1 1 + 1=0 The following examples illustrate addition of binary numbers mod 2: 1111 + 0010 1101 (3 ones) 10110101 +10011101 00101000 (2 ones) 162 Symbols, Signals and Noise We note that we can get this result by simply throwing away the 1's we would carry to the next column to the left if we were doing binary addition. We should note that the numbers in the circles in Figure VIII-5 can be obtained by addition mod 2 of the corresponding row or column. By definition, the distance between the code groups 1111 and 0010 is 3, while the distance between the code groups 1 1 10101 and 1001 1 1 1 is 2, and we obtain this distance by counting the 1's in the sum mod 2. When we make n errors in digits in transmission, we get a block of characters which is a distance n from that which was sent and which may be a distance n nearer to some other block of characters constituting a code group which might have been sent. If we want the received code group always to be nearer to the group that was sent than to any other code group which might have been sent, despite a change of n digits in transmission, we must see that all the code groups we use are separated by a distance of at least 2n + 1. Thus, to correct n errors, we must find 2 M code groups each at a distance at least 2n -f 1 from every other. If we are to have an efficient code, we must use the least possible number of digits in the groups (which will certainly be more than M). The astonishing thing is that, for quite a number of values of M and n, Slepian and other mathematicians have actually found the best codes. I won't attempt to tell how! As a matter of fact, although the general problem of how to produce the best error-correcting code for given values of M and n has been solved, we now have more error-correcting codes than we know what to do with. The reason is that equipment which will make use of the longer and more efficient of these highly efficient codes is too complicated to use. Moreover, the simpler codes, which correct only one error per block, don't help in many actual cases. For instance, in transmission over telephone lines, a chief source of interference is long pulses of noise produced by the operation of various pieces of telephone apparatus. These tend to cause errors in several successive digits. In view of this sad situation, D. W. Hagelbarger of the Bell The Noisy Channel 163 Laboratories recently devised a method of encoding which, by using twice the number of digits in the text to be sent, corrects up to six adjacent errors and can be implemented with quite simple equipment. It might be described as an inefficient but useful method of error-correction in contrast to codes which are efficient but useless (in an engineering, not a mathematical sense). In Chapter VII, we discussed ways of removing redundancy from a message so that it could be transmitted by means of fewer binary digits. In this chapter, we have considered the matter of adding redundancy to a nonredundant message in order to attain virtually error-free transmission over a noisy channel The fact that such error-free transmission can be attained using a noisy channel was and is surprising to communication engineers and mathe- maticians, but Shannon has proved that it is necessarily so. Prior to receiving a message over an error-free channel, the recipient is uncertain as to what particular message out of many possible messages the sender will actually transmit. The amount of the recipient's uncertainty is the entropy or information rate of the message source, measured in bits per symbol or per second. The recipient's uncertainty as to what message the message source will send is completely resolved if he receives an exact replica of the message transmitted. A message may be transmitted by means of positive and nega- tive pulses of current. If a strong enough noise consisting of ran- dom positive and negative pulses is added to the signal, a positive signal pulse may be changed into a negative pulses or a negative signal pulse may be changed into a positive pulse. When such a noisy channel is used to transmit the message, if the sender sends any particular symbol there is some uncertainty as to what symbol will be received by the recipient of the message. When the recipient receives a message over a noisy channel, he knows what message he has received, but he cannot ordinarily be sure what message was transmitted. Thus, his uncertainty as to what message the sender chose is not completely resolved even on the receipt of a message. The remaining uncertainty depends on the probability that a received symbol will be other than the symbol transmitted. From the sender's point of view, the uncertainty of the recipient 164 Symbols, Signals and Noise as to the true message is the uncertainty, or entropy, of the message source plus the uncertainty of the recipient as to what message was transmitted when he knows what message was received. The measure which Shannon provides of this latter uncertainty is the equivoca- tion, and he defines the rate of transmission of information as the entropy of the message source less the equivocation. The rate of transmission of information depends both on the amount of noise or uncertainty in the channel and on what message source is connected to the channel at the transmitting end. Let us suppose that we choose a message source such that this rate of transmission which we have defined is as great as it is possible to make it. This greatest possible rate of transmission is called the channel capacity for a noisy channel. The channel capacity is measured in bits per symbol or per second. So far, the channel capacity is merely a mathematically defined quantity which we can compute if we know the probabilities of various sorts of errors in the transmission of symbols. The channel capacity is important, because Shannon proves, as his fundamental theorem for the noisy channel, that when the entropy or informa- tion rate of a message source is less than this channel capacity, the messages produced by the source can be so encoded that they can be transmitted over the noisy channel with an error less than any specified amount. In order to encode messages for error-free transmission over noisy channels, long sequences of symbols must be lumped together and encoded as one supersymbol This is the sort of block encoding that we have encountered earlier. Here we are using it for a new purpose. We are not using it to remove the redundancy of the messages produced by a message source. Instead, we are using it to add redundancy to nonredundant messages so that they can be transmitted without error over a noisy channel Indeed, the whole problem of efficient and error-free communication turns out to be that of removing from messages the somewhat inefficient redun- dancy which they, have and then adding redundancy of the right sort in order to allow correction of errors made in transmission. The redundant digits we must use in encoding messages for error-free, transmission, of course, slow the speed of transmission. We have seen that in using a binary symmetric channel in which The Noisy Channel 1 65 1 transmitted digit in 100 is erroneously received, we can send only 92 correct nonredundant message digits for each 100 digits we feed into the noisy channel. This means that on the average, we must use a redundant code in which, for each 92 nonredundant message digits, we must include in some way 8 extra check digits thus making the over-all stream of digits redundant. Shannon's very general work tells us in principle how to proceed. But, the mathematical difficulties of treating complicated channels are great. Even in the case of the simple, symmetric, off-on binary channel, the problem of finding efficient codes is formidable, although mathematicians have found a large number of best codes. Alas, even these seem to be too complicated to use! Is this a discouraging picture? How much wiser we are than in the days before information theory! We know what the problem is. We know in principle how well we can do, and the result has astonished engineers and mathematicians. Further, we do have useful if inefficient error-correcting codes which we can use in doing something about the problem. In a day in which the impor- tance of accurate transmission of digital data is growing almost beyond conceiving, this is worth more than the whole price of admission. CHAPTER YV Many Dimensions YEARS AND YEARS AGO (over thirty) I found in the public library of St. Paul a little book which introduced me to the mysteries of the fourth dimension. It was Flatland, by Abbott. It describes a two-dimensional world without thickness. Such a world and all its people could be drawn in complete detail, inside and out, on a sheet of paper. What I now most remember and admire about the book are the descriptions of Flatland society. The inhabitants are polygonal, and sidedness determines social status. The most exalted of the multisided creatures hold the honorary status of circles. The lowest order is isosceles triangles. Equilateral triangles are a step higher, for regularity is admired and required. Indeed, irregular children are cracked and reset to attain regularity, an operation which is frequently fatal. Women are extremely narrow, needle-like crea- tures and are greatly admired for their swaying motion. The author of record, A. Square, accords well with all we have come to associate with the word. Flatland has a mathematical moral as well. The protagonist is astonished when a circle of varying size suddenly appears in his world. The circle is, of course, the intersection of a three-dimen- sional creature, a sphere, with the plane of Flatland. The sphere explains the mysteries of three dimensions to A. Square, who in turn preaches the strange doctrine. The reader is left with the thought that he himself may someday encounter a fluctuating and disappearing entity, the three-dimensional intersection of a four- dimensional creature with our world. 166 Many Dimensions 167 Four-dimensional cubes or tesseracts, hyperspheres, and other hypergeometric forms are old stuff both to mathematicians and to science fiction writers. Supposing a fourth dimension like unto the three which we know, we can imagine many three-dimensional worlds existing as close to one another as the pages of a manu- script, each imprinted with different and distinct characters and each separate from every other. We can imagine traveling through the fourth dimension from one world to another or reaching through the fourth dimension into a safe to steal the bonds or into the abdomen to snatch an appendix. Most of us have heard also that Einstein used time as a fourth dimension, and some may have heard of the many-dimensional phase spaces of physics, in which the three coordinates and three velocity components of each of many particles are all regarded as dimensions. Clearly, this sort of thing is different from the classical idea of a fourth spatial dimension which is just like the three dimensions of up and down, back and forth, and left and right, those we all know so well. The truth of the matter is that nineteenth-century mathematicians succeeded in generalizing geometry to include any number of dimensions or even an infinity of dimensions. These dimensions are for the pure mathematician merely mental constructs. He starts out with a line called the x direction or x axis, as shown in a of Figure IX- 1. Some point/? lies a distance x p to the right of the origin O on the x axis. This coordinate x p in fact describes the location of the point p. The mathematician can then add ay axis perpendicular to the x axis, as shown in b of Figure IX- 1. He can specify the location of a point p in the two-dimensional space or plane in which these axes lie by means of two numbers or coordinates: the distance from the origin O in the y direction, that is, the height y^ and the distance from the origin O in the x direction Xp, that is, how far/? is to the right of the origin O. In c of Figure IX- 1 the x, y } and z axes are supposed to be all perpendicular to one another, like the edges of a cube. These axes represent the directions of the three-dimensional space with which we are familiar. The location of the point p is given by its height y p above the origin O, its distance x p to the right of the origin O, and its distance z p behind the origin O. 168 Symbols, Signals and Noise y (a) (b) CC) Of course, in the drawing c of Figure IX- 1 the x 9 y, and z axes aren't really all perpendicular to one another. We have here merely a two-dimensional perspective sketch of an actual three-dimen- sional situation in which the axes are all perpendicular to one another. In d of Figure IX- 1, we similarly have a two-dimensional perspective sketch of axes in a five-dimensional space. Since we come to the end of the alphabet in going from x to z 9 we have merely labeled these directions xi, x%, x& x 9 x 5 , according to the practice of mathematicians. Of course these five axes of d of Figure IX- 1 are not all perpen- Many Dimensions 169 dicular to one another in the drawing, but neither are the three axes of c. We can't lay out five mutually perpendicular lines in our three-dimensional space, but the mathematician can deal logically with a "space" in which five or more axes are mutually perpen- dicular. He can reason out the properties of various geometrical figures in a five-dimensional space, in which the position of a point p is described by five coordinates * lp , x 2p , *3p,* 4p , XS P . To make the space like ordinary space (a Euclidean space) the mathematician says that the square of the distance d of the point/? from the origin shall be given by in dealing with multidimensional spaces, mathematicians define the "volume" of a "cubical" figure as the product of the lengths of its sides. Thus, in a two-dimensional space the figure is a square, and, if the length of each side is L, the "volume" is the area of the square, which is ZA In three-dimensional space the volume of a cube of width, height, and thickness L is L 3 . In five-dimensional space the volume of a hypercube of extent L in each direction is L 5 , and a ninety-nine dimensional cube L on a side would have a volume L". Some of the properties of figures in multidimensional space are simple to understand and startling to consider. For instance, con- sider a circle of radius 1 and a concentric circle of radius 1 A inside of it, as shown in Figure IX-2. The area ("volume") of a circle is flrr 2 , so the area of the outer circle is IT and the area of the inner Fig. IX-2 170 Symbols, Signals and Noise circle is 7r(!/2) 2 = 0/4)77. Thus, a quarter of the area of the whole circle lies within a circle of half the diameter. Suppose, however, that we regard Figure IX-2 as representing spheres. The volume of a sphere is (%)7rr 3 , and we find that 54 of the volume of a sphere lies within a sphere of l /i diameter. In a similar way, the volume of a hypersphere of n dimensions is pro- portional to r 71 , and as a consequence the fraction of the volume which lies in a hypersphere of half the radius is l /2 n . For instance, for n = 7 this is a fraction 1/128. We could go through a similar argument concerning the fraction of the volume of a hypersphere of radius r that lies within a sphere of radius 0.99r. For a 1,000-dimension hypersphere we find that a fraction 0.00004 of the volume lies in a sphere of 0.99 the radius. The conclusion is inescapable that in the case of a hypersphere of a very high dimensionality, essentially all of the volume lies very near to the surface! Are such ideas anything but pure mathematics of the most esoteric sort? They are pure and esoteric mathematics unless we attach them to some problem pertaining to the physical world. Imaginary numbers, such as V^I, once had no practical physical meaning. However, imaginary numbers have been assigned mean- ings in electrical engineering and physics. Can we perhaps find a physical situation which can be represented accurately by the mathematical properties of hyperspace? We certainly can, right in the field of communication theory. Shannon has used the geometry of multidimensional space to prove an important theorem concern- ing the transmission of continuous, band-limited signals in the presence of noise. Shannon's work provides a wonderful example of the use of a new point of view and of an existing but hitherto unexploited branch of mathematics (in this case, the geometry of multidimen- sional spaces) in solving a problem of great practical interest. Because it seems to me so excellent an example of applied mathe- matics, I propose to go through a good deal of Shannon's reason- ing. I believe that the course of this reasoning is more unfamiliar than difficult, but the reader will have to embark on it at his own peril. Many Dimensions 171 In order to discuss this problem of transmission of continuous signals in the presence of noise, we must have some common measure of the strength of the signal and of the noise. Power turns out to be an appropriate and useful measure. When we exert a force of 1 Ib over a distance of 1 ft in raising a 1 Ib weight to the height of 1 ft we do work. The amount of work done is I foot -pound (ft-lb). The weight has, by virtue of its height, an energy of 1 ft-lb. In falling, the weight can do an amount of work (as in driving a clock) equal to this energy. Power is rate of doing work. A machine which expends 33,000 ft-lb of energy and does 33,000 ft-lb of work in a minute has by definition a power of 1 horsepower (hp). In electrical calculations, we reckon energy and work in terms of a unit called the joule and power in terms of a unit called a watt. A watt is one joule per second. If we double the voltage of a signal, we increase its energy and power by a factor of 4. Energy and power are proportional to the square of the voltage of a signal. We have seen as far back as Chapter IV that a continuous signal of band width W can be represented completely by its amplitude at 2 W sample points per second. Conversely, we can construct a band-limited signal which passes through any 2W sample points per second which we may choose. We can specify each sample arbitrarily and change it without changing any other sample. When we so change any sample we change the correspond- ing band-limited signal We can measure the amplitudes of the samples in volts. Each sample represents an energy proportional to the square of its voltage. Thus, we can express the squares of the amplitudes of the samples in terms of energy. By using rather special units to measure energy, we can let the energy be equal to the square of the sample amplitude, and this won't lead to any troubles. Let us, then, designate the amplitudes of successive and cor- rectly chosen samples of a band-limited signal, measured perhaps in volts, by the letters xi, ;c 2 , x 3 , etc. The parts of the signal energy represented by the samples will be x-f, x 2 2 , x 5 2 , etc. The total 172 Symbols, Signals and Noise energy of the signal, which we shall call E, will be the sum of these energies: E = *! 2 + x 2 2 + * 3 2 + etc. (9.2) But we see that in geometrical terms E is just the square of the distance from the origin, as given by 9.1, if #1, X2 9 *3, etc., are the coordinates of a point in multidimensional space! Thus, if we let the amplitudes of the samples of a band-limited signal be the coordinates of a point in hyperspace, the point itself represents the complete signal, that is, all the samples taken together, and the square of the distance of the point from the origin represents the energy of the complete signal. Why should we want to represent a signal in this geometrical fashion? The reason that Shannon did so was to prove an impor- tant theorem of communication theory concerning the effect of noise on signal transmission. In order to see how this can be done, we should recall the mathematical model of a signal source which we adopted in Chapter III. We there assumed that the source is both stationary and ergodic. These assumptions must extend to the noise we con- sider and to the combined "source" of signal plus noise. It is not actually impossible that such a source might produce a signal or a noise consisting of a very long succession of very high- energy samples or of very low-energy samples, any more than it is impossible that an ergodic source of letters might produce an extremely long run of E's. It is merely very unlikely. Here we are dealing with the theorem we encountered first in Chapter V. An ergodic source can produce a class of messages which are probable and a class which are so very improbable that we can disregard them. In this case, the improbable messages are those for which the average power of the samples produced departs significantly from the time average (and the ensemble average) characteristic of the ergodic source. Thus, for all the long messages that we need to consider, there is a meaningful average power of the signal which does not change appreciably with time. We can measure this average power by adding the energies of a large number of successive samples and dividing by the time T during which the samples are sent. As we Many Dimensions 173 make the time T longer and longer and the number of samples larger and larger, we will get a more and more accurate value for the average power. Because the source is stationary, this average power will be the same no matter what succession of samples we use. We can say this in a different way. Except in cases so unlikely that we need not consider them, the total energy of a large number of successive samples produced by a stationary source will be nearly the same (to a small fractional difference) regardless of what particular succession of samples we choose. Because the signal source is ergodic as well as stationary, we can say more. For each signal the source produces, regardless of what the particular signal is, it is practically certain that the energy of the same large number of successive samples will be nearly the same, and the fractional differences among energies get smaller and smaller as the number of samples is made larger and larger. Let us represent the signals from such a source by points in hyperspace. A signal of band width W and duration T can be represented by 2 WT samples, and the amplitude of each of these samples is the distance along one coordinate axis of hyperspace. If the average energy per sample is P 9 the total energy of the 2WT samples will be very close to 2 WTP if 2 WT is a very large number of samples. We have seen that this total energy tells how far from the origin the point which represents the signal is. Thus, as the number of samples is made larger and larger, the points represent- ing different signals of the same duration produced by the source lie within a smaller and smaller distance (measured as a fraction of the radius) from the surface of a hypersphere of radius -\flWTP. The fact that the points representing the different signals all lie so close to the surface is not surprising if we remember that for a hypersphere of high dimensionality almost all of the volume is very close to the surface. We receive, not the signal itself, but the signal with noise added. The noise which Shannon considers is called white Gaussian noise. The word white implies that the noise contains all frequencies equally, and we assume that the noise contains all frequencies equally up to a frequency of W cycles per second and no higher frequencies. The word Gaussian refers to a law for the probability 174 Symbols, Signals and Noise of samples of various amplitudes, a law which holds for many natural sources of noise. For such Gaussian noise, each of the 2W samples per second which represent it is uncorrelated and inde- pendent. If we know the average energy of the samples which we will call N, knowing the energy of some samples doesn't help to predict the energy of others. The total energy of 2 WT samples will be very nearly 2 WTN tf2WT is a large number of samples, and the energy will be almost the same for any succession of noise samples that are added to the signal samples. We have seen that a particular succession of signal samples is represented by some point in hyperspace a distance \/2 WTP from the origin. The sum of a signal plus noise is represented by some point a little distance away from the point representing the signal alone. In fact, we see that the distance from the point representing the signal alone to the point representing the signal plus the noise is -\/2WTN. Thus, the signal plus the noise lies in a little hyper- sphere of radius -\/2 WTN centered on the point representing the signal alone. Now, we don't receive the signal alone. WQ receive a signal of average energy P per sample plus Gaussian noise of average energy N per sample. In a time T 3 the total received energy is 2WT(P + N) and the point representing whatever signal was sent plus whatever noise was added to it lies within a hypersphere of radius ^J2WT(P + N). After we have received a signal plus noise for T seconds we can find the location of the point representing the signal plus noise. But how are we to find the signal? We only know that the signal lies within a distance ^/2WTN of the point representing the signal plus noise. How can we be sure of deducing what signal was sent? Suppose that we put into the hypersphere of radius \/2WT(P + N), in which points representing a signal plus noise must lie, a large number of little nonoverlapping hyperspheres of radius a bare shade larger than \/2 WTN. Let us then send only signals repre- sented by the center points of these little spheres. When we receive the 2 WT samples of any particular one of these signals plus any noise samples, the corresponding point in hyper- space can only lie within the particular little hypersphere surround- Many Dimensions 175 ing that signal point and not within any other. This is so because, as we have noted, the points representing long sequences of samples produced by an ergodic noise source must be almost at the surface of a sphere of radius ^/2WTN. Thus, the signal sent can be identi- fied infallibly despite the presence of the noise. How many such nonoverlapping byperspheres of radius ^J2WTN can be placed in a hypersphere of radius ^2WT(P + N)1 The number certainly cannot be larger than the ratio of the volume of the larger sphere to that of the smaller sphere. The number n of dimensions in the space is equal to the number of signal (and noise) samples 2 WT. The volume of a hypersphere in a space of n dimensions is proportional to r". Hence, the ratio of the volume of the large signal-plus-noise sphere to the volume of the little noise sphere is /V2T(J + JV)\ 2WT _ f \ ^2WTN / \ This is a limit to the number of distinguishable messages we can transmit in a time T. The logarithm of this number is the number of bits which we can transmit in the time T. It is (0) As the message is I 1 seconds long, the corresponding number of bits per second C is C= HHogO + P/N) (9.3) Having got to this point, we can note that the ratio of average energy per signal sample to average energy per noise sample must be equal to the ratio of average signal power to average noise power, and we can, in 9.3, regard P/N as the ratio of signal power to noise power instead of as the ratio of average signal-sample energy to average noise-sample energy. The foregoing argument, which led to 9.3, has merely shown that no more than C bits per second can be sent with a band width of W cycles per second using a signal of power P mixed with a Gaussian noise of power N. However, by a further geometrical argument, in which he makes use of the fact that the volume of a 176 Symbols, Signals and Noise hypersphere of high dimensionality is almost all very close to the surface, Shannon shows that the signaling rate can approach as close as one likes to C as given by 9.3 with as small a number of errors as one likes. Hence, C, as given by 9.3, is the channel capacity for a continuous channel in which a Gaussian noise is added to the signal. It is perhaps of some interest to compare equation 9.3 with the expressions for speed of transmission and for information which Nyquist and Hartley proposed in 1928 and which we discussed in Chapter II. Nyquist and Hartley's results both say that the number of binary digits which can be transmitted per second is n log m Here m is the number of different symbols, and n is the number of symbols which are transmitted per second. One sort of symbol we can consider is a particular value of voltage, as, +3, + 1, 1, or 3. Nyquist knew, as we do, that the number of independent samples or values of voltage which can be transmitted per second is 2 W. By using this fact, we can rewrite equation 9.3 in the form C = (/i/2) log(l +P/AQ Here we are really merely retracing the steps which led us to 9.3. We see that in equation 9.3 we have got at the average number m of different symbols we can send per sample, in terms of the ratio of signal power to noise power. If the signal power becomes very small or the noise power becomes very large, so that P/N is nearly 0, then the average number of different symbols we can transmit per sample goes to log 1 = Thus, the average number of symbols we can transmit per sample and the channel capacity go to as the ratio of signal power to noise power goes to 0. Of course, the number of symbols we can transmit per sample and the channel capacity become large as we make the ratio of signal power to noise power large. Our understanding of how to send a large average number of Many Dimensions 111 independent symbols per sample has, however, gone far beyond anything which Nyquist or Hartley told us. We know that if we are to do this most efficiently we must, in general, not try to encode a symbol for transmission as a particular sample voltage to be sent by itself. Instead, we must, in general, resort to the now-familiar procedure of block encoding and encode a long sequence of symbols by means of a large number of successive samples. Thus, if the ratio of signal power to noise power is 24, we can on the average transmit with negligible error \/l + 24 = \/25 = 5 different symbols per sample, but we can't transmit any of 5 different symbols by means of one particular sample. In Figure VIII- 1 of Chapter VIII, we considered sending binary digits one at a time in the presence of noise by using a signal which was either a positive or a negative pulse of a particular amplitude and calling the received signal a 1 if the signal plus noise was positive and a if the received signal plus noise was negative. Suppose that in this case we make the signal powerful enough compared with the noise, which we assume to be Gaussian, so that only 1 received digit in 100,000 will be in error. Calculations show that this calls for about six times the signal power which equation 9.3 says we will need for the same band width and noise power. The extra power is needed because we use as a signal either a short positive or negative pulse specifying one binary digit, rather than using one of many long signals consisting of many different samples of various amplitudes to represent many successive binary digits. One very special way of approaching the ideal signaling rate or channel capacity for a small, average signal power in a large noise power is to concentrate the signal power in a single short but powerful pulse and to send this pulse in one of many possible time positions, each of which represents a different symbol. In this very special and unusual case we can efficiently transmit symbols one at a time. In general, however, to achieve something close to the ideal signaling rate, we must use as the elements of the code a set of long, complicated signal waves which resemble Gaussian noise. We can if we wish look on relation 9.3 not narrowly as telling us how many bits per second we can send over a particular com- munication channel but, rather, as telling us something about the 178 Symbols, Signals and Noise possibilities of transmitting a signal of a specified band width with some required signal-to-noise ratio over a communication channel of some other band width and signal-to-noise ratio. For instance, suppose we must send a signal with a band width of 4 megacycles per second and attain a ratio of signal power to noise power P/N of 1,000. Relation 9.3 tells us that the corresponding channel capacity is C = 40,000,000 bits/second But the same channel capacity can be attained with the combina- tions shown in Table XIII. TABLE XIII Combinations of W and P/N Which Give Same Channel Capacity W P/N 4,000,000 8,000,000 2,000,000 1,000 30.6 1,000,000 We see from Table XIII that, in attaining a given channel capacity, we can use a broader band width and a lower ratio of signal to noise or a narrower band width and a higher ratio of signal to noise. Early workers in the field of information theory were intrigued with the idea of cutting down the band width required by increas- ing the power used. This calls for lots of power. Experience has shown that it is much more useful and practical to increase the band width so as to get a good signal-to-noise ratio with less power than would otherwise be required. This is just what is done in FM transmission, as an example. In FM transmission, a particular amplitude of the message signal to be transmitted, which may, for instance, be music, is encoded as a radio signal of a particular frequency. As the amplitude of the message signal rises and falls, the frequency of the FM signal which represents it changes greatly, so that in sending high fidelity music which has a band width of 15,000 cps, the FM radio signal can range over a band width of 150,000 cps. Because FM trans- Many Dimensions 179 mission makes use of a band width much larger than that of the music of which it is an encoding, the signal-to-noise ratio of the received music can be much higher than the ratio of signal power to noise power in the FM signal that the radio receiver receives. FM is not, however, an ideally efficient system; it does not work the improvement which we might expect from 9.3. Ingenious inventors are ever devising improved systems of mod- ulation. Twice in my experience someone has proposed to me a system which purported to do better than equation 9.3, for the ideal channel capacity, allows. The suggestions were plausible, but I knew, just as in the case of perpetual motion machines, that something had to be wrong with them. Careful analysis showed where the error lay. Thus, communication theory can be valuable in telling us what can't be accomplished as well as in suggesting what can be. One thing that can't be accomplished in improving the signal- to-noise ratio by increasing the band width is to make a system which will behave in an orderly and happy way for all ratios of signal power to noise power. According to the view put forward in this chapter, we look on a signal as a point in a multidimensional space, where the number of dimensions is equal to the number of samples. To send a narrow- band signal of a few samples by means of a broad-band signal having more samples, we must in some way map points in a space of few dimensions into points in a space of more dimensions in a one-to-one fashion. Way back in Chapter I, we proved a theorem concerning the mapping of points of a space of two dimensions (a plane) onto points of a space of one dimension (a line). We proved that if we map each point of the plane in a one-to-one fashion into a single corresponding point on the line, the mapping cannot be continu- ous. That is, if we move smoothly along a path in the plane from point to nearby point, the corresponding positions on the line must jump back and forth discontinuously. A similar theorem is true for the mapping of the points of any space onto a space of differ- ent dimensionality. This bodes trouble for transmission schemes in which few message samples are represented by many signal samples. 180 Symbols, Signals and Noise Shannon gives a simple example of this sort of trouble, which is illustrated in Figure IX-3. Suppose that we use two sample amplitudes v 2 and vi to represent a single sample amplitude u. We regard v 2 and vi as the distance up from and to the right of the lower left hand corner of a square. In the square, we draw a snaky line which starts near the lower left-hand corner and goes back and forth across the square, gradually progressing upward. We let distance along this line, measured from its origin near the lower left-hand corner to some specified point along the line, be u, the voltage or amplitude of the signal to be transmitted. Certainly, any value of u is represented by particular values of vi and v 2 . We see that the range of v x or v 2 is less than the range of u. We can transmit vi and v 2 and then reconstruct u with great accuracy. Or can we? Suppose a little noise gets into vi and v 2 , so that, when we try to find the corresponding value of u at the receiver, we land somewhere in a circle of uncertainty due to noise. As long as the diameter of the circle is less than the distance between the loops of the snaky path, we can tell what the correct value of u is to a fractional error much smaller than the fractional error of vi or v 2 . } c } A ClRri F OP V rf t IM/"CTDTA 1 KIT-V } UNCcRTAI NTY DUE TO NOISE e ^) i Fig. IX-3 Many Dimensions 1 8 1 But if the noise is larger, we can't be sure which loop of the snaky path was intended, and we frequently make a larger error in u. This sort of behavior is inevitable in systems, such as FM, which use a large band width in order to get a better signal-to-noise ratio. As the noise added in transmission is increased, the noise in the received (demodulated) signal at first increases gradually and then increases catastrophically. The system is said to "break" at this level of signal to noise. Here we have an instance in which a seemingly abstract theorem of mathematics tells us that a certain type of behavior cannot be avoided in electrical communication systems of a certain general type. The approach in this chapter has been essentially geometrical. This is only one way of dealing with the problems of continuous signals. Indeed, Shannon gives another in his book on communi- cation theory, an approach which is applicable to all types of signals and noise. The geometrical approach is interesting, how- ever, because it is proving illuminating and fruitful in many prob- lems concerning electric signals which are not directly related to communication theory. Here we have arrived at a geometry of band-limited signals by sampling the signals and then letting the amplitudes of the samples be the coordinates of a point in a multidimensional space. It is possible, however, to geometrize band-limited signals without speaking in terms of samples, and mathematicians interested in problems of signal transmission have done this. In fact, it is becom- ing increasingly common to represent band-limited signals as points in a multidimensional "signal space" or "function space" and to prove theorems about signals by the methods of geometry. The idea of signals as points in a multidimensional signal space or function space is important, because it enables mathematicians to think about and to make statements which are true about all band-limited signals, or about large classes of band-limited signals, without considering the confusing details of particular signals, just as mathematicians can make statements about all triangles or all right triangles. Signal space is a powerful tool in the hands or, rather, in the minds of competent mathematicians. We can only wonder and admire. 182 Symbols, Signals and Noise From the point of view of communication theory, our chief concern in this chapter has been to prove an important theorem concerning a noisy continuous channel. The result is embodied in equation 9.3, which gives the rate at which we can transmit binary digits with negligible error over a continuous channel in which a signal of band width W and power P is mixed with a white Gaussian noise of band width W and power N. Nyquist knew, in 1928, that one can send 2W independent symbols per second over a channel of band width 2 W, but he didn't know how many different symbols could be sent per second for a given ratio of signal power to noise power. We have found this out for the case of a particular, common type of noise. We also know that even if we can transmit some average number m of symbols per sample, in general, we can't do this by trying to encode suc- cessive symbols independently as particular voltages. Instead, we must use block encoding, and encode a large number of successive symbols together. Equation 9.3 shows that we can use a signal of large band width and low ratio of signal power to noise power in transmitting a message which has a small band width and a large ratio of signal power to noise power. FM is an example of this. Such considera- tions will be pursued further in Chapter X. This chapter has had another aspect. In it we have illustrated the use of a novel viewpoint and the application of a powerful field of mathematics in attacking a problem of communication theory. Equation 9.3 was arrived at by the by-no-means-obvious expedient of representing long electrical signals and the noises added to them by points in a multidimensional space. The square of the distance of a point from the origin was interpreted as the energy of the signal represented by the point. Thus, a problem in communication theory was made to corre- spond to a problem in geometry, and the desired result was arrived at by geometrical arguments. We noted that the geometrical repre- sentation of signals has become a powerful mathematical tool in studying the transmission and properties of signals. The geometrization of signal problems is of interest in itself, but it is also of interest as an example of the value of seeking new Many Dimensions 183 mathematical tools in attacking the problems raised by our increas- ingly complex technology. It is only by applying this order of thought that we can hope to deal with the increasingly difficult problems of engineering. CHAPTER .zv Information Theory and Physics I HAVE GIVEN SOMETHING of the historical background of com- munication theory in Chapter II. From this we can see that communication theory is an outgrowth of electrical communica- tion, and we know that the behavior of electric currents and electric and magnetic fields is a part of physics. To Morse and to other early telegraphists, electricity provided a very limited means of communication compared with the human voice or the pen in hand. These men had to devise codes by means of which the letters of the alphabet could be represented by turning an electric current successively on and off. This same problem of the representation of material to be communicated by various sorts of electrical signals has led to the very general ideas concerning encoding which are so important in communication theory. In this relation of encoding to particular physical phenomena, we see one link between communication theory and physics. We have also noted that when we transmit signals by means of wire or radio, we receive them inevitably admixed with a certain amount of interfering disturbances which we call noise. To some degree, we can avoid such noise. The noise which is generated in our receiving apparatus we can reduce by careful design and by ingenious invention. In receiving radio signals, we can use an antenna which receives signals most effectively from the direction of the transmitter and which is less sensitive to signals coming from 184 Information Theory and Physics 185 other directions. Further, we can make sure that our receiver responds only to the frequencies we mean to use and rejects inter- fering signals and noise of other frequencies. Still, when all this is done, some noise will inevitably remain, mixed with the signals that we receive. Some of this noise may come from the ignition systems of automobiles. Far away from man-made sources, some may come from lightning flashes. But even if lightning were abolished, some noise would persist, as surely as there is heat in the universe. Many years ago an English biologist named Brown saw small pollen particles, suspended in a liquid, dance about erratically in the field of his microscope. The particles moved sometimes this way and sometimes that, sometimes swiftly and sometimes slowly. This we call Brownian motion. Brownian motion is caused by the impact on the particles of surrounding molecules, which themselves execute even a wilder dance. One of Einstein's first major works was a mathematical analysis of Brownian motion. The pollen grains which Brown observed would have remained at rest had the molecules about them been at rest, but molecules are always in random motion. It is this motion which constitutes heat. In a gas, a molecule moves in a disorganized way. It moves swiftly or slowly in straight lines between frequent collisions. In a liquid, the molecules jostle about in close proximity to one another but continually changing place, sometimes moving swiftly and sometimes slowly. In a solid, the molecules vibrate about their mean positions, sometimes with a large amplitude and sometimes with a small amplitude, but never moving much with respect to their nearest neighbors. Always, however, in gas, liquid, or solid, the molecules move, with an average energy due to heat which is proportional to the temperature above absolute zero, however erratically the speed and energy may vary from time to time and from molecule to molecule. Energy of mechanical motion is not the only energy in our universe. The electromagnetic waves of radio and light also have energy. Electromagnetic waves are generated by changing currents of electricty. Atoms are positively charged nuclei surrounded by negative electrons, and molecules are made up of atoms. When the molecules of a substance vibrate with the energy of heat, 1 86 Symbols, Signals and Noise relative motions of the charges in them can generate electromag- netic waves, and these waves have frequencies which include those of what we call radio, heat, and light waves. A hot body is said to radiate electromagnetic waves, and the electromagnetic waves that it emits are called radiation. The rate at which a body which is held at a given temperature radiates radio, heat, and light waves is not the same for all sub- stances. Dark substances emit more radiation than shiny sub- stances. Thus, silver, which is called shiny because it reflects most of any waves of radio, heat, or light falling on it, is a poor radiator, while the carbon particles of black ink constitute a good radiator. When radiation falls on a substance, the fraction that is reflected rather than absorbed is different for radiation of different frequen- cies, such as radio waves and light waves. There is a very general rule, however, that for radiation of a given frequency, the amount of radiation a substance emits at a given temperature is directly proportional to the fraction of any radiation falling on it which is absorbed rather than reflected. It is as if there were a skin around each substance which allowed a certain fraction of any radiation falling on it to pass through and reflected the rest, and as if the fraction that passed through the skin were the same for radiation either entering or leaving the substance. If this were not so, we might expect a curious and unnatural (as we know the laws of nature) phenomenon. Let us imagine a com- pletely closed box or furnace held at a constant temperature. Let us imagine that we suspend two bodies inside the furnace. Suppose (contrary to fact) that the first of these bodies reflected radiation well, absorbing little, and that it also emitted radiation strongly, while the second absorbed radiation well, reflecting little, but emitted radiation poorly. Suppose that both bodies started out at the same temperature. The first would absorb less radiation and emit more radiation than the second, while the second would absorb more radiation and emit less radiation than the first. If this were so, the second body would become hotter than the first. This is not the case, however; all bodies in a closed box or furnace whose walls are held at a constant, uniform temperature attain just exactly the same temperature as the walls of the furnace, whether the bodies are shiny, reflecting little radiation and absorb- Information Theory and Physics 187 ing much, or whether they are dark, reflecting little radiation and absorbing much. This can be so only if the ability to absorb rather than reflect radiation and the ability to emit radiation go hand in hand, as they always do in nature. Not only do all bodies inside such a closed furnace attain the same temperature as the furnace; there is also a characteristic intensity of radiation in such an enclosure. Imagine a part of the radiation inside the enclosure to strike one of the walls. Some will be reflected back to remain radiation in the enclosure. Some will be absorbed by the walls. In turn, the walls will emit some radia- tion, which will be added to that reflected away from the walls. Thus, there is a continual interchange of radiation between the interior of the enclosure and the walls. If the radiation in the interior were very weak, the walls would emit more radiation than the radiation which struck and was absorbed by them. If the radiation in the interior were very strong, the walls would receive and absorb more radiation than they emitted. When the electromagnetic radiation lost to the walls is just equal to that supplied by the walls, the radiation is said to be in equilibrium with its material surroundings. It has an energy which increases with temperature, just as the energy of motion of the molecules of a gas, a liquid, or a solid increases with temperature. The intensity of radiation in an enclosure does not depend on how absorbing or reflecting the walls of the enclosure are; it depends only on the temperature of the walls. If this were not so and we made a little hole joining the interior of a shiny, reflecting enclosure with the interior of a dull, absorbing enclosure at the same temperature, there would have to be a net flow of radiation through the hole from one enclosure to another at the same tem- perature. This never happens. We thus see that there is a particular intensity of electromagnetic radiation, such as light, heat, and radio waves, which is character- istic of a particular temperature. Now, while eletromagnetic waves travel through vacuum, air, or insulating substances such as glass, they can be guided by wires. Indeed, we can think of the signal sent along a pair of telephone wires either in terms of the voltage between the wires and the current of electrons which flows in the 188 Symbols, Signals and Noise wires, or in terms of a wave made up of electric and magnetic fields between and around the wires, a wave which moves along with the current. As we can identify electrical signals on wires with electro- magnetic waves, and as hot bodies radiate electromagnetic waves, we should expect heat to generate some sort of electrical signals. J. B. Johnson, who discovered the electrical fluctuations caused by heat, described them, not in terms of electromagnetic waves but in terms of a fluctuating voltage produced across a resistor. Once Johnson had found and measured these fluctuations, another physicist was able to find a correct theoretical expression for their magnitude by applying the principles of statistical me- chanics. This second physicist was none other than H. Nyquist, who, as we saw in Chapter II, also contributed substantially to the early foundations of information theory. Nyquisfs expression for what is now called either Johnson noise or thermal noise is V* = 4kTRW (10.1) Here F 2 is the mean square noise voltage, that is, the average value of the square of the noise voltage, across the resistor, k is Boltzmann's constant: k = 1.37 x 10~ 23 joule/degree T is the temperature of the resistor in degrees Kelvin, which is the number of Celsius or centigrade degrees (which are % as large as Fahrenheit degrees) above absolute zero. Absolute zero is 273 centigrade or 459 Fahrenheit. R is the resistance of the resistor measured in ohms. W is the band width of the noise in cycles per second. Obviously, the band width W depends only on the properties of our measuring device. If we amplify the noise with a broad-band amplifier we get more noise than if we use a narrow-band amplifier of the same gain. Hence, we would expect more noise in a television receiver, which amplifies signals over a band width of several million cycles per second, than in a radio receiver, which amplifies signals having a band width of several thousand cycles per second. We have seen that a hot resistor produces a noise voltage. If we connect another resistor to the hot resistor, electric power will flow Information Theory and Physics 189 to this second resistor. If the second resistor is cold, the power will heat it. Thus, a hot resistor is a potential source of noise power. What is the most noise power N that it can supply? The power is N = kTW (10.2) In some ways, 10.2 is more satisfactory than 10.1. For one thing, it has fewer terms; the resistance R no longer appears. For another thing, its form is suitable for application to somewhat different situations. For instance, suppose that we have a radio telescope, a big parabolic reflector which focuses radio waves into a sensitive radio receiver. I have indicated such a radio telescope in Figure X-l. Suppose we point the radio telescope at different celestial or ter- restrial objects, so as to receive the electromagnetic noise which they radiate because of their temperature. We find that the radio noise power received is given by 10.2, where T is the temperature of the object at which the radio telescope points. If we point the telescope down at water or at smooth ground, what it actually sees is the reflection of the sky, but if we point it at things which don't reflect radio waves well, such as leafy trees or bushes, we get a noise corresponding to a temperature around 290 Kelvin (about 62 Fahrenheit), the temperature of the trees. If we point the radio telescope at the moon and if the telescope SUN MOON RADIO TELESCOPE Fig. X-l 190 Symbols, Signals and Noise is directive enough to see just the moon and not the sky around it, we get about the same noise, which corresponds not to the temperature of the very surface of the moon but to the temperature a fraction of an inch down, for the substance of the moon is some- what transparent to radio waves. If we point the telescope at the sun, the amount of noise we obtain depends on the frequency to which we tune the radio receiver. If we tune the receiver to a frequency around 10 million cycles per second (a wave length of 30 meters), we get noise corresponding to a temperature of around a million degrees Kelvin; this is the temperature of the tenuous outer corona of the sun. The corona is transparent to radio waves of shorter wave lengths, just as the air of the earth is. Thus, if we tune the radio receiver to a frequency of around 10 billion cycles per second, we receive radia- tion corresponding to the temperature of around 8,000 Kelvin, the temperature a little above the visible surface. Just why the corona is so much hotter than the visible surface which lies below it is not known. The radio noise from the sky is also different at diiferent fre- quencies. At frequencies above a few billion cycles per second the noise corresponds to a temperature of 2 to 4 Kelvin. At lower frequencies the noise is greater and increases steadily as the fre- quency is lowered. The Milky Way, particular stars, and island universes or galaxies in collision all emit large amounts of radio noise. The heavens are not at a uniform temperature, and we cannot regard the heavens as radiating noise according to equa- tion 10.2. Nonetheless, Johnson or thermal noise constitutes a minimum noise which we must accept, and additional noise sources only make the situation worse. The fundamental nature of Johnson noise has led to its being used as a standard in the measurement of the performance of radio receivers. As we have noted, a radio receiver adds a certain noise to the signals it receives. It also amplifies any noise that it receives. We can ask, how much amplified Johnson noise would just equal the noise the receiver adds? We can specify this noise by means of an equivalent noise temperature T n . This equivalent noise temperature T n is a measure of the noisiness of the radio receiver. The smaller T n is the better the receiver is. Information Theory and Physics 191 We can interpret the noise temperature T n in the following way. If we had an ideal noiseless receiver with just the same gain and band width as the actual receiver and if we added Johnson noise corresponding to the temperature T n to the signal it received, then the ratio of signal power to noise power would be the same for the ideal receiver with the Johnson noise added to the signal as for the actual receiver. Thus, the noise temperature T n is a just measure of the noisiness of the receiver. Sometimes another measure based on T n is used; this is called the noise figure NF. In terms of T^ the noise figure is 293 The noise figure was defined for use here on earth, where every signal has mixed with it noise corresponding to a temperature of around 293 Kelvin. The noise figure is the ratio of the total output noise, including noise due to Johnson noise for a temperature of 293 Kelvin at the input and noise produced in the receiver, to the amplified Johnson noise alone. Of course, the equivalent noise temperature T n of a radio receiver depends on the nature and perfection of the radio receiver, and the lowest attainable noise figure depends on the frequency of opera- tion. However, Table XIV below gives some rough figures for various sorts of receivers. The effective temperatures of radio receivers and the tempera- TABLE XIV Equivalent Type of Receiver Noise Temperature, Degrees Kelvin Radio or TV receiver 30,000 6,000,000,000 cycle per second receiver using Maser amplifier 20 Good 6,000,000,000 cycle per second receiver not using Maser amplifier 3,000 192 Symbols, Signals and Noise tures of the objects at which their antennas are directed are very important in connection with communication theory, because noise determines the power required to send messages. Johnson noise is Gaussian noise, to which equation 9.3 applies. Thus, ideally, in order to transmit C bits per second, we must have a signal power P related to the noise power N by a relation that was derived in the preceding chapter: If we use expression 10.2 for noise, this becomes Let us assume a given signal power P. If we make W very small, C will become very small. However, if we make W larger and larger, C does not become larger and larger without limit, but rather it approaches a limiting value. When P/kTW becomes veiy small compared with unity, 10.4 becomes c= We can also write this P = 0.693 kTC (10.6) Relation 10.6 says that, even when we use a very wide band width, we need at least a power 0.693 kT joule per second to send one bit per second, so that on the average we must use an energy of 0.693 kT joule for each bit of information we transmit. We should remember, however, that equation 9.3 holds only for an ideal sort of encoding in which many characters representing many bits of information are encoded together into a long stretch of signal. Most practical communication systems require much more energy per bit, as we noted in Chapter IX. Let us now see what the implications of expression 10.6 are for some unusual communication systems. Suppose that we are on a space ship near Mars and want to send English text back to Earth. Ideally, something like 1 binary digit per letter or 5.5 binary digits Information Theory and Physics 193 per word would suffice. If we want to send text at a common tele- typewriter speed of 60 words per minute (which is 1 word per second) we will need to send 5.5 binary digits per second, so this will be the value of C, the channel capacity. If the signal which reaches our receiver comes from cold, cold space, the only necessary noise will correspond to the temperature of space. If we use a frequency of thousands of megacycles per second, we can take this as being around 4 Kelvin. Thus, we can use 10.6 to calculate the power P R which must be received. C is 5.5, and Tis 4. The required received power turns out to be P R = 2x 10~ 22 watts Of course, we must transmit much more power than this, for not all the power transmitted will be intercepted by the receiving antenna. Let us consider the special case in which the transmitted power is sent out almost uniformly in all directions. At a distance L from the transmitter, is will have spread evenly over a sphere of radius L and surface 47rL 2 . Suppose that the receiving antenna is a concave, parabolic reflector of diameter D and area 7rD 2 /4. Then the ratio of the power transmitted. P T to the power received by the antenna P R will be P T Now imagine that the space ship is 30 million miles or about 1.5 x 10 11 ft from earth. Imagine also that the diameter D of the antenna is 150 ft. Then the ratio of transmitter power to receiver power will be -^= L6x io 19 PR If the required receiver power is 2 x 10~ 22 watts, the trans- mitter power must thus be about P T = 0.003 watt Thus, ideally if we used a 150-ft-diameter receiving antenna, we could transmit English text back from the vicinity of Mars at a speed of 60 words a minute by using only about three thousandths of a watt! 194 Symbols, Signals and Noise What power would we actually have to use? If we encoded the text letter by letter using 5 binary digits per letter, a simple and common practice, we would thus increase the power required by a factor of 5. If we used the best known receiver, the maser, which has a noise temperature of around 20, instead of the 4 we assumed for sky noise alone, we would for this reason require another factor of five times. If we transmitted the binary digits one at a time by turning a radio transmitter either on or off, we would raise the power requirement by another factor of perhaps forty times because of this inefficient method of modulation or encoding. Thus, an actual system might call for a power of a thousand times the ideal, or 3 watts. Further, if we didn't use a maser receiver, we might have to raise the power by another factor of 10, to 30 watts. Suppose, now, that we wanted to receive messages from a space ship when it was 30 million miles away and between us and the sun. A 150-ft antenna would be directive enough to see only the sun and the space ship. The sun's temperature is about two thou- sand times that of the cold sky, so even ideally we would need about 6 watts, and practically we might need several hundred watts. Of course, in an actual interplanetary communication system, we would use a large directive antenna at the transmitter out in space. We would thus send the transmitted power back toward earth in a narrow beam. This would cut down the power required to perhaps a ten- thousandth to a millionth of the power computed above. Thus, even television transmission would be possible be- tween Mars and the earth, if only we could get to Mars to televise something! I have used this example partly because I find it striking and interesting. Partly, I have used it because it is in such a case that it is most important to cut the power down as much as possible. Both power and powerful radio transmitters will be very expensive far away from earth. The above example illustrates the wide difference between the restrictions imposed by the physical universe and the sort of thing we are able to accomplish with present schemes of encoding and with existing radio receivers. In an era of space exploration and perhaps of space travel, it will be worth while to try to approach more nearly the limiting efficiency of communication allowed by nature. Information Theory and Physics 195 Let us now turn to another aspect of limitations which the laws of physics impose on our ability to communicate. We have already considered electromagnetic waves traveling freely in space and electromagnetic waves guided by a pair of wires. Electromagnetic waves can also be sent through pipes or tubes called wave guides, that is, they can if the wave length is smaller than about twice the diameter of the pipe or tube. When the wave length is much smaller than the diameter of the wave guide, electromagnetic waves can travel through the wave guide in many different spatial patterns or modes. Ideally, each of these modes can travel independently without interfering with the others, and so we could launch and receive many independent messages in the same frequency range by using these independent modes. This possibility is of little practical importance, however, for imperfections of construction do cause the modes to interact. Further, in a practical system it is best to use the mode which transmits electromagnetic waves with the least loss of power, that is, with the least attenuation, and to get rid of the rest. The existence of the many modes is, however, important in illustrating a theoretical point. Imagine that we send electromag- netic energy of very short wave length through a wave guide. Suppose that we put across the wave guide a transparency or picture so made that "light" parts of it transmit electromagnetic waves freely and "dark" parts absorb some of the electromagnetic energy. It can be shown that the strength of the various modes on the far side of the picture is a representation of the lightness and darkness of the picture. In shining light through a transparent object, we similarly set up a complicated pattern of electromagnetic modes, each of which carries some information concerning the object viewed. Thus, we can relate the idea of forming an image of an object by means of light of a given frequency or wave length to the idea of transmitting information by means of a number of independent communication channels which are the different modes of propagation. This matter has been explored to some degree. Here, however, we encounter a difficulty. The expressions for Johnson noise, equations 10.1 and 10.2, are classical or pre- quantum-theory equations. In dealing with radio waves, these are 1 96 Symbols, Signals and Noise under most circumstances quite accurate enough, but they are very inaccurate in dealing with light waves, which have frequencies of around five hundred million megacycles per second. Radiation comes in little packets, or quanta. Each has an energy E given by E = hf * (10.8) Here / is the frequency in cycles per second and h is Planck's constant h = 6.63 x 10~ 34 joule/second Usually, quantum effects become important when h/is comparable to or larger than kT. Thus, the frequency/above which our classical expressions will be clearly in error is /=2.07x 10*>r (10.9) For a temperature of 3 Kelvin this is a frequency of about 60,000 megacycles per second, corresponding to a wave length of % centi- meter, which is in the microwave radio range. For a temperature of 300 K (room temperature) the frequency is 6 million mega- cycles, and the wave length 0.005 cm, which lies in the long infrared. Visible light has a frequency of around 500 million mega- cycles per second and a wave length of around 6 x 10~ 5 cm. What limitations do quantum effects put on communication? The answer is that we don't know exactly. Today, over ten years after the invention of information theory in its present form, the physicists haven't provided a complete answer to this very funda- mental question. We can say a little about the matter, however. Classically, we can regard a signal, however faint it may be, as a smoothly varying current, voltage, or electric or magnetic field. This is disturbed by the presence of Johnson noise, but the noise is merely a smoothly varying unpredictable quantity added to a smoothly varying signal. According to quantum theory, a signal will be to some degree unpredictable even when we add no noise. Thus, we can't send a signal having an energy of less than 1 quantum, that is, h/ If we send 1 quantum, we can't specify exactly both its frequency and Information Theory and Physics 197 the time it will arrive at the receiver. Heisenberg's uncertainty principle forbids this. In considering quantum effects, one noise we have to contend with is an admixture with the signal of quanta of thermal origin. This corresponds to Johnson noise. We see, however, that, even if these noise quanta are absent, there will be some uncertainty hi the received signal, while in the classical case there was not. We can at this point answer a few questions concerning the limitations imposed on communication by quantum effects. For instance, how many quanta do we need to use per bit of informa- tion transmitted? Despite the fact that we cannot at will send just 1 quantum and be sure that we have sent 1 and not none, it turns out that, in the absence of interfering quanta which constitute thermal noise, we can on the average send an unlimited number of bits per quantum if only we take long enough in doing so. We can do this, for instance, by trying to send the quantum in one of a very large number of different time intervals or at one of a very large number of different frequencies, thus increasing the choice the sender of the message has as to how he shall send a quantum. Complicated encoding of the messages to be transmitted can be used to avoid errors in the over-all system even in the presence of occasional errors of transmission due to quantum uncertainty. We might also ask, on the average how much power do we need per bit of information transmitted. Again, if we have no interfering quanta, we can make this power as small as we like, both by sending many bits per quantum, as outlined above, and by using a low frequency so that the energy per quantum is small. In fact, if we use very low frequencies in signaling (this limits the rate at which we can signal), expression 10.6 applies in the quantum as well as in the classical case, for the quantum behavior approaches classical behavior for low frequencies. However, in actual communication systems, we may want to use frequencies for which quantum effects are important. We have no exact expressions which take quantum effects into account in dealing with high-frequency signals mixed with noise. The one thing we can be sure of is that things will be somewhat worse than in the classical, nonquantum, Johnson-noise case. But just how much worse they will be we do not yet know. 198 Symbols, Signals and Noise From the point of view of information theory, the most interest- ing relation between physics and information theory lies in the evaluation of the unavoidable limitations imposed by the laws of physics on our ability to communicate. In a very fundamental sense, this is concerned with the limitations imposed by Johnson noise and quantum effects. It also, however, includes limitations imposed by atmospheric turbulence and by fluctuations in the ionosphere, which can distort a signal in a way quite different from adding noise to it. Many other examples of this sort of relation of physics to information theory could be unearthed. Physicists have thought of a connection between physics and communication theory which has nothing to do with the funda- mental problem that communication theory set out to solve, that is, the possibilities of the limitations of efficient encoding in trans- mitting information over a noisy channel. Physicists propose to use the idea of the transmission of information in order to show the impossibility of what is called a perpetual-motion machine of the second kind. As a matter of fact, this idea preceded the invention of communication theory in its present form, for L. Szilard put forward such ideas in 1929. Some perpetual-motion machines purport to create energy; this violates the first law of thermodynamics, this is, the conservation of energy. Other perpetual-motion machines purport to convert the dis- organized energy of heat in matter or radiation which is all at the same temperature into ordered energy, such as the rotation of a flywheel. The rotating flywheel could, of course, be used to drive a refrigerator which would cool some objects and heat others. Thus, this sort of perpetual motion could, without the use of additional organized energy, transfer the energy of heat from cold material to hot material. The second law of thermodynamics can be variously stated: that heat will not flow from a cold body to a hot body without the expenditure of organized energy or that the entropy of a system never decreases. The second sort of perpetual-motion machine violates the second law of thermodynamics. One of the most famous perpetual-motion machines of this second kind was invented by James Clerk Maxwell. It makes use of a fictional character called Maxwell's demon. Information Theory? and Physics 199 I have pictured Maxwell's demon in Figure X-2. He inhabits a divided box and operates a small door connecting the two cham- bers of the box. When he sees a fast molecule heading toward the door from the far side, he opens the door and lets it into his side. When he sees a slow molecule heading toward the door from his side he lets it through. He keeps slow molecules from entering his side and fast molecules from leaving his side. Soon, the gas in his side is made up of fast molecules. It is hot, while the gas on the other side is made up of slow molecules and it is cool. Maxwell's demon makes heat flow from the cool chamber to the hot chamber. I have shown him operating the door with one hand and thumbing his nose at the second law of thermodynamics with his other hand. Maxwell's demon has been a real puzzler to those physicists who have not merely shrugged him off. The best general objection we can raise to him is that, since the demon's environment is at thermal equilibrium, the only light present is the random electromagnetic radiation corresponding to thermal noise, and this is so chaotic that the demon can't use it to see what sort of molecules are coming toward the door. We can think of other versions of Maxwell's demon. What about putting a spring door between the two chambers, for instance? A molecule hitting such a door from one side can open it and go through; one hitting it from the other side can't open it at all. Won't we end up with all the molecules and their energy on the side into which the spring door opens? Fig. X-2 200 Symbols, Signals and Noise One objection which can be raised to the spring door is that, if the spring is strong, a molecule can't open the door, while, if the spring is weak, thermal energy will keep the door continually flapping, and it will be mostly open. Too ? a molecule will give energy to the door in opening it. Physicists are pretty well agreed that such mechanical devices as spring doors or delicate ratchets can't be used to violate the second law of thermodynamics. Arguing about what will and what won't work is a delicate business. An ingenious friend fooled me completely with his machine until I remembered that any enclosure at thermal equi- librium must contain random electromagnetic radiation as well as molecules. However, there is one simple machine which, although it is frictionless, ridiculous, and certainly inoperable in any prac- tical sense, is, I believe, not physically impossible in the very special sense in which physicists use this expression. This machine is illustrated in Figure X-3. The machine makes use of a cylinder C and a frictionless piston P. As the piston moves left or right, it raises one of the little pans p and lowers the other. The piston has a door in it which can be opened or closed. The cylinder contains just one molecule M The whole device is at a temperature T. The molecule will continually gain and lose energy in its collisions with the walls, and it will have an average energy proportional to the temperature. When the door in the piston is open, no work will be done if we move the piston slowly to the right or to the left. We start by centering the piston with the door open. We clamp the piston in r~L \ p s, Fig. X-3 A Information Theory and Phvsics 201 the center and close the door. We then observe which side of the piston the molecule is on. When we have found out which side of the piston the molecule is on, we put a little weight from low shelf Si onto the pan on the same side as the molecule and unclamp the piston. The repeated impact of the molecule on the piston will eventually raise the weight to the higher shelf 5 2 , and we take the weight off and put it on this higher shelf. We then open the door in the piston, center it, and repeat the process. Eventually, we will have lifted an enormous number of little weights from the lower shelves Si to the upper shelves 82- We have done organized work by means of disorganized thermal energy! How much work have we done? It is easily shown that the average force .F which the molecule exerts on the piston is \cT F^=~- (10.10) JL/ Here L is the distance from the piston to the end of the cylinder on the side containing the molecule. When we allow the molecule to push against the piston and slowly drive it to the end of the cylinder, so that the distance is doubled, the most work PFthat the molecule can do is W= 0.693 kr (10.11) Actually, in lifting a constant weight the work done will be less, but 10.11 represents the limit. Did we get this free? Not quite! When we have centered the piston and closed the door it is equally likely that we will find the molecule in either half of the cylinder. In order to know which pan to put the weight on, we need one bit of information, specifying which side the molecule is on. To make the machine run we must receive this information in a system which is at a temperature T. What is the very least energy needed to transmit one bit of information at the tempera- ture Tl We have already computed this; from equation 10.6 we see that it is exactly 0.693 kJ joule, just equal to the most energy the machine can generate. We should remember that this applies in the quantum case if we signal slowly, using very low frequencies. Thus, we use up all the output of the machine in transmitting enough information to make the machine run! 202 Symbols, Signals and Noise It's useless to argue about the actual, the attainable, as opposed to the limiting efficiency of such a machine; the important thing is that even at the very best we could do more than break even. We have now seen in one simple case that the transmission of information in the sense of communication theory can enable us to convert thermal energy into mechanical energy. The bit which measures amount of information used is the unit in terms of which the entropy of a message source is measured in communication theory. The entropy of thermodynamics determines what part of existing thermal energy can be turned into mechanical work. It seems natural to try to relate the entropy of thermodynamics and statistical mechanics with the entropy of communication theory. The entropy of communication theory is a measure of the uncer- tainty as to what message, among many possible messages, a message source will actually produce on a given occasion. If the source chooses a message from among m equally probable mes- sages, the entropy in bits per message is the logarithm to the base 2 of m; in this case it is clear that such messages can be transmitted by means of log m binary digits per message. More generally, the importance of the entropy of communication theory is that it measures directly the average number of binary digits required to transmit messages produced by a message source. The entropy of statistical mechanics is the uncertainty as to what state a physical system is in. It is assumed in statistical mechanics that all states of a given total energy are equally probable. The entropy of statistical mechanics is Boltzmann's constant times the logarithm to the base e of the number of possible states. This entropy has a wide importance in statistical mechanics. One matter of importance is that the free energy, which we will call F.E., is given by F.E. ^E ~HT (10.12) Here E is the total energy, H is the entropy, and Tis the tempera- ture. The free energy is the part of the total energy which, ideally, can be turned into organized energy, such as the energy of a lifted weight. In order to understand the entropy of statistical mechanics, we have to say what a physical system is, and we will do this by citing Information Theory and Physics 203 a few examples. A physical system can be a crystalline solid, a closed vessel containing water and water vapor, a container filled with gas, or any other substance or collection of substances. We will consider such a system when it is at equilibrium, that is, when it has settled down to a uniform temperature and when any physi- cal or chemical changes that may tend to occur at this temperature have gone as far as they will go. As a particular example of a physical system, we will consider and idealized gas made up of a lot of little, infinitely small particles, whizzing around every which way in a container. The state of such a system is a complete description, or as com- plete a description as the laws of physics allow, of the positions and velocities of all of these particles. According to classical mechanics (Newton's laws of motion), each particle can have any velocity and energy, so there is an uncountably infinite number of states, as there is such an uncountable infinity of points in a line or a square. According to quantum mechanics, there is an infinite but countable number of states. Thus, the classical case is analo- gous to the difficult communication theory of continuous signals, while the more exact quantum case is analogous to the communi- cation theory of discrete signals which are made up of a countable set of distinct, different symbols. We have dealt with the theory of discrete signals at length in this book. According to quantum mechanics, a particle of an idealized gas can move with only certain energies. When it has one of these allowed energies, it is said to occupy a particular energy level How large will the entropy of such a gas be? If we increase the volume of the gas, we increase the number of energy levels within a given energy range. This increases the number of states the system can be in at a given temperature, and hence it increases the entropy. Such an increase in entropy occurs if a partition confining a gas to a portion of a container is removed and the gas is allowed to expand suddenly into the whole container. If the temperature of a gas of constant volume is increased, the particles can occupy energy levels of higher energy, so more com- binations of energy levels can be occupied; this increases the number of states, and the entropy increases. If a gas is allowed to expand against a slowly moving piston and 204 Symbols, Signals and Noise no heat is added to the gas, the number of energy levels in a given energy range increases, but the temperature of the gas decreases just enough so as to keep the number of states and the entropy the same. We see that for a given temperature, a gas confined to a small volume has less entropy than the same gas spread through a larger volume. In the case of the one-molecule gas of Figure X-3, the entropy is less when the door is closed and the molecule is confined to the space on one side of the piston. At least, the entropy is less if we know which side of the piston the molecule is on. We can easily compute the decrease in entropy caused by halving the volume of an ideal, one-molecule, classical gas at a given temperature. In halving the volume we halve the number of states, and the entropy changes by an amount klog e V4 = -0.693 k The corresponding change in free energy is the negative of T times this change in entropy, that is, 0.693 kr This is just the work that, according to 10.1 1, we can obtain by halving the volume of the one-molecule gas and then letting it expand against the piston until the volume is doubled again. Thus, computing the change in free energy is one way of obtaining 10. 1 1 . In reviewing our experience with the one-molecule heat engine in this light, we see that we must transmit one bit of information in order to specify on which side of the piston the molecule is. We must transmit this information against a background of noise corresponding to the uniform temperature T 7 . To do this takes 0.693 kr joule of energy. Because we now know that the molecule is definitely on a par- ticular side of the piston, the entropy is 0.693 k less than it would be if we were uncertain as to which side of the piston the molecule was on. This reduction of entropy corresponds to an increase in free energy of 0.693 kr joule. This free energy we can turn into work by allowing the piston to move slowly to the unoccupied end of the cylinder while the molecule pushes against it in repeated im- pacts. At this point the entropy has risen to its original value, and Information Theory and Physics 205 we have obtained from the system an amount of work which, alas, is just equal to the minimum possible energy required to transmit the information which told us on which side of the piston the molecule was. Let us now consider a more complicated case. Suppose that a physical system has at a particular temperature a total of m states. Suppose that we divide these states into n equal groups. The number of states in each of these groups wiU be m/n. Suppose that we regard the specification as to which one of the n groups of states contains the state that the system is in as a message source. As there are n equally likely groups of states, the communication-theory entropy of the source is log n bits. This means that it will take n binary digits to specify the particular group of states which contains the state the system is actually in. To transmit this information at a temperature T requires at least .693 kT log n ^kTlo&n joule of energy. That is, the energy required to transmit the message is proportional to the communication-theory entropy of the mes- sage source. If we know merely that the system is in one of the total of m states, the entropy is kloge m If we are sure that the system is in one particular group of states containing only m/n states (as we are after transmission of the information as to which state the system is in), the entropy is The change in entropy brought about by information concerning which one of the n groups of states the system is in is thus -k log* n The corresponding increase in free energy is But this is just equal to the least energy necessary to transmit the 206 Symbols, Signals and Noise information as to which group of states contains the state the system is in, the information that led to the decrease in entropy and the increase in free energy. We can regard any process which specifies something concern- ing which state a system is in as a message source. This source generates a message which reduces our uncertainty as to what state the system is in. Such a source has a certain communication-theory entropy per message. This entropy is equal to the number of binary digits necessary to transmit a message generated by the source. It takes a particular energy per binary digit to transmit the message against a noise corresponding to the temperature T of the system. The message reduces our uncertainty as to what state the system is in, thus reducing the entropy (of statistical mechanics) of the system. The reduction of entropy increases the free energy of the system. But, the increase in free energy is just equal to the mini- mum energy necessary to transmit the message which led to the increase of free energy, and energy proportional to the entropy of communication theory. This, I believe, is the relation between the entropy of communi- cation theory and that of statistical mechanics. One pays a price for information which leads to a reduction of the statistical- mechanical entropy of a system. This price is proportional to the communication-theory entropy of the message source which pro- duces the information. It is always just high enough so that a perpetual motion machine of the second kind is impossible. We should note, however, that a message source which generates messages concerning the state of a physical system is one very particular and peculiar kind of message source. Sources of English text or of speech sounds are much more common. It seems irrele- vant to relate such entropies to the entropy of physics, except perhaps through the energy required to transmit a bit of informa- tion under highly idealized conditions. All this concern about relating the entropies of physics and communication theory seems to me to be a tempest in a teapot. No one doubts the second law of thermodynamics. If, however, such a study inspired physicists to discover and study the quantum analog of equation 10.4, it would be worthwhile, for such a relation is a conspicuously missing part of communication theory. Information Theory and Physics 207 To summarize, in this chapter we have considered some of the problems of communicating electrically in our actual physical world. We have seen that various physical phenomena, including lightning and automobile ignition systems, produce electrical dis- turbances or noise which are mixed with the electrical signals we use for the transmission of messages. Such noise is a source of error in the transmission of signals, and it limits the rate at which we can transmit information when we use a particular signal power and band width. The noise emitted by hot bodies (and any body is hot to a degree if its temperature is greater than absolute zero) is a particularly simple, universal, unavoidable noise which sets a natural limit on radio transmission systems. We can express this limit according to the classical laws of physics. This expression is in error for high frequencies and low temperatures. We do not as yet have a general quantum-mechanical formulation of this limitation. The use of the term entropy in both physics and communication theory has raised the question of the relation of the two entropies. It can be shown in a simple case that the limitation imposed by thermal noise on the transmission of information results in the failure of a machine designed to convert the chaotic energy of heat into the organized energy of a lifted weight. Such a machine, if it succeeded, would violate the second law of thermodynamics. More generally, suppose we regard a source of information as to what state a system is in as a message source. The information-theory entropy of this source is a measure of the energy needed to trans- mit a message from the source in the presence of the thermal noise which is necessarily present in the system. The energy used in transmitting such a message is as great as the increase in free energy due to the reduction in physical entropy which the message brings about. While various physicists have sought various uses for informa- tion theory in statistical mechanics, as far as I know they haven't come up with anything very useful or startling. I wish they'd get around to the physical limitations imposed on information trans- mission by quantum effects, and perhaps when they do they will find some other unsolved problems highly pertinent to information theory. CHAPTER y Cybernetics SOME WORDS HAVE a heady quality; they conjure up strong feelings of awe, mystery, or romance. Exotic used to be Dorothy Lamour in a sarong. Just what it connotes currently I don't know, but I am sure that its meaning, foreign, is pale by comparison. Palimpsest makes me think of lost volumes of Solomon's secrets or of other invaluable arcane lore, though I know that the word means nothing more than a manuscript erased to make room for later writing. Sometimes the spell of a word or expression is untainted by any clear and stable meaning, and through all the period of its currency its magic remains secure from commonplace interpretations. Too, elan vital and id are, I think, examples of this. I don't believe that cybernetics is quite such a word, but it does have an elusive quality as well as a romantic aura. The subtitle of Wiener's book, Cybernetics, is Control and Com- munication in the Animal and the Machine. Wiener derived the word from the Greek for steersman. Since the publication of Wiener's book hi 1948, cybernetics has gained a wide currency. Further, if there is cybernetics, then someone must practice it, and cyberneticist has been anonymously coined to designate such a person. What is cybernetics? If we are to judge from Wiener's book it includes at least information theory, with which we are now reasonably familiar; something that might be called smoothing, filtering, detection and prediction theory, which deals with finding 208 Cybernetics 209 the presence of and predicting the future value of signals, usually in the presence of noise; and negative feedback and servomecha- nism theory, which Wiener traces back to an early treatise on the governor (the device that keeps the speed of a steam engine con- stant) published by James Clerk Maxwell in 1868. We must, I think, also include another field which may be described as automata and complicated machines. This includes the design and programming of digital computers. Finally, we must include any phenomena of life which resemble anything in this list or which embody similar processes. This brings to mind at once certain behavioral and regulatory functions of the body, but Wiener goes much further. In his second autobiographi- cal volume, I Am a Mathematician, he says that sociology and anthropology are primarily sciences of communication and there- fore fall under the general head of cybernetics, and he includes, as a special branch of sociology, economics as well. One could doubt Wiener's sincerity in all this only with difficulty. He has a grand view of the importance of a statistical approach to the whole world of life and thought. For him, a current which stems directly from the work of Maxwell, Boltzmann, and Gibbs sweeps through his own to form a broad philosophical sea in which we find even the ethics of Kierkegaard. The trouble is that each of the many fields that Wiener draws into cybernetics has a considerable scope in itself. It would take many thousands of words to explain the history, content, and prospects of any one of them. Lumped together, they constitute not so much an exciting country as a diverse universe of over- whelming magnitude and importance. Thus, few men of science regard themselves as cyberneticists. Should you set out to ask, one after another, each person listed in American Men of Science what his field is, I think that few would reply cybernetics. If you persisted and asked, "Do you work in the field of cybernetics?" a man concerned with communication, or with complicated automatic machines such as computers, or with some parts of experimental psychology or neurophysiology would look at you and speculate on your background and intentions. If he decided that you were a sincere and innocent outsider, who would in any event never get more than a vague idea of his work, he might well reply, '"yes." 210 Symbols, Signals and Noise So far, in this country the word cybernetics has been used most extensively in the press and in popular and semiliterary, if not semiliterate, magazines. I cannot compete with these in discussing the grander aspects of cybernetics. Perhaps Wiener has done that best himself in I Am a Mathematician. Even the more narrowly technical content of the fields ordinarily associated with the word cybernetics is so extensive that I certainly would never try to explain it all in one book, even a much larger book than this. In this one chapter, however, I propose to try to give some small idea of the nature of the different technical matters which come to mind when cybernetics is mentioned. Such a brief resume may perhaps help the reader in finding out whether or not he is inter- ested in cybernetics and indicate to him what sort of information he should seek in order to learn more about it. Let us start with the part of cybernetics that I have called smoothing, filtering, and prediction theory, which is an extremely important field in its own right. This is a highly mathematical subject, but I think that some important aspects of it can be made pretty clear by means of a practical example. Suppose that we are faced with the problem of using radar data to point a gun so as to shoot down an airplane. The radar gives us a sequence of measurements of position each of which is a little in error. From these measurements we must deduce the course and the velocity of the airplane, so that we can predict its position at some time in the future, and by shooting a shell to that position, shoot the plane down. Suppose that the plane has a constant velocity and altitude. Then the radar data on its successive locations might be the crosses of Figure XI- 1. We can by eye draw a line AB, which we would guess to represent the course of the plane pretty well. But how are we to tell a machine to do this? If we tell a computing machine, or "computer," to use just the last and next-to-last pieces of radar data, represented by the points L and NL, it can only draw a line through these points, the dashed line A'E'. This is clearly in error. In some way, the computer must use earlier data as well. The simplest way for the computer to use the data would be to give an equal weight to all points. If it did this and fitted a straight Cybernetics 211 Fig, XI-1 line to all the data taken together, it might get a result such as that shown in Figure XI-2. Clearly, the airplane turned at point T, and the straight line AB that the computer computed has little to do with the path of the plane. We can seek to remedy this by giving more importance to recent data than to older data. The simplest way to do this is by means of linear prediction, In making a linear prediction, the computer takes each piece of radar data (a number representing the distance north or the distance east from the radar, for instance) and multi- plies it by another number. This other number depends on how recent the piece of data is; it will be a smaller number for an old piece of data than for a recent one. The computer then adds up all the products it has obtained and so produces a predicted piece of data (for instance, the distance north or east of the radar at some future time). 212 Symbols, Signals and Noise Fig. XI-2 The result of such prediction might be as shown in Figure XI-3. Here a linear method has been used to estimate a new position and direction each time a new piece of radar data, represented by a cross, becomes available. Until another piece of data becomes available, the predicted path is taken as a straight line proceeding from the estimated location in the estimated direction. We see that it takes a long time for the computer to take into account the fact that the plane has turned at the point T ? despite the fact that we are sure of this by the time we have looked at the point next after T. A linear prediction can make good use of old data, but, if it does this, it will be slow to respond to new data which is inconsistent with the old data, as the data obtained after an airplane turns will be. Or a linear prediction can be quick to take new data strongly into account, but in this case it will not use old data effectively, even when the old data is consistent with the new data. . 1 *jf _ ' T ^ x X ' ' T Fig. XI-3 Cybernetics 213 To predict well even when circumstances change (as when the airplane turns) we must use nonlinear prediction. Nonlinear pre- diction includes all methods of prediction in which we don't merely multiply each piece of data used by a number depending on how old the data is and then add the products. As a very simple example of nonlinear prediction, suppose that we have two different linear predictors, one of which takes into account the last 100 pieces of data received, and the other of which takes into account only the last ten pieces of data received. Suppose that we use each predictor to estimate the next piece of data which will be received. Suppose that we compare this next piece of data with the output of each predictor. Suppose that we make use of predictions based on 100 past pieces of data only when, three times in a row, such predictions agree with each new piece of data better than predictions based on ten past pieces of data. Otherwise, we assume that the aircraft is maneuvering in such a way as to make long-past data useless, and we use predictions based on ten past pieces of data. This way of arriving at a final prediction is nonlinear because the prediction is not arrived at simply by multiplying each past piece of data by a number which depends only on how old the data is. Instead, the use we make of past data depends on the nature of the data received. More generally, there are endless varieties of nonlinear predic- tion. In fact, nonlinear prediction, and other nonlinear processes as well, are the overwhelming total of all very diverse means after the simplest category, linear prediction and other linear processes, have been excluded. A great deal is known about linear prediction, but very little is known about nonlinear prediction. This very special example of predicting the position of an air- plane has been used merely to give a concrete sense of something which might well seem almost meaningless if it were stated in more abstract terms. We might, however, restate the broader problem, which has been introduced in a more general way. Let us imagine a number of possible signals. These signals might consist of things as diverse as the possible paths of airplanes or the possible different words that a man may utter. Let us also imagine some sort of noise or distortion. Perhaps the radar data is inexact, or perhaps the man speaks in a noisy room. We are required to 214 Symbols, Signals and Noise estimate some aspect of the correct signal: the present or future position of the airplane, the word the man just spoke, or the word that he will speak next. In making this judgment we have some statistical knowledge of the signal. This might concern what air- plane paths are most likely, or how often turns are made, or how sharp they are. It might include what words are most common and how the likelihood of their occurrence depends on preceding words. Let us suppose that we also have similar statistics concerning noise and distortion. We see that we are considering exactly the sort of data that are used in communication theory. However, given a source of data and a noisy channel, the communication theorist asks how he can best encode messages from the source for transmission over the channel. In prediction, given a set of signals distorted by noise, we ask, how do we best detect the true signal or estimate or predict some aspect of it, such as its value at some future time? The armory of prediction consists of a general theory of linear prediction, worked out by Kolmogoroff and Wiener, and mathe- matical analyses of a number of special nonlinear predictors. I don't feel that I can proceed very profitably beyond this statement, but I can't resist giving an example of a theoretical result (due to David Slepian, a mathematician) which I find rather startling. Let us consider the case of a faint signal which may or may not be present in a strong noise. We want to determine whether or not the signal is present. The noise and the signal might be voltages or sound pressures. We assume that the noise and the signal have been combined simply by adding them together. Suppose further that the signal and the noise are ergodic (see Chapter III) and that they are band limited that is, they contain no frequencies outside of a specified frequency range. Suppose further that we know exactly the frequency spectrum of the noise, that is, what fraction of the noise power falls in every small range of frequencies. Suppose that the frequency spectrum of the signal is different from this. Slepian has shown that if we could measure the over-all voltage (or sound pressure) of the signal plus noise exactly for every instant in any interval of time, however short the interval is, we could infallibly tell whether or not the signal was present along with the noise, no matter how faint the signal might be. This is a Cybernetics 215 sound theoretical, not a useful practical, conclusion. However, it has been a terrible shock to a lot of people who had stated quite positively that, if the signal was weak enough (and they stated just how weak), it could not be detected by examining the signal plus noise for any particular finite interval of time. Before leaving this general subject, I should explain why I described it in terms of filtering and smoothing as well as prediction and detection. If the noise mixed with a signal has a frequency spectrum different from that of the signal, we will help to separate the signal from the noise by using an electrical filter which cuts down on the frequencies which are strongly present in the noise with respect to the frequencies which are strongly present in the signal. If we use a filter which removes most or all high frequency components (which vary rapidly with time), the output will not vary so abruptly with time as the input; we will have smoothed the combination of signal and noise. So far, we have been talking about operations which we perform on a set of data in order to estimate a present or future signal or to detect a signal. This is, of course, for the purpose of doing something. We might, for instance, be flying an airplane in pursuit of an enemy plane. We might use a radar to see the enemy plane. Every time we take an observation, we might move the controls of the plane so as to head toward the enemy. A device which acts continually on the basis of information to attain a specified goal in the face of changes is called a servo- mechanism. Here we have an important new element, for the radar data measures the position of the enemy plane with respect to our plane, and the radar data is used in determining how the position of our plane is to be changed. The radar data is fed back in such a way as to alter the nature of radar data which will be obtained later (because the data are used to alter the position of the plane from which new radar data are taken). The feedback is called negative feedback, because it is so used as to decrease rather than to increase any departure from a desired behavior. We can easily think of other examples of negative feedback. The governor of a steam engine measures the speed of the engine. This measured value is used in opening or closing the throttle so as to 216 Symbols, Signals and Noise keep the speed at a predetermined value. Thus, the result of the measurement of speed is fed back so as to change the speed. The thermostat on the wall measures the temperature of the room and turns the furnace off or on so as to maintain the temperature at a constant value. When we walk carrying a tray of water, we may be tempted to watch the water in the tray and try to tilt the tray so as to keep the water from spilling. This is often disastrous. The more we tilt the tray to avoid spilling the water, the more wildly the water may slosh about. When we apply feedback so as to change a process on the basis of its observed state, the over-all situation may be unstable. That is, instead of reducing small devia- tions from the desired goal, the control we exert may make them larger. This is a particularly hazardous matter in feedback circuits. The thing we do to make corrections most complete and perfect is to make the feedback stronger. But this is the very thing that tends to make the system unstable. Of course, an unstable system is no good. An unstable system can result in such behavior as an airplane or missile veering wildly instead of following the target, the temper- ature of a room rising and falling rapidly, an engine racing or coming to a stop, or an amplifier producing a singing output of high amplitude when there is no input. The stability of negative-feedback systems has been studied extensively, and a great deal is known about linear negative- feedback systems, in which the present amplitude is the sum of past amplitudes multiplied by numbers depending only on remote- ness from the present. Linear negative-feedback systems are either stable or unstable, regardless of the input signal applied. Nonlinear feedback systems can be stable for some inputs but unstable for others. A shimmying car is an example of a nonlinear system. It can be perfectly stable at a given speed on a smooth road, and yet a single bump can start a shimmy which will persist indefinitely after the bump has been passed. Oddly enough, most of the early theoretical work on negative- feedback systems was done in connection with a device which has not yet been described. This is the negative feedback amplifier, which was invented by Harold Black in 1927. Cybernetics 217 The gain of an amplifier is the ratio of the output voltage to the input voltage. In telephony and other electronic arts, it is important to have amplifiers which have a very nearly constant gain. How- ever, vacuum tubes and transistors are imperfect devices. Their gain changes with time, and the gain can depend on the strength of the signal. The negative feedback amplifier reduces the effect of such changes in the gain of vacuum tubes or transistors. We can see very easily why this is so by examining Figure XI-4. At the top we have an ordinary amplifier with a gain of ten times. If we put in 1 volt, as shown by the number to the left, we get out 10 volts, as shown by the number to the right. Suppose the gain of the amplifier is halved, so that the gain is only five times, as shown next to the top. The output also falls to one half, or 5 volts, in just the same ratio as the gain fell. The third drawing from the top shows a negative feedback amplifier designed to give a gain of ten times, The upper box has OJ xtoo 10 r 1 i i; 0,9 X.09 t 0,1819 0.8181 X50 | 9.09 I X09 Fig. XI -4 218 Symbols, Signals and Noise a high gain of one hundred times. The output of this box is con- nected to a very accurate voltage-dividing box, which contains no tubes or transistors and does not change with time or signal level. The input to the upper box consists of the input voltage of 1 volt less the output of the lower box, which is 0.09 times the output voltage of 10 volts; this is, of course, 0.9 volt. Now, suppose the tubes or transistors in the upper box change so that they give a gain of only fifty times instead of one hundred times; this is shown at the bottom of Figure XI-4. The numbers given in the figure are only approximate, but we see that when the gain of the upper box is cut in half the output voltage falls only about 10 per cent. If we had used a higher gain in the upper box the effect would have been even less. The importance of negative feedback can scarcely be over- estimated. Negative feedback amplifiers are essential in telephonic communication. The thermostat in your home is an example of negative feedback. Negative feedback is used to control chemical processing plants and to guide missiles toward airplanes. The automatic pilot of an aircraft uses negative feedback in keeping the plane on course. In a somewhat broader sense, I use negative feedback from eye to hand in guiding my pen across the paper, and negative feedback from ear to tongue and lips in learning to speak or in imitating the voice of another. The animal organism makes use of negative feedback in many other ways. This is how it maintains its tempera- ture despite changes in outside temperature, and how it maintains constant chemical properties of the blood and tissues. The ability of the body to maintain a narrow range of conditions despite environmental changes has been called homeostasis. G. Ross Ashby, one of the few self-acknowledged cyberneticists, built a machine called a homeostat to demonstrate features of adjustment to environment which he believes to be characteristic of life. The homeostat is provided with a variety of feedback circuits and with two means for changing them. One is under the control of the homeostat; the other is under the control of a person who acts as the machine's "environment." If the machine's circuits are so altered by changes of its "environment" as to make it unstable, it readjusts its circuits by trial and error so as to attain stability again. Cybernetics 219 We may if we wish liken this behavior of the homeostat to that of a child who first learns to walk upright without falling and then learns to ride a bicycle without falling or to many other adjust- ments we make in life. In his book Cybernetics, Wiener puts great emphasis on negative feedback as an element of nervous control and on its failure as an explanation of disabilities, such as tremors of the hand, which are ascribed to failures of a negative feedback system of the body. We have so far discussed three constituents of cybernetics: information theory, detection and prediction, including smoothing and filtering, and negative feedback, including servomechanisms and negative feedback amplifiers. We usually also associate elec- tronic computers and similar complex devices with cybernetics. The word automata is sometimes used to refer to such complicated machines. One can find many precursors of today's complicated machines in the computers, automata, and other mechanisms of earlier centuries, but one would add little to his understanding of today's complex devices by studying these precursors, Human beings learn by doing and by thinking about what they have done. The oppor- tunities for doing in the field of complicated machines have been enhanced immeasurably beyond those of previous centuries, and the stimulus to thought has been wonderful to behold. Recent advances in complicated machines might well be traced to the invention of automatic telephone switching late in the last century. Early telephone switching systems were of a primitive, step-by-step form, in which a mechanism set up a new section of a link in a talking path as each digit was dialed. From this, switch- ing systems have advanced to become common-control systems. In a common-control switching system^ the dialed number does not operate switches directly. It is first stored, or represented elec- trically or mechanically, in a particular portion of the switching system. Electrical apparatus in another portion of the switching system then examines different electrical circuits that could be used to connect the calling party to the number called, until it finds one that is not in use. This free circuit is then used to connect the calling party to the called party. Modern telephone switching systems are of bewildering com- plexity and overwhelming size. Linked together to form a nation- 220 Symbols, Signals and Noise wide telephone network which allows dialing calls clear across the country, they are by far the most complicated construction of man. It would take many words to explain how they perform even a few of their functions. Today, a few pulls of a telephone dial will cause telephone equipment to seek out the most economical available path to a distant telephone, detouring from city to city if direct paths are not available. The equipment will establish a connection, ring the party, time the call, and record the charge in suitable units, and it will disconnect the circuits when a party hangs up. It will also report malfunctioning of its parts to a central location, and it continues to operate despite the failure of a number of devices. One important component of telephone switching systems is the electric relay. The principal elements of a relay are an electro- magnet, a magnetic bar to which various movable contacts are attached, and fixed contacts which the movable contacts can touch, thus closing circuits. When an electric current is passed through the coil of the electromagnet of the relay, the magnetic bar is attracted and moves. Some moving contacts move away from the corresponding fixed contacts, opening circuits; other moving contacts are brought into contact with the corresponding fixed contacts, closing circuits. In the thirties, G. R. Stibitz of the Bell Laboratories applied the relays and other components of the telephone art to build a com- plex calculator, which could add, subtract, multiply, and divide complex numbers. During World War II, a number of more com- plicated relay computers were built for military purposes by the Bell Laboratories, while, in 1941, Howard Aiken and his co- workers built their first relay computer at Harvard. An essential step in increasing the speed of computers was taken shortly after the war when J. P. Eckert and J. W. Mauchly built the Eniac, a vacuum tube computer, and more recently transistors have been used in place of vacuum tubes. Thus, it was an essential part of progress in the field of complex machines that it became possible to build them and that they were built, first by using relays and then by using vacuum tubes and transistors. The building of such complex devices, of course, involved more than the existence of the elements themselves; it involved their Cybernetics 221 interconnection to do particular functions such as multiplication and division. Stibitz's and Shannon's application of Boolean alge- bra, a branch of mathematical logic, to the description and design of relay circuits has been exceedingly important in this connection. Thus, the existence of suitable components and the art of inter- connecting them to carry out particular functions provided, so to speak, the body of the complicated machine. The organization, the spirit, of the machine is equally essential, though it would scarcely have evolved in the absence of the body. Stibitz's complex calculator was almost spiritless. The operator sent it pairs of complex numbers by teletype, and it cogitated and sent back the sum, difference, product, or quotient. By 1943, however, he had made a relay computer which received its instruc- tions in sequence by means of a long paper tape, or program, which prescribed the numbers to be used and the sequences of operations to be performed. A step forward was taken when it was made possible for the machine to refer back to an earlier part of the program tape on completing a part of its over-all task or to use subsidiary tapes to help it in its computations. In this case the computer had to make a decision that it had reached a certain point and then act on the basis of the decision. Suppose, for instance, that the computer was computing the value of the following series by adding up term after term: We might program the computer so that it would continue adding terms until it encountered a term which was less than 1/1,000,000 and then print out the result and go on to some other calculation. The computer could decide what to do next by subtracting the latest term computed from 1 / 1 ,000,000. If the answer was negative, it would compute another term and add it to the rest; if the answer was positive, it could print out the sum arrived at and refer to the program for further instructions. The next big step in the history of computers is usually attributed to John von Neumann, who made extensive use of early computers in carrying out calculations concerning atomic bombs. Even early computers had memories, or stores, in which the numbers used in 222 Symbols, Signals and Noise intermediate steps of a computation were retained for further processing and in which answers were stored prior to printing them out. Von Neumann's idea was to put the instructions, or program, of the machine, not on a separate paper tape, but right into the machine's memory. This made the instructions easily and flexibly available to the machine and made it possible for the machine to modify parts of its instructions in accordance with the results of its computations. In desk calculating machines, decimal digits are stored by wheels which can assume any often distinct positions of rotation. Today's complex computing machines store binary numbers in their memo- ries. Each digit of a binary number is represented by magnetizing a little magnetic ring, or core, in one direction or the other. The computer's memory is made up of groups of cores. Each group can store all the digits of a multidigit number, and all the digits of the number are read into or out of the cores of such a group simultaneously in a few millionths of a second. A particular binary number called an address is assigned to each such group of cores; by means of this, the group is designated and called into use. The word address is used to refer to such a group of cores. Today's large-scale computers can store hundreds of thousands of binary digits in magnetic cores and can store even more digits as + or pulses, recorded on magnetic tapes or drums. Besides memory, computers have various special parts, such as arithmetic units, which can add or multiply. When some such operation is to be performed on two numbers, they are first trans- ferred from the memory addresses, where they are stored into a register, a temporary storage space. The operation is then per- formed, and the result transferred to an appropriate address in the memory. The user of the computer prepares a program in terms of a hundred or more different commands. By using a sequence of such commands, the programmer can make the machine do literally anything, provided only that the programmer knows clearly what he wants done. That is, he must specify all the steps needed to accomplish the end. Also, of course, the task must be one which the machine can do in an acceptable length of time. Table XV shows a set of commands used to make a hypothetical Cybernetics TABLE XV 223 Address Instruction Commentary 1 START 2 CLEAN 3 UPPIT 10 + START 4 UPPIT 13 + START 5 DELVR 16 + START 6 UPSRT 1 7 IFSRT 9, 4 Start register is address modi- fying register. This instruction sets it to 0. Sets add register to 0. Puts first number in add reg- ister. Adds second number to add register. Stores result. Increases start register by 1. Tests start register If <4, goes on If >4, goes to 9 Transfers back to instruction 2. 9 STOPP Stops. 10 DATEH 6 Reserves block of 6 locations 11 for data. 12 13 14 15 16 ANSIR 3 Reserves block of 3 locations 17 for answers. 18 computer add up a number of pairs of numbers and store the sums so obtained. In mathematical terms, this program forms the sums Cj = di + hi, i = 1 3, where a^ is located in address 9 + 4 hi is located in 12 + /, and c { is stored in 15 + /. The program starts at address 1 and comes to rest at address 9. We have noted that a skilled programmer can program a com- puter to do anything, provided that he knows clearly what he wants done. Suppose that one has an explicit statement of some mathe- 224 Symbols, Signals and Noise matical task in terms of certain standard words or equations. Suppose that this statement really tells completely what is to be done. A programmer can write a program, called a compiler, which will cause the computer to examine the statement and then write a program which will make the computer do the task in question. When the program which the compiler causes the computer to write is fed to the computer, the computer will carry out the required task. Writing programs is a lengthy and uncongenial task. An engineer or scientist who has a suitable compiler available can specify what he desires to be done compactly in terms of a sequence of allowed words and equations. By means of the compiler, he can make the computer translate Ms statement of the problem into the long, detailed, and obscure (to a human being) sequence of instructions which will cause the computer to make the calculations called for. The best-known compiler is Fortran, which is used to convert instructions written in a symbolism closely resembling standard mathematical notation into computer programs. The Blodi com- piler converts a description of a circuit diagram into a program which causes the computer to imitate the operation of the circuit described. The Janet compiler converts specifications of the notes of a musical composition in terms of the pitch, duration, and quality of each note into a program which causes the computer to generate a magnetic tape which, when played, produces the sounds specified. Compilers are very useful to programmers in making computers carry out a wide variety of complicated tasks. The binary digits stored in the memory of a computer can be used to specify num- bers, but they can also specify or encode words, musical notes, or logical operations. Thus, besides their use in performing compli- cated mathematical calculations, computers have been used to make a concordance of the Revised Standard Bible, to simulate the operation of a telephone switching system, to recognize spoken digits from to 9, to play checkers and to learn to improve their game, to play chess, to prove theorems in geometry and symbolic logic, to create unusual musical sounds, and to compose music according to the rules of first species counterpoint. We can get some faint idea how such tasks can be performed. Cybernetics 225 Binary numbers can be assigned to letters in such a way that an arrangement of words in alphabetical order corresponds to an arrangement of numbers in increasing order. This makes it possible to sort and arrange the numbers representing words so as to arrange the words in alphabetical order. The differences between numbers assigned to notes of the scale can be made to correspond to the musical intervals between the notes, so that allowing or forbidding certain musical intervals becomes equivalent to allow- ing or forbidding certain numerical differences. However, we should not delude ourselves into believing that complicated uses of computers can be explained in a few words. A talented person with a master's degree in mathematics can attain a fair understanding of programming after a few years training and experience. An exceptionally talented person can program a com- puter to do really new and difficult things. While in principle a computer can be programmed to do any- thing which the programmer understands in detail, programmers don't really understand some tasks they would like to assign to computers. Thus, things which a computer has not done so far include recognizing spoken digits as accurately as a human being does, satisfactorily translating from one language into another, playing checkers or chess as well or as fast as an expert, identifying an important or interesting theorem, and composing very interest- ing music. The use of computers toward such ends has, however, greatly stimulated human thought concerning the nature of the recognition of human words, the structure of various languages, the strategy of winning at games, and the structure of music. When new knowledge so arrived at is put to use in programming the larger and faster computers of the future, it is hard to foresee what their limitations may be. Further, the programming of computers to solve complicated and unusual problems has given us a new and objective criterion of understanding. Today, if a man says that he understands how a human being behaves in a given situation or how to solve a certain mathematical or logical problem, it is fair to insist that he demonstrate Ms understanding by programming a computer to imitate the behavior or to accomplish the task in question. If he 226 Symbols, Signals and Noise is unable to do this, his understanding is certainly incomplete, and it may be completely illusory. Will computers be able to think? This is a meaningless question unless we say what we mean by to think. Marvin Minsky, a free- wheeling mathematician who is much interested in computers and complex machines, proposed the following fable. A man beats everyone else at chess. People say, "How clever, how intelligent, what a marvelous mind he has, what a superb thinker he is." The man is asked, "How do you play so that you beat everyone?" He says, "I have a set of rules which I use in arriving at my next move." People are indignant and say, "Why that isn't thinking at all; it's just mechanical." Minsky's conclusion is that people tend to regard as thinking only such things as they don't understand. I will go even further and say that people frequently regard as thinking almost any grammatical jumbling together of "important" words. At times I'd settle for a useful, problem-solving type of "thinking," even if it was mechanical. In any event, it seems likely that philosophers and humanists will manage to keep the definition of thinking perpetu- ally applicable to human beings and a step ahead of anything a machine ever manages to do. If this makes them happy, it doesn't offend me at all. I do think, however, that it is probably impossible to specify a meaningful and explicitly defined goal which a man can attain and a computer cannot, even including the "imitation game" proposed by A. M. Turing, a British logician, in 1936. In this game a man is in communication, say by teletype, with either a computer or a man, he doesn't know which. The man tries by means of questions to discover whether he is in touch with a man or a machine; the computer is programmed to deceive the man. Certainly, however, a computer programmed to play the imitation game with any chance of success is far beyond today's computers and today's art of programming, and it belongs to a very distant future, if to any. We have seen that cybernetics is a very broad field indeed. It includes communication theory, to which we are devoting a whole book. It includes the complicated field of smoothing and predic- tion, which is so important in radar and in many other military applications. When we try to estimate the true position or the Cybernetics 227 future position of an airplane on the basis of imperfect radar data, we are, according to Wiener, dealing with cybernetics. Even in using an electrical filter to separate noise of one frequency from signals of another frequency, we are invoking cybernetics. It is in this general field that the contribution of Wiener himself has been greatest, and he has worked out a general theory of prediction by means of linear devices, which makes a prediction merely by multiplying each piece of data by a number which is smaller the older the data is and adding the products. Another part of cybernetics is negative feedback. A thermostat makes use of negative feedback when it measures the temperature of a room and starts or stops the furnace in order to make the temperature conform to a specified value. The autopilots of air- planes use negative feedback in manipulating the controls in order to keep the compass and altimeter readings at assigned values. Human beings use negative feedback in controlling the motions of their hands to achieve certain ends. Negative feedback devices can be unstable; the effect of the output can sometimes be to make the behavior diverge widely from the desired goal. Wiener attributes tremors and some other malfunctioning of the human being to improper functioning of negative feedback mechanisms. Negative feedback can also be used in order to make the large output signal of an amplifier conform closely in shape to the small input. Negative feedback amplifiers were extremely important in communication systems long before the day of cybernetics. Finally, cybernetics has laid claim to the whole field of automata or complex machines, including telephone switching systems, which have been in existence for many years, and electronic computers, which have been with us only since World War II. If all this is so, cybernetics includes most of the essence of modern technology, excluding the brute producton and use of power. It includes our knowledge of the organization and function of man as well. Cybernetics almost becomes another word for all of the most intriguing problems of the world. As we have seen, Wiener includes sociological, philosophical, and ethical problems among these. Thus, even if a man acknowledged being a cyberneticist, that 228 Symbols, Signals and Noise wouldn't give us much of a clue concerning his field of competence, unless he was a universal genius. Certainly, it would not necessarily indicate that he had much knowledge of information theory. Happily, as I have noted, few scientists would acknowledge themselves as cyberneticists, save perhaps in talking to those whom they regard as hopelessly uninformed. Thus, if cybernetics is over- extensive or vague, the overextension or vagueness will do no real harm. Indeed, cybernetics is a very useful word, for it can help to add a little glamor to a person, to a subject, or even to a book. I certainly hope that its presence here will add a little glamor to this one. CHAPTER Yv Information Theory and Psychology I HAVE READ a good deal more about information theory and psychology than I can or care to remember. Much of it was a mere association of new terms with old and vague ideas. Presumably the hope was that a stirring in of new terms would clarify the old ideas by a sort of sympathetic magic. Some attempted applications of information theory in the field of experimental psychology have, however, been at least reasonably well informed. They have led to experiments which produced valid data. It is hard to draw any conclusions from these data that are both sweeping and certain, but the data do form a basis or at least an excuse for interesting speculations. In this chapter, I propose to discuss some experiments concerning information theory and psychology which are at least down-to-earth enough to grapple with. Naturally I have chosen these largely on the basis of my personal interest and background, but one has to impose some limitations in order to say anything coherent about a broad and less than pellucid field. It seems to me that an early reaction of psychologists to infor- mation theory was that, as entropy is a wonderful and universal measure of amount of information and as human beings make use of information, in some way the difficulty of a task, perhaps the time a man takes to accomplish a set task, must be proportional to the amount of information involved. 229 230 Symbols, Signals and Noise This idea is very clearly illustrated in some experiments reported by Ray Hyman, an experimental psychologist, in the Journal of Experimental Psychology in 1953. Here I shall describe only one of several of the experiments Hyman made. A number of lights were placed before a subject, as psychologists call an experimentee or laboratory human animal. Each light was labeled with a monosyllabic "name" with which the subject became familiar. After a warning signal, one of the several lights flashed, and the subject thereafter spoke the name of the light as soon as he could. The time interval between the flashing of the light and the speaking of the name was measured. Sometimes one out of eight lights flashed, the light being chosen at random with equal probabilities. In this case, the information conveyed in enabling the subject to identify the light correctly was log 8, or 3 bits. Sometimes one among 7 light flashed (2.81 bits), sometimes among 6 (2.58 bits), sometimes one out of 5 (2.32 bits), one out of 4 (2.00 bits), one out of 3 (1.58 bits), one out of 2 (1 bit), or one out of 1 (0 bits). The average response time, or latency, between the lighting of the light and the speaking of its name was plotted against number of bits, as shown in Figure XII- 1. Clearly, there is a certain latency, or response time, even when only one light is used, the choice among lights is certain, and the information conveyed as to which light is lighted is zero. When more lights are used, the increase in latency is proportional to the information conveyed. Such an increase of latency with the loga- rithm of the number of alternatives had in fact been noted by a German psychologist, J. Merkel, in 1885. It is certainly a strikingly accurate, reproducible, and a significant aspect of human response. We note from Figure XII- 1 that the increase in latency is about 0.15 second per bit. Some unwary psychologists have jumped to the conclusion that it takes 0.15 second for a human being to respond to 1 bit of information; therefore, the information capacity of a human being is about I/. 15, or about 7 bits per second. Have we discovered a universal constant of human perception or of human thought? Clearly, in Hyman's experiment the increase in latency is pro- portional to the uncertainty of the stimulus measured in bits. However, various experiments by various experimenters give Information Theory and Psychology 231 BUU 700 <n a 600 o 111 tn 500 o 0* 2 400 LU 2 *- 300 O 200 < HI K. too EXPERIMENT A J / / /* / Y r .5 1.0 1.5 2.0 2.5 3.0 BITS PER STIMULUS PRESENTATION Fig. XII-I 3.5 somewhat different rates of increase in seconds per bit. Moreover, data published by G. H. Mowbray and M. V. Rhoades in 1959 show that, after much practice, a subject's performance tends to change so that there is little or no effect of information content on latency. It appears that human beings may have different ways of handling information, a way used in learning, in which number of alternatives is very important, and a way used after much learn- ing, in which number of alternatives, up to a fairly large number, makes little difference. Further, in one sort of experiment, in which a subject depresses one or more keys on which his fingers rest in response to a vibration of the key, it appears that there may be little increase in latency with amount of information right from the start. 232 Symbols, Signals and Noise Moreover, even if the latency were a constant plus an increment proportional to information content, one could not reasonably assert that this showed that a significant information rate can be obtained by dividing the increased time by the number of bits. We will see that this can lead to fantastic information rates in the sort of experiment which I shall describe next. H. Quastler made early information-rate experiments in which subjects played random sequences of notes or chords or read lists of randomly chosen words as rapidly as possible, and J. C. R. Licklider did early work on both reading and pointing speed. Before we heard of this work, J. E. Karlin and I embarked on an extensive series of experiments on reading lists of words, which of all experiments gives the highest observed information rate, a rate which is much higher than, for instance, sending Morse code or typing. Suppose the "sender" of the message chooses an alphabet of, say, 1 6 words and makes up a list by choosing words among these randomly and with equal probabilities. Then, the amount of choice in designating each word is log 16, or 4 bits. The subject "trans- mits" the information, translating it into a new form, speech rather than print, by reading the list aloud. If he can read at a rate of 4 words a second, for instance, he transmits information at a rate of 4 x 4, or 16 bits per second. Figure XII-2 shows data from three subjects. The words were chosen from the 500 most common words in English. We see that while the reading rate drops somewhat in going from 2 to 4 word vocabularies (or from 1 to 2 bits per word), it is almost constant for vocabularies or alphabets containing from 4 to 256 words (from 2 to 8 bits per word). Let us now remember the alleged means for getting an informa- tion rate from such data as Hyman's, that is, noting the increase in time with increase in bits per stimulus. Consider the dotted average data curve of Figure XII-2. In going from 2 bits per stimulus to 8 bits per stimulus the reading rate doesn't decrease at all; that is, the change in reading time per word is 0, despite an increase of 6 in the number of bits per word. If we divide 6 by 0, we get an information rate of infinity! Of course, this is ridiculous, but it is scarcely more ridiculous than deducing an information Information Theory and Psychology 233 f> (JNO03S 234 Symbols, Signals and Noise rate from such data as Hyman's by dividing increase in number of bits by increase in latency. Directly from Figure XII-2, we can see that as reader A reads 8-bit words at a rate of 3.8 per second, he manages to transmit information at a rate of 8 x 3.8, or about 30 bits per second. Moreover, when the words put in the list are chosen randomly from a 5,000 word dictionary (12.3 bits per word), he manages to read them at a rate of 2.7 per second, giving a higher information rate of 33 bits per second. It is clear that no unique information rate can be used to describe the performance of a human being. He can transmit (and, we shall see, respond to or remember) information better under some circumstances than under others. We can best consider him as an information-handling channel or device having certain built-in limitations and properties. He is a very flexible device; he can handle information quite well in a variety of forms, but he handles it best if it is properly encoded, properly adjusted to his capabilities. What are his capabilities? We see from Figure XII-2 that he is slowed down only a little by increasing complexity. He can read a list of words chosen randomly from an alphabet of 256 about as fast as words chosen from an alphabet of 4. He isn't very speedy compared with machines, and in order to make him perform well we must give him a complex task. This is just what we might have expected. Complexity eventually does slow him down, however, as we see from the points for an alphabet consisting of all the words in a 5,000 word dictionary. Perhaps there is an optimum alphabet or vocabulary, which has quite a number of bits per word, but not so many words as to slow a man down unduly. Partly to help in finding such a vocabulary, Karlin and I measured reading rate as a function of both number of syllables and "familiarity," that is, whether the word came from the first thousand in order of commonness of occurrence or familiarity, from the tenth thou- sand, or from the nineteenth thousand. The results are shown in Figure XII-3. We see that while an increase in number of syllables slows down reading speed, a decrease in familiarity has just as pronounced an effect. Thus, a vocabulary of familiar one syllable words would Information Theory and Psychology 235 5.0 4.5 4.0 3.5 3.0 2.0 1.5 (a) READER A FAMILIARITY O t - 1,000 A 10,000-11,000 D 19,000-20,000 234 SYLLABLES PER WORD Fig. XII-3 236 Symbols, Signals and Noise seem to be a good choice. Using the 2,500 most common mono- syllables (2,500 words means 11.3 bits per word) as a "preferred vocabulary," a reader attained a reading speed of 3.7 words per second, giving an information rate of 42 bits per second. "Scrambled prose," that is, words chosen with the same prob- abilities as in nontechnical prose but picked at random without grammatical connection, also gave a high information rate. The entropy is about 11.8 bits per word, the highest reading rate was 3.7 words per second, and the corresponding information rate is 44 bits per second. Perhaps one could gain a little by improving the alphabet, but I don't think one would gain much. At any rate, these experiments gave the highest information rate which has been demonstrated. It is a rate slow by the standards of electrical communication, but it does represent a tremendous number of binary choices around 2,500 a minute! What, we may ask, limits the rate? Is it reading through each word letter by letter? In this case the Chinese, who have a single sign for each word, might be better off. But Chinese who read both English and Chinese with facility read randomized lists of common Chinese characters and randomized lists of the equivalent English words at almost exactly the same speed. Is the limitation a mechanical one? Figure XII-4 shows rates for several tasks. A man can repeat a memorized phrase over twice as fast as he can read randomized words from the preferred list, and he can read prose appreciably faster. It appears that the limitation on reading rate is mental rather than mechanical. So far, it appears that we cannot characterize a human being by means of a particular information rate. While the difficulty of a task ultimately increases with its information content, the diffi- culty depends markedly on how well the task is tailored to human abilities. The human being is very flexible in ability, but he has to strain and slow down to do unusual things. And he is quite good at complexity but only fair at speed. One way of tailoring a task to human abilities is by deliberate, thoughtful experiments. This is analogous to the process of so encoding messages from a message source as to attain the highest possible rate of information transmission over a noisy channel. Information Theory and Psychology 237 REPETITIVE PHRASE PROSE SCRAMBLED PROSE v \ PREFERRED* - LIST READER ABC ABC ABC Fig. XlI-4 ABC This was discussed in Chapter VIII, and the highest attainable rate was called the channel capacity. The "preferred list" of the 2,500 most frequently used monosyllables was devised in a delib- erate effort to attain a high information rate in reading aloud randomized lists of words. We may note, however, that choosing words at random with the probabilities of their occurrence in English text gives as high or a little higher information rate. Have the words of the English lan- guage and their frequencies of occurrence been in some way fitted 238 Symbols, Signals and Noise to human abilities by a long process of unconscious experiment and evolution? We have seen in Chapter V that the probability of occurrence of a word in English text is very nearly inversely proportional to its rank. That is, the hundredth most common word occurs about a hundredth as frequently as the most common word, and so on. Figure V-2 illustrates this relation, which was first pointed out by George Kingsley Zipf, who ascribed it to a principle of least effort. Clearly, Zipf 's law cannot be entirely correct in this simple form. We saw in Chapter V that word probabilities cannot be inversely proportional to the rank of the word for all words; if they were, the sum of the probabilities of all words would be greater than unity. There have been various attempts to modify, derive, and explain Zipf 's law, and we will discuss these somewhat later. However, we will at first regard Zipf 's law in its original and simplest form as an approximate description of an aspect of human behavior in generating language, a description which Zipf arrived at empirically by examining the statistics of actual text. Zipf, as we have noted, associated his law with a principle of least effort. Attempts have been made to identify the effort or "cost" of producing text with the number of letters in text. However, most linguists regard language primarily as the spoken language, and it seems unlikely that speaking, reading, or writing habits are dictated primarily by the numbers of letters used in words. In fact, we noted in the information-rate experiments which we just considered that reading rates are about the same for common Chinese ideographs and for the equivalent words in English written out alphabetically. Further, we have noted from Figure XII-3 that commonness or familiarity has an influence on reading time as great as does number of syllables. Could we not, perhaps, take reading time as a measure of effort? We might think, for instance, that common words are more easily accessible to us, that they can be recognized or called forth with less effort or cost than uncommon words. Perhaps the human brain is so organized that a few words can be stored in it in such a fashion that they can be recognized and called forth easily and that many more can be stored in a fashion which makes their use less easy. We might believe that reading time is a measure of acces- sibility, ease of use, of cost. Information Theory and Psychology 239 We might imagine, further, that in using language, human beings choose words in such a way as to transmit as much information as possible for a given cost. If we identify cost with time of utter- ance, we would then say that human beings choose words in such a way as to convey as much information as possible in a given time of speaking or in a given time of reading aloud. It is an easy mathematical task to show that if a speaking time t r is associated with the rth word in order of commonness, then for a message composed of randomly chosen words the informa- tion rate will be greatest if the rth word is chosen with a probability p(r) given by p(r) = 2~ c 'r (12.1) Here c is a constant chosen to make the sum of the probabilities for all words add up to unity. This mathematical relation says that words with a long reading time will be used less frequently than words with a short reading time, and it gives the exact relation which must hold if the information rate is to be maximized. Now, if Zipf's law holds, the probability of occurrence of the rth word in order of commonness must be given by p(r) = (12.2) Here A is another constant. Thus, from 12. 1 and 12.2 we must have A-2-<*r (12.3) r By using a relation given in the Appendix, this relation can be re-expressed t r = a 4- b log r (12.4) Here a and b are constants which must be determined by exam- ining the relation of the reading time t r and the order of common- ness or rank of a word, r. If Zipf *s law is true and if the information rate is maximized for words chosen randomly and independently with probabilities given by Zipf 's law, then relation 12.4 should hold for experimental data. Of course, words aren't chosen randomly and independently in constructing English text, and hence we cannot say that word 240 Symbols, Signals and Noise probabilities in accord with relation 12,1 actually would maximize information transmission per unit time. Nonetheless, it would be interesting to know whether predictions based on a random and independent choice of words do hold for the reading of actual English text. Benoit Mandelbrot, a mathematician much interested in lin- guistic problems, has considered this matter in connection with reading-time data taken by D. EL Howes, an experimental psy- chologist. R.R. Riesz, an experienced experimenter in the field of psychophysics, and I have also attempted to compare equation 12.4 with human behavior. There is a difficulty in making such a comparison. It seems fairly clear that reading speed is limited by word recognition, not by word utterance. A man may be uttering a long familiar word while he is recognizing a short, unfamiliar word. To get around this difficulty it seemed best to do some averaging by measuring the total time of utterance for three successive words and then comparing this with the sum of the times for the words computed by means of 12.4. Riesz ingeniously and effectively did this and obtained the data of Figure XII-5. In the test, a subject read a paragraph as fast as possible. Certainly, a straight line according to 12.4 fits the data as well as any curve would. But the points are too scattered to prove that 12.4 really holds. Moreover, we should expect such a scatter, for the rank r corre- sponds to commonness of occurrence in prose from a variety of sources, but we have used it as indicating the subject's experience with and familiarity with the word. Also, as we see from Figure XII-3, word length may be expected to have some effect on reading time. Finally, we have disregarded relations among successive words. This sort of experiment is extremely exasperating. One can see other experiments which he might do, but they would be time consuming, and there seems little chance that they would establish anything of general significance in a clear-cut way. Perhaps a genius will unravel the situation some day, but the wary psycholo- gist is more apt to seek a field in which his work promises a definite, unequivocal outcome. Information Theory and Psychology 241 \ f * \ ef\ \ - i A . ft* \ V to < \ 'O \ <4 ^ t \ i ff) \ \ \ O j q o> oc> N <o to ^ en (\i - d d 6 6 6 6 6 o SQNOD35 NI 3MJL f)NiaV3H 242 Symbols, Signals and Noise The foregoing work does at least suggest that word usage may be governed by economy of effort and that economy of effort may be measured as economy of time. We still wonder, however, whether this is the outcome of a trained ability to cope with the English language or whether language somehow becomes adapted to the mental abilities of people. What about the number of words we use, for instance? People sometimes measure the vocabulary of a writer by the total number of different words in his works and the vocabulary of an individual by the number of different words he understands. How- ever, rare and unusual words make up a small fraction of spoken or written English. What about the words that constitute most of language? How numerous are these? One might assert that the number of words used should reflect the complexity of life and that we would need more in Manhattan than in Thule (before the Air Force, of course). But, we always have the choice of using either different words or combinations of common words to designate particular things. Thus, I can say either "the blonde girl," "the redheaded girl," "the brunette girl" or I can say "the girl with light hair/' "the girl with red hair," "the girl with dark hair." In the latter case, the words with, light, dark, red, and hair serve many other purposes, while blonde, redheaded, and brunette are specialized by contrast. Thus, we could construct an artificial language with either fewer or more common words than English has, and we could use it to say the same things that we say in English. In fact, we can if we wish regard the English alphabet of twenty six letters as a reduced language into which we can translate any English utterance. Perhaps, however, all languages tend to assume a basic size of vocabulary which is dictated by the capabilities and organization of the human brain rather than by the seeming complexity of the environment. To this basic language, clever and adaptable people can, of course, add as many special and infrequently used words as they desire to or can remember. Zipf has studied just this matter by means of the graphs illustrat- ing his law. Figure XII-6 * shows frequency (number of times a 1 Reproduced from George Kingsley Zipf, Human Behavior and the Principle of Least Effort, Addison- Wesley Publishing Company, Reading, Mass., 1949. Information Theory and Psychology 243 to,ooo 100 RANK Fig. XII-6 10,000 word is used) plotted against rank (order of commonness) for 260,430 running words of James Joyce's Ulysses (curve A) and for 43,989 running words from newspapers (curve B). The straight line C illustrates Zipf 's idealized curve or "law." Clearly, the heights of A and B are determined merely by the number of words in the sample; the slope of the curve and its constancy with length of sample are the important things. The steps at the lower right of the curves, of course, reflect the fact that infrequent words can occur once, twice, thrice, and so on in the sample but not 1.5 or 2.67 times. When we idealize such curves to a 45 line, as in curve C, we note that more is involved than the mere slope of the line. We start our frequency measurement with words which occur only once; that is, the lower left hand corner of the graph represents a fre- 244 Symbols, Signals and Noise quency of occurrence of 1. Similarly, the rank scale starts with 1, the rank assigned to the most frequently used word. Thus, vertical and horizontal scales start as 1, and equal distances along them were chosen in the first place to represent equal increases in number. We see that the 45 Zipf-law line tells us that the number of different words in the sample must equal the number of occurrences of the most frequently used -word. We can go further and say that if Zipf 's law holds in this strict and primitive form, a number of words equal to the square root of the number of different words in the passage will make up half of all the words in the sample. In Figure XII-7 the number N of different words and the number V of words constituting half the passage are plotted against the number L of words in the passage. 10,000 1000 (T O IL O I CO O I 100 10 o.t 10 10 2 I0 3 10* 10 s L IN THOUSANDS OF WORDS Fig. XII-7 Information Theory and Psychology 245 Here is vocabulary limitation with a vengeance. In the Joyce passage, about 170 words constitute half the text. And Figure XII-6 assures us that the same thing holds for newspaper writing! Zipf gives curves which indicate that his law holds well for Gothic, if one counts as a word anything spaced as a word. It holds fairly well for Yiddish and for a number of Old German and Middle High German authors, though some irregularities occur at the upper left part of the curve. Curves for Norwegian tend to be steeper at the lower right than at the upper left, and plains Cree gives a line having only about three-fourth the slope of the Zipf 's law line. This means a greater number of different words in a given length of text a larger vocabulary. Chinese characters give a curve which zooms up at the left, indicating a smaller vocabulary. Nonetheless, the remarkable thing is the similarity exhibited by all languages. The implication is that the variety and probability distribution of words is pretty much the same for many, if not all, written languages. Perhaps languages do necessarily adapt them- selves to a pattern dictated by the human mental abilities, by the structure and organization of the human brain. Perhaps everyone notices and speaks about roughly the same number of features in his environment. An Eskimo in the bleak land of the north man- ages this by distinguishing by word and in thought among many types of snow; an Arab in the desert has a host of words concern- ing camels and their equipage. And perhaps all of these languages adapt themselves in such a way as to minimize the effort involved in human communication. Of course, we don't really know whether or not these things are so. Zipf *s data have been criticized. I find it impossible to believe that the number of different words is entirely dictated by length of sample, regardless of author. Certainly the frequency with which the occurs cannot change with the length of sample, as Zipf's law in its simple form implies. It is said that Zipf's law holds best for sample sizes of around 120,000 words, that for smaller samples one finds too many words that occur only once, and that for larger samples too few words occur only once. It seems most reasonable to assume that only the multiple authorship gave the newspaper the same vocabulary as Joyce. So far, our approach to Zipf's law has been that of taking it as 246 Symbols, Signals and Noise an approximate description of experimental data and asking where this leads us. There is another approach to Zipf 's law. One can attempt to show that it must be so on the basis of simple assump- tions concerning the generation text. While various workers have given proposed derivations, showing that Zipf 's law follows from certain assumptions, Benoit Mandelbrot, a mathematician who was mentioned earlier, did the first satisfactory work and appears to have carried such work furthest. Mandelbrot gives two derivations. In the first, he assumes that text is produced as a sequence of letters and spaces chosen ran- domly but with unequal probabilities, as in the first-order approxi- mation to English text of Chapter III. This allows an infinite number of different "words" composed of sequences of letters separated from other sequences by spaces. On the basis of this assumption only, Mandelbrot shows that the probability of occurrence of the rth of these "words" in order of commonness must be given by p(r) =P(r + V)-B (12.5) The constants B and V can be computed if the probabilities of the various letters and of the space are known. B must be greater than 1. P must be such as to make the sum of/?(r) over all "words" equal to unity. We see that if V were very small and B were very nearly equal to 1, 12.5 would be practically the same as Zipf *s original law. Instead of the straight, 45 line of Figure XII-6, equation 12.5 gives a curve which is less steep at the upper left and steeper at the lower right. Such a curve fits data on much actual text better than Zipf 's original law does. It has been asserted, however, that the lengths of the "words" produced by the random process described don't correspond to the length of words as found in typical English text. Further, language certainly has nonrandom features. Words get shortened as their usage becomes more common. Thus, taxi and cab came from taxicab, and cab in turn came from cabriolet. Can we say that the fact that the random production of letters leads to the production of "words" which obey Zipf 's law explains Zipf 's law? It seems to me that we can assert this only if we can show how the forces which do shape language imitate this random process. Information Theory and Psychology 247 In his second derivation of a modified form of Zipf 's law as a consequence of certain initial assumptions, Mandelbrot assumes that word frequencies are such as to maximize the information for a given cost. As a simple case, he assumes that each letter has a particular cost and that the cost of each word (that is, of each sequence of letters ending in a space) is the sum of the costs of its letters. This leads him to the same expression as the other deriva- tion, that is, to equation 12.5. The interpretation of the different symbols is different, however. The constant B can be less than unity if the total number of allowable words is finite. Regardless of the meaning of the constants, P, V, and B in expression 12.5, we can, if we wish, merely give them such values as will make the curve defined by 12.5 best fit statistical data derived from actual text. Certainly, we can fit actual data better his way than we can if we assume that V = and B = 1 (corre- sponding to Zipf 's original law). In fact, by so choosing the values of P, V, and B, equation 12.5 can be made to fit available data very well in all but a few exception cases. In the cases of modern Hebrew of around 1930 and Pennsylvania Dutch, which is a mixture of languages, a value of B smaller than 1 gives the best fit. According to Mandelbrot, the wealth of vocabulary is measured chiefly by the value of B; if B is much greater than 1, a few words are used over and over again; if B is nearer to 1, a greater variety of words is used. Mandelbrot observes that as a child grows, B decreases from values around 1.6 to values around 1.15 or to a value around 1 if the child happens to be James Joyce. Certainly, equation 12.5 fits data better than Zipf 's original law does. It overcomes the objection that, according to Zipf 's original law, the probability of the word the should depend on the length of the sample of text. This does not mean, however, that Mandel- brot's explanation or derivation of equation 12.5 is necessarily correct. Further, it is possible that some other mathematical expression would fit data concerning actual text even better. A much more thorough study would be necessary to settle such questions. Zipf 's law holds for many other data than those concerning word usage. For instance, in most countries it holds for population of cities plotted against rank in size. Thus, the tenth largest city has about a tenth the population of the largest city, and so on. 248 Symbols, Signals and Noise However, the fact that the law holds in different cases may be fortuitous. The inverse-square law holds for gravitational attraction and also for intensity of light at different distances from the sun, yet these two instances of the law cannot be derived from any common theory. It is clear that our ability to receive and handle information is influenced by inherent limitations of the human nervous system. George A. Miller's law of 7 plus-or-minus 2 is an example. This states that after a short period of observation, a person can remem- ber and repeat the names of from 5 to 9 simple, familiar objects, such as binary or decimal digits, letters, or familiar words. By means of a tachistoscope, a brightly illuminated picture can be shown to a human subject for a very short time. If he is shown a number of black beans, he can give the number correctly up to perhaps as many as 9 beans. Thus, one flash can convey a number through 9, or 10 possibilities in all. The information conveyed is log 10, or 3.3 bits. If the subject is shown a sequence of binary digits, he can recall correctly perhaps as many as 7, so that 7 bits of information are conveyed. If the subject is shown letters, he can remember perhaps 4 or 5, so that the information is as much as 5 log 26 bits, or 23 bits. The subject can remember perhaps 3 or 4 short, common words, somewhat fewer than 7 2. If these are chosen from the 500 most common words, the information is 3 log 500, or 27 bits. As in the case of the reading rate experiments, the gain due to greater complexity outweighs the loss due to fewer items, and the information increases with increasing complexity. Now, both Miller's 7 plus-or-minus-2 rule and the reading rate experiments have embarrassing implications. If a man gets only 27 bits of information from a picture, can we transmit by means of 27 bits of information a picture which, when flashed on a screen, will satisfactorily imitate any picture? If a man can transmit only about 40 bits of information per second, as the reading rate experi- ments indicate, can we transmit TV or voice of satisfactory quality using only 40 bits per second? In each case I believe the answer to be no. What is wrong? What is wrong is that we have measured what gets out of the human Information Theory and Psychology 249 being, not what goes in. Perhaps a human being can in some sense only notice 40 bits a second worth of information, but he has a choice as to what he notices. He might, for instance, notice the girl or he might notice the dress. Perhaps he notices more, but it gets away from him before he can describe it. Two psychologists, E. Averback and G. Sperling, studied this problem in similar manners. Each projected a large number (16 or 18) of letters tachistoscopically. A fraction of a second later they gave the subject a signal by means of a pointer or tone which indicated which of the letters he should report. If he could unfail- ingly report any indicated letter, all the letters must have "gotten in," since the letter which was indicated was chosen randomly. The results of these experiments seem to show that far more than 7 plus-or-minus-2 items are seen and stored briefly in the organism, for a few tenths of a second. It appears that 7 plus-or-minus-2 of these items can be transferred to a more permanent memory at a rate of about one item each hundredth of a second, or less than a tenth of a second for all items. This other memory can retain the transferred items for several seconds. It appears that it is the size limitation of this longer-term memory which gives us the 7 plus- or-minus-2 figure of Miller. Human behavior and human thought are fascinating, and one could go on and on in seeking relations between information theory and psychology. I have discussed only a few selected aspects of a broad field. One can still ask, however, is information theory really highly important in psychology, or does it merely give us another way of organizing data that might as well have been handled in some other manner? I myself think that information theory has provided psychologists with a new and important picture of the process of communication and with a new and important measure of the complexity of a task. It has also been important in stirring psychologists up, in making them re-evaluate old data and seek new data. It seems to me, however, that while information theory provides a central, universal structure and organization for elec- trical communication, it constitutes only an attractive area in psychology. It also adds a few new and sparkling expressions to the vocabulary of workers in other areas. CHAPTER y Information Theory and Art SOME MONTHS AGO when a competent modern composer and professor of music visited the Bell Laboratories, he was full of the news that musical sounds and, in fact, whole musical compositions can be reduced to a series of numbers. This was of course old stuff to us. By using pulse code modulation, one can represent any electric or acoustic wave form by means of a sequence of sample amplitudes. We had considered something that the composer didn't appreci- ate. In order to represent fairly high-quality music, with a band width of 15,000 cycles per second, one must use 30,000 samples per second, and each one of these must be specified to an accuracy of perhaps one part in a thousand. We can do this by using three decimal digits (or about ten binary digits) to designate the ampli- tude of each sample. A composer could exercise complete freedom of choice among sounds simply by specifying a sequence of 30,000 three-digit deci- mal numbers a second. This would allow him to choose from among a number of twenty-minute compositions which can be written as 1 followed by 108 million O's an inconceivably large number. Putting it another way, the choice he could exercise in composing would be 300,000 bits per second. Here we sense what is wrong. We have noted that by the fastest demonstrated means, that is, by reading lists of words as rapidly as possible, a human being demonstrates an information rate of 250 Information Theory? and Art 251 no more than 40 bits per second. This is scarcely more than a ten- thousandth of the rate we have allowed our composer. Further, it may be that a human being can make use of, can appreciate, information only at some rate even less than 40 bits per second. When we listen to an actor, we hear highly redundant English uttered at a rather moderate speed. The flexibility and freedom that a composer has in expressing a composition as a sequence of sample amplitudes is largely wasted. They allow him to produce a host of "compositions" which to any human auditor will sound indistinguishable and uninteresting. Mathematically, white Gaussian noise, which contains all frequen- cies equally, is the epitome of the various and unexpected. It is the least predictable, the most original of sounds. To a human being, however, all white Gaussian noise sounds alike. Its subtleties are hidden from him, and he says that it is dull and monotonous. If a human being finds monotonous that which is mathematically most various and unpredictable, what does he find fresh and interesting? To be able to call a thing new, he must be able to distinguish it from that which is old. To be distinguishable, sounds must be to a degree familiar. We can tell our friends apart, we can appreciate their particular individual qualities, but we find much less that is distinctive in strangers. We can, of course, tell a Chinese from our Caucasian friends, but this does not enable us to enjoy variety among Chinese. To do that we have to learn to know and distinguish among many Chinese. In the same way, we can distinguish Gaussian noise from Romantic music, but this gives us little scope for variety, because all Gaussian noise sounds alike to us. Indeed, to many who love and distinguish among Romantic composers, most eighteenth-century music sounds pretty much alike. And to them Grieg's Holberg Suite may sound like eight- eenth-century music, which it resembles only superficially. Even to those familiar with eighteenth-century music, the choral music of the sixteenth century may seem monotonous and undistinguish- able. I know, too, that this works in reverse order, for some partisans of Mozart find Verdi monotonous, and to those for whom Verdi affords tremendous variety much modern music sounds like undistinguishable noise. Of course a composer wants to be free and original, but he also 252 Symbols, Signals and Noise wants to be known and appreciated. If his audience can't tell one of his compositions from another, they certainly won't buy record- ings of many different compositions. If they can't tell his composi- tions from those of a whole school of composers, they may be satisfied to let one recording stand for the lot. How, then, can a composer make his compositions distinctive to an audience? Only by keeping their entropy, their information rate, their variety within the bounds of human ability to make distinctions. Only when he doles his variety out at a rate of a very few bits per second can he expect an audience to recognize and appreciate it. Does this mean that the calculating composer, the information- theoretic composer so to speak, will produce a simple and slow succession of randomly chosen notes? Of course not, not anymore than a writer produces a random sequence of letters. Rather, the composer will make up his composition of larger units which are already familiar in some degree to listeners through the training they have received in listening to other compositions. These units will be ordered so that, to a degree, a listener expects what comes next and isn't continually thrown off the track. Perhaps the com- poser will surprise the listener a bit from time to time, but he won't try to do this continually. To a degree, too, the composer will introduce entirely new material sparingly. He will familiarize the listener with this new material and then repeat the material in somewhat altered forms. To use the analogy of language, the composer will write in a language which the listener knows. He will produce a well-ordered sequence of musical words in a musically grammatical order. The words may be recognizable chords, scales, themes, or ornaments. They will succeed one another in the equivalents of sentences or stanzas, usually with a good deal of repetition. They will be uttered by he familiar voices of the orchestra. If he is a good composer, he will in some way convey a distinct and personal impression to the skilled listener. If he is at least a skillful composer, his composi- tion will be intelligible and agreeable. Of course, none of this is new. Those quite unfamiliar with information theory could have said it, and they have said it in other words. It does seem to me, however, that these facts are particu- Information Theory and Art 253 larly pertinent to a day in which composers, and other artists as well, are faced with a multitude of technical resources which are tempting, exasperating, and a little frightening. Their first temptation is certainly to choose too freely and too widely. M. V. Mathews of the Bell Laboratories was intrigued by the fact that an electronic computer can create any desired wave form in response to a sequence of commands punched into cards. He devised a program such that he could specify one note by each card as to wave form, duration, pitch, and loudness. Delighted with the freedom this afforded him, he had the computer reproduce rapid rhythmic passages of almost unplayable combinations, such as three notes against four with unusual patterns of accent. These ingenious exercises sounded, simply, chaotic. Very skillful composers, such as Varese, can evoke an impression of form and sense by patching together all sorts of recorded and modified sounds after the fashion of musique concrete. Many appealing compositions utilizing electronically generated sounds have already been produced. Still, the composer is faced with difficulties when he abandons traditional resources. The composer can choose to make his compositions much simpler than he would if he were writing more conventionally, in order not to lose his audience. Or he and others can try to educate an audience to remember and distinguish among the new resources of which they avail themselves. Or the composer can choose to remain unintelligible and await vindication from posterity. Perhaps there are other alternatives; certainly there are if the composer has real genius. Does information theory have anything concrete to offer con- cerning the arts? I think that it has very little of serious value to offer except a point of view, but I believe that the point of view may be worth exploring in the brief remainder of this chapter. In Chapters HI, VI, and XII we considered language. Language consists of an alphabet or vocabulary of words and of grammatical rules or constraints concerning the use of words in grammatical text. We learned to distinguish between the features of text which are dictated by the vocabulary and the rules of grammar and the actual choice exercised by the writer or speaker, It is only this element of choice which contributes to the average amount of 254 Symbols, Signals and Noise information per word. We saw that Shannon has estimated this to be between 3.3 and 7.2 bits per word. It must also be this choice which enables a writer or speaker to convey meaning, whatever that may be. The vocabulary of a language is large, although we have seen in Chapter XII that a comparatively few words make up the bulk of any text. The rules of grammar are so complicated that they have not been completely formulated. Nonetheless, most people have a large vocabulary, and they know the rules of grammar in the sense that they can recognize and write grammatical English. It is reasonable to assume a similarly surprisingly large knowl- edge of musical elements and of relations among them on the part of the person who listens to music frequently, attentively, and appreciatively. Of course, it is not necessary that the listener be able to formulate his knowledge for him to have it, any more than the writer of grammatical English need be able to formulate the rules of English grammar. He need not even be able to write music according to the rules, any more than a mute who under- stands speech can speak. He can still in some sense know the rules and make use of his knowledge in listening to music. Such a knowledge of the elements and rules of the music of a particular nation, era, or school is what I have referred to as "knowing the language of music" or of a style of music. However much the rules of music may or may not be based on physical laws, a knowledge of a language of music must be acquired by years of practice, just as the knowledge of a spoken language is. It is only by means of such a knowledge that we can distinguish the style and individuality of a composition, whether literary or musical. To the untutored ear, the sounds of music will seem to be examples chosen not from a restricted class of learned sounds but from all the infinity of possible sounds. To the untutored ear, the mechani- cal workings of the rules of music will seem to represent choice and variety. Thus, the apparent complexity of music will over- whelm the untutored auditor or the auditor familiar only with some other language of music. We should note that we can write sense while violating the rules of grammar to a degree (me heap big injun). We might liken the intelligibility of this sentence to an English-speaking person to our Information Theory and Art 255 ability to appreciate music which is somewhat strange but not entirely foreign to our experience. We should also note that we can write nonsense while obeying the rules of grammar carefully (the alabaster word spoke silently to the purple). It is to this second possibility to which I wish to address myself in a moment. I will remark first, however, that while one can of course both write sense and obey the rules while doing so, he often exposes his inadequacies to the public gaze by thus being intelligible. It is no news that we can dispense with sense almost entirely while retaining a conventional vocabulary and some or many rules. Thus, Mozart provided posterity with a collection of assorted, numbered bars in % time, together with a set of rules (Koechel 294D). By throwing dice to obtain a sequence of random numbers and choosing successive bars by means of the rules, even the nonmusical amateur can "compose" an almost endless number of little waltzes which sound like somewhat disorganized Mozart. An example is shown in Figure XIII- 1. Joseph Haydn, Maximilian Stadler, and Karl Philipp Emanuel Bach are said to have produced similar random music. In more recent times, John Cage has used random processes in the choice of sequences of notes. In ignorance of these illustrious predecessors, in 1949 M. E. Shannon (Claude Shannon's wife) and I undertook the composi- tion of some very primitive statistical or stochastic music. First we made a catalog of allowed chords on roots I- VI in the key of C. Actually, the catalog covered only root I chords; the others were Fig. XIII-1 256 Symbols, Signals and Noise derived from these by rules. By throwing three specially made dice and by using a table of random numbers, a number of composi- tions were produced. In these compositions, the only rule of chord connection was that two succeeding chords have a common tone in the same voice. This let the other voices jump around in a wild and rather unsatis- factory manner. It would correspond to the use of a simple and consistent but incorrect digram probability in the construction of synthetic text, as illustrated in Chapter III. While the short-range structure of these compositions was very primitive, an effort was made to give them a plausible and reason- ably memorable, longer-range structure. Thus, each composition consisted of eight measures of four quarter notes each. The long- range structure was attained by making measures 5 and 6 repeat measures 1 and 2, while measures 3 and 4 differed from measures 7 and 8. Thus, the compositions were primitive rondos. Further, it was specified that chords 1,16, and 32 have root I and chords 15 and 31 have either root IV or root V, in order to give the effect of a cadence. Although the compositions are formally rondos, they resemble hymns. I have reproduced one as Figure XIII-2. As all hymns should have titles and words, I have provided these by nonrandom means. The other compositions sound much like the one given. Clearly, they are all by the same composer. Still, after a few hear- ings they can be recognized as different. I have even managed to grow fond of them through hearing them too often. They must grate on the ears of an uncorrupted musician. In 1951, David Slepian, an information theorist of whom we have heard before, took another tack. Following some early work by Shannon, he evoked such statistical knowledge of music as lay latent in the breasts of musically untrained mathematicians who were near at hand. He showed such a subject a quarter bar, a half bar, or three half bars of a "composition" and asked the subject to add a sensible succeeding half bar. He then showed another subject an equal portion including that added half bar and got another half bar, and so on. He told the subjects the intended styles of the compositions. Information Theory and Art 257 RANDOM When once my ran - dom thoughts I turned That Christmas day, did they fore- see did they learn of us who sing Or ^ ife J J L ^= f On Beth- le - hem and n that i ] star The Child grown Christ, and Christ de - nied The day when an - gels sang be - fore 7T ^~ f - ^ , =^= i H 8 ^^ ( J J J .1 Which drew three wise men from a - far And Christ be -trayed and cru - ci - fied And, lay - Ing down the gifts they bore ' r r i i I won - dered what the wise men learned And ris - en Christ the De - i - ty? Fore- tell our gifts and car - ol - Ing? Fig, XIII -2 In Figure XIII-3, 1 show two samples: a fragment of a chorale In which each half bar was added on the basis of the preceding half bar and a fragment of a "romantic composition/' in which each half bar was added on the basis of the preceding three half bars. It seems to me surprising that these "compositions" hang together as well as they do, despite the inappropriate and inadmis- sible chords and chord sequences which appear. The distinctness 258 CHORALE = Symbols, Signals and Noise ROMANTIC- 41 3 ftj ' -0- it ~0- * r %Y '/ f j 'if f ['3H3 f- 1^ F%. Z///-5 of the styles is also arresting; apparently the mathematicians had quite different ideas of what was appropriate in a chorale and what was appropriate in a romantic composition. Slepian's experiment shows the remarkable flexibility of the human being as well as some of his fallibility. True stochastic processes are apt to be more consistent but duller. A number have been used in the composition of music. There is no doubt that a computer supplied with adequate statistics describing the style of a composer could produce random music with a recognizable similarity to a composer's style. The nursery-tune style demonstrated by Pinkerton and the diversity of styles evoked by Hiller and Isaacson, which I will describe pres- ently, illustrate this possibility. In 1956, Richard C. Pinkerton published in the Scientific Ameri- can some simple schemes for writing tunes. He showed how a note could be chosen on the basis of its probability of following the particular preceding note and how the probabilities changed with respect to the position of the note in the bar. Using probabilities derived from nursery tunes, he computed the entropy per note, Information Theory and Art 259 which he found to be 2.8 bits. I feel sure that this is quite a bit too high, because only digram probabilities were considered. He also presented a simple finite-state machine which could be used to generate banal tunes, much as the machine of Figure III-l gener- ates "sentences." In 1957, F. B. Brooks, Jr., A. L. Hopkins, Jr., P. G. Neumann, and W. V. Wright published an account of the statistical composi- tion of music on the basis of an extensive statistical study of hymn tunes. In 1956, the Burroughs Corporation announced that they had used a computer to generate music, and, in 1957, it was announced that Dr. Martin Klein and Dr. Douglas Bolitho had used the Datatron computer to write "popular" melodies. Jack Owens set words to one, and it was played over the ABC network as Push Button Bertha. No doubt many others have done similar things. It remained, however, for L. A. Hiller, Jr., and L. M. Isaacson of the University of Illinois to make a really serious experiment with computer music. Hiller and Isaacson succeeded in formulating the rules of four-part, first-species counterpoint in such a way that a computer could choose notes randomly and reject them if they violated the rules. Because the rules involve, except in connection with the con- cluding cadence, only direct relations among three successive notes, the music tends to wander, but over a short range it sounds surprisingly good. A sample is shown in Figure XIII-4. 1 Hiller and Isaacson went on to demonstrate that they could use the computer to generate interesting rhythmic and dynamic pat- terns and to generate "Markoff-chain" music, in which successive note selection depended on probability functions computed from tables derived from various considerations of overtones or har- monics. In this case they generated a coda according to a simple prescription. As it stands, this music, which was brought together and pub- lished as the Illiac Suite for String Quartet, has a good deal of local structure but is weak and wandering as a whole. The imposition 1 Reproduced from L. A. Hiller, Jr., and L. M. Isaacson, Illiac Suite for String Quartet, New Music, 1957, by permission of Theodore Presser Company, Bryn Mawr, Pa. 260 Symbols, Signals and Noise N^f- CODA TT= =f=j =F =F T F= * jt? PJ 9 ff ' PI ^ w * -1 J d t y .p PI M- *\ J U _LJ P pp tf Fig. XIII-4 of some simple pattern or repetition might have helped consider- ably. This could be of a strictly deterministic nature, as in the case of the prescribed repetitions in a rondo, or it could be of the nature of Chomsky's grammar, which we have considered in Chapter VI. It is clear, however, that it is foolish to try to attain long-range structure simply by relating a note to the immediately preceding notes by digram, trigram, and higher probabilities. The relation must be among parts of the composition, not simply among notes. The work of Killer and Isaacson does demonstrate conclusively that a computer can take over many musical chores which only human beings had been able to do before. A composer, and especially an unskilled composer, might very well rely on a com- puter for much routine musical drudgery. The composer could merely guide the main pattern of the composition and let the computer fill in details of harmony and counterpoint, according to a specification of style or period. Further, the computer could Information Theory and Art 261 be used to try out proposed new rules of composition, such as new rules of counterpoint or harmony, with whose use and conse- quences the composer might have little experience and familiarity. In these days we hear that cybernetics will soon give us machines which learn. If they learn in a complicated enough sense of the word, why couldn't they learn what we like, even when we don't know ourselves? Thus, by rewarding or punishing a computer for the success or failure of its efforts, we might so condition the com- puter that when we pressed a button marked Spanish, classical, rock-and-roll, sweet, etc., it would produce just what we wanted in connection with the terms. Such thoughts are intriguing, but they are of course nonsense in our day and will probably remain so for a long time to come. Music is not all of art. I began with music because it offers an apt means for illustrating in an unusual context some ideas derived from information theory. We could just as well draw our illustra- tions from the use of language. Indeed, experiments with the stochastic production of text have been perhaps more widely cultivated than experiments with music. A professor at the Grand Academy of Lagoda showed Captain Lemuel Gulliver a word frame consisting of lettered blocks mounted on shafts. The professor turned these at random and sought new wisdom in the patterns of letters which appeared. Here we see just the wrong application of a stochastic process in the generation of text. Certainly, this will not give us new knowledge. Who would take the uncorroborated word of a random process? There are all too many unsubstantiated statements avail- able; what we need to know is what is so and what isn't. Nonetheless, a stochastic process can produce some interesting effects. In Chapter III we noted Shannon's approximations to English text. These were made by using letter digram and trigram frequencies and a table of random numbers. We have seen that they contain some interesting "words." To me, deamy has a pleasant sound; I would take "it's a deamy idea" in a complimentary sense. On the other hand, I'd hate to be denounced as ilonasive. I would not like to be called grocid; per- haps it reminds me of gross, groceries, and gravid. Pondenome, whatever it may be, is at least dignified. 262 Symbols, Signals and Noise I repeat Shannon's second-order word approximation here: The head and in frontal attack on an English writer that the character of this point is therefore another method for the letters that the time who ever told the problem for an unexpected. I find this disquieting. I feel that the English writer is in mortal peril, yet I cannot come to his aid because the latter part of the message is garbled. In seeking less garbled material, as I noted in Chapter VI, I wrote three grammatically connected words in a column from the top down on a slip of paper. I showed them to a friend, asked him to make up a sentence in which they occurred, and then to add the next word in this sentence. I then folded over the top word of the four I now had and showed the visible three to another friend and got another word from him. After canvassing twenty friends, I had the following: When morning broke after an orgy of wild abandon he said her head shook vertically aligned in a sequence of words signifying what . . . Later examples are: One day when I went to what was Dionysus thinking of women without men go off half way decent impression . . . I forget whether he went on and on. Finally he stipulated that this must stop immediately after this. The last time I saw him when she lived. It happened one frosty look of trees waving gracefully against the wall. You never can . . . We see that a seeming order of meaning persists over groups of far more than four words. Eventually, however, the text wanders. The long-range wandering is of course attributable to the fact that there is no long-range, persistent purpose or meaning guiding the choice of words. We sometimes see a similar quality in the utter- ances of schizophrenics. W. H. Hudson illustrates a more gradual wandering admirably by a character in his The Purple Land: Uncle Anselmo never gets to the end of a story because he continually wanders on to new ground. One can add a certain amount of long-range order by writing, in view, at the bottom of the slip of paper to which people add Information Theory and Art 263 words a title which indicates what the passage is supposed to be about. Dr. Donald A. Dunn of the Stanford Electronics Laboratory has kindly supplied me with some examples in which the person adding a word saw only one preceding word, together with the title: Men and Women Eve loved intensely sentimental or not sufficient tonight wherever you may die before yesterday again and whatever m'love misbehaves. The seduced are compatible unusual family life seemed wonderful experience for tenderness forever yours. Orphans frequently visited his promiscuity and infidelity despite hate and love for tomorrow sex ain't nothing. In the following examples, which were produced at the Bell Laboratories, the person adding a word saw three preceding words as well as the title: About Life Life has many good and wise men seldom condemn halfwits lightly! You wonder why not. Human feelings but savage tribes found . . . Engineers It is frequently said that they knew why forces might affect salaries. However, all scientists can't imagine . . . Housecleaning First empty the furniture of the master bedroom and bath. Toilets are to be washed after polishing doorknobs the rest of the room. Washing windows semiannually is to be taken by small aids such as husbands are prone to omit soap powder. Murder Story When I killed her I stabbed Claude between his powerful jaws clamped tightly together. Screaming loudly despite fatal consequences in the struggle for life ebbing as he coughed hollowly spitting blood from his ears. I think that it is hard to read such material without amusement. I feel a little admiration as well. I would never write, "It happened one frosty look of trees waving gracefully against the wall." I almost wish I could. Poor poets endlessly rhyme love with dove, 264 Symbols, Signals and Noise and they are constrained by their highly trained mediocrity never to produce a good line. In some sense, a stochastic process can do better; it at least has a chance. I wish I had hit on deamy, but I never would have. Will a computer produce text of any literary merit by means of grammatical rules and a sequence of random numbers? It might produce fresh and amusing "words" and amusing short passages of some shock value. One can of course imagine a machine designed to write detective novels and equipped with settings for hard-boiled, puzzle, character, suspense, and so on, but such a device seems to me to be very far away. The visual arts can be used to illustrate the same points which have been made in connection with music and language. A com- pletely random visual pattern, like a completely random acoustic wave or a completely random sequence of letters, is mathematically the most surprising, the least predictable of all possible patterns. Alas, a completely random pattern is also the dullest of all patterns, and to a human being one random pattern looks just like another. Figure XIII-5, which is an array of 10,000 randomly black or white dots, illustrates this. Bela Julesz, who works in the field of perception, caused an Fig. XIII-5 Information Theory and Art 265 electronic computer to produce this random array of dots as a part of his studies of stereoscopic vision and of the meaning of pattern. He also programmed the computer to remove some of the random- ness from such a random pattern. He did this by making the computer examine successively various sets of five points located at the tips and at the center of an X, as shown by the points marked X in Figure XIII-6 (other points are marked O). If the center point was the same (black or white) as either points 1 and 4 or points 2 and 3, it was changed (from black to white or from white to black). This tends to remove any black or white diagonals, except when points 1 and 4 are black and points 2 and 3 are white or vice versa. As we can see from Figure XIII-7, making a pattern less random in this way alters and improves its appearance profoundly. An unpredictable (random) component is desirable for the sake of variety or surprise, but some orderliness is necessary if a pattern is to be pleasing. This exploitation of both order and randomness is in fact old to art. The kaleidoscope offers a charming effect by giving to a random arrangement of bits of colored glass a sixfold symmetry. O 1 2 X O X 3 4 o x o x o o o o o o Fig. XIII-6 266 Symbols, Signals and Noise Fig. XIII-7 Many years ago Marcel Duchamp, who painted Nude Descending a Staircase, allowed a number of threads to fall on pieces of black cloth and then framed and preserved them. Jean Tinguley, the Swiss artist, has produced, by means of a machine, partly ordered, partly random colored designs of considerable merit; I derive continuing pleasure from one which hangs in my office. I saved for years a pile of solder droppings which I intended to mount on a block of ebony and present to the Museum of Modern Art. Finally, I lost both the solder and the desire to do so. All of this has given me a sort of minimum philosophy of art, which I will not, I hasten to assure the reader, blame on informa- tion theory. It is a minimum philosophy because it says nothing about the talent or genius which alone can make art worth while. Successful art requires the appreciation of an audience as well as the talent of the artist. Audiences are influenced by things other than the object of art before them. If a person sets his mind against it, anything will leave him cold. A desire to appreciate can, on the other hand, lead to one's liking even poor works. I like the hymn- like compositions that Betty Shannon and I made. Authors some- times prefer inferior works of their own. Both small coteries and large groups can be led to appreciate sincerely things which are for Information Theory and Art 267 a time the fashion but which have little long-range appeal and which probably have little merit. Among other things, audiences want to have a sense of author- ship, a sense of an individual, in connection with works of art. To bring appreciation to an artist, his work must have enough con- sistency so that it is recognizable as his. How let down the sincere appreciator must be if he always has to look at the label or wait for the announcer in order to know that the painting or music is the product of his favorite artist. Suppose that one artist had actually produced in succession the masterpieces we now accept as the works of a number of great artists with diverse styles, long before the artists lived. This would astonish us, but we could scarcely appreciate him as an artist, however much we might admire the individual paintings. Picasso is eminently recognizable, but he is disquieting. He has been skillful in many styles, and yet he escapes our final judgment by going from one style to another. How much easier it is to appreciate Matisse. To be appreciated by an audience, art must be intelligible to the audience. Even a good joke in Chinese will amuse few Americans, and certainly ten jokes in Chinese will be no more amusing than one. To a degree, to be appreciated art must be in a language familiar to the audience; otherwise no matter how great the variety may be, the audience will have an impression of monotony, of sameness. We can be surprised repeatedly only by constrast with that which is familiar, not by chaos. Some artists adopt a language taught to their audience by earlier masters. Brahms was one of these. Other artists teach something of a new language to their audiences, as the impressionists did. Certainly, the language of art changes with time, and we should be grateful to the artists who teach us new words. However, we should not doubt the originality of such artists as Bach and Handel, who spoke ringingly in a language of the past. While a language with intelligible words and relations between words is necessary in art, it is not sufficient. Mechanical sameness is dull and disappointing. I prefer the surprises of stochastic prose to the vapid verses of Owen Meredith. Perhaps in some age of bad art, man will be forced to stochastic art as an alternative to the stale product of human artisans. So much for information theory and art. CHAPTER XIV Back to Communication Theory SURELY, IT is WONDERFUL if a new idea contributes to the solution of a broad range of problems. But, first of all, to be worthy to notice a new idea must have some solid and clearly demonstrated value, however narrow that value may be. An information theorist has criticized me for exploring in this book possible applications of information theory in fields of lan- guage, psychology, and art. To him, the relation between such subjects and information theory seems marginal or even dubious. Why distract the reader from the clearly demonstrated value and importance of information theory by discussing matters concerning which no clear value or importance can be demonstrated? Partly, in writing this book I have felt an obligation to the reader to discuss relations between information theory in its solid and narrow sense and various fields with which it has been connected in the writings of others. Partly, I believe, that information theory is useful in helping us in talking sense or at least in keeping from talking nonsense in connection with some linguistic, artistic, and psychological problems. However, there is a danger in overempha- sizing such matters in a book on information theory. It would certainly be wrong to assert or to believe that informa- tion theory is valuable chiefly because of wide-ranging connections 268 Back to Communication Theory 269 with a variety of fields such as language, cybernetics, psychology, and art. To believe this would be to repeat mistakes which have been made in connection with other important discoveries. Thus, in Newton's day his work was beclouded by controversy and philosophy, and for many years thereafter it was associated in people's minds with a putative universality which confused its real nature. Einstein, however, could see more clearly. He said: "Reason, of course, is weak when measured against its never ending task." Einstein then described Newton's contribution to this task of understanding and observed, "and with that, the goal was reached, the science of celestial mechanics was born, confirmed a thousand times over by Newton himself and those who came after him." It is fair to add that since Newton's day, Newtonian mechanics has been useful in solving or contributing to the solution of prob- lems that never entered the minds of Newton and his contempo- raries, but it has not solved all problems of science, as some optimistic philosophers expected it to. To me the indubitably valuable content of information theory seems clear and simple. It embraces the ideas of the information rate or entropy of an ergodic message source, the information capacity of noiseless and noisy channels, and the efficient encoding of messages produced by the source, so as to approach errorless transmission at a rate approaching the channel capacity. The world of which information theory gives us an understanding of clear and present value is that of electrical communication systems and, especially, that of intelligently designing such systems. It seems to me wise at the close of this book to turn away from the broad, speculative possibilities (or impossibilities?) of informa- tion theory and to ask the following question: Beyond the things already described in this book, what have information theorists done and what are they doing that is mathematically sound, well founded, compelling? What, in other words, have they done that qualifies as sound science which we must accept rather than as intriguing speculation that we have the privilege of arguing about? Here we find a broad range of work. To explain all of it fully to the reader would take another book. Thus, this chapter will be a brief summary of some of the work of information theorists since 270 Symbols, Signals and Noise the publication of Shannon's original paper. Its purpose is to acquaint the reader with the scope of information theory in its narrow sense and, perhaps, to entice him into following such activities in greater detail. One thing that information theorists have sought is some appli- cation of the entropy of information rate of a message source to a problem other than that of encoding and transmission of informa- tion. Ambitious men want to bring meaning into the picture somehow, but a more modest worker is willing to settle for any application which is meaningful and rigorously correct. The only application of information rate to a problem other than efficient encoding which has been given so far and which meets these criteria was advanced by J. L. Kelly, Jr., in 1956. 1 It concerns gambling on chance events in which the bettor has inside information as to the outcome of the event bet upon. We might imagine, for instance, that the dice are already thrown (or the race run) and that the favored bettor knows this and has received some knowledge of the outcome, but the person with whom he bets doesn't know this and gives the bettor fair odds on the basis of the chance of the outcome. The information which the bettor receives is doled out to him in bits, that is, yes-or-no answers to questions. His informant could, for instance, inform the bettor completely concerning whether a coin tossed had turned up heads or tails by sending him one bit of information. Or the informant could narrow for the bettor the possible outcomes of the cast of a die from 6 to 3 by using one bit of information to tell the bettor whether the outcome was odd or even. Following this introduction, I can best explain Kelly's result by quoting the abstract of his paper: If the input symbols to a communication channel represent the outcomes of a chance event on which bets are available at odds consistent with then" probabilities (i.e., "fair" odds), a gambler can use the knowledge given him by the received symbols to cause his money to grow exponentially. The maximum exponential rate of growth of the gambler's capital is equal to 1 "New Interpretation of Information Rate," Bell System Technical Journal Vol. 35 (July, 1956), pp. 917-926. Back to Communication Theory 271 the rate of transmission of information over the channel. This result is generalized to include the case of arbitrary odds. Thus we find a situation in which the transmission rate is significant even though no coding is contemplated. Previously this quantity was given significance only by a theorem of Shannon's which asserted that, with suitable encoding, binary digits could be transmiited over the channel at this rate with an arbitrarily small probability of error. Numerically the factor by which the gambler's initial capital is increased after N bets is 2NR Here R is the average number of bits of information transmitted to the bettor per bet. If this seems a trivial application of the amount of information in bits, the reader should meditate on the fact that it is the only mathematically established interpretation, other than those con- cerned with the rate of generation of probable messages and their efficient encoding for transmission, that anyone has discovered. In advancing information theory, one may seek a new use for information theory rather than a new interpretation of information rate. Thus, in 1949, C. E. Shannon published a long paper entitled "Communication Theory of Secrecy Systems." 2 It is doubtful whether this paper has helped substantially in the deciphering of messages, but it has provided, for the first time, a well organized theory of cryptography and cryptanalysis, and it is highly regarded by the expert cryptanalysts. It would be hopeless to try to go into the details of Shannon's work here, but I will try to give an idea of some of its content. The cryptanalyst who lays hands on a message enciphered by an unknown means is ignorant of two things: the message itself and a specification of the means used to encipher it, which we may call the key. Sometimes, the cryptanalyst may know the general scheme of encipherment. To take a ridiculously simple example, he might know that a simple substitution cipher had been used, that is, for each letter of the alphabet some other letter had been substituted according to a fixed scheme. ^ Ibid., Vol. 28 (October, 1949), pp. 656-715. 272 Symbols, Signals and Noise The cryptanalyst may have a short or a long enciphered message to work with. If the message had only three letters in it, say QXD, these might stand for AND, or BET, or any other English word made up of three different letters. As the message becomes longer, however, the number of possible English texts which could have been encrypted by means of a simple substitution cipher to give the particular message at hand decreases; if the enciphered message is long enough, there will be only one possible source message. Shannon expressed this decrease of uncertainty as to what mes- sage might have been enciphered so as to give the message in question as a change in the equivocation. The equivocation H^x) of Chapter VIII gives the uncertainty of what message was enciphered by the general means in question in order to give the received enciphered message. Shannon was able to compute in the case of various ciphers how the equivocation decreases as the number of characters in the message increases. When the equivo- cation approaches zero, only one message could have been en- ciphered to give the enciphered message, and, in principle, the message can be deciphered uniquely. What other sorts of problems have confronted or now confront information theorists? Some of these problems concern the sampl- ing theorem. Information theorists use the sampling theorem in order to represent a smoothly varying, band-limited signal by means of a sequence of numbers; the sample numbers are the amplitudes of the signal taken every 1/2 PF seconds, where Wis the band width of the signal. The samples which represent a given band-limited signal are not unique; they can be taken at various times. Thus, in Figure XIV-1, either the vertical solid lines or the vertical dashed lines are samples which legitimately represent the function, and samples could have been taken at many other locations. In fact, the samples don't even have to be equally spaced in time, provided that, on the average, there are two 2W samples per second! i L Fig. XIV-1 Back to Communication Theory 273 A band-limited signal is represented uniquely by 2 W samples per second only when all samples from the infinite past to the infinite future are used. Sometimes we would like to talk about a piece of band-limited signal or about a band-limited signal which is almost zero except for some specified range of time, and we would like to describe such a portion of a signal or a signal of limited duration handily in terms of samples. Our first thought might be, can we merely specify a short signal or a portion of a signal by specifying the values of a finite sequence of samples and saying nothing about samples before or after these? Alas, specifying such a finite set of samples does not specify just one band-limited signal; many different band-limited signals can be passed through a finite sequence of samples, and, if the signals are very large outside of the range of the specified samples, they can be very different within the range of the specified samples. This failing, we might say, let us specify certain successive sample values and make all preceding and succeeding samples be zero. Surely, we may think, the band-limited signal so specified will conform closely to the sample values where these are not zero and will be small wherever the samples are specified as zero. Suppose, for instance, that we insist that all of a set of equally spaced samples after a time /o are zero, while the samples before the time r ai " e nonzero, as shown by the dots in Figure XIV-2. Because the samples are specified for all times past and future, they do specify a unique band-limited signal. Will this signal be nearly zero for times after t$ Alas, H. O. Pollak, of the Bell Laboratories, has shown that this need not be so. Suppose we ask, what part of the total energy of *. t Fig. XIV-2 274 Symbols, Signals and Noise the band-limited signal passing through such samples is carried by the part of the wave which occurs ten seconds, or twenty minutes, or fifty years after ? Remember all the samples are zero after /<>. The surprising answer is that almost half of the energy of the signal can be carried by the part that occurs later than any specified time after the samples become zero. Thus, the signal can be zero at all the samples after / and still be large in between them. Efforts to use the sampling theorem rigorously to represent signals of limited length are in mathematical trouble, and mathe- maticians are trying to find some way out. Work by Pollak and Slepian indicates that neither samples nor sine waves are the most appropriate way to represent band-limited functions of finite duration, and these mathematicians have used a more appropriate group of functions called prolate spheroidal functions for this purpose. One puzzling matter about information theory may be illustrated by the following example. Suppose that in telegraphy we let a positive pulse represent a dot and a negative pulse represent a dash. Suppose that some practical joker reverses connections so that when a positive pulse is transmitted a negative pulse is received and when a negative pulse is transmitted a positive pulse is received. Because no uncertainty has been introduced, information theory says that the rate of transmission of information is just the same as before. Yet we feel that some damage has been done to the communication system. The damage would be even more appalling if , in a teletypewriter link, we consistently printed out W for A, K for B, and so on, in a completely scrambled fashion. This bothered Shannon, and he has worked out a theory to cover the situation. In this theory, he establishes a fidelity criterion. Thus, he might assign a given penalty for substituting a consonant for a vowel and a lesser penalty for substituting one vowel for another. He can then assess the damage done to a message by either consistent or random errors. When the damage is done by the random errors of a noisy channel, he shows in principle how to minimize it, and he shows how many bits per second are required to transmit the signal with a given degree of fidelity. Shannon has also done a considerable amount of work concern- Back to Communication Theory 275 ing the transmission of messages over networks in which one mes- sage may interfere with another message. The simplest case is that of transmission of messages in both directions over the same channel between two points, A and B. As a very special case, we will assume that the circuit acts the same from B to A as from A to B. Suppose that we plot the channel capacity for transmission from A to B against the channel capacity for transmission from B to A, as shown in Figure XIV-3. We can imagine two very simple cases. In one case, transmission from B to A does not interfere with transmission from A to B, and transmission from A to B does not interfere with transmission from B to A. In this case, the curve consists of the horizontal solid line giving the channel capactiy from B to A and the vertical solid line giving the channel capacity from A to B. Or we can imagine that at one time we can transmit in one t NO INTERFERENCE \ ^-^v. CO \ ^x^ INTERMEDIATE g \ x x ASE < \ a 2 \ O uu \ \ CO til Q. \ \ COMPLETED \ INTERFERENCE X x . CO \ CD \ \ \ \ BITS PER SECOND, B TO A Fig. XIV-3 276 Symbols, Signals and Noise direction only, either from A to B or from B to A. Then if we are transmitting from A to B one-third of the time, we can transmit from B to A only two-thirds of the time, and so on. The sum of the channel capacity from B to A and the channel capacity from A to B must be a constant, and the result is the dashed 45 line of Figure XIV-3. In an intermediate case, in which there is some interference between transmission in the two directions, we will get a curve roughly of the form of the dotted line of Figure XIV-3. The study of efficient encoding continues to command the atten- tion of information theorists. In the case of discrete channels, information theorists continually uncover the most efficient code for correcting A errors in a sequence of B digits, and they continu- ally seek a systematic way of finding best codes, but they have not yet succeeded in this. Information theorists also seek best codes for transmitting infor- mation over a noisy continuous channel. In 1959, Shannon pub- lished a long paper in which he arrived at upper and lower bounds on the attainable error rates for codes of various complexity (that is, length) used in signaling over a continuous channel with Gaussian noise. Further, engineers who wish to improve electrical communication continually try to find new encoding and transmission schemes which are simple enough to be useful. Partly, they try to encode television and voice signals into as few binary digits per second as they can; the approaches they use have been indicated in Chapter VII. Such efficient encoding will grow in importance as the digital transmission of signals (as in pulse code modulation) becomes more common. It will grow in importance as the encrypting of signals in order to obtain privacy or secrecy becomes more com- mon, for secrecy is best attained by digital means. Engineers also look for simple and efficient error-correcting codes which will be useful in correcting the multiple errors which occur in the transmission of digital signals over existing telephone circuits. The use of digital transmission in transmitting text and in transmitting business and technical data is growing by leaps and bounds, both in military and in civilian applications. Telephone circuits go almost everywhere. The many future possibilities of data Back to Communication Theory 211 transmission may be realized far, far more quickly if data can be so encoded that it can be transmitted over existing telephone circuits with a satisfactorily low error rate. Finally, as we have noted in Chapters IX and X, engineers seek new methods of modulation which are better than AM and FM in enabling them to send signals long distances with low powers. When we come to use satellites to relay television and telephone messages from continent to continent we will almost surely use a broad-band method of modulation which calls for only a hundredth of the transmitter power we would need if we used AM. Thus, a 2-watt, microwave transmitter aboard the satellite will be sufficient to relay a television signal from here to London or Paris. Perhaps the reader finds such matters picayune and unexciting compared with the broad philosophical vistas which information theory seems to open to us. Can an informed understanding, a loving appreciation of the nature, virtues and distinctions among the French impressionists or the Dutch genre painters ever be so meaningful as a sudden and bewildering confrontation with a new and strange world of art, such as the Japanese? Yet, the connoisseur who pursues with devotion the details of a field may well have as much insight and as sound values as the rapturous dilettante. There is some intellectual obligation to appre- ciate a field for what it is rather than for the reactions it excites in the minds of the uninformed. I hope that this book has its exciting aspects, but I also hope that it won't lead the reader to a view of information theory widely different from that held by informed workers in the field. Hence, it is perhaps well to end in a sober vein. APPENDIX: On Mathematical Notation THE READER WILL FIND a fairly liberal use of mathematical notation in this book, including a number of equations. This may incline him to say the book is full of mathematics. Of course it is. Communication theory is a mathematical theory, and, as this book is an exposition of communication theory, it is bound to contain mathematics. The reader should not, however, confuse the mathematics with the notation used. The book could contain just as much mathematics and not include one symbol or equality sign. The Babylonians and the Indians managed quite a lot of mathe- matics, including parts of algebra, without the aid of anything more than words and sentences. Mathematical notation came much later. Its purpose is to make mathematics easier, and it does for anyone who becomes familiar with it. It replaces long strings of words which would have to be used over and over again with simple signs. It provides convenient names for quantities that we talk about. It presents relations concisely and graphically to the eye, so that one can see at a glance the relations among quantities which would otherwise be strewn through sentences that the eye would be perplexed to comprehend as a whole. The use of mathematical notation merely expresses or represents mathematics, just as letters represent words or notes represent music. Mathematical notation can represent nonsense or nothing, 278 On Mathematical Notation 279 just as jumbled letters or jumbled notes can represent nothing. Crackpots often write tracts full of mathematical notation which stands for no mathematics at all. In this book I have tried to put all the important ideas into words in sentences. But, because it is simpler and easier to understand things written concisely in mathematical notation, I have in most cases put statements into mathematical notation also. I have to a degree explained this throughout the book, but here I summarize and enlarge on these explanations. I have also ventured to include a few simple related matters which are not used elsewhere in this book, in the hope that these may be of some general use or interest to the reader. The first thing to be noted is that letters can stand for numbers and for other things as well. Thus, in Chapter V, 5, stands for a group or sequence of symbols or characters, a group of letters perhaps; j signifies which group. For the first group of letters, j might be 1, and that first group might be AAA, for instance. For another value of/, say, 121, the group of letters might be ZQE. We often have occasion to add, subtract, multiply, or divide numbers. Sometimes we represent the numbers by letters. Ex- amples of the notations for these operations are: Addition 2 + 3 a + d We read a + d as "a plus d." We may interpret a + d as the sum of the number represented by a and the number represented by d. Subtraction 5-4 q- r We read q r as "q minus r." Multiplication 3 x5or3-5or (3) (5) u x v or u - v or uv If we did not use parentheses to separate 3 and 5 in (3) (5), we would interpret the two digits as 35 (thirty-five). We can use 280 Symbols, Signals and Noise parentheses to distinguish any quantities we want to multiply. We could write uv as (u) (v), but we don't need to. We read (3) (5) as 3 times 5, but we read uv as "wv" with no pause between the u and v, rather than as "u times v." Division 6-3or|or6/3 -or Up P We ordinarily read \/p as "1 over/?" rather than as "1 divided Quantities included in parentheses are treated as one number; thus (2 + 4) _ 6 _ 2 3 "3~ 2 (4 + 8) (2) = (12) (2) =24 (a + b)c = ac + be We read (a 4- b) either as "<2 plus 6" or as "the quantity a plus i," if just saying "<2 plus 6" might lead to confusion. Thus, if we said "c times a plus 6" we might mean ca + b, though we would read ca + b as "ca plus b." If we say "c times the quantity a plus 6," it is clear that we mean c(a + b). The idea of a probability is used frequently in this book. We might say, for instance, that in a string of symbols the probability of they th symbol isp(j). We read this "p of/" The symbols might be words, numbers, or letters. We can imagine that the symbols are tabulated; various values of j can be taken as various numbers which refer to the symbols. Table XVI, shows one way in which the numbers j can be assigned to the let- ters of the alphabet. When we wish to refer to the probability of a particular letter, N for instance, we could, I suppose, refer to this as/>(5), since 5 refers to N in the above table. We'd ordinarily simply write /?(N), however. What is this probability? It is the fraction of the number of On Mathematical Notation 281 TABLE XVI Value of j Corresponding Letter \ E 1 T 3 A 4 5 N 6 R etc. letters in a long passage which are the letter in question. Thus, out of a million letters, close to 130,000 will be E's, so 130,000 Sometimes we speak of probabilities of two things occurring together, either in sequence or simultaneously. For instance, x may stand for the letter we send and/ for the letter we receive. p(x 9 y) is the probability of sending x and receiving y. We read this "p of x, y (we represent the comma by a pause). For instance, we might send the particular letter W and receive the particular letter B. The probability of this particular event would be written /?(W, B). Other particular examples of p(x, y) are p(A, A), p(Q, S), p(E, E), etc. p(x } y} stands for all such instances. We also have conditional probabilities. For instance, if I transmit x, what is the probability of receiving /? We write this conditional probability p x (y)- We read this "/> sub x of 7." Many authors write such a conditional probability p(y \ x\ which can be read as, "the probability of y given x." I have used the same notation which Shannon used in his original paper on communication theory. Let us now write down a simple mathematical relation and interpret it: p(x,y) = p(x)p x (y) That is, the probability of encountering x and y together is the probability of encountering x times the probability of encounter- ing/, when we do encounter x. Or it may seem clearer to say that 282 Symbols, Signals and Noise the number of times we find x and y together must be the number of times we find x times the fraction of times that/, rather than some other letter, is associated with x. We frequently want to add many things up; we represent this by means of the summation sign 2, which is the Greek letter sigma. Suppose that 7 stands for an integer, so thaty may be 0, 1, 2, 3, 4, 5, etc. Suppose we want to represent 0+1+2 + 3+4 + 5 + 6 + 74-8 which of course is equal to 36. We write this 7 = 8 We read this, "the sum ofy fromy equals toy equals 8." The 2 sign means sum. They = at the bottom means to start with 0, and they = 8 at the top means to stop with 8. They to the right of the sign means that what we are summing is just the integers themselves. We might have a number of quantities for which y merely acts as a label. These might be the probabilities of various letters, for instance, according to Table XVII. If we wanted to sum these probabilities for all letters of the alphabet we would write 26 7=1 We read this "the sum of/? (7 ) fromy equals 1 to 26." This quantity is of course equal to 1 . The fraction of times A occurs per letter plus the fraction of times B occurs per letter, and so on, is the fraction of times per letter that any letter at all occurs, and one letter occurs per letter. If we just write y On Mathematical Notation 283 TABLE XVII Value ofj Letter Referred to Probability of Letter, p (j) 1 E .13105 2 T .10468 3 A .08151 4 O .07995 5 N .07098 6 R .06882 7 I .06345 8 S .06101 9 H .05259 10 D .03788 11 L .03389 12 F .02924 13 C .02758 14 M .02536 15 a .02459 16 G .01994 17 Y .01982 18 P .01982 19 W .01539 20 B .01440 21 V .00919 22 K .00420 23 X .00166 24 J .00132 25 Q .00121 26 z .00077 it means to sum for all values of/, that is, for all that represent something. We read this, "the sum of/? (7 ) overy." If/ is a letter of the alphabet, then we will sum over, that is, add up, twenty-six different probabilities. Sometimes we have an expression involving two letters, such as i and/ We may want to sum with respect to one of these indices. For instance, p(i, j ) might be the probability of letter / occurring followed by letter/ as, p(Q 9 v ) would be the probability of encountering the sequence QV. We could write, for instance 284 Symbols, Signals and Noise J We read this, "the sum ofp of i,j with respect to (or, over);'." This says, let j assume every possible value and add the probabilities. We note that J This reads, "the sum of/? of z, j over; equals p of /." If we add up the probabilities of a letter followed by every possible letter we get just the probability of the letter, since every time the letter occurs it is followed by some letter. Besides addition, subtraction, multiplication, and division we also want to represent a number or quantity multiplied by itself some number of times. We do this by writing the number of times the quantity is to be multiplied by itself above and to the right of the quantity; this number is called an exponent "2 to the first (or 2 to the first power) equals 2." 1 is the exponent. 2 2 = 4 "2 squared, (or 2 to the second) equals 4." 2 is the exponent. 2 3 rr 8 "2 cubed (or 2 to the third) equals 8." 3 is the exponent. 2^= 16 "2 to the fourth equals sixteen." 4 is the exponent. We can let the exponent be a letter, n; thus, 2 n , which we read "2 to the n," means multiply 2 by itself n times, a 7 *, which we read "a to the Ti," means multiply a by itself n times. To get consistent mathematical results we must say 0= 1 "0 to the zero equals 1," regardless of what number a may be. On Mathematical Notation 285 Mathematics also allows fractional and negative exponents. We should particularly note that a~ n = or I/a 71 #1 We read a~ n as "a to the minus 72." We read \/a n as "one over a to the n." It is also worth noting that a n a m = Cw a to the n, a to the m equals to the n plus m," Thus 23 X 22 = 8 x 4 = 32 = 2* or 41/2 x 41/2 =: 41 = 4 A quantity raised to the V4 power is the 4 1/2 = the square root of 4 = 2. It is convenient to represent large numbers by means of the powers of 10 or some other number 3.5 x 106 = 3,500,000 This is read "three point five times ten to the sixth, (or ten to the six)." The only other mathematical function which is referred to exten- sively in this book is the logarithm. Logarithms can have different bases. Except in instances specifically noted in Chapter X, all the logarithms in this book have the base 2. The logarithm to the base 2 of a number is the power to which 2 must be raised to equal the number. The logarithm of any number x is written log x and read "log x." Thus, the definition of the logarithm to the base 2, as given above, is expressed mathematically by: 21og x - x That is, "2 to the log x equals x.* 9 As an example r log 8 = 3 23 = 8 286 Symbols, Signals and Noise Other logarithms to the base 2 are x log x 1 2 1 4 2 8 3 16 4 32 5 64 6 Some important properties of logarithms should be noted: log ab = log a + log 6 log fl/& = log a log b log d c = c log J As a special case of the last relation, log 2 m = m log 2 = m Except in information theory, logarithms to the base 2 are not used. More commonly, logarithms to the base 10 or the base e (e = 2.718 approximately) are used. Let us for the moment write the logarithm of x to the base 2 as Iog 2 x, the logarithm to the base 10 as logic x, and the logarithm to the base e as log e x. It is useful to note that 10 2 X = (loga 10) (logio X) = Iog 2 x = 3.32 logic x Iog 2 x = (Iog 2 e) (log* *) = Iog2 X = 1.44 log e X The logarithm to the base e is called the natural logarithm. It has a number of simple and important mathematical properties. For instance, if x is much smaller than 1, then approximately log e (l + x) = x Use is made of this approximation in Chapter X. In the text of the book, by log x we always mean Iog2 x. Glossary ADDRESS: In a computer, a number designating a part of the memory used to store a number, also the part of the memory which is used to store a number. ALPHABET: The alphabet, the alphabet plus the space, any given set of symbols or signals from which messages are constructed. AMPLITUDE: Magnitude, intensity, height. The amplitude of a sine wave is its greatest departure from zero, its greatest height above or below zero. ATTENUATION: Decrease in the amplitude of a sine wave during trans- mission. AUTOMATON: A complicated and ingenious machine. Elaborate clocks which parade figures on the hour, automatic telephone switching systems, and electronic computers are automata. AXIS: One of a number of mutually perpendicular lines which constitute a coordinate system. BAND: A range or strip of frequencies. BAND LIMITED: Having no frequencies lying outside of a certain band of frequencies. BAND WIDTH: The width of a band of frequencies, measured in cps. BINARY DIGIT: A or a 1. and 1 are the binary digits. BIT: The choice between two equally probable possibilities. BLOCK: A sequence of symbols, such as letters or digits, BLOCK ENCODING: Encoding a message for transmission, not letter by letter or digit by digit, but, rather, encoding a sequence of symbols together. BOLTZMANN'S CONSTANT: A constant important in radiation and other thermal phenomena. Boltzmann's constant is designated by the letter k. k = 1.37 x 10 23 joules per degree Celsius (centigrade). 287 288 Glossary BROWNIAN MOTION: Erratic motion of very small particles caused by the impacts of the molecules of a liquid or gas. CAPACITOR: An electrical device or circuit element which is made up of two metal sheets, usually of thin metal foil, separated by a thin dielectric (insulating) layer. A capacitor stores electric charge. CAPACITY: The capacity of a communication channel is equal to the number of bits per second which can be transmitted by means of the channel. CHANNEL VOCODER: A vocoder in which the speech is analyzed by measur- ing its energy in a number of fixed frequency ranges or bands. CHECK DIGITS: Symbols sent in addition to the number of symbols in the original message, in order to make it possible to detect the presence of or correct errors in transmission. CLASSICAL: Prequantum or prerelativistic. COMMAND: One of a number of elementary operations a computer can carry out, e.g., add, multiply, print out, and so on. COMPLICATED MACHINE: An automaton. CONTACT: A piece of metal which can be brought into contact with another piece of metal (another contact) in order to close an electric circuit. COORDINATE: A distance of a point in a space from the origin in a direction parallel to an axis. In three-dimensional space, how far up or down, east or west, north or south a point is from a specified origin. CORE (magnetic): A closed loop of magnetic material linked by wires. Cores are used in the memory of an electronic computer. Magneti- zation one way around the core means 1 ; magnetization the other way around the core means 0. CPS: Cycles per second, the terms in which frequency is measured. CYCLE: A complete variation of a sine wave, from maximum, to minimum, to maximum again. DELAY: The difference between the time a signal is received and the time it was sent. DETECTION THEORY: Theory concerning when the presence of a signal can be determined even though the signal is mixed with a specified amount of noise. DIGRAM PROBABILITY: The probability that a particular letter will follow another particular letter. DIMENSION: The number of numbers or coordinates necessary to specify the position in a space is the number of dimensions in the space. The space of experience has three dimensions: up-down, east-west, north-south. DIODE: A device which will conduct electricity in one direction but not in the other direction. Glossary 289 DISCRETE SOURCE: A message source which produces a sequence of symbols such as letters or digits, rather than an electric signal which may have any value at a given time. DISTORTIONLESS: Transmission is distortionless if the attenuation is the same for sine waves of all frequencies and if the delay is the same for sine waves of all frequencies. DOUBLE-CURRENT TELEGRAPHY: Telegraphy in which use is made of three distinct conditions: no current, current flowing into the wire, and current flowing out of the wire. ELECTROMAGNETIC WAVE: A wave made up of changing electric and mag- netic fields. Light and radio waves are electromagnetic waves. ENERGY LEVEL: According to quantum mechanics, a particle (atom, electron) cannot have any energy, but only one of many particular energies. A particle is in a particular energy level when it has the energy and motion characteristic of that energy level. ENSEMBLE: All of an infinite number of things taken together, such as, all the messages that a given message source can produce. ENTROPY: The entropy of communication theory, measured in bits per symbol or bits per second, is equal to the average number of binary digits per symbol or per second which are needed in order to transmit messages produced by the source. In communication theory, entropy is interpreted as average uncertainty or choice, e.g., the average uncertainty as to what symbol the source will produce next or the average choice the source has as to what symbol it will produce next. The entropy of statistical mechanics measures the uncertainty as to which of many possible states a physical system is actually in. EQUIVOCATION: The uncertainty as to what symbols were transmitted when the received symbols are known. ERGODIC: A source of text is ergodic if each ensemble average, taken over all messages the source can produce, is the same as the correspond- ing average taken over the length of a message. See Chapter III. FILTER: An electrical network which attenuates sinusoidal signals of some frequencies more than it attenuates sinusoidal signals of other frequencies. A filter may transmit one band of frequencies and reject all other frequencies. FINITE-STATE MACHINE: A machine which has only a finite number of different states or conditions. A switch which can be set at any of ten positions is a very simple finite-state machine. A pointer which can be set at any of an infinite number of positions is not a finite- state machine. 290 Glossary FM: Frequency modulation, representing the amplitude of a signal to be transmitted by the frequency of the wave which is transmitted. FORMANT: In speech sounds, there is much energy in a few ranges of frequency. Strong energy in a particular range of frequencies in a speech sound constitutes a formant. There are two or three principal formants hi speech. FREQUENCY: The reciprocal of the period of a sine wave; the number of peaks per second. GALVANOMETER: A device used to detect or measure weak electric currents. GAUSSIAN NOISE: Noise in which the chance that the intensity measured at any time has a certain value follows one very particular law. HYPERCUBE: The multidimensional analog of a cube. HYPERSPHERE: The multidimensional analog of a sphere. INDUCTOR: An electric device or circuit element made up of a coil of highly conducting wire, usually copper. The coil may be wound on a magnetic core. An inductor resists changes in electric current. INPUT SIGNAL: The signal fed into a transmission system or other device. JOHNSON NOISE: Electromagnetic noise emitted from hot bodies; thermal noise. JOULE: A measure or amount of energy or work. LATENCY: Interval of time between a stimulus and the response to it. LINE SPEED: The rate at which distinct, different current values can be transmitted over a telegraph circuit. LINEAR: An electric circuit or any system or device is linear if the response to the sum of two signals is the sum of the responses which would have been obtained had the signals been applied separately. If the output of a device at a given time can be expressed as the sum of products of inputs at previous times and constants which depend only on remoteness in time, the device is necessarily linear. LINEAR PREDICTION: Prediction of the future value of a signal by means of a linear device. MAP: To assign on one diagram a point corresponding to every point on another diagram. MAXWELL'S DEMON: A hypothetical and impossible creature who, without expenditure of energy, can see a molecule coming in a gas which is all at one temperature and act on the basis of this information. MEMORY: The part of an electronic computer which stores or remembers numbers. MESSAGE: A string of symbols; an electric signal. MESSAGE SOURCE: A device or person which generates messages. Glossary 291 NEGATIVE FEEDBACK: The use of the output of a device to change the input in such a way as to reduce the difference between the input and a prescribed input. NEGATIVE FEEDBACK AMPLIFIER: An amplifier in which negative feedback is used in order to make the output very nearly a constant times the input, despite imperfections in the tubes or transistors used hi the amplifier. NETWORK: An interconnection of resistors, capacitors, and inductors. NOISE: Any undesired disturbance in a signaling system, such as, random electric currents in a telephone system. Noise is observed as static or hissing in radio receivers and as "snow" in TV. NOISE TEMPERATURE: The temperature a body would have to have in order to emit Johnson noise of any intensity equal to the intensity of an observed or computed noise. NONLINEAR PREDICTION: Prediction of the future value of a signal by means of a nonlinear device, that is, any device which is not linear. ORIGIN: The point at which the axes of a coordinate system intersect. OUTPUT SIGNAL: The signal which comes out of a transmission system or device. PERIOD: The time interval between two successive peaks of a sine wave. PERIODIC: Repeating exactly and regularly time after time. PERPETUAL MOTION: Obtaining limitless mechanical energy or work con- trary to physical laws. Perpetual-motion machines of the first kind would generate energy without source. Perpetual-motion machines of the second kind would turn the unavailable energy of the heat of a body which is all at one temperature into ordered mechanical work or energy. PHASE: A measure of the time at which a sine wave reaches its greatest height. The phase angle between two sine waves of the same frequency is proportional to the fraction of the period separating their peak values. PHASE SHIFT: Delay measured as a fraction of the period rather than as a time difference. PHASE SPACE: A multidimensional space in which the velocity and the position of each particle of a physical system is represented by distance parallel to a separate axis. PHONEME: A class of allied speech sounds, the substitution of one of which for another in a word will not cause a change in meaning. The sounds of b and p are different phonemes, the substitution of one of which for another can change the meaning of a word. 292 Glossary POTENTIAL THEORY: The mathematical study of certain equations and their solutions. The results apply to gravitational fields, to certain aspects of electric and magnetic fields, and to certain aspects of the flow of air and liquids. POWER: Rate of doing work or of expending energy. A watt is 1 joule per second. PROBABILITY: In mathematics, a number between and 1 associated with an event. In applications this number is the fraction of times the event occurs in many independent repetitions of an experiment. E.g., the probability that an ideal, unbiased coin will turn up heads is .5. QUANTUM: A small, discrete amount of energy and, especially, of electro- magnetic energy. QUANTUM THEORY: Physical theory that takes into account the fact that energy and other physical quantities are observed in discrete amounts. RADIATE: To emit electromagnetic waves. RADIATION: Electromagnetic waves emitted from a hot body (anything above absolute zero temperature). RANDOM: Unpredictable. REDUNDANT: A redundant signal contains detail not necessary to deter- mine the intent of the sender. If each digit of a number is sent twice (1 1001 1 instead of 1 1) the signal or message is redundant. REGISTER: In a computer, a special memory unit into which numbers to be operated on (to be added, for instance) are transferred. RELAY: An electrical device consisting of an electromagnet, a magnetic bar which moves when the electromagnet is energized, and pairs of contacts which open or close when the bar moves. RESISTOR: An electrical device or circuit element which may be a coil of fine poorly conducting wire, a thin film of poorly conducting material, such as carbon, or a rod of poorly conducting material. A resistor resists the flow of electric current. SAMPLE: The value or magnitude of a continuously varying signal at a particular specified time. SAMPLING THEOREM: A signal of band width W cps is perfectly specified or described by its exact values at 2 W equally spaced times per second. SERVOMECHANISM: A device which acts on the basis of information received to change the information which will be received in the future in accordance with a specific goal. A thermostat which measures the temperature of a room and controls the furnace to keep the tem- perature at a given value is a servomechanism. SIGN: In medicine, something which a physician can observe, such as an Glossary 293 elevated temperature. In linguistics, a pictograph or other imita- tive drawing. SIGNAL: Any varying electric current deliberately transmitted by an elec- trical communication system. SINE WAVE: A smooth, never-ending rising and falling mathematical curve. A plot vs. time of the height of a crank attached to a shaft which rotates at a constant speed is a sine wave. SINGLE-CURRENT TELEGRAPHY: Telegraphy in which use is made of two distinct conditions: no current and current flowing into or out of the wire. SPACE: A real or imaginary region in which the position of an object can be specified by means of some number of coordinates. STATIONARY: A machine, or process, or source of text is stationary, roughly, if its properties do not change with time. See Chapter III. STATISTICAL MECHANICS: Provides an explanation of the laws of thermo- dynamics in terms of the average motions of many particles or the average vibrations of a solid. STATISTICS: In mathematical theories, we can specify or assign probabilities to various events. In judging the bias of an actual coin, we collect data as to how many times heads and tails turn up, and on the basis of these data we make a somewhat imperfect statistical estimate of the probability that heads will turn up. Statistics are estimates of probability on the basis of data. More loosely, "the statistics of a message source" refers to all the probabilities which describe or characterize the source. STOCHASTIC: A machine or any process which has an output, such as letters or numbers, is stochastic if the output is in part dependent on truly random or unpredictable events. STORE: Memory. SUBJECT: A human animal on which psychological experiments are carried out. SYMBOL: A letter, digit, or one of a group of agreed upon marks. Linguists distinguish a symbol, whose association with meaning or objects is arbitrary, from a sign, such as a pictograph of a waterfall. SYMPTOM: In medicine, something that the physician can know only through the patient's testimony, such as, a headache, as opposed to a sign. SYSTEM: In engineering, a collection of components or devices intended to perform some over-all functions, such as, a telephone switching system. In thermodynamics and statistical mechanics, a particular collection of material bodies and radiation which is under consider- ation, such as, the gas in a container. 294 Glossary TESSARACT: The four-dimensional analog of a cube, a hypercube of four dimensions. THEOREM- A statement whose truth has been demonstrated by an argu- ment based on definitions and on assumptions which are taken to be true. THERMAL NOISE: Johnson noise. THERMODYNAMICS: The branch of science dealing with the transformation of heat into mechanical work and related matters. TOTAL ENERGY: The total energy of a signal is its average power times its duration. TRANSISTOR: An electronic device making use of electron flow m a solid, which can amplify signals and perform other functions. VACUUM TUBE: An electronic device making use of electron flow in a vacuum, which can amplify signals and perform other functions. VOCODER: A speech transmission system in which a machine at the trans- mitting end produces a description of the speech; the speech itself is not transmitted, but the description is transmitted, and the description is used to control an artificial speaking machine at the receiving end which imitates the original speech. WATT: A power of 1 joule per second, WAVEGUIDE: A metal tube used to transmit and guide very short electro- magnetic waves. WHITE NOISE: Noise in which all frequencies in a given band have equal powers. ZIPF'S LAW: An empirical rule that the number of occurrences of a word in a long stretch of text is the reciprocal of the order of frequency of occurrence. For example, the hundredth most frequent word occurs approximately 1/100 as many times as the most frequent word. Index Abbott's Flatland, 166 Absolute zero, 188, 292 Acoustics, 126; continuous sources in, 59; network theory in, 6 Addition modulo, 161 Addresses, denned, 222, 287; illustrated, 223 Aerodynamics, 20; potential theory in, 6 Aiken, Howard, 220 Algebra, 278; Boolean, 221 Alphabet, denned, 287; see also Letters of alphabet Ambiguity, in sentences, 113-114 Amplification, by radio receivers, 188- 191 Amplifiers, 294; broad-band and nar- row-band, 188; gain in, 2 17; Maser, 191, 194; negative feedback, 216- 218,227 Amplitudes, 131; defined, 31, 287; in pulse code modulation, 132-133; hi samples of band-limited signals, 171-174; see also Attenuation Antennas, 184-185, 192; in interplane- tary communication, 193-194 Approximations, see Word approxima- tions Arithmetic, as mathematical theory, 7, 8; units in computers, 222 Ashby, G. Ross, 21 8 Attenuation, 195; defined, 33, 287; in distortionless transmission, 289; by filters, 289; frequency and, 33-34; number of current values and, 38 Automata, 209, 227; defined, 219, 287; examples of, 287 Averages, ensemble, 58-59, 60; time, 58-59, 60 Averback, E., 249 Axes, defined, 287; in multidimensional spaces, 167-169 Ayer, A. J., on importance of commu- nication, 1 Babbitt, Milton, xi Band limited, defined, 287; signals, 170- 182,272-274 Band width, 131, 173-175, 188-189, 192; amount of information trans- missible over, 40, 44; channel capacity and, 178; defined, 38, 287; power and, 178; represented by amplitude, 171 Bands, defined, 287; line speed and, 38 Bell, Alexander Graham, 30 Berkeley, Bishop, 116, 117, 119 Binary digits, 206; addition of, 161-162; alternative number of patterns de- termined by, 71, 73-74; contracted to "bit," 98; computers and, 222, 224, 225; defined, 287; encoding of text in, 74-75, 76-77, 78-80, 83-86, 88-90, 94-98; errors in transmis- sion of, 148-150, 157-163; not necessarily same as "bit," 98-100; stored in computers, 222, 224; in transmission of speech, 148-150, 157-163; "tree of choice" of, 73- 74,99 295 296 Index Binary system of notation, decimal sys- tem and, 69-70, 72-73, 76-77; octal system and, 71, 73 Bit rate, defined, 100 Bits, 8, 66; as contraction of "binary digit," 98; defined, 202, 287; as measurement of entropy, 80-86, 88-94, 98-100; not necessarily same as "binary digit," 98-100; in psychological experiments, 230- 231; per quantum, 197 Black, Harold, 216 Blake, William, 117 Block encoding, 77, 90, 177, 182; check digits in, 159-161; defined, 75, 287; error in, 149, 156-157 Blocks, defined, 75, 287; "distance" between, 161-162; entropy and, 90-93, 94, 97; Huffman code and, 97, 101; length of, 101-103, 127 Blodi compiler, 224 Bodies, forces on, 2-3; hot, 186, 207, 290, 292; in motion, 2 Bolitho, Douglas, 259 Boltzmann, 209 Boltzmann's constant, 188, 202; defined, 287 Boolean algebra, 22 1 "Breaking," in modulation systems, 181 Breakthrough, 140 Bricker, P. D., xi Brooks, F.B. Jr., 259 Brownian motion, 29; defined, 185, 288 Cables, linearity of, 33; insulation of, 29; transatlantic, 26, 29, 138, 143; voltage in, 29 Cage, John, 255 Campbell, G. A., 30 Capacitors, 33; defined, 5, 288 Capacity, channel, 97, 98, 106, 155-156, 158-159, 164, 176, 275-276; de- fined, 288; information, 97 Caraot,N.L.S.,20 Channel capacity, 107; band width and, 178; for continuous channel plus noise, 176-177; denned, 97, 106, 164; entropy less than, 98, 106; errors in transmission and, 155- 156; measurement of, 164; with messages in two directions, 275- Channel capacity (Continued} 276; of symmetrical and unsym- metrical binary channels, 158-159, 164-165 Channel vocoder, 138; defined, 288; illustrated, 137 Channels, capacity of, 97, 98, 106, 155- 156, 158-159, 164, 176, 275-276; error-free, 163; noisy, 107, 145-165, 170-182, 276; symmetrical binary, 157-159, 164-165 Check digits, 159-161, 165; defined, 288 Checker-playing computers, 224, 225 Cherry, Colin, 118 Chess-playing computers, 224, 225 Choice, in finite-state machines, 54-56; in language, 253-254; in message sources, 62, 79-80, 81; see also Bits, Freedom of choice Chomsky, Noam, 112-115, 260; Syntac- tic Structures, 113n. Ciphers, 64, 271-272 Circuits, accurate transmission by, 43; contacts in, 288; linear, 33, 43-44; relay, 220, 221; undersea, 25 Classical, defined, 288 Codes, in cryptography, 64, 118, 271- 272; error-correcting, 159-163, 165, 276; Huffman, 94-97, 99, 100, 101, 105; in telegraphy, 24-29; see also Encoding, Morse code Coding, see Encoding Commands, 222; defined, 288; illus- trated, 223 Communication, aim of, 79; as encod- ing of messages, 78; interplanetary, 192-194; quantum effects and, 196- 198; see also Language Communication theory (Information theory), 18, 126, 268-269; art and, 250-267; ergodic sources and, 60- 61, 63; as general theory, 8-9; as mathematical theory, ix-x, 9, 18, 60-61, 63, 278; multidimensional geometry in, 170, 181, 183; origins of, 1, 20-44; physics and, 24, 198; psychology and, 229-249; useful- ness of, 8-9, 269 Companding, defined, 132 Compilers, 224 Complicated machines, see Automata Index 297 Computers, 66, 209, 219, 287; cores in, 222, 288; Datatron, 259; decisions of, 221, as finite-state machines, 62; grammar for, 115; literary work by, 264; memories (stores) of, 221-222, 290; music by, 224, 225, 250, 253, 259-261; prediction by, 210-212; programming, 221-225; relay, 220-221; transistor, 220; "understanding" in, 124; uses of, 224-226, vacuum tube, 220; visual arts and, 264-266 Contacts, defined, 288; in relays, 292 Continuous signals, encoding of, 66-68, 78, 131-143, 276; entropy of, 131; frequency of, 67, 131; noise and, 170-182, 276; theory of, 170-182, 203 Coordinates, 167-169; defined, 288 Cores, magnetic, addresses in, 222; defined, 288 Costello, P.M., xi Cps, defined, 3 1,288 Cryptography, theory of, 271-272; see also Codes Currents, electric, detection of, 27, 290; induced by magnetic field, 29n., Max- well on, 5n.; in telephony, 30; values of, 28, 36-38, 44; see also Double- current telegraphy, Noise, Signals, Single-current telegraphy Cybernetics, 41; described, 208-210, 226-227; etymology of, 208 Cycles, Carnot, 20; defined, 31, 288 Datatron computer, 259 Decimal system of notation, 69, 72-73 Delay, defined, 33, 288; in distortionless transmission, 289; see also Phase shift Detection theory, 208, 215; defined, 288 Dielectrics, in capacitors, 5, 288 Digram probabilities, 51, 56, 57, 92-93; defined, 50, 288 Dimensions, defined, 288; four, 166- 167; infinity of, 167-170; phase spaces as, 167; two, 166; up-down, east-west, north-south, 288; see also One-to-one mapping Diodes, defined, 288 Discrete signals, theory of (Fundamental theorem of the noiseless channel) 98,106, 150-159,163,164,165,203 Discrete sources, 66, 67; defined, 59, 289; noise and, 150-156 Distance, between blocks in code, 161- 162 Distortionless, defined, 34, 289 DNA (Desoxyribonucleic acid), 65 Double-current telegraphy, defined, 27, 289 Dudley, Homer, 136 Dunn, Donald A., 263 Eckert, J. P., 220 Edison, Thomas, quadruplex telegraphy of, 27-29, 38 Efficient encoding, 75-76, 77, 104, 106, 146-147, 276; by block encoding, 101-103, 156-157; for continuous signals, 131-143, 276; by Huffman code, 94-98, 101, 105; in Morse code, 43, 131; number of binary digits needed in, 78-80, 88; prin- ciples of, 142; for TV transmission, 131, 139-143, 276; for voice trans- mission, 131; see also Encoding Effort, economy of, in language, 238- 239, 242 Einstein, Albert, 145, 167; on Brownian motion, 185; on Newton, 269 Electromagnetic waves, 5n.; defined, 289; generation of, 185-186; on wires, 187-188 Electronic digital computers, see Com- puters Encoding, 8, 76, 78; "best" way of, 76, 77, 78-79; into binary notation, 74-75, 76-77, 78-80, 83-86, 88-90, 94-98; channel capacity and, 97- 98; of continuous signals, 66-68, 78, 131-143, 276; cryptographic, 64-65; dangers in, 143-144; of English text, 56, 74-75, 76-80, 88, 94-98, 101-106, 127-129; in FM, 65, 178-179; of genetic information, 64-65; Hagelbarger's method of, 162-163; by Huffman code, 94-97, 100, 105, 128; into Morse code, 24- 27, 65; of nonstationary sources, 59n.; noise and, 42, 44, 144, 276; 298 Index Encoding (Continued} by patterns of pulses and spaces, 68-69, 78; physical phenomena and, 184; quantum uncertainty and, 197; of speech, 65, 131-139, 276, in telephony, 65, 276-277; word by word, 93, 143; see also Block encod- ing, Blocks, Efficient encoding Encrypting of signals, 276 Energy, 171; electromagnetic, 185; as formants, 290; free, 202, 204-206, 207; of mechanical motion, 185; organized, 202; ratio, in signal and noise samples, 175-179, 182; ther- mal and mechanical, 202; see also Quanta, Total energy Energy level, denned, 203, 289 English text, encoding of, 56, 74-75, 76-80, 88, 94-98, 101-106, 127- 129; see also Letters of alphabet, Word approximations, Words Eniac, 220 Ensembles, 41, 42; defined, 57, 289 Entropy, 105; per block, 91, 106; chan- nel capacity and, 98, 106; in com- munication theory, 23-24, 80, 202, 206, 207; conditional, 153; of con- tinuous signals, 131; defined, 66, 80, 202, 289; estimates of, 91, 106; of finite-state machines, 93-94, 153; formulas for, 81, 84, 85; grammar and, 110, 115; highest possible, 97, 106; Huffman code and, 94-98; human behavior and, 229; of idealized gas, 203-205; per letter of English text, 101-103, 111, 130; measured in "bits," 80-81; message sources and, 23, 81-85, 88-94, 105, 206; in model of noisy communica- tion system, 152-155; per note of music, 258-259; in physics (statisti- cal mechanics and thermodynam- ics), 21-23, 80, 198, 202, 206, 207; probability and, 81-86; reversibility and, 21-22; of speech, 139; per symbol, 9 1,94, 105 Equivocation, defined, 154, 289; in enciphered message, 272; entropy and, 155, 164; in symmetrical binary channels, 155, 164 Ergodic, defined, 289 Ergodic sources, defined, 59, 63; English writers as, 63; as mathematical models, 59-61, 63, 79, 172; proba- bility in, 90, 172 Errors, correction of, 159-163, 165; detection of, 149-150; in noisy channels, 147-165; reduction through redundancy of, 149-150, 163, 164-165; in transmission of binary digits, 148-150, 157-163; see also Check digits, Noise Exponents, defined, 284 Fano, 94 Feedback, see Negative feedback Fidelity criterion, 131,274 Filtering, 208, 210, 215 Filters, 227; defined, 41, 289; in smooth- ing, 215; in vocoders, 136-138 Finite-state machines, defined, 289; de- scription of, 54-56; electronic digi- tal computers as, 62; entropy of, 93- 94, 153; grammar and, 103-104, 111-112, 114-115; illustrated, 55; men as, 62, 103; randomness in, 62 Flatland, 166 Flicker, in motion pictures and TV, 141 FM, "breaking" in, 181; defined, 290; encoding in, 65, 178-179 Foot-pound, 171 Formants, defined, 290 Fortran computer, 224 Fourier, Joseph, analyzes sine waves, 30-34, 43-44 Freedom of choice, 48; entropy and, 81, see also Choice Frequency, 131; defined, 31, 290; of in- put and output sine waves, 33; of letters in English, 47-54, 56, 63; quantum effects and, 196; in speech, 135-138; of TV signals, 67; of voice signals, 67; of white noise, 173 Frequency modulation, see FM Friedman, William F., 48 Fundamental theorem of the noiseless channel, 106; analogous to quan- tum theory, 203; reasoning of, 150- 159, 163, 164, 165; stated, 98, 156 Gabor, Dennis, "Theory of Communi- cation," 43 Index 299 Gain, in amplifiers, 217-218 Galvanometers, defined, 27, 290 Gambling, information rate and, 270- 271 Games, as illustrations of theorems and proofs, 10-14; played by compu- ters, 224, 225, 226 Gas, 62, 293; as example of physical system, 203-206; ideal expansion of, 20n.; motion of molecules in, 185 Gaussian noise, 177, 276; defined, 173- 174, 276; monotony of, 251; same as Johnson noise, 192 Genetics, information theory and, 64-65 Geometry, 4; computers and, 224, 225; Euclidean, 7, 14; multidimensional, 166-170; in problems of continuous signals, 181, 182 Gilbert, E.N.,xi Golay, Marcel J. E., 159 Governors, as servomechanisms, 209, 215-216 Grammar, 253-254; in Chomsky, 112- 115; entropy and, 1 1 0, 1 1 5 ; finite- state machines and, 103-104, 111- 112, 114-115; meaning and, 114- 116, 118; phrase-structure, 115; rules of, 109-110 Guilbaud, G. T., 52 Hagelbarger, D.W., 162 Hamming, R.W., 159 Harmon, L. D., 120 Hartley, R. V. L., 42; "Transmission of Information," 39-40, 176 Heat, motion and, 185-186; waves, 186, 187; see also Temperature, Thermo- dynamics Heaviside, Oliver, 30 Heisenberg's uncertainty principle, 197 Hertz, Heinrich Rudolf, 5 Hex (a game), 10-13 Killer, L. A. Jr., 259 Homeostasis, defined, 218 Hopkins, A. L. Jr., 259 Horsepower, defined, 171 Hot bodies, 186, 207, 290, 292 Howes, D. H., 240 Huffman code, 94-97, 99, 101, 105, 128; prefix property in, 100 Hydrodynamics, 20 Hyman, Ray, experiment on informa- tion rate by, 230, 232, 234 Hypercubes, defined, 290 Hyperquantization, defined, 132, 142; effects of errors in, 149 Hyperspace, 170; amplitudes as coordi- nates of point in, 172-175 Hyperspheres, defined, 290; volume of, 170, 173 Ideas, general, 119-120 Imaginary numbers, 170 Inductors, 33; defined, 5, 290 Infinite sets, 16 Information, definitions of, 24; as han- dled by man, 234; learning and, 230-237, 248-249; measurement of amount of, 80; uncertainty and, 24; per word, 254 Information and Control, 1 Information capacity, 97 Information rate, 91; in man, 230-237, 238-239, 250-251; see also Entropy Information theory, see Communication theory Input signals, defined, 32, 290 Institute of Radio Engineers, 1 Insulation, of cables, 29; in capacitors, 5,288 Interference, current values and, 38; in- tersymbol, 29; see also Noise Intersymbol interference, defined, 29 Isaacson, L. M., 259 Janet compiler, 224 Jenkins, H. M., xi Johnson, J. B., 188 Johnson noise (Thermal noise), 196-199; defined, 290; formulas for, 188-189, 195; same as Gaussian noise, 192; as standard for measurement, 190 Joule, defined, 171,290 Julesz, Bela, 264 Karlin, J. E., 232 KeUy, J. L. Jr., 270 Kelvin, Lord, 30 Kelvin degrees, defined, 188 Klein, Martha, 259 Kolmogoroff, A. N., 41, 42, 44, 214 300 Index Language, basic, 242; choice in, 253- 254; classification in, 122-123; as code of communication, 118; de- fined, 253; economy of effort in, 238-239, 242; emotions and, 116- 117; euphony in, 116-117; every- day vs. scientific, 3-4, 121; experi- ence as basis for, 123, 242, 245; human brain and, 242, 245; mean- ing in, 1 14-124; random and non- random factors in, 246; understand- ing and, 117, 123-124; see also Grammar, Words, Zipf's law Latency, defined, 290; as proportional to information conveyed, 230-232 Learning, experiments in, 230-237, 248- 249; by machines, 261 Least effort, Zipf's law and, 238-239 Letters of alphabet, binary digit encod- ing of, 74-75, 78, 128-129; entropy per, 101-103, 111, 130; frequency of, 47-54, 56, 63; predictability of, 47-52; sequences of, 49-54, 63, 79; suppression of, 48 Licklider, J. C. R., 232 Line speed, defined, 38, 290 Linear prediction, 211-212, 227; de- fined, 290 Linearity, defined, 32-33, 290; of cir- cuits, 33, 43-44 Liquids, motion of molecules in, 185 Logarithms, bases of, 285; explained, 36, 82, 285-286; Nyquist's relation and, 36-38 Machines, desk calculating, 222; finite- state, 54-56, 62, 93-94, 103-104, 111-112, 114-115, 153, 289; learn- ing by, 261; pattern-matching, 120; perpetual-motion, 198-202, 206, 291; translation, 54, 123, 225; Tur- ing, 114; see also Computers Man, behavior of, 126, 225; emotions in, 116-117; entropy and, 229; as finite-state machine, 62, 103; infor- mation rate in, 230-237, 238-239, 250-251; memory in, 248-249; as message source, 61, 103-104; nega- tive feedback in, 218, 227; see also Language Mandelbrot, Benoit, xi, 240, 246-247 Map, defined, 290 Mapping, continuous, 16, 179; one-to- one, 14-15, 16-17, 179 Mars, transmission from vicinity of, 192-194 Maser amplifier, 191, 194 Mathematical models, function of, 45- 47; ergodic stationary sources as, 59-61, 63, 79, 172; in network theory, 46; to produce text, 56 Mathematics, intuitionist, 16; new tools in, 182; notation in, 278-279; po- tential theory in, 6; purpose of, 17; theorems and proofs in, 9-17; in words and sentences, 278-279 Mathews, M. V., xi, 253 Mauchly, J. W., 220 Maxwell, James Clerk, 198, 209; Elec- tricity and Magnetism, 5n. ? 20; equations of, 5-6, 1 8 Maxwell's demon, 198-200; defined, 290; illustrated, 199 Meaning, language and, 114-124 Memories, 221-222; addresses in, 222, 287; defined, 290 Memory, in man, 248-249 Merkel, J., 230 Message sources, 8; basis of knowledge of, 61; choice in, 62, 79-80, 81; de- fined, 206, 290; discrete, 59, 66, 67, 150-156, 289; entropy and, 23, 81- 85, 88-94, 105, 206; ergodic, 57-61, 63, 79, 90, 172; intermittent, 91; natural structure of, 128; nonsta- tionary, 59n.; simultaneous, 80, 91-92; stationary, 57-59, 172-173; statistics of, 293; as "tossed" coins, 81-85; see also Signals Messages, defined, 290; increase and decrease of number of, 81; see also Signals Miller, George A., 248 Minsky, Marvin, 226 Modes, 195 Modulation, improved systems of, 179, 277; noise and, 180-181; see also FM, Pulse code modulation Molecules, Maxwell's demon and, 198- 200; motion of, 185-186 Morse, Samuel F. B., 24-25, 42, 43 Index 301 Morse code, 24-25, 40, 65, 97, 129, 131; limitations on speed in, 25-27 Motion, Aristotle on, 2; Brownian, 29, 185, 288; energy of, 185; of mole- cules, 185-186; Newton's laws of, 2-3,4-5,8, 18,20,203 Mowbray, G. H., 231 Multiplex transmission, defined, 132 Music, 65, 66; appreciation of, 251-252; composition of, 250-253; electronic composition of, 224, 225, 250, 253, 259-261; Janet compiler in, 224; rules of, 254-255; statistical com- position of, 255-259 Nash, see Hex Negative feedback, 209, 227; in animal organisms, 218; defined, 215, 291; as element of nervous control, 219, 227; linear, 216; nonlinear, 216; unstability in, 216, 227; uses of, 218 Negative feedback amplifiers, 216-218, 227; defined, 291; illustrated, 217 Nervous system, as finite-state machine, 62; negative feedback and, 219, 227 Network theory, 8, 18; in acoustics, 6; defined, 5-6; generality of, 6-7; linearity in, 33; mathematical models in, 46 Networks, 5; defined, 5, 291; as filters, 289; simultaneous transmission of messages over, 275-276 Neumann, P. G., 259 Newman, James R., ix, xi; World of Mathematics, x Newton, Isaac, 14, 269; laws of, 2-3, 4-5, 8, 18, 20, 145, 203 Noise, 43, 207; attempts to overcome, 29, 146-165, 170-182; "breaking" to, 181; causes of, 184-185; con- tinuous signals and, 170-182; in discrete communication systems, 150-156; electromagnetic, 188-190; encoding and, 42, 44, 144, 276; ex- traction of signals from, 42, 44; number of current values and, 38; power required with, 192; in radar, 41-42, 213-214; in radio, 145, 184- 185; 188-191, 207, 291; from ran- dom error in samples, 131-132; as "snow" in TV, 144, 146, 291; in Noise (Continued) telegraphy, 38, 147; in telephony, 145, 162, 291; in teletypewriter transmission, 146; see also Gaus- sian noise, Johnson noise, White noise Noise figure, formula for, 191 Noise temperature, defined, 291; as measure of noisiness of receiver, 190-192 Noiseless channel, theorem of, 98, 106, 150-159, 163, 164, 165,203 Nonlinear prediction, 213-214; defined, 291 Numbers, imaginary, 170; see also Binary digits, Binary system of no- tation Nyquist, Harry, "Certain Factors Af- fecting Telegraph Speed," 35-39, 176, 182; "Certain Topics in Tele- graph Transmission Theory," 39; on Johnson noise, 188 Off-on signals, encoding by, 68-77; see also Single-current telegraphy One-molecule heat engine, 200-201, 204 One-to-one mapping, 14-15, 16-17, 179 Origin, defined, 29 1 ; see also Points Output signals, defined, 32, 291 Parabolic reflectors, 189, 193 Particles, indistinguishability of, 7; energy level of, 289 Period, defined, 31,291 Periodic, defined, 31, 291 Perpetual motion, defined, 291; ma- chines, 198-202,206 Phase, defined, 31, 291 Phase angle, defined, 291 Phase shift, defined, 33, 291; see also Delay Phase space, 167; defined, 291 Philosophy, Greek, 125; as reassurance, 117 Phoneme, defined, 291; vocoders, 138 Phrase-structure grammar, 115 Physics, 125-126; communication theory and, 24, 198; theoretical, 4 Pinkerton, Richard C, 258 Pitch, in speech, 135 Plosives, 135-136 302 Index Poe, Edgar Allen, The Gold Bug, 64; rhythm of, 116-117 Poincare, Henri, 30 Points, in multidimensional space, 167- 169, 172, 179, 181, 182; see also Origin Pollack, H.O., 273-274 Potential theory, 6; defined, 292 Power, average, 172-173; band width and, 178; concentrated in short pulses, 177; defined, 292; as meas- ure of strength of signal and noise, 171; noise, 175-179, 182, 189, 192- 193; ratio (signal power to noise power), 175-179, 182, 191; receiver, 193; signal, 173-179, 182, 189, 192; transmitter, 193 Predictability, of chance events, 46-47; of letters of alphabet, 47-52, 56, 283; randomness and, 61-62; see also Probability Prediction, linear, 211-212, 227, 290; nonlinear, 213-214, 291 ; from radar data, 210-214, 227 Prefix property, defined, 100 Probability, 293; conditional, 51, 281; defined, 292; entropy and, 81-86; of letters in English text, 47-52, 56, 283; in model of noisy communica- tions system, 151-153; of next word or letter in message, 61-62; nota- tion for, 280-284; in word order, 86-87; see also Digram probabili- ties, Predictability, Trigram proba- bilities Programs (for computers), 62, 221-225; decisions in, 221; compilers for, 224 Prolate spheroidal functions, 274 Proofs, by computers, 224, 225; con- structive, 13, 15-16; illustrated by "hex," 10-13; illustrated by map- ping, 14-17; nature of, 9-17 Psychology, communication theory and, 229-249 Pulse code modulation, 132, 138, 142, 147, 276; in musical composition, 250 Pupin, Michael, 30 Quadruplex telegraphy, 27-29, 38 Quanta, defined, 292; energy of, 196; transmission of bits and, 197 Quantization, defined, 68, 132; in TV signals, 142; see also Hyperquanti- zation Quantum theory, 125-126, 203; defined, 292; energy level in, 289; unpredic- tability of signals in, 196 Quastler, H., 232 Radar, noise in, 41-42, 213-214; predic- tion by, 210-214, 227 Radiate, defined, 186, 292 Radiation, defined, 186, 292; equilib- rium of, 187; rates of, 186; see also Quanta Radiators, good and poor, 1 86 Radio, noise in, 145, 184-185, 188-191, 207, 291 Radio telescopes, 189-190 Radio waves, as encoding of sounds of speech, 76; see also Electromag- netic waves Random, defined, 292 Ratio, of signal power to noise power, 175-179, 182, 191 Reading speed, 232-237, 238-241; rec- ognition and, 240 Receivers, efficient, 184-185; measure- ment of performance of, 190-192 Recognition, and reading speed, 240 Redundancy, 144; defined, 39, 143, 292; as means of reducing error, 149- 150, 163, 164-165 Registers, 222; defined, 292 Relays, in circuits, 220, 221; defined, 292; malfunction of, 147 Resistors, 33; defined, 5, 292; hot, 188- 189; as source of noise power, 189 Response time, see Latency Reversibility, indicated by entropy, 21- Rhoades, M. V., 231 Riesz, R. R., 240 RNA (Ribonucleic acid), 65 Runyon, J. P., xi Samples, 78; of band-limited signals, 171-175, 272-274; defined, 292; fidelity criterion and, 131; interval between, 66-68; random errors in, 131-132 Index 303 Sampling theorem, defined, 66-67; problems in, 272-274 Satellites, 277 Scanning, in TV, 140 Science, history of, 19-21; informed ignorance in, 108; meaning of words in, 3-4, 18, 121 Secrecy, through encrypting of signals, 276; see also Codes Sentences, ambiguous, 113-114; genera- tion of, 112-115 Servomechanisms, 209; defined, 215, 292 Shannon, Claude E., xi, 43, 87-89, 93, 94, 98, 129-130, 146, 147, 221, 254, 270, 274, 276; "Communication Theory of Secrecy Systems," 271- 272; estimate of entropy per letter of English text by, 101-103, 111, 130; fundamental theorem of the noiseless channel of, 98, 106, 150- 159, 163, 164, 165, 203; "Mathe- matical Theory of Communica- tion," ix, 1, 9, 41-42; use of multi- dimensional geometry by, 170, 173 Shannon, M. E. (Betty), 255, 266 Shepard, R. N., xi Signals, analog, 140; band-limited, 170- 182, 272-274; broad-band, 143; choice of "best" sort of, 42; con- tinuous, 66-68, 78, 131-143, 170- 182, 203, 276; defined, 196, 293; demodulated, 181; discrete, 66, 67, 98, 106, 150-159, 163, 164, 165,203; encrypting of, 276; energy of, 172- 175, 182, 294; faint, 214; future values of, 209; input and output, 32, 290-291; as points in multidimen- sional space, 172, 179, 181, 182; in quantum theory, 196; representa- tion of, by samples, 66-68; unpre- dictability of, 196; on wires, 187- 188; see also Messages, Modulation, Redundancy, Samples Signs, 119; defined, 292-293 Sine waves, amplitude of, 31, 287; cycles in, 31, 288; defined, 31, 293; Fourier's analysis of, 30-34, 43-44; period of, 31, 291; see also Fre- quency Single-current telegraphy (off-on teleg- raphy), defined, 27, 293 Slepian, David, xi, 162, 214, 256, 274 Smoothing, 208, 210, filters in, 215 Sources, see Message sources Space, defined, 293; dimensions in, 166- 170; Euclidean, 169; function, 181; phase, 167, 291; signal, 181 Space ships, transmission from, 192-194 Speech, encoding of, 65, 131-139, 276; entropy of, 139; frequencies in, 135-138; pitch in, 135, 136; recog- nition of, by computers, 224, 225; samples in representation of, 67- 68; voiced and voiceless, 135-136; see also Language, Words Speed, line, 38; reading, 232-237, 238- 241; of transmission, 24-27, 36-38, 44, 130-131, 155-156, 164 Sperling, G., 249 Stadler, Maximilian, 255 Stars, radio noise from, 190 States, 203-206, 207; defined, 203; see also Finite-state machines Stationary, defined, 57-59, 293 Statistical mechanics, defined, 293; entropy in, 22, 202, 206, 207 Statistics, defined, 293; of a message source, 293 Stibitz, G. R., 220, 221 Stochastic, 60, 63; defined, 56, 293; music, 255-259 Stores, 221-222; defined, 293 Subject, defined, 230, 293 Summation sign, explained, 282-284 Switching systems, 219-220, 224, 227, 287 Syllables, reading speed and, 234-236 Symbols, as current values, 28, 36-38; defined, 293; probability of occur- rence of, 88-94; primary and sec- ondary, 40; selection of, 39-40 Symptoms, 119; defined, 293 Systems, 202-203; defined, 293; deter- ministic, 46-47; switching, 219-220, 224, 227 Systems of notation, binary, 69-77; decimal, 69-70; octal, 71 Szilard, L., 21, 198 Telegraphy, noise in, 38, 147; Nyquist on, 35-39, 176, 182; as origin of theory of communication, 20; quadruplex, 27-29, 38; speed of, 25-27, 36-38; and telephony on 304 Index Telegraphy (Continued) same wire, 38; see also Double-cur- rent telegraphy, Single-current telegraphy Telephony, 126; continuous sources in, 59; currents in, 30; encoding in, 65, 276-277; intercontinental relay in, 277; military, 132; multiplex trans- mission in, 132; negative feedback in, 218; noise in, 145, 162, 291; rates of transmission in, 130-131; switching systems in, 219-220, 224, 227, 287; and telegraphy on same wire, 38 Telescopes, radio, 189-190 Teletypes, errors in type-setting, 149; noise in, 146 Television, approximation of picture signal in, 141; color, 140, 142; ef- ficient encoding in, 139-143, 276; errors in transmission over, 132; intercontinental relay in, 277; inter- planetary, 194; quantization in, 132; rates of transmission over, 130-131; scanning in, 140; "snow" in, 144, 146, 291 Temperature, in degrees Kelvin, 188; radiation and, 186-187; radio noise power and, 189-190; of sun, 190 Tessaracts, defined, 167, 294 Theorems, defined, 294; illustrated by "hex," 10-13; illustrated by map- ping, 14-17; nature of, 9-17; proven by computers, 124, 126,224,225 Theory, classification of, 7-8; function of, 4, 18, 125; general vs. narrow, 4-7; physical vs. mathematical, 6-8; potential, 6, 292; of relativity, 145; unified field, 5; see also Com- munication theory, Network theory, Quantum theory Thermal equilibrium, 199-200 Thermal noise, see Johnson noise Thermodynamics, 19, 293; defined, 294; entropy in, 21-23, 198, 202, 207; laws of, 198-199,206, 207 Thermostats, 216, 227, 292 "Thinking," by computers, 226 Thomson, William (Lord Kelvin), 30 Tinguley, Jean, 266 Total energy, free energy in, 202; of signal, 172-175, 294 Transistors, 33; in computers, 220; de- fined, 294; malfunction of, 147 Translating machines, 54, 123, 225 Transmission, accurate, 127; of continu- ous signals in the presence of noise, 170-182; distortionless, 34, 289; error-free, 163-164; length of time of, 40, 175; microwave, 277; of Morse code, 24-27; multiplex, 132; "predictors" in, 129-130; rates of 24-27,36-38,44, 130-131, 155-156, 164; of samples, 131-132; of TV signals, 139-143, 194; in two direc- tions, 33, 275-276; from vicinity of Mars, 192-194 "Tree of choice," binary digits and, 73- 74,99 Trigram probabilities, 51-52, 93 Tuller, W. G., 43 Turing, A. M., 226 Turing machines, 1 14 Uncertainty, Heisenberg's principle of, 197; information as, 24; of message received, 79-80, 105, 163-164; see also Entropy Understanding, language and, 117, 123- 124 Unstability of feedback systems, 216, 227 Vacuum tubes, 33; in computers, 220; defined, 294; malfunction of, 147 Visual arts, appreciation of, 266-267; computers and, 264-267 Vocabulary, size and complexity of, 242- 245 Vocoders, defined, 294; described, 136- 142, illustrated, 137; types of, 138 Voice, see Speech Voltage, in cables, 29; in detection of faint signals, 214-215 Volume, in multidimensional geometry, 169-173 Von Neumann, John, 221-222 Watt, defined, 171, 294 Wave guides, 195, 294 Waves, heat, 186, 187; light, 186, 187, 196; radio, 76; of speech sounds, 133-135; see also Electromagnetic waves, Sine waves Index 305 White noise, defined, 173, 294; monot- ony of, 251 Wiener, Norbert, 41, 44, 227; Cyber- netics, 41, 208, 219; I Am a Mathe- matician, 209, 210 Wolman, Eric, xi Word approximations, to English text, 48-54, 86, 90, 110-111,246,261- 264; first-order, 49, 53, 86, 246, 261; fourth-order, 110-111; to Latin, 52; second-order, 51, 53, 90, 262; third-order, 51-52, 90; zero-order, 49 Word-by-word encoding, 93, 143 Words, associations to, 118-119; binary- digit encoding of, 75, 77, 78, 86-88, 129; determined by qualities, 119- 121; sequences of, 53-56, 79, 110- 1 1 1, 122; as used in science, 3-4, 18, 121; see also Language, Zipf 's law Wright, W. V., 259 Zipf 's law, 238-239;. defined, 294; illus- trated, 87, 243; other data and, 247 About the Author Dr. John R. Pierce was born in Des Moines, Iowa, in 1910 and spent his early life in the Midwest. He received his undergraduate education at the California Institute of Technology in Pasadena, and his M.A. and Ph.D. degrees in electrical engineering from the same institution. Dr. Pierce has been with the Bell Telephone Laboratories since 1936, and is at present Director of Research in Communications Principles at their laboratory in Murry Hill, New Jersey. He lives with his wife and two children in Berkeley Heights, New Jersey. Dr. Pierce's writings have appeared in Scientific American, The Atlantic Monthly, Coronet, and several science fiction magazines. His books include Man's World of Sound (with E. E. David); Electrons' Waves and Messages; Waves and the Ear (with W. A. van Bergeijk and E. E. David); and two highly technical books, Travel- ing Wave Tubes and Theory and Design of Electron Beams (the latter has appeared in a Russian translation). Dr. Pierce is a member of several scientific societies, including the National Academy of Sciences, and is a fellow of the Institute of Radio Engineers. He has received the following awards: Eta Kappa Nu, Outstanding Young Electrical Engineer, 1942; Morris Liebmann Memorial Prize, 1947; Stuart Ballantine Medal, 1960. 114691