Data, Randomness, and (More) Noise

Data, Randomness, and (More) NoiseMy recent post concerning randomness elicited more than the usual number of comments, so perhaps we could profitably consider some more features of data, randomness, and noise.

We usually think of data as some collection (bytes in the case of computers) that contains useful information, or at least information as distinct from noise. And we usually think of noise as random corruptions of the data. However, we need to be careful how we define information and noise. This came home to me many years ago when I knew a radio astronomer who was discovering useful information about the cosmos by analyzing radio frequency noise. That is, what was one man’s signal was another man’s noise. Noise is a slippery concept.

If I am playing with my grandchildren, their childish cries and babbling can be very enjoyable, but if I am trying to listen to a symphony, the same output from them is definitely noise. A similar concept applies to weeds. What is a weed? If I have a border of roses around my house, and thanks to bird droppings we have a stalk of wheat or a tomato popping up, the wheat is a weed (I might keep the tomato). But if a rose should somehow take root in a wheat field, then the rose is a weed. In the same sense, what is noise and what is signal depends on what use we intend to make of the incoming data.

This is different from the information content in the incoming data. My grandchildren might distract me from hearing the symphony, but they do not decrease the information content coming from the speakers. Of course, if I were to re-record the symphony in my living room with a microphone which also picks up the children at play, then I have added two types (at least) of noise. The first is the signal from the children which we can assume is unwanted by most listeners, and the other is truly random noise introduced by the recording process. (Non-random noise is also generated, but that is another story.)

Of course, your computer has no idea of what you will do with a file that you read from the hard drive, so it cannot allow harmless noise to pass because it does not know what is harmless noise and what is critical. Therefore more care must be taken in opening a document to read on a computer than you might take if you copied a letter manually and sent it to a friend. A typo or two probably would not interfere with transmitting the desired message. If fact, inserting your own personal quirks might actually constitute additional information since it tells something about you — as in: yes, that letter must be from John; he always misspells “separate” as “seperate.”

A great deal of effort has been put into optimizing error detection and correction codes to enable data to be stored, recalled, transmitted, and re-stored without introducing unacceptable errors. The tradeoff is often between accuracy and bandwidth. In common speech, we often use twice the bandwidth needed for the information we wish to transmit. You can show this with a simple game. Have someone prepare a message of at least a 100 or so words in normal colloquial English. Assume you have no knowledge of what is to be transmitted, but you know when the person is ready to transmit. Then ask if the first letter is an “A.” The sender can only respond yes or no (i.e. 1 bit). Keep asking different letters until you get the right one then go to the next and so on. (Limit the punctuation and other symbols to keep it simple — spaces and periods are necessary). How many guesses will it take per letter to send the whole message?

People who have not tried this are often amazed to learn that for reasonable length messages, most guesses are correct! That is, it takes less than two guesses per letter. Try it. This gives you an indication of the redundancy of English.

Article Written by

  • http://twitter.com/KodeSource Doug McFarlane

    Sending out the request takes bandwidth, ie: ‘Is the next letter an ‘A’?’.

    But if both sides use an agreed upon question and response algorithm, then the sender only needs to send the responses from the algorithm, based on English language patterns.

    If your ’2 bits’ per letter is accurate, and normal English takes say a minimum of 5 bits (2^5 = 32 unique characters), that is quite a savings.

    Although text is known for its high compression rate anyways, so I’m not sure if this saves space overall, but again, an interesting read.