3. COMPUTER SCIENCE I DATA REPRESENTATION.pptx

COMPUTER SCIENCE I
TOPIC 3: DATA REPRESENTATION
FACILITATOR: SAMUEL B. ANTWI

DATA AND COMPUTERS
• In the not-so-distant past, computers dealt almost exclusively with numeric and textual data, but
now computers are truly multimedia devices, dealing with a vast array of information categories.
• Computers store, present, and help us modify many different types of data, including:
• Numbers
• Text
• Audio
• Images and graphics
• Video

DATA AND COMPUTERS
• We cannot discuss data representation without also talking about data
Compression-reducing the amount of space needed to store a piece of data. In
the past we needed to keep data small because of storage limitations.
• Today, computer storage is relatively cheap; but now we have an even more
pressing reason to shrink our data: the need to share it with others. The Web and
its underlying networks have inherent bandwidth restrictions that define the
maximum number of bits or bytes that can be transmitted from one place to
another in a fixed amount of time.

DATA AND COMPUTERS
• The compression ratio gives an indication of how much compression occurs. The
compression ratio is the size of the compressed data divided by the size of
the original data. The values could be in bits or characters or whatever is
appropriate as long as both values are measuring the same thing. The ratio
should result in a number between 0 and 1.
• The closer the ratio is to zero, the tighter the compression.

DATA AND COMPUTERS
• A data compression technique can be lossless, which means the data can be
retrieved without losing any of the original information. Or it can be lossy, in
which case some information is lost in the process of compaction. Although we
never want to lose information, in some cases the loss is acceptable. When
dealing with data representation and compression, we always face a
tradeoff between accuracy and size.

ANALOG AND DIGITAL INFORMATION
• The natural world, for the most part, is continuous and infinite. A number line is continuous,
with values growing infinitely large and small. That is, you can always come up with a number
larger or smaller than any given number. And the numeric space between two integers is
infinite.
• For instance, any number can be divided in half. But the world is not just infinite in a
mathematical sense. The spectrum of colors is a continuous rainbow of infinite shades. Objects
in the real world move through continuous and infinite space. Theoretically, you could always
close the distance between you and a wall by half, and you would never actually reach the
wall.

• Information can be represented in one of two ways: analog or digital. Analog data is a continuous
representation, analogous to the actual information it represents. Digital data is a discrete
representation, breaking the information up into separate elements.
• A mercury thermometer is an analog device. The mercury rises in a continuous flow in the tube in
direct proportion to the temperature. We calibrate and mark the tube so that we can read the
current temperature, usually as an integer such as 75 degrees Fahrenheit. However, a mercury
thermometer is actually rising in a continuous manner between degrees.
• So at some point in time the temperature is actually 74.568 degrees Fahrenheit, and the mercury
is accurately indicating that, even if our markings are not fine enough to note such small changes

• Analog information is directly proportional to the continuous, infinite world around us.
• Computers, therefore, cannot work well with analog information. So instead, we
digitize information by breaking it into pieces and representing those pieces
separately.
• Each of the representations we discuss in this chapter has found an appropriate way
to take a continuous entity and separate it into discrete elements. Those discrete
elements are then individually represented using binary digits.

BINARY REPRESENTATIONS
• As we undertake the details of representing particular types of data, it’s important to
remember the inherent nature of using binary.
• One bit can be either 0 or 1. There are no other possibilities. Therefore, one bit can
represent only two things.
• For example, if we wanted to classify a food as being either sweet or sour, we would
need only one bit to do it. We could say that if the bit is 0, the food is sweet, and if the
bit is 1, the food is sour.
But if we want to have additional classifications (such as spicy), one bit is not sufficient.

BINARY REPRESENTATIONS
• To represent more than two things, we need multiple bits. Two bits can
represent four things because there are four combinations of 0 and 1 that can
be made from two bits: 00, 01, 10, and 11. So, for instance, if we want to
represent which of four possible gears a car is in (park, drive, reverse, or
• neutral), we need only two bits. Park could be represented by 00, drive by
01, reverse by 10, and neutral by 11.

REPRESENTING NUMERIC DATA
• Numeric values are the most prevalent type of data used in a computer
system. Unlike other types of data, there may seem to be no need to come up
with a clever mapping between binary codes and numeric data. Since binary
is a number system, there is a natural relationship between the numeric
information and the binary values that we store to represent them. This is true,
in general, for positive integer data. However, there are other issues
regarding the representation of numeric information to consider at this point.

REPRESENTING NEGATIVE VALUES
• In the traditional decimal system, a sign (+ ) is placed before a number’s
value, though the positive sign is often assumed. The sign represents the
ordering, and the digits represent the magnitude of the number. The classic
number line looks something like this, in which a negative sign meant that the
number was to the left of zero and the positive number was to the right of
zero.

FIXED-SIZED NUMBERS
• If we allow only a fixed number of values, we can represent numbers as just
integer values, where half of them represent negative numbers. The sign is
determined by the magnitude of the number. For example, if the maximum
number of decimal digits we can represent is two, we can let 1 through 49 be
the positive numbers 1 through 49 and let 50 through 99 represent the
negative numbers 50 through 1.

REPRESENTING TEXT
• A text document can be decomposed into paragraphs, sentences, words, and
ultimately individual characters. To represent a text document in digital form,
we simply need to be able to represent every possible character that may
appear. The document is the continuous (analog) entity, and the separate
characters are the discrete (digital) elements that we need to represent and
store in computer memory.

REPRESENTING TEXT
• There are a finite number of characters to represent. So the general
approach for representing characters is to list them all and assign each a
binary string. To store a particular letter, we store the appropriate bit string.
• So what characters do we have to worry about? There are the 26 letters in
the English language. But uppercase and lowercase letters have to be treated
separately, so that’s really 52 unique characters

REPRESENTING TEXT
• Various punctuation characters also have to be represented, as do the numeric
digits (the actual characters “0”, “1”, through “9”). Even the space character must
have a representation.
• And what about languages other than English? The list of characters we may want
to represent starts to grow quickly once you begin to think about it. Keep in mind
that, as we discussed earlier in this chapter, the number of unique things
(characters in this case) we want to represent determines how many bits we’ll need
to represent any one of them.

REPRESENTING TEXT
• A character set is simply a list of characters and the codes used to represent
each one.
• There have been several character sets used over the years, though a few
have dominated. By agreeing to use a particular character set, computer
manufacturers have made the processing of text data easier.
• We explore two character sets in the following sections: ASCII and Unicode.

THE ASCII CHARACTER SET
• ASCII stands for American Standard Code for Information Interchange.
• The ASCII character set originally used seven bits to represent each character,
allowing for 128 unique characters. The eighth bit in each character byte was
originally used as a check bit, which helped ensure proper data transmission.
• Later ASCII evolved so that all eight bits were used to represent a character. This
eight-bit version is formally called the Latin-1
• Extended ASCII character set.

THE ASCII CHARACTER SET
• The extended ASCII set allows for 256 characters and includes accented letters
as well as several additional special symbols.

THE UNICODE CHARACTER SET
• The extended version of the ASCII character set provides 256 characters, which is enough for
English but not enough for international use. This limitation gave rise to the Unicode character set,
which has a much stronger international influence.
• The goal of the people who created Unicode is nothing less than to represent every character in
every language used in the entire world, including all of the Asian ideograms. It also represents
many additional special-purpose characters such as scientific symbols. To accomplish this, the
Unicode character set uses 16 bits per character.
• Therefore, the Unicode character set can represent 216, or over 65 thousand, characters.
Compare that to the 256 characters represented in the extended ASCII set.

THE UNICODE CHARACTER SET
• The Unicode character set is gaining popularity and is used by many programming
languages and computer systems today. However, the character set itself is still
evolving.
• For consistency, Unicode was designed to be a superset of ASCII. That is, the first
256 characters in the Unicode character set correspond exactly to the extended
ASCII character set, including the codes used to represent them. Therefore,
programs that assume ASCII values are unaffected even if the underlying system
embraces the Unicode approach.

TEXT COMPRESSION
• Alphabetic information (text) is a fundamental type of data. Therefore, it is
important that we find ways to store text efficiently and transmit text
efficiently between one computer and another. The following sections examine
three types of text compression:
keyword encoding
run-length encoding
Huffman encoding

KEYWORD ENCODING
• Consider how often you use words such as “the”, “and”, “which”, ”that”, “and” “what”. If
these words took up less space (that is, had fewer characters), our documents would shrink in
size.
• Even though the savings on each word would be small, they are used so often in a typical
document that the combined savings would add up quickly.
• One fairly straightforward method of text compression is called keyword encoding, in which
frequently used words are replaced with a single character. To decompress the document,
you reverse the process: replace the single characters with the appropriate full word. P93

KEYWORD ENCODING
• The human body is composed of many independent systems, such as the
circulatory system, the respiratory system, and the reproductive system. Not
only must all systems work independently, they must interact and cooperate as
well. Overall health is a function of the well-being of separate systems, as
well as how these separate systems work in concert.

KEYWORD ENCODING
• The human body is composed of many independent systems, such ^ ~
circulatory system, ~ respiratory system, + ~ reproductive system. Not only &
each system work independently, they & interact + cooperate ^ %. Overall
health is a function of ~ %-being of separate systems, ^ % ^ how # separate
systems work in concert.

RUN-LENGTH ENCODING
• In some situations, a single character may be repeated over and over again
• in a long sequence. This type of repetition doesn’t generally take place in
English text, but often occurs in large data streams, such as DNA sequences. A
text compression technique called run-length encoding capitalizes on these
situations. Run-length encoding is sometimes called recurrence coding.

RUN-LENGTH ENCODING
• In run-length encoding, a sequence of repeated characters is replaced by a
flag character, followed by the repeated character, followed by a single digit
that indicates how many times the character is repeated. For example,
consider the following string of seven repeated ‘A’ characters: AAAAAAA
• If we use the ‘*’ character as our flag, this string would be encoded as:
• *A7

RUN-LENGTH ENCODING
• The flag character is the indication that the series of three characters (including the flag)
should be decoded into the appropriate repetitious string. All other text is treated
regularly. Therefore, the following encoded
• string:
• *n5*x9ccc*h6 some other text *k8eee
• would be decoded into the following original text:
• nnnnnxxxxxxxxxccchhhhhh some other text kkkkkkkkeee
• The original text contains 51 characters, and the encoded string contains 35 characters,
giving us a compression ratio in this example of 35/51 or approximately 0.68.

HUFFMAN ENCODING
• Another text compression technique, called Huffman encoding, is named after its
creator, Dr. David Huffman. Why should the character ‘X’, which is seldom used in
text, take up the same number of bits as the blank, which is used very frequently?
Huffman codes address this question by using variable-length bit strings to
represent each character. That is, a few characters may be represented by five
bits, and another few by six bits, and yet another few by seven bits, and so forth.
This approach is contrary to the idea of a character set, in which each character
is represented by a fixed-length bit string (such as 8 or 16).

HUFFMAN ENCODING
• Then the word DOORBELL would be encoded in binary as:
• 1011110110111101001100100
• If we used a fixed-size bit string to represent each character (say, 8 bits),
• then the binary form of the original string would be 8 characters times 8
• bits or 64 bits. The Huffman encoding for that string is 25 bits long, giving
• a compression ratio of 25/64, or approximately 0.39.

REPRESENTING AUDIO INFORMATION
• We perceive sound when a series of air compressions vibrate a membrane in our ear, which
sends signals to our brain. Thus a sound is defined in nature by the wave of air that interacts
with our eardrum.
• A stereo sends an electrical signal to a speaker to produce sound. This signal is an analog
representation of the sound wave. The voltage in the signal varies in direct proportion to the
sound wave. The speaker receives the signal and causes a membrane to vibrate, which in turn
vibrates the air (creating a sound wave), which in turn vibrates the eardrum. The created
sound wave is hopefully identical to the one that was captured initially, or at least good
enough to please the listener.

REPRESENTING AUDIO INFORMATION
• To represent audio information on a computer, we must digitize the sound wave,
somehow breaking it into discrete, manageable pieces. One way to accomplish this
is to actually digitize the analog representation of the sound. That is, take the
electric signal that represents the sound wave and represent it as a series of
discrete numeric values. An analog signal varies in voltage continuously. To digitize
the signal we periodically measure the voltage of the signal and record the
appropriate numeric value. This process is called sampling. Instead of a continuous
signal, we end up with a series of numbers representing distinct voltage levels.

AUDIO FORMATS
• Over the past few years there have been several popular formats for audio
• information, including WAV, AU, AIFF, VQF, and MP3. All of these are
• based on the storage of voltage values sampled from analog signals, but all
• format the details of the information in different ways and all use various
• compression techniques to one extent or another.

THE MP3 AUDIO FORMAT
• MP3 is short for MPEGÐ2, audio layer 3 file. MPEG is an acronym for the Moving Picture Experts Group,
which is an international committee that develops standards for digital audio and video compression.
• MP3 employs both lossy and lossless compression. First it analyzes the frequency spread and compares it to
mathematical models of human
• psychoacoustics (the study of the interrelation between the ear and the brain), then it discards information that
cannot be heard by humans. Then the bit stream is compressed using a form of Huffman encoding to achieve
additional compression.
• There are many software tools available on the Web to help you create MP3 files. These tools generally
require that the recording be stored in some other common format, such as WAV, before that data is
converted into MP3 format, significantly reducing the file size.

REPRESENTING IMAGES AND GRAPHICS
• Images such as photographs and graphics such as line drawings have common
issues when it comes to their representation and compression. We first look at
the general idea of representing color, then turn to the various techniques for
digitizing and representing visual information.

REPRESENTING COLOR
• Color is our perception of the various frequencies of light that reach the
retinas of our eyes. Our retinas have three types of color photoreceptor cone
cells that respond to different sets of frequencies. These photoreceptor
categories correspond to the colors of red, green, and blue. All other colors
perceptible by the human eye can be made by combining various amounts of
these three colors.

REPRESENTING COLOR
• Therefore, color is often expressed in a computer as an RGB (red-greenblue)
• value, which is actually three numbers that indicate the relative
• contribution of each of these three primary colors. If each number in the
• triple is given on a scale of 0 to 255, then 0 means no contribution of that
• color, and 255 means full contribution of that color. For example, an RGB

REPRESENTING COLOR
• value of (255, 255, 0) maximizes the contribution of red and green, and minimizes the
contribution of blue, which results in a bright yellow.
• The concept of RGB values gives rise to a three-dimensional “color space.”
• The amount of data that is used to represent a color is called the color depth. It is usually
expressed in terms of the number of bits that are used to represent its color. HiColor is a term
that indicates a 16-bit color depth. Five bits are used for each number in an RGB value and the
extra bit is sometimes used to represent transparency. TrueColor indicates a 24-bit color depth.
Therefore, each number in an RGB value gets eight bits, which gives the range of 0 to 255 for
each. This results in the ability to represent over 16.7 million unique colors.

DIGITIZED IMAGES AND GRAPHICS
• A photograph is an analog representation of an image. It is continuous across its
surface, with shades of one color blending into another. Digitizing a picture is
the act of representing it as a collection of individual dots called pixels, a term
that stands for picture elements. Each pixel is composed of a single color. The
number of pixels used to represent a picture is called the resolution. If enough
pixels are used (high resolution), and are then presented in the proper order
side by side, the human eye can be fooled into thinking it’s viewing a continuous
picture.

• The storage of image information on a pixel-by-pixel basis is called a raster-graphics
format. There are several popular raster file formats in use, including bitmap (BMP),
GIF, and JPEG.
• A bitmap file is one of the most straightforward graphic representations.
• In addition to a few administrative details, a bitmap file contains the pixel color values
of the image from left to right and top to bottom. A bitmap file supports 24-bit
TrueColor, though usually the color depth can be specified to reduce the file size. A
bitmap file may be compressed usingrun-length encoding as described earlier.

• The GIF format (Graphics Interchange Format), developed by CompuServe in 1987, uses
indexed color exclusively to reduce file size, which limits the number of available colors to 256.
If even fewer colors are required, the color depth can usually be specified to fewer bits. GIF
files are best used for graphics and images with few colors, and are therefore considered
optimal for line art.
• The storage of image information on a pixel-by-pixel basis is called a raster-graphics format.
There are several popular raster file formats in use, including bitmap (BMP), GIF, and JPEG.
• A bitmap file is one of the most straightforward graphic representations. In addition to a few
administrative details, a bitmap file contains the

• pixel color values of the image from left to right and top to bottom. A bitmap
file supports 24-bit TrueColor, though usually the color depth can be specified
to reduce the file size. A bitmap file may be compressed using run-length
encoding.

REPRESENTING VIDEO
• Video information is one of the most complex types of information to capture
and compress to get a result that makes sense to the human eye.
• Video clips contain the equivalent of many still images, each of which must be
compressed. The Web is full of video clips that are choppy and hard to follow.
This situation will likely improve over the next few years,
• depending on the evolving sophistication of video compression techniques,
• which are referred to as video codecs.

VIDEO CODECS
• Codec stand for COmpressor/DECompressor. A video codec refers to the
methods used to shrink the size of a movie to allow it to be played on a
computer or over a network.
• Most codecs are block oriented, meaning that each frame of a video is
divided into rectangular blocks. The codecs differ in how the blocks are
encoded. Some video codecs are accomplished completely in software, while
others require special hardware.

VIDEO CODECS
• Video codecs employ two types of compression: temporal and spatial. Temporal
compression looks for differences between consecutive frames. If most of an image
in two frames hasn’t changed, why should we waste space to duplicate all of the
similar information? A keyframe is chosen as the basis to compare the differences,
and its entire image is stored. For consecutive images, only the changes (called
delta frames) are stored.
• Temporal compression is effective in video that changes little from frame to frame,
such as a scene that contains little movement.

VIDEO CODECS
• Spatial compression removes redundant information within a frame. This problem is
essentially the same as that faced when compressing still images. Spatial video
compression often groups pixels into blocks (rectangular areas) that have the same color,
such as a portion of a clear blue sky.
• Instead of storing each pixel, the color and the coordinates of the area are stored
instead. This idea is similar to run-length encoding.
• Various video codecs are popular today, including Sorenson, Cinepak,
• MPEG, and Real Video.

3. COMPUTER SCIENCE I DATA REPRESENTATION.pptx

More Related Content

Similar to 3. COMPUTER SCIENCE I DATA REPRESENTATION.pptx (20)

Recently uploaded (20)

3. COMPUTER SCIENCE I DATA REPRESENTATION.pptx