Aviraj --floating point representation and arithmetic.pptx

FLOATING POINT REPRESENTATION
AND ARITHMETIC
Presentation Topic
B. Sc CS (H)I Year Computer Organization and Architecture
SIR CHHOTU RAM ENGG. INSTITUTE & TECH. CCS UNIV. CAMPUS, MEERUT

INTRODUCTION
•Objective: To understand how to represent floating point numbers
in the computer and how to perform arithmetic with them.
•Approximate arithmetic
–Finite Range
–Limited Precision
•Topics
–IEEE format for single and double precision floating point numbers
–Floating point addition

FLOATING POINT
• An IEEE floating point representation consists of
–A Sign Bit (no surprise)
–An Exponent (“times 2 to the what?”)
–Mantissa (“Significand”), which is assumed to be 1.xxxxx
(thus, one bit of the mantissa is implied as 1)
–This is called a normalized representation
• So a mantissa = 0 really is interpreted to be 1.0, and a
mantissa of all 1111 is interpreted to be 1.1111

FLOATING POINT STANDARD
• Defined by IEEE Std 754-1985
• Developed in response to divergence of representations
–Portability issues for scientific code
• Now almost universally adopted
• Two representations
–Single precision (32-bit)
–Double precision (64-bit)

S
s
5
IEEE Floating-Point Format
• S: sign bit (0  non-negative, 1  negative)
• Normalize significand: 1.0 ≤ |significand| < 2.0
– Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly
(hidden bit)
– Significand is Fraction with the “1.” restored
• Exponent: excess representation: actual exponent + Bias
– Ensures exponent is unsigned
– Single: Bias = 127; Double: Bias = 1203
S Exponent Fraction
single: 8 bits
double: 11 bits
single: 23 bits
double: 52 bits
Bias)
(Exponent
S
2
Fraction)
(1
1)
(
x 






FLOATING-POINT EXAMPLE
• Represent –0.75
– –0.75 = (–1)1 × 1.12 × 2–1
– S = 1
– Fraction = 1000…002
– Exponent = –1 + Bias
• Single: –1 + 127 = 126 = 011111102
• Double: –1 + 1023 = 1022 = 011111111102
• Single: 1011111101000…00
• Double: 1011111111101000…00

FLOATING-POINT EXAMPLE
• What number is represented by the single-
precision float
11000000101000…00
– S = 1
– Fraction = 01000…002
– Fxponent = 100000012 = 129
• x = (–1)1 × (1 + 012) × 2(129 – 127)
= (–1) × 1.25 × 22
= –5.0

A
s
Representation of Floating Point Numbers
•IEEE 754 double precision
31 30 20 19 0
Sign Biased exponent Normalized Mantissa (implicit 53rd bit)
(-1)s  F  2E-1023

FLOATING POINT ARITHMETIC
• fl(x) = nearest floating point number to x
• Relative error (precision = s digits)
–|x - fl(x)|/|x| 1/2 1-s for = 2, 2-s
• Arithmetic
–x y = fl(x+y) = (x + y)(1 + ) for < u
–x y = fl(x y)(1 + ) for < u
ULP—Unit in the Last Place is the smallest possible increment or decrement
that can be made using the machine's FP arithmetic.

FLOATING POINT -ADDITION
• Consider a 4-digit decimal example
– 9.999 × 101 + 1.610 × 10–1
• 1. Align decimal points
– Shift number with smaller exponent
– 9.999 × 101 + 0.016 × 101
• 2. Add significands
– 9.999 × 101 + 0.016 × 101 = 10.015 × 101
• 3. Normalize result & check for over/underflow
– 1.0015 × 102
• 4. Round and renormalize if necessary
– 1.002 × 102

Aviraj --floating point representation and arithmetic.pptx

More Related Content

Similar to Aviraj --floating point representation and arithmetic.pptx (20)

Recently uploaded (20)

Aviraj --floating point representation and arithmetic.pptx