xxxxxxxxxx
begin
using Pkg
Pkg.activate(mktempdir())
Pkg.add("PyPlot")
Pkg.add("PlutoUI")
using PlutoUI,PyPlot
end
Number representation
Besides of the concrete names of Julia library functions everything in this chapter is valid for all modern programming languagues and computer systems.
All data in computers are stored as sequences of bits. For concrete number types, the bitstring
function returns this information as a sequence of 0
and 1
. The sizeof
function returns the number of bytes in the binary representation.
Integer numbers
Int16
xxxxxxxxxx
T_int=Int16
1
xxxxxxxxxx
i=T_int(1)
2
xxxxxxxxxx
sizeof(i)
"0000000000000001"
xxxxxxxxxx
bitstring(i)
Positive integer numbers are represented by their representation in the binary system. For negative numbers
typemin
and typemax
return the smallest and largest numbers which can be represented in number type.
-32768
32767
32767
xxxxxxxxxx
typemin(T_int),typemax(T_int),2^(8*sizeof(T_int)-1)-1
Unless the possible range of the representation
10
xxxxxxxxxx
3+7
-32759
xxxxxxxxxx
typemax(T_int)+T_int(10)
Floating point numbers
How does this work for floating point numbers ?
0.30000000000000004
xxxxxxxxxx
0.1+0.2
But this should be 0.3. What is happening ???
Real number representation
Let us think about representation real numbers. Usually we write them as decimal fractions and cut the representation off if the number of digits is infinite.
Any real number
can be expressed via the representation formula: with base , significand (or mantissa) digits and exponentThe representation is infinite for periodic decimal numbers and irrational numbers.
Scientific notation
The scientific notation of real numbers is derived from this representation in the case of 6.022e23
. Then
This representation is not unique, e.g. 0.6022e24
IEEE754 standard
This is the actual standard format for storing floating point numbers. It was developed in the 1980ies.
, thereforeTruncation to fixed finite size:
: significand (mantissa) lengthNormalization: assume
save one bit for the storage of the significand. This requires a normalization step after operations which adjusts significand and exponent of the result. : exponent size. Define :Extra bit for sign
storage size:
Standardized for most modern languages
Hardware support usually for 64bit and 32bit
precision | Julia | C/C++ | k | t | bits |
---|---|---|---|---|---|
quadruple | n/a | long double | 16 | 113 | 128 |
double | Float64 | double | 11 | 53 | 64 |
single | Float32 | float | 8 | 24 | 32 |
half | Float16 | n/a | 5 | 11 | 16 |
See also the Julia Documentation on floating point numbers, 0.30000000000000004.com, wikipedia and the links therein.
The storage sequence is: Sign bit, exponent, mantissa.
Storage layout for a normalized Float32 number (
bit 1: sign,
bit
: exponent bitsthe value
is stored no need for sign bit in exponent
bit
: mantissa bits not stored "hidden bit"
Julia allows to obtain the signifcand and the exponent of a floating point number
2.0
xxxxxxxxxx
x0=2.0
1.0
1
xxxxxxxxxx
significand(x0),exponent(x0)
We can calculate the length of the exponent
from the maximum representable floating point number by taking the base-2 logarithm of its exponent:
xxxxxxxxxx
exponent_length(T::Type{<:AbstractFloat})=Int(log2(exponent(floatmax(T))+1)+1);
The size of the significand
is calculated from the overall size of the representation minus the size of the exponent and the size of the sign bit + 1 for the "hidden bit".
xxxxxxxxxx
significand_length(T::Type{<:AbstractFloat})=8*sizeof(T)-exponent_length(T)-1+1;
This allows to define a more readable variant of the bitstring repredentatio for floats.
The sign bit is the first bit in the representation:
xxxxxxxxxx
signbit(x::AbstractFloat)=bitstring(x)[1:1];
Next comes the exponent:
exponent_bits (generic function with 1 method)
xxxxxxxxxx
exponent_bits(x::AbstractFloat)=bitstring(x)[2:exponent_length(typeof(x))+1]
And finally, the significand:
xxxxxxxxxx
significand_bits(x::AbstractFloat)=bitstring(x)[exponent_length(typeof(x))+2:8*sizeof(x)];
Put them together:
xxxxxxxxxx
floatbits(x::AbstractFloat)=signbit(x)*"_"*exponent_bits(x)*"_"*significand_bits(x);
Julia floating point types
Float16
xxxxxxxxxx
T=Float16
Type Float16:
size of exponent: 5
size of significand: 11
Float16(0.1)
xxxxxxxxxx
x=T(0.1)
Binary representation: 0_01011_1001100110
Exponent e=-4
Stored: e+15= 11
assumed implicitely.
Numbers which are exactly represented in decimal system may not be exactly represented in binary system!
Such numbers are always rounded to a finite approximate
Float16(0.2998)
xxxxxxxxxx
x_per=T(0.1)+T(0.2)
"0_01101_0011001100"
xxxxxxxxxx
floatbits(x_per)
Floating point limits
Finite size of representation
there are minimal and maximal possible numbers which can be representedsymmetry wrt. 0 because of sign bit
smallest positive denormalized number:
6.0e-8
"0_00000_0000000001"
xxxxxxxxxx
nextfloat(zero(T)), floatbits(nextfloat(zero(T)))
smallest positive normalized number:
6.104e-5
"0_00001_0000000000"
xxxxxxxxxx
floatmin(T),floatbits(floatmin(T))
largest positive normalized number:
6.55e4
"0_11110_1111111111"
xxxxxxxxxx
floatmax(T), floatbits(floatmax(T))
Largest representable number:
Inf
"0_11111_0000000000"
6.55e4
"0_11110_1111111111"
xxxxxxxxxx
typemax(T),floatbits(typemax(T)),prevfloat(typemax(T)), floatbits(prevfloat(typemax(T)))
Machine precision
There cannot be more than
floating point numbers almost all real numbers have to be approximatedLet
be an exact value and be its approximation. Then is the best accuracy estimate we can get, where (truncation) (rounding)
Also:
is the smallest representable number such that .Relative errors show up in particular when
subtracting two close numbers
adding smaller numbers to larger ones
How do operations work?
E.g. Addition
Adjust exponent of number to be added:
Until both exponents are equal, add 1 to exponent, shift mantissa to right bit by bit
Add both numbers
Normalize result
The smallest number one can add to 1 can have at most
Machine epsilon
Smallest floating point number
such that in floating point arithmeticIn exact math it is true that from
it follows that and vice versa. In floating point computations this is not true
Float16(0.000977)
xxxxxxxxxx
ϵ=eps(T)
"0_00101_0000000000"
xxxxxxxxxx
floatbits(ϵ)
1.0
"0_01111_0000000000"
"0_01111_0000000000"
xxxxxxxxxx
one(T)+ϵ/2,floatbits(one(T)+ϵ/2), floatbits(one(T))
1.001
"0_01111_0000000001"
xxxxxxxxxx
one(T)+ϵ,floatbits(one(T)+ϵ)
0.000977
"0_00101_0000000000"
xxxxxxxxxx
nextfloat(one(T))-one(T),floatbits(nextfloat(one(T))-one(T))
Density of floating point numbers
How dense are floating point numbers on the real axis?
xxxxxxxxxx
function fpdens(x::AbstractFloat;sample_size=1000)
xleft=x
xright=x
for i=1:sample_size
xleft=prevfloat(xleft)
xright=nextfloat(xright)
end
return prevfloat(2.0*sample_size/(xright-xleft))
end;
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Inf
Inf
Inf
Inf
Inf
Inf
Inf
Inf
Inf
Inf
xxxxxxxxxx
X=T(10.0) .^collect(-10:T(0.1):10)
xxxxxxxxxx
begin
fig=PyPlot.figure()
PyPlot.loglog(X,fpdens.(X))
PyPlot.title("$(eltype(X)) numbers per unit interval")
PyPlot.grid()
PyPlot.xlabel("x")
fig
end