xxxxxxxxxx
 
begin 
    using Pkg
    Pkg.activate(mktempdir()) 
    Pkg.add("PyPlot") 
    Pkg.add("PlutoUI")
    using PlutoUI,PyPlot
end

11.8 s

Number representation

8.7 μs

Besides of the concrete names of Julia library functions everything in this chapter is valid for all modern programming languagues and computer systems.

10.6 μs

All data in computers are stored as sequences of bits. For concrete number types, the bitstring function returns this information as a sequence of 0 and 1. The sizeof function returns the number of bytes in the binary representation.

7.4 μs

Integer numbers

9.6 μs

T_int

Int16

xxxxxxxxxx
 
T_int=Int16

3.9 μs

xxxxxxxxxx
 
i=T_int(1)

3.8 ms

xxxxxxxxxx
 
sizeof(i)

21.5 μs

"0000000000000001"

xxxxxxxxxx
 
bitstring(i)

6.0 ms

Positive integer numbers are represented by their representation in the binary system. For negative numbers $n$ , the binary representation of their "two's complement" $2^{N} - | n |$ (where $N$ is the number of available bits) is stored.

typemin and typemax return the smallest and largest numbers which can be represented in number type.

8.8 μs

-32768

xxxxxxxxxx
 
typemin(T_int),typemax(T_int),2^(8*sizeof(T_int)-1)-1

3.6 ms

Unless the possible range of the representation $(- 2^{N - 1}, 2^{N - 1})$ is exceeded, addition, multiplication and subtraction of integers are exact. If it is exceeded, operation results wrap around into the opposite sign region.

8.6 μs

xxxxxxxxxx
 
3+7

6.9 μs

-32759

xxxxxxxxxx
 
typemax(T_int)+T_int(10)

4.4 ms

Floating point numbers

7.6 μs

How does this work for floating point numbers ?

14.2 μs

0.30000000000000004

xxxxxxxxxx
 
0.1+0.2

3.2 μs

But this should be 0.3. What is happening ???

10.0 μs

Real number representation

Let us think about representation real numbers. Usually we write them as decimal fractions and cut the representation off if the number of digits is infinite.
Any real number $x \in R$ can be expressed via the representation formula: $x = \pm \sum_{i = 0}^{\infty} d_{i} β^{- i} β^{e}$ with base $β \in N, β \geq 2$ , significand (or mantissa) digits $d_{i} \in N, 0 \leq d_{i} < β$ and exponent $e \in Z$
The representation is infinite for periodic decimal numbers and irrational numbers.

21.0 μs

Scientific notation

The scientific notation of real numbers is derived from this representation in the case of $β = 10$ . Let e.g. $x = 6.022 \cdot 10^{23}$ =6.022e23. Then

$β = 10$
$d = (6, 0, 2, 2, 0 \dots)$
$e = 23$

This representation is not unique, e.g. $x_{1} = 0.6022 \cdot 10^{24}$ =0.6022e24 $= x$ with

$β = 10$
$d = (0, 6, 0, 2, 2, 0 \dots)$
$e = 24$

13.9 μs

IEEE754 standard

This is the actual standard format for storing floating point numbers. It was developed in the 1980ies.

$β = 2$ , therefore $d_{i} \in {0, 1}$
Truncation to fixed finite size: $x = \pm \sum_{i = 0}^{t - 1} d_{i} β^{- i} β^{e}$
$t$ : significand (mantissa) length
Normalization: assume $d_{0} = 1 \Rightarrow$ save one bit for the storage of the significand. This requires a normalization step after operations which adjusts significand and exponent of the result.
$k$ : exponent size. Define $L, K$ : $- β^{k} + 1 = L \leq e \leq U = β^{k} - 1$
Extra bit for sign
$\Rightarrow$ storage size: $(t - 1) + k + 1$

Standardized for most modern languages
Hardware support usually for 64bit and 32bit

precision	Julia	C/C++	k	t	bits
quadruple	n/a	long double	16	113	128
double	Float64	double	11	53	64
single	Float32	float	8	24	32
half	Float16	n/a	5	11	16

See also the Julia Documentation on floating point numbers, 0.30000000000000004.com, wikipedia and the links therein.

The storage sequence is: Sign bit, exponent, mantissa.

28.5 μs

Storage layout for a normalized Float32 number ( $d_{0} = 1$ ):

bit 1: sign, $0 \to +, 1 \to -$
bit $2 \dots 9$ : $k = 8$ exponent bits
- the value $e + 2^{k - 1} - 1 = e + 127$ is stored $\Rightarrow$ no need for sign bit in exponent
bit $10 \dots 32$ : $23 = t - 1$ mantissa bits $d_{1} \dots d_{23}$
$d_{0} = 1$ not stored $\equiv$ "hidden bit"

21.4 μs

Julia allows to obtain the signifcand and the exponent of a floating point number

5.5 μs

2.0

xxxxxxxxxx
 
x0=2.0

3.1 μs

1.0

xxxxxxxxxx
 
significand(x0),exponent(x0)

15.6 ms

We can calculate the length of the exponent $k$ from the maximum representable floating point number by taking the base-2 logarithm of its exponent:

13.6 μs

xxxxxxxxxx
 
exponent_length(T::Type{<:AbstractFloat})=Int(log2(exponent(floatmax(T))+1)+1);

45.4 μs

The size of the significand $t$ is calculated from the overall size of the representation minus the size of the exponent and the size of the sign bit + 1 for the "hidden bit".

9.0 μs

xxxxxxxxxx
 
significand_length(T::Type{<:AbstractFloat})=8*sizeof(T)-exponent_length(T)-1+1;

35.1 μs

This allows to define a more readable variant of the bitstring repredentatio for floats.

The sign bit is the first bit in the representation:

7.2 μs

xxxxxxxxxx
 
signbit(x::AbstractFloat)=bitstring(x)[1:1];

30.0 μs

Next comes the exponent:

16.6 μs

exponent_bits (generic function with 1 method)

xxxxxxxxxx
 
exponent_bits(x::AbstractFloat)=bitstring(x)[2:exponent_length(typeof(x))+1]

44.0 μs

And finally, the significand:

10.9 μs

xxxxxxxxxx
 
significand_bits(x::AbstractFloat)=bitstring(x)[exponent_length(typeof(x))+2:8*sizeof(x)];

33.8 μs

Put them together:

10.8 μs

xxxxxxxxxx
 
floatbits(x::AbstractFloat)=signbit(x)*"_"*exponent_bits(x)*"_"*significand_bits(x);

26.3 μs

Julia floating point types

6.8 μs

Float16

xxxxxxxxxx
 
T=Float16

6.4 μs

Type Float16:

size of exponent: 5
size of significand: 11

17.2 ms

Float16(0.1)

xxxxxxxxxx
 
x=T(0.1)

9.7 ms

Binary representation: 0_01011_1001100110
Exponent e=-4
Stored: e+15= 11
$d_{0} = 1$ assumed implicitely.

87.8 ms

Numbers which are exactly represented in decimal system may not be exactly represented in binary system!
Such numbers are always rounded to a finite approximate

11.5 μs

x_per

Float16(0.2998)

xxxxxxxxxx
 
x_per=T(0.1)+T(0.2)

25.2 μs

"0_01101_0011001100"

xxxxxxxxxx
 
floatbits(x_per)

5.5 μs

Floating point limits

Finite size of representation $\Rightarrow$ there are minimal and maximal possible numbers which can be represented
symmetry wrt. 0 because of sign bit
smallest positive denormalized number: $d_{i} = 0, i = 0 \dots t - 2, d_{t - 1} = 1$ $\Rightarrow$ $x_{m i n} = 2^{1 - t} 2^{L}$

15.0 μs

6.0e-8

"0_00000_0000000001"

xxxxxxxxxx
 
 nextfloat(zero(T)), floatbits(nextfloat(zero(T)))

25.6 ms

smallest positive normalized number: $d_{0} = 1, d_{i} = 0, i = 1 \dots t - 1$ $\Rightarrow$ $x_{m i n} = 2^{L}$

21.0 μs

6.104e-5

"0_00001_0000000000"

xxxxxxxxxx
 
floatmin(T),floatbits(floatmin(T))

134 μs

largest positive normalized number: $d_{i} = 1, 0 \dots t - 1$ $\Rightarrow$ $x_{m a x} = 2 (1 - 2^{1 - t}) 2^{U}$

11.2 μs

6.55e4

"0_11110_1111111111"

xxxxxxxxxx
 
floatmax(T), floatbits(floatmax(T))

26.2 μs

Largest representable number:

14.1 μs

Inf

"0_11111_0000000000"

6.55e4

"0_11110_1111111111"

xxxxxxxxxx
 
typemax(T),floatbits(typemax(T)),prevfloat(typemax(T)), floatbits(prevfloat(typemax(T)))

14.9 ms

Machine precision

There cannot be more than $2^{t + k}$ floating point numbers $\Rightarrow$ almost all real numbers have to be approximated
Let $x$ be an exact value and $\tilde{x}$ be its approximation. Then $| \frac{\tilde{x} - x}{x} | < ϵ$ is the best accuracy estimate we can get, where
- $ϵ = 2^{1 - t}$ (truncation)
- $ϵ = \frac{1}{2} 2^{1 - t}$ (rounding)
Also: $ϵ$ is the smallest representable number such that $1 + ϵ > 1$ .
Relative errors show up in particular when
- subtracting two close numbers
- adding smaller numbers to larger ones

How do operations work?

E.g. Addition

Adjust exponent of number to be added:
- Until both exponents are equal, add 1 to exponent, shift mantissa to right bit by bit
Add both numbers
Normalize result

The smallest number one can add to 1 can have at most $t$ bit shifts of normalized mantissa until mantissa becomes 0, so its value must be $2^{- t}$ .

Machine epsilon

Smallest floating point number $ϵ$ such that $1 + ϵ > 1$ in floating point arithmetic
In exact math it is true that from $1 + ε = 1$ it follows that $0 + ε = 0$ and vice versa. In floating point computations this is not true

45.1 μs

Float16(0.000977)

xxxxxxxxxx
 
ϵ=eps(T)

154 μs

"0_00101_0000000000"

xxxxxxxxxx
 
floatbits(ϵ)

8.2 μs

1.0

"0_01111_0000000000"

"0_01111_0000000000"

xxxxxxxxxx
 
one(T)+ϵ/2,floatbits(one(T)+ϵ/2), floatbits(one(T))

10.1 ms

1.001

"0_01111_0000000001"

xxxxxxxxxx
 
 one(T)+ϵ,floatbits(one(T)+ϵ)

10.7 μs

0.000977

"0_00101_0000000000"

xxxxxxxxxx
 
nextfloat(one(T))-one(T),floatbits(nextfloat(one(T))-one(T))

38.6 μs

Density of floating point numbers

How dense are floating point numbers on the real axis?

9.2 μs

xxxxxxxxxx
 
function fpdens(x::AbstractFloat;sample_size=1000) 
    xleft=x
    xright=x
    for i=1:sample_size
        xleft=prevfloat(xleft)
        xright=nextfloat(xright)
    end
    return prevfloat(2.0*sample_size/(xright-xleft))
end;

72.2 μs

Float161

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

more192

Inf

193

Inf

194

Inf

195

Inf

196

Inf

197

Inf

198

Inf

199

Inf

200

Inf

201

Inf

xxxxxxxxxx
 
X=T(10.0) .^collect(-10:T(0.1):10)

154 ms

xxxxxxxxxx
 
begin
    fig=PyPlot.figure()
    PyPlot.loglog(X,fpdens.(X))
    PyPlot.title("$(eltype(X)) numbers per unit interval")
    PyPlot.grid()
    PyPlot.xlabel("x")
    fig
end

4.8 s