Representing and manipulating information

:class: tip

This lecture will cover contents from Chapter 4 of the book.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
## How do we "see" things?

### 1.2 Everything is a bit
 
- Each bit is `0` or `1`
- By encoding/interpreting sets of bits in various ways
  - Computers determine what to do (instructions)
  - … and represent and manipulate numbers, sets, strings, etc…
- Why bits?  Electronic Implementation
  - Easy to store with bistable elements.
  - Reliably transmitted on noisy and inaccurate wires. 





<figure
  
>
  <picture>
    <!-- Auto scaling with imagemagick -->
    <!--
      See https://www.debugbear.com/blog/responsive-images#w-descriptors-and-the-sizes-attribute and
      https://developer.mozilla.org/en-US/docs/Learn/HTML/Multimedia_and_embedding/Responsive_images for info on defining 'sizes' for responsive images
    -->
    
      
        <source
          class="responsive-img-srcset"
          
            srcset="/assets/img/courses/csc231/03-data-representation/data_01-480.webp 480w,/assets/img/courses/csc231/03-data-representation/data_01-800.webp 800w,/assets/img/courses/csc231/03-data-representation/data_01-1400.webp 1400w,"
            type="image/webp"
          
          
            sizes="95vw"
          
        >
      
    
    <img
      src="/assets/img/courses/csc231/03-data-representation/data_01.png"
      
      
        width="50%"
      
      
        height="auto"
      
      
      
        alt="Electronic representation of bits"
      
      
      
        data-zoomable
      
      
        loading="lazy"
      
      onerror="this.onerror=null; $('.responsive-img-srcset').remove();"
    >
  </picture>

  
</figure>


### 1.3 Encoding byte values 

- Byte = 8 bits
- Binary: `0000 0000` to `1111 1111`. 
- Decimal: `0` to `255`. 
- Hexadecimal: `00` to `FF`. 
  - Base 16 number representation
  - Use character `0` to `9` and `A` to `F`. 
- Example: 15213 (decimal) = 0011 1011 0110 1101 (binary) = 3B6D (hex)

| Hex | Decimal | Binary |         Binary to Decimal Calculation         |
| --- | ------- | ------ | --------------------------------------------- |
|  0  |    0    |  0000  | 0 * $2^3$ + 0 * $2^2$ + 0 * $2^1$ + 0 * $2^0$ |
|  1  |    1    |  0001  | 0 * $2^3$ + 0 * $2^2$ + 0 * $2^1$ + 1 * $2^0$ |
|  2  |    2    |  0010  | 0 * $2^3$ + 0 * $2^2$ + 1 * $2^1$ + 0 * $2^0$ |
|  3  |    3    |  0011  | 0 * $2^3$ + 0 * $2^2$ + 1 * $2^1$ + 1 * $2^0$ |
|  4  |    4    |  0100  | 0 * $2^3$ + 1 * $2^2$ + 0 * $2^1$ + 0 * $2^0$ |
|  5  |    5    |  0101  | 0 * $2^3$ + 1 * $2^2$ + 0 * $2^1$ + 1 * $2^0$ |
|  6  |    6    |  0110  | 0 * $2^3$ + 1 * $2^2$ + 1 * $2^1$ + 0 * $2^0$ |
|  7  |    7    |  0111  | 0 * $2^3$ + 1 * $2^2$ + 1 * $2^1$ + 1 * $2^0$ |
|  8  |    8    |  1000  | 1 * $2^3$ + 0 * $2^2$ + 0 * $2^1$ + 0 * $2^0$ |
|  9  |    9    |  1001  | 1 * $2^3$ + 0 * $2^2$ + 0 * $2^1$ + 1 * $2^0$ |
|  A  |   10    |  1010  | 1 * $2^3$ + 0 * $2^2$ + 1 * $2^1$ + 0 * $2^0$ |
|  B  |   11    |  1011  | 1 * $2^3$ + 0 * $2^2$ + 1 * $2^1$ + 1 * $2^0$ |
|  C  |   12    |  1100  | 1 * $2^3$ + 1 * $2^2$ + 0 * $2^1$ + 0 * $2^0$ |
|  D  |   13    |  1101  | 1 * $2^3$ + 1 * $2^2$ + 0 * $2^1$ + 1 * $2^0$ |
|  E  |   14    |  1110  | 1 * $2^3$ + 1 * $2^2$ + 1 * $2^1$ + 0 * $2^0$ |
|  F  |   15    |  1111  | 1 * $2^3$ + 1 * $2^2$ + 1 * $2^1$ + 1 * $2^0$ |


- [Google Spreadsheet demonstrating conversion process](https://docs.google.com/spreadsheets/d/16yW8yDfDTxBiH-PkIddm1Cg4kYE4k-56GLTa6xDD_WU/edit?usp=sharing)

### 1.4 How are data represented? 

| C data type | typical 32-bit | typical 64-bit | x86_64  |  
| ----------- | -------------- | -------------- | ------- |  
| char        | 1              | 1              | 1       |  
| short       | 2              | 2              | 2       |  
| int         | 4              | 4              | 4       |  
| long        | 4              | 8              | 8       |  
| float       | 4              | 4              | 4       |  
| double      | 8              | 8              | 8       |  
| pointer     | 4              | 8              | 8       |  

## Bit-level operations in  C

- Boolean algebra developed by George Boole in 19th century
- Algebraic representation of logic: encode `True` as `1` and `False` as `0`. 
- Operations: `AND` (`&`), `OR` (`|`), `XOR` (`^`), `NOT` (`~`).

| A | B | A&B  | A\|B  | A^B | ~A |
| - | - | ---- | ---- | --- | -- | 
| 0 | 0 | 0    | 0    | 0   | 1  |
| 0 | 1 | 0    | 1    | 1   | 1  |
| 1 | 0 | 0    | 1    | 1   | 0  |
| 1 | 1 | 1    | 1    | 0   | 0  |  

- General Boolean algebra
  - Operate on bit vectors
  - Operation applied bitwise. 
  - All properties of boolean algebra apply.  





<figure
  
>
  <picture>
    <!-- Auto scaling with imagemagick -->
    <!--
      See https://www.debugbear.com/blog/responsive-images#w-descriptors-and-the-sizes-attribute and
      https://developer.mozilla.org/en-US/docs/Learn/HTML/Multimedia_and_embedding/Responsive_images for info on defining 'sizes' for responsive images
    -->
    
      
        <source
          class="responsive-img-srcset"
          
            srcset="/assets/img/courses/csc231/03-data-representation/data_02-480.webp 480w,/assets/img/courses/csc231/03-data-representation/data_02-800.webp 800w,/assets/img/courses/csc231/03-data-representation/data_02-1400.webp 1400w,"
            type="image/webp"
          
          
            sizes="95vw"
          
        >
      
    
    <img
      src="/assets/img/courses/csc231/03-data-representation/data_02.png"
      
      
        width="50%"
      
      
        height="auto"
      
      
      
        alt="bitwise boolean operations"
      
      
      
        data-zoomable
      
      
        loading="lazy"
      
      onerror="this.onerror=null; $('.responsive-img-srcset').remove();"
    >
  </picture>

  
</figure>


- Operation and notation  
  - Boolean operations: `&`, `|`, `^`, `~`.
  - Shift operations:
    - Left Shift: 	x << y
      - Shift bit-vector x left y positions
      - Throw away extra bits on left
      - Fill with 0’s on right
    - Right Shift: 	x >y
      - Shift bit-vector x right y positions
      - Throw away extra bits on right
      - Logical shift (for unsigned values)
        - Fill with 0’s on left
      - Arithmetic shift (for signed values)
        - Replicate most significant bit on left
    - Undefined Behavior
      - Shift amount < 0 or ≥ word size
  - Apply to any "integral" data type: long, int, short, char, unsigned
  - View arguments as bit vectors. 
  - Arguments applied bit-wise. 
  - Mathematical operations:
    - Bit-wise with carry
    - $0 + 0 = 0$
    - $0 + 1 = 1$
    - $1 + 0 = 1$
    - $1 + 1 = 0$ and carry $1$ to the next bit operation 
    (or add 1 to left of the most significant bit position)


- Inside your `csc231`, create another directory called `03-data` and change 
into this directory.
- Create a file named `bitwise_demo.c` with the following contents:

<script src="https://gist.github.com/linhbngo/d1e9336a82632c528ea797210ed0f553.js?file=bitwise_demo.c"></script>

- Compile and run `bitwise_demo.c`.
- Confirm that the binary printouts match the corresponding decimal printouts and the expected bitwise operations. 

Encoding integers

3.1 Mathematical equation

$X=\sum_{i=0}^{w-1}x_{i}*2^{i}$

3.2 What about negative numbers?

$X=-x_{w-1} * 2^{w-1} + \sum_{i=0}^{w-2}x_{i}*2^{i}$

Unsigned Binary 2’s complement 1’s complement
0 0000 0 0
1 0001 1 1
2 0010 2 2
3 0011 3 3
4 0100 4 4
5 0101 5 5
6 0110 6 6
7 0111 7 7
8 1000 8 -7
9 1001 -7 -6
10 1010 -6 -5
11 1011 -5 -4
12 1100 -4 -3
13 1101 -3 -2
14 1110 -2 -1
15 1111 -1 0
  Decimal Hex Binary
short int x 15213 3B 6D 00111011 01101101
short int y -15213 C4 93 11000100 10010011
  -16 8 4 2 1  
10 0 1 0 1 0 8 + 2 = 10
-10 1 0 1 1 0 -16 + 4 + 2 = -10
  -32 16 8 4 2 1  
10 0 0 1 0 1 0 8 + 2 = 10
-10 1 1 0 1 1 0 -32 + 16 + 4 + 2 = -10
  Decimal Hex Binary
short int x 15213 3B 6D 00111011 01101101
short int y -15213 C4 93 11000100 10010011
Weight 15213   -15213  
1 1 1 1 1
2 0 0 1 2
4 1 4 0 0
8 1 8 0 0
16 0 0 1 16
32 1 32 0 0
64 1 64 0 0
128 0 0 1 128
256 1 256 0 0
512 1 512 0 0
1024 0 0 1 1024
2048 1 2048 0 0
4096 1 4096 0 0
8192 1 8192 0 0
16384 0 0 1 16384
-32768 0 0 1 -32768
Sum   15213   -15213
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
### 3.3 Numeric ranges

- Unsigned values for `w-bit` word
  - UMin = 0
  - UMax = $2^{w} - 1$
- 2's complement values for `w-bit` word
  - TMin = $-2^{w-1}$
  - TMax = $2^{w-1} - 1$
  - -1: 111..1

- Values for different word sizes:

|      | 8 (1 byte)    | 16 (2 bytes)      | 32 (4 bytes)             | 64 (8 bytes)                         |
| ---- | ---- | ------- | -------------- | -------------------------- |
| UMax | 255  | 65,535  | 4,294,967,295  | 18,446,744,073,709,551,615 |
| TMax | 127  | 32,767  | 2,147,483,647  | 9,223,372,036,854,775,807  |
| TMin | -128 | -32,768 | -2,147,483,648 | -9,223,372,036,854,775,808 | 

- Observations
  - abs(TMin) = TMax + 1
    - Asymetric range
  - UMax = 2 * TMax + 1

 - C programming
   - `#include <limits.h>`
   - Declares constants: `ULONG_MAX`, `LONG_MAX`, `LONG_MIN`
   - Platform specific



- Write a C program called `numeric_ranges.c` that prints out the 
value of `ULONG_MAX`, `LONG_MAX`, `LONG_MIN`. Also answer the following 
question: If we multiply `LONG_MIN` by -1, what do we get?
- Note: You need to search for the correct format string specifiers. 

:::{dropdown} Solution
`-p` allows the creation of all directories
on the specified path, regardless whether any directory on 
that path exists. 

<script src="https://gist.github.com/linhbngo/d1e9336a82632c528ea797210ed0f553.js?file=numeric_ranges.c"></script>
>
:::
 

Conversions (casting)

:::{image} fig/03-data-representation/data_04.png :alt: 2’s complement to unsigned :class: bg-primary mb-1 :height: 200px :align: center :::

:::{image} fig/03-data-representation/data_05.png :alt: unsigned to 2’s :class: bg-primary mb-1 :height: 200px :align: center :::

:::{dropdown} Solution

:::{image} fig/03-data-representation/data_06.png :alt: expanding :class: bg-primary mb-1 :height: 200px :align: center :::

:::{image} fig/03-data-representation/data_07.png :alt: truncating :class: bg-primary mb-1 :height: 200px :align: center :::

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
## Addition, multiplication, and negation (of integers)


- Mathematical operations:
  - Bit-wise with carry
  - $0 + 0 = 0$
  - $0 + 1 = 1$
  - $1 + 0 = 1$
  - $1 + 1 = 0$ and carry $1$ to the next bit operation (or add 1 to left of the most significant bit position)
- This works for both unsigned and 2's complement notation
- Example 1: 4-bit unsigned $2+6=8$

$$
\ 0010 \\
+\ 0110 \\
\hline
\ 1000
$$  

- Example 2: 4-bit unsigned $11+12=23$

$$
\ 1011 \\
+\ 1100 \\
\hline
\ 10111
$$

- Example 3: 4-bit signed $5-7=5+(-7)=(-2)$
  - Positive to negative conversion in 2's complement: flipped bit and add 1. 
  - $7$: $0111$
  - $-7$: $1000 + 1=1001$

$$
\ 0101 \\
+\ 1001 \\
\hline
\ 1110
$$

$1110=(-1)*(1)*(8)+(1)*(4)+(1)*(2)+(0)*1)=(-8)+4+2=(-2)$


- Given `w` bits operands
 - True sum can have `w + 1` bits (carry bit). 
 - Carry bit is discarded. 
- Implementation:
 - s = (u + v) mod 2<sup>w</sup>



- Create a file named `unsigned_addition.c` with the following contents:

<script src="https://gist.github.com/linhbngo/d1e9336a82632c528ea797210ed0f553.js?file=unsigned_addition.c"></script>

- Compile and run `unsigned_addition.c`.
- Confirm that calculated values are correct. 



- Almost similar bit-level behavior as unsigned addition
  - True sum of `w`-bit operands will have `w+1`-bit, but
  - Carry bit is discarded. 
  - Remainding bits are treated as 2's complement integers. 
-  Overflow behavior is different
  - $TAdd_{w}(u, v) = u + v + 2^{w}$ if $u + v < TMin_{w}$ (**Negative Overflow**)
  - $TAdd_{w}(u, v) = u + v$ if $TMin_{w} \leq u + v \leq TMax_{w}$
  - $TAdd_{w}(u, v) = u + v - 2^{w}$ if $u + v TMax_{w}$ (**Positive Overflow**)



- Create a file named `signed_addition.c` with the following contents:

<script src="https://gist.github.com/linhbngo/d1e9336a82632c528ea797210ed0f553.js?file=signed_addition.c"></script>

- Compile and run `signed_addition.c`.
- Confirm that calculated values are correct. 



- Compute product of `w`-bit numbers x and y. 
- Exact results can be bigger than `w` bits. 
  - Unsigned: up to `2w` bits: $0 \leq x * y \leq (2^{w} - 1)^{2}$
  - 2's complement (negative): up to `2w - 1` bits: $x * y \geq (-2)^{2w-2} + 2^{2w-1}$
  - 2's complement (positive): up to `2w` bits: $x * y \leq 2^{2w-2}$
- To maintain exact results:
  - Need to keep expanding word size with each product computed. 
  - Is done by software if needed ([arbitrary precision arithmetic packages](https://en.wikipedia.org/wiki/List_of_arbitrary-precision_arithmetic_software)).
- **Trust your compiler**: Modern CPUs and OSes will most likely know to select the optimal method
to multiply. 



- Power-of-2 multiply with left shift
  - $u << k$ gives $u * 2^{k}$
  - True product has `w + k` bits: discard `k` bits. 
- Unsigned power-of-2 divide with right shift
  - $u >> k$ gives $floor(u / 2^{k})$
  - Use logical shift.
- Signed power-of-2 divide with shift
  - x > 0: $x >> k$ gives $floor(u / 2^{k})$
  - x < 0: $(x + (1 << k) - 1) >> k$ gives ceiling $u / 2^{k}$
  - C statement: `(x < 0 ? x + (1 << k) - 1: x) >k`



- Negate through complement and increment:
  - `~x + 1 == -x`



- Implement a C program called `negation.c` that implements and validates
the equation in slide 24. The program should take in a command line argument
that takes in a number of type `short` to be negated. 
- What happens if you try to negate `-32768`?

:::{dropdown} Solution
<script src="https://gist.github.com/linhbngo/d1e9336a82632c528ea797210ed0f553.js?file=negation.c"></script>
:::

Byte-oriented memory organization

word-oriented memory organization

:::{image} fig/03-data-representation/data_09.png :alt: byte ordering example :class: bg-primary mb-1 :height: 100px :align: center :::

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
## Fractional binary numbers (float and double)


- What is 1011.101<sub>2</sub>?
  - 8 + 0 + 2 + 1 + 1/2 + 0 + 1/4
- Can only exactly represent numbers of the form x/2<sup>k</sup>
- Limited range of numbers within the `w`-bit word size. 



- [IEEE Standard 754](https://standards.ieee.org/standard/754-2019.html)
  - Established in 1985 as uniform standard for floating point arithmetic
  - Supported by all major CPUs
  - Some CPUs don’t implement IEEE 754 in full, for example, early GPUs, Cell BE processor
- Driven by numerical concerns
  - Nice standards for rounding, overflow, underflow
  - Hard to make fast in hardware (Numerical analysts predominated over hardware 
designers in defining standard).



<iframe width="560" height="315" src="https://www.youtube.com/embed/5tJPXYA0Nec" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

- Ariane 5 explordes on maiden voyage: $500 million dollars lost. 
  - [64-bit floating point number assigned to 16-bit integer](http://sunnyday.mit.edu/nasa-class/Ariane5-report.html)
  - Cause rocket to get incorrect value of horizontal velocity and crash. 

<iframe width="560" height="315" src="https://www.youtube.com/embed/_Dbd3z8t9qc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

- Patriot Missle defense system misses Scud: 28 dead
  - [Velocity is a real number that can be expressed as a whole number and a decimal 
  (e.g., 3750.2563...miles per hour). Time is kept continuously by the system’s internal clock 
  in tenths of seconds but is expressed as an integer or whole number (e.g., 32,33, 34...). 
  The longer the system has been running, the larger the number representing time. To predict 
  where the Scud will next appear, both time and velocity must be expressed as real numbers. 
  Because of the way the Patriot computer performs its calculations and the fact that its 
  registers are only 24 bits long, the conversion of time from an integer to a real number 
  cannot be any more precise than 24 bits. This conversion results in a loss of precision 
  causing a less accurate time calculation. The effect of this inaccuracy on the range gate’s 
  calculation is directly proportional to the target’s velocity and the length of time 
  the system has been running. Consequently, performing the conversion after the Patriot has 
  been running continuously for extended periods causes the range gate to shift away from 
  the center of the target, making it less likely that the target, in this case a Scud, 
  will be successfully intercepted.](https://www.gao.gov/assets/220/215614.pdf)



- Numerical form: (-1)<sup>s</sup>M2<sup>E</sup>
  - Sign bit `s` determins whether the number is negative or positive. 
  - Significant `M` normalizes a fractional value in range [1.0, 2.0).
  - Exponent `E` weights value by power of two. 
- Encoding
  - Most significant bit is sign bit `s`. 
  - `exp` field encodes `E` (but is not equal to `E`)
  - `frac` field encodes `M` (but is not equalt to `M`)

:::{image} fig/03-data-representation/data_10.png
:alt: floating encoding
:class: bg-primary mb-1
:height: 100px
:align: center
:::

- Single precision: 32 bits

:::{image} fig/03-data-representation/data_11.png
:alt: 32-bit encoding
:class: bg-primary mb-1
:height: 100px
:align: center
:::

- Double precision: 64 bits

:::{image} fig/03-data-representation/data_12.png
:alt: 64-bit encoding
:class: bg-primary mb-1
:height: 100px
:align: center
:::

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
## Floating operations


- Compute exact result. 
- Make it fit into desired precision. 
  - Possible overflow if exponent too large
  - Possible round to fit into `frac`

- Rounding modes

|                        | 1.40 | 1.60 | 1.50 | 2.50 | -1.50 |
| ---------------------- | ---- | ---- | ---- | ---- | ----- |
| Towards zero           | 1    | 1    | 1    | 2    | -1    |
| Round down             | 1    | 1    | 1    | 2    | -2    |
| Round up               | 2    | 1    | 1    | 3    | -1    |
| Nearest even (default) | 1    | 2    | 2    | 2    | -2    |

- Nearest even
  - Hard to get any other mode without dropping into assembly. 
  - C99 has support for rounding mode management
- All others are statistically based
  - Sum of set of positive numbers will consistently be over- or under-estimated. 



- $(-1)^{s_1}M_{1}2^{E_1} * (-1)^{s_2}M_{2}2^{E_2}$  
- Exact result: $(-1)^{s}M2^{E}$
  - $s = s_{1} XOR s_{2}$
  - $M = M_{1}*M_{2}$
  - $E = E_{1}+E_{2}$
- Correction
  - If M >= 2, shift M right, increment E. 
  - If E out of range, overflow. 
  - Round M to fit `frac` precision
- Implementation: Biggest chore is multiplying significands.  



- $(-1)^{s_1}M_{1}2^{E_1} + (-1)^{s_2}M_{2}2^{E_2}$ 
- Exact result: $(-1)^{s}M2^E$1
  - Sign s, significand M: result of signed align and add
  - E = E<sub>1</sub>
- Correction
  - If M >= 2, shift M right, increment E. 
  - If M < 1, shift M left k positions, decrement E by k. 
  - Overflow if E out of range
  - Round M to fit `frac` precision
- Implementation: Biggest chore is multiplying significands.  



- Compare to those of [Abelian group](https://en.wikipedia.org/wiki/Abelian_group) 
(a group in which the result of applying the group operation to two group elements 
does not depend on the order in which they are written):
  - Closed under addition? **Yes** (but may generate infinity or NaN)
  - Communicative? **Yes**
  - Associative? **No**
    - Overflow and inexactness of rounding
    - (3.14+1e10)-1e10 = 0, 3.14+(1e10-1e10) = 3.14
  - 0 is additive identity? **Yes**
  - Every element has additive inverse? **Almost**
    - Except for infinities and NaN
- Monotonicity?
  - **Almost**
  - Except for infinities and NaN



- Compare to those of [Abelian group](https://en.wikipedia.org/wiki/Abelian_group) 
(a group in which the result of applying the group operation to two group elements 
does not depend on the order in which they are written):
  - Closed under addition? **Yes** (but may generate infinity or NaN)
  - Communicative? **Yes**
  - Associative? **No**
    - Overflow and inexactness of rounding
    - (1e20 * 1e20) * 1e-20= inf, 1e20 * (1e20 * 1e-20)= 1e20
  - 1 is multiplicative identity? **Yes**
  - Multiplication distributes over addition? **No**
    - Overflow and inexactness of rounding
    - 1e20 * (1e20-1e20)= 0.0,  1e20 * 1e20 – 1e20 * 1e20 = NaN
- Monotonicity?
  - **Almost**
  - Except for infinities and NaN



- C guarantees two levels
  - `float`: single precision
  - `double`: double precision
- Conversion/casting
  - Casting between int, float, and double changes bit representation
  - double/float to int
    - Truncates fractional part
    - Like rounding toward zero
    - Not defined when out of range or NaN: Generally sets to TMin
  - int to double
    - Exact conversion, as long as int has ≤ 53 bit word size
  - int to float
    - Will round according to rounding mode