notes/data-rep.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195

\chapter{Data representation}

\section{Metadata}

Metadata means data about data. It is the information stored in a file
that is not part of the main information, but instead important
properties and data of the file, such as the author name of a PDF
document. Although we do not take it into account when calculating
file size, it is important to realise that in the real world, it would
be there.

\section{Character encoding}

The \textit{American Standard Code for Information Interchange}
(ASCII) is a character encoding standard. It is comprised of
a \textit{character set} which contains characters and numerical
representations of each of them, so that they can be processed and
stored by a computer. Standard ASCII uses 7 bits to represent each
character and can thusly store 128 characters (127 characters and 1
null character). The letter A in ASCII is character 64 ($0100000_2$),
allowing for easier human representation of the rest of the alphabet,
which increments from 64.

Extended ASCII uses 8 bits to store each character and thus contains
256 characters (255 characters and 1 null character), allowing for
more than the limited alphabet standard ASCII contains by including
more symbols and some accented characters. This however, is quite
limited in comparison to Unicode which is a significantly larger set
which allows for character encoding in many languages and many symbols
such as emoji characters. Unicode is also compatible with many more
devices. The drawback to extended ASCII and even moreso to Unicode is
that they require more space to encode characters.

\section{Images}

A \textit{bitmap} (such as jpg, png) image is made up of rows of
pixels, each with a colour value.  The resolution of the image, the
$\mbox{width}\times\mbox{height}$ of pixels, determines the detail the
image may contain. The colour depth is the number of bits of
information that each colour is encoded with.  As with 7 bit ASCII,
the number of colours that a pixel can possibly have is determined by:

\begin{equation}
    \mbox{\textit{Number of colours}} = 2^{\mbox{\footnotesize Bit\ depth}}
\end{equation}

For example a 4 bit depth would mean that each pixel could store one
of 16 colours. As this is all of the information an image file stores,
we can deduce that:

\begin{equation}
    \mbox{\textit{File size}} =
    \mbox{width}\times\mbox{height}\times\mbox{bit depth}
\end{equation}

A \textit{vector} image does not use pixels, but instead mathematical
descriptions of shapes to form an image. This means that whilst
zooming into a bitmap image we lose quality as resolution decreases,
this is not the case for vector images. However, vector images could
not represent a photograph and are generally used for design,
animation, or games.

\section{Sound}

A \textit{microphone} transforms vibrations in the air (sound) into an
analogue signal (\textit{voltage}). Digital circuitry can take a
\textit{sample} of this analogue signal at set intervals, thus
producing a close representation of the original. The
\textit{resolution} of a sound file is based on the number of possible
values a single sample can have, much like colour depth, where the
resolution is measured in bits. The \textit{sample rate} (measured in
Hertz (Hz), meaning `per second') is the number of samples taken per
second. By multiplying the sample rate by the time of the recording in
seconds, we have the number of samples taken. Then by multiplying this
by the number of bits each sample requires (the resolution) we have
the file size.

\begin{equation}
    \mbox{\textit{File size}} =
    \mbox{sample rate}\times\mbox{resolution}\times\mbox{time in
    seconds}
\end{equation}

\section{Data compression}

If files such as image, sound, or perhaps video were stored
uncompressed then they would take a lot of space on hard drives and
require longer to transfer over a network. Therefore, data compression
is generally used which involves using algorithms to reduce the size
of a file. \textit{Lossless compression} simply reduces the size of a
file and we will examine two example of this. \textit{Lossy
compression}
\footnote{\url{https://en.wikipedia.org/wiki/Data_compression}}
involves removing data such as removing quiet tones from an audio
file.

\subsection{Run length encoding}

Run length encoding is a form of compression where repeated values are
compressed. For example, take the binary
\begin{align*}
    111110000001110000
\end{align*}
by compressing the runs where a zero or a one is repeated, it can
become
\begin{align*}
    (5,1),(6,0),(3,1)(4,0)
\end{align*}

\section{Huffman encoding}

Huffman describes a method of encoding text. Suppose we have the text
\begin{align*}
    repetitive
\end{align*}
We want to create a Huffman tree by writing out each character and the
number of times it occurs at the bottom. Then we join together the two
lowest numbers and add their number of occurrences. We keep doing this
until there is one number left. This ensures that each the letters
that occur the most are closest to the top of the tree. Each branch is
then labelled zero or one, depending on whether it goes left or right
respectively. Each letter then has a new code, based on the path from
the top of the tree to it.

%% requires package \usepackage[edges]{forest}
%\begin{forest}
%    for tree={
%        grow=south,
%        s sep=7mm
%    }
%    [rpveit (10)
%    [rpve (6)
%    [e (3)]
%    [rpv (3)
%        [ rp(2)
%        [r (1)]
%        [p (1)]
%        ]
%        [v (1)]
%        ]
%        ]
%        [it (4)
%        [i (2)]
%        [t (2)]
%        ]
%        ]
%        ]
%\end{forest}

\begin{tikzpicture}
    \node at (0,0) (top) {rpveit (10)};

    \node at (-2, -1.5) (2a) {rpve (6)};
    \node at (4, -3) (2b) {it (4)};

    \node at (-1, -3) (3a) {rpv (3)};
    \node at (-2, -4.5) (3b) {rp (2)};

    \node at (-6, -6) (e) {e (1)};
    \node at (-4, -6) (r) {r (1)};
    \node at (-1, -6) (p) {p (1)};
    \node at (1, -6) (v) {v (1)};
    \node at (3, -6) (i) {i (2)};
    \node at (5, -6) (t) {t (2)};


    \draw (top) to node[above]{$0$} (2a);
    \draw (3a) to node[above left]{$0$} (3b);

    \draw (top) to node[above]{$1$} (2b);
    \draw (2a) to node[above right]{$1$} (3a);

    \draw (3b) to node[above]{$0$} (r);
    \draw (2b) to node[above left]{$0$} (i);
    \draw (3b) to node[above]{$0$} (r);
    \draw (2a) to node[left]{$0$} (e);

    \draw (3b) to node[above right]{$1$} (p);
    \draw (2b) to node[above right]{$1$} (t);
    \draw (3a) to node[above]{$1$} (v);
\end{tikzpicture}
Here, the letter e has the code $00$, the letter r has the code
$0100$, the letter p has the code $0101$, and so on, as these are the
paths from the top of the tree to the letter in question. If we were
encoding with standard ASCII (7 bits per character) then the size of
the sting would be $7\times 10 = 70\ bits$ (as there are 10
characters).

Using the Huffman tree we have generated we can encode $repetitive$ as 
\begin{align*}
    0100000101001110111001100
\end{align*}
This uses $25\ bits$, which is a $(70-25)\div(70)\times 100 \approx
64.3\%$ saving.