\chapter{Data representation} \section{Metadata} Metadata means data about data. It is the information stored in a file that is not part of the main information, but instead important properties and data of the file, such as the author name of a PDF document. Although we do not take it into account when calculating file size, it is important to realise that in the real world, it would be there. \section{Character encoding} The \textit{American Standard Code for Information Interchange} (ASCII) is a character encoding standard. It is comprised of a \textit{character set} which contains characters and numerical representations of each of them, so that they can be processed and stored by a computer. Standard ASCII uses 7 bits to represent each character and can thusly store 128 characters (127 characters and 1 null character). The letter A in ASCII is character 64 ($0100000_2$), allowing for easier human representation of the rest of the alphabet, which increments from 64. Extended ASCII uses 8 bits to store each character and thus contains 256 characters (255 characters and 1 null character), allowing for more than the limited alphabet standard ASCII contains by including more symbols and some accented characters. This however, is quite limited in comparison to Unicode which is a significantly larger set which allows for character encoding in many languages and many symbols such as emoji characters. Unicode is also compatible with many more devices. The drawback to extended ASCII and even moreso to Unicode is that they require more space to encode characters. \section{Images} A \textit{bitmap} (such as jpg, png) image is made up of rows of pixels, each with a colour value. The resolution of the image, the $\mbox{width}\times\mbox{height}$ of pixels, determines the detail the image may contain. The colour depth is the number of bits of information that each colour is encoded with. As with 7 bit ASCII, the number of colours that a pixel can possibly have is determined by: \begin{equation} \mbox{\textit{Number of colours}} = 2^{\mbox{\footnotesize Bit\ depth}} \end{equation} For example a 4 bit depth would mean that each pixel could store one of 16 colours. As this is all of the information an image file stores, we can deduce that: \begin{equation} \mbox{\textit{File size}} = \mbox{width}\times\mbox{height}\times\mbox{bit depth} \end{equation} A \textit{vector} image does not use pixels, but instead mathematical descriptions of shapes to form an image. This means that whilst zooming into a bitmap image we lose quality as resolution decreases, this is not the case for vector images. However, vector images could not represent a photograph and are generally used for design, animation, or games. \section{Sound} A \textit{microphone} transforms vibrations in the air (sound) into an analogue signal (\textit{voltage}). Digital circuitry can take a \textit{sample} of this analogue signal at set intervals, thus producing a close representation of the original. The \textit{resolution} of a sound file is based on the number of possible values a single sample can have, much like colour depth, where the resolution is measured in bits. The \textit{sample rate} (measured in Hertz (Hz), meaning `per second') is the number of samples taken per second. By multiplying the sample rate by the time of the recording in seconds, we have the number of samples taken. Then by multiplying this by the number of bits each sample requires (the resolution) we have the file size. \begin{equation} \mbox{\textit{File size}} = \mbox{sample rate}\times\mbox{resolution}\times\mbox{time in seconds} \end{equation} \section{Data compression} If files such as image, sound, or perhaps video were stored uncompressed then they would take a lot of space on hard drives and require longer to transfer over a network. Therefore, data compression is generally used which involves using algorithms to reduce the size of a file. \textit{Lossless compression} simply reduces the size of a file and we will examine two example of this. \textit{Lossy compression} \footnote{\url{https://en.wikipedia.org/wiki/Data_compression}} involves removing data such as removing quiet tones from an audio file. \subsection{Run length encoding} Run length encoding is a form of compression where repeated values are compressed. For example, take the binary \begin{align*} 111110000001110000 \end{align*} by compressing the runs where a zero or a one is repeated, it can become \begin{align*} (5,1),(6,0),(3,1)(4,0) \end{align*} \section{Huffman encoding} Huffman describes a method of encoding text. Suppose we have the text \begin{align*} repetitive \end{align*} We want to create a Huffman tree by writing out each character and the number of times it occurs at the bottom. Then we join together the two lowest numbers and add their number of occurrences. We keep doing this until there is one number left. This ensures that each the letters that occur the most are closest to the top of the tree. Each branch is then labelled zero or one, depending on whether it goes left or right respectively. Each letter then has a new code, based on the path from the top of the tree to it. %% requires package \usepackage[edges]{forest} %\begin{forest} % for tree={ % grow=south, % s sep=7mm % } % [rpveit (10) % [rpve (6) % [e (3)] % [rpv (3) % [ rp(2) % [r (1)] % [p (1)] % ] % [v (1)] % ] % ] % [it (4) % [i (2)] % [t (2)] % ] % ] % ] %\end{forest} \begin{tikzpicture} \node at (0,0) (top) {rpveit (10)}; \node at (-2, -1.5) (2a) {rpve (6)}; \node at (4, -3) (2b) {it (4)}; \node at (-1, -3) (3a) {rpv (3)}; \node at (-2, -4.5) (3b) {rp (2)}; \node at (-6, -6) (e) {e (1)}; \node at (-4, -6) (r) {r (1)}; \node at (-1, -6) (p) {p (1)}; \node at (1, -6) (v) {v (1)}; \node at (3, -6) (i) {i (2)}; \node at (5, -6) (t) {t (2)}; \draw (top) to node[above]{$0$} (2a); \draw (3a) to node[above left]{$0$} (3b); \draw (top) to node[above]{$1$} (2b); \draw (2a) to node[above right]{$1$} (3a); \draw (3b) to node[above]{$0$} (r); \draw (2b) to node[above left]{$0$} (i); \draw (3b) to node[above]{$0$} (r); \draw (2a) to node[left]{$0$} (e); \draw (3b) to node[above right]{$1$} (p); \draw (2b) to node[above right]{$1$} (t); \draw (3a) to node[above]{$1$} (v); \end{tikzpicture} Here, the letter e has the code $00$, the letter r has the code $0100$, the letter p has the code $0101$, and so on, as these are the paths from the top of the tree to the letter in question. If we were encoding with standard ASCII (7 bits per character) then the size of the sting would be $7\times 10 = 70\ bits$ (as there are 10 characters). Using the Huffman tree we have generated we can encode $repetitive$ as \begin{align*} 0100000101001110111001100 \end{align*} This uses $25\ bits$, which is a $(70-25)\div(70)\times 100 \approx 64.3\%$ saving.