diff options
| author | Mohit Agarwal <mohit.agarwal@sky.com> | 2021-10-12 19:44:59 +0100 |
|---|---|---|
| committer | Mohit Agarwal <mohit.agarwal@sky.com> | 2021-10-12 19:44:59 +0100 |
| commit | d6de8c993dbf5522cc2e1b1a6491fd424981ab58 (patch) | |
| tree | 31eac9d804eb1f46cc48f4700140f5cb4c3a2932 /notes/data-rep.tex | |
| parent | 21b74cee1648bad2b9bbc2995fe79018c49a2457 (diff) | |
Writing notes
Diffstat (limited to 'notes/data-rep.tex')
| -rw-r--r-- | notes/data-rep.tex | 195 |
1 files changed, 195 insertions, 0 deletions
diff --git a/notes/data-rep.tex b/notes/data-rep.tex new file mode 100644 index 0000000..f2f5ef4 --- /dev/null +++ b/notes/data-rep.tex @@ -0,0 +1,195 @@ +\chapter{Data representation} + +\section{Metadata} + +Metadata means data about data. It is the information stored in a file +that is not part of the main information, but instead important +properties and data of the file, such as the author name of a PDF +document. Although we do not take it into account when calculating +file size, it is important to realise that in the real world, it would +be there. + +\section{Character encoding} + +The \textit{American Standard Code for Information Interchange} +(ASCII) is a character encoding standard. It is comprised of +a \textit{character set} which contains characters and numerical +representations of each of them, so that they can be processed and +stored by a computer. Standard ASCII uses 7 bits to represent each +character and can thusly store 128 characters (127 characters and 1 +null character). The letter A in ASCII is character 64 ($0100000_2$), +allowing for easier human representation of the rest of the alphabet, +which increments from 64. + +Extended ASCII uses 8 bits to store each character and thus contains +256 characters (255 characters and 1 null character), allowing for +more than the limited alphabet standard ASCII contains by including +more symbols and some accented characters. This however, is quite +limited in comparison to Unicode which is a significantly larger set +which allows for character encoding in many languages and many symbols +such as emoji characters. Unicode is also compatible with many more +devices. The drawback to extended ASCII and even moreso to Unicode is +that they require more space to encode characters. + +\section{Images} + +A \textit{bitmap} (such as jpg, png) image is made up of rows of +pixels, each with a colour value. The resolution of the image, the +$\mbox{width}\times\mbox{height}$ of pixels, determines the detail the +image may contain. The colour depth is the number of bits of +information that each colour is encoded with. As with 7 bit ASCII, +the number of colours that a pixel can possibly have is determined by: + +\begin{equation} + \mbox{\textit{Number of colours}} = 2^{\mbox{\footnotesize Bit\ depth}} +\end{equation} + +For example a 4 bit depth would mean that each pixel could store one +of 16 colours. As this is all of the information an image file stores, +we can deduce that: + +\begin{equation} + \mbox{\textit{File size}} = + \mbox{width}\times\mbox{height}\times\mbox{bit depth} +\end{equation} + +A \textit{vector} image does not use pixels, but instead mathematical +descriptions of shapes to form an image. This means that whilst +zooming into a bitmap image we lose quality as resolution decreases, +this is not the case for vector images. However, vector images could +not represent a photograph and are generally used for design, +animation, or games. + +\section{Sound} + +A \textit{microphone} transforms vibrations in the air (sound) into an +analogue signal (\textit{voltage}). Digital circuitry can take a +\textit{sample} of this analogue signal at set intervals, thus +producing a close representation of the original. The +\textit{resolution} of a sound file is based on the number of possible +values a single sample can have, much like colour depth, where the +resolution is measured in bits. The \textit{sample rate} (measured in +Hertz (Hz), meaning `per second') is the number of samples taken per +second. By multiplying the sample rate by the time of the recording in +seconds, we have the number of samples taken. Then by multiplying this +by the number of bits each sample requires (the resolution) we have +the file size. + +\begin{equation} + \mbox{\textit{File size}} = + \mbox{sample rate}\times\mbox{resolution}\times\mbox{time in + seconds} +\end{equation} + +\section{Data compression} + +If files such as image, sound, or perhaps video were stored +uncompressed then they would take a lot of space on hard drives and +require longer to transfer over a network. Therefore, data compression +is generally used which involves using algorithms to reduce the size +of a file. \textit{Lossless compression} simply reduces the size of a +file and we will examine two example of this. \textit{Lossy +compression} +\footnote{\url{https://en.wikipedia.org/wiki/Data_compression}} +involves removing data such as removing quiet tones from an audio +file. + +\subsection{Run length encoding} + +Run length encoding is a form of compression where repeated values are +compressed. For example, take the binary +\begin{align*} + 111110000001110000 +\end{align*} +by compressing the runs where a zero or a one is repeated, it can +become +\begin{align*} + (5,1),(6,0),(3,1)(4,0) +\end{align*} + +\section{Huffman encoding} + +Huffman describes a method of encoding text. Suppose we have the text +\begin{align*} + repetitive +\end{align*} +We want to create a Huffman tree by writing out each character and the +number of times it occurs at the bottom. Then we join together the two +lowest numbers and add their number of occurrences. We keep doing this +until there is one number left. This ensures that each the letters +that occur the most are closest to the top of the tree. Each branch is +then labelled zero or one, depending on whether it goes left or right +respectively. Each letter then has a new code, based on the path from +the top of the tree to it. + +%% requires package \usepackage[edges]{forest} +%\begin{forest} +% for tree={ +% grow=south, +% s sep=7mm +% } +% [rpveit (10) +% [rpve (6) +% [e (3)] +% [rpv (3) +% [ rp(2) +% [r (1)] +% [p (1)] +% ] +% [v (1)] +% ] +% ] +% [it (4) +% [i (2)] +% [t (2)] +% ] +% ] +% ] +%\end{forest} + +\begin{tikzpicture} + \node at (0,0) (top) {rpveit (10)}; + + \node at (-2, -1.5) (2a) {rpve (6)}; + \node at (4, -3) (2b) {it (4)}; + + \node at (-1, -3) (3a) {rpv (3)}; + \node at (-2, -4.5) (3b) {rp (2)}; + + \node at (-6, -6) (e) {e (1)}; + \node at (-4, -6) (r) {r (1)}; + \node at (-1, -6) (p) {p (1)}; + \node at (1, -6) (v) {v (1)}; + \node at (3, -6) (i) {i (2)}; + \node at (5, -6) (t) {t (2)}; + + + + \draw (top) to node[above]{$0$} (2a); + \draw (3a) to node[above left]{$0$} (3b); + + \draw (top) to node[above]{$1$} (2b); + \draw (2a) to node[above right]{$1$} (3a); + + \draw (3b) to node[above]{$0$} (r); + \draw (2b) to node[above left]{$0$} (i); + \draw (3b) to node[above]{$0$} (r); + \draw (2a) to node[left]{$0$} (e); + + \draw (3b) to node[above right]{$1$} (p); + \draw (2b) to node[above right]{$1$} (t); + \draw (3a) to node[above]{$1$} (v); +\end{tikzpicture} +Here, the letter e has the code $00$, the letter r has the code +$0100$, the letter p has the code $0101$, and so on, as these are the +paths from the top of the tree to the letter in question. If we were +encoding with standard ASCII (7 bits per character) then the size of +the sting would be $7\times 10 = 70\ bits$ (as there are 10 +characters). + +Using the Huffman tree we have generated we can encode $repetitive$ as +\begin{align*} + 0100000101001110111001100 +\end{align*} +This uses $25\ bits$, which is a $(70-25)\div(70)\times 100 \approx +64.3\%$ saving. |
