aboutsummaryrefslogtreecommitdiff
path: root/notes/data-rep.tex
diff options
context:
space:
mode:
Diffstat (limited to 'notes/data-rep.tex')
-rw-r--r--notes/data-rep.tex195
1 files changed, 195 insertions, 0 deletions
diff --git a/notes/data-rep.tex b/notes/data-rep.tex
new file mode 100644
index 0000000..f2f5ef4
--- /dev/null
+++ b/notes/data-rep.tex
@@ -0,0 +1,195 @@
+\chapter{Data representation}
+
+\section{Metadata}
+
+Metadata means data about data. It is the information stored in a file
+that is not part of the main information, but instead important
+properties and data of the file, such as the author name of a PDF
+document. Although we do not take it into account when calculating
+file size, it is important to realise that in the real world, it would
+be there.
+
+\section{Character encoding}
+
+The \textit{American Standard Code for Information Interchange}
+(ASCII) is a character encoding standard. It is comprised of
+a \textit{character set} which contains characters and numerical
+representations of each of them, so that they can be processed and
+stored by a computer. Standard ASCII uses 7 bits to represent each
+character and can thusly store 128 characters (127 characters and 1
+null character). The letter A in ASCII is character 64 ($0100000_2$),
+allowing for easier human representation of the rest of the alphabet,
+which increments from 64.
+
+Extended ASCII uses 8 bits to store each character and thus contains
+256 characters (255 characters and 1 null character), allowing for
+more than the limited alphabet standard ASCII contains by including
+more symbols and some accented characters. This however, is quite
+limited in comparison to Unicode which is a significantly larger set
+which allows for character encoding in many languages and many symbols
+such as emoji characters. Unicode is also compatible with many more
+devices. The drawback to extended ASCII and even moreso to Unicode is
+that they require more space to encode characters.
+
+\section{Images}
+
+A \textit{bitmap} (such as jpg, png) image is made up of rows of
+pixels, each with a colour value. The resolution of the image, the
+$\mbox{width}\times\mbox{height}$ of pixels, determines the detail the
+image may contain. The colour depth is the number of bits of
+information that each colour is encoded with. As with 7 bit ASCII,
+the number of colours that a pixel can possibly have is determined by:
+
+\begin{equation}
+ \mbox{\textit{Number of colours}} = 2^{\mbox{\footnotesize Bit\ depth}}
+\end{equation}
+
+For example a 4 bit depth would mean that each pixel could store one
+of 16 colours. As this is all of the information an image file stores,
+we can deduce that:
+
+\begin{equation}
+ \mbox{\textit{File size}} =
+ \mbox{width}\times\mbox{height}\times\mbox{bit depth}
+\end{equation}
+
+A \textit{vector} image does not use pixels, but instead mathematical
+descriptions of shapes to form an image. This means that whilst
+zooming into a bitmap image we lose quality as resolution decreases,
+this is not the case for vector images. However, vector images could
+not represent a photograph and are generally used for design,
+animation, or games.
+
+\section{Sound}
+
+A \textit{microphone} transforms vibrations in the air (sound) into an
+analogue signal (\textit{voltage}). Digital circuitry can take a
+\textit{sample} of this analogue signal at set intervals, thus
+producing a close representation of the original. The
+\textit{resolution} of a sound file is based on the number of possible
+values a single sample can have, much like colour depth, where the
+resolution is measured in bits. The \textit{sample rate} (measured in
+Hertz (Hz), meaning `per second') is the number of samples taken per
+second. By multiplying the sample rate by the time of the recording in
+seconds, we have the number of samples taken. Then by multiplying this
+by the number of bits each sample requires (the resolution) we have
+the file size.
+
+\begin{equation}
+ \mbox{\textit{File size}} =
+ \mbox{sample rate}\times\mbox{resolution}\times\mbox{time in
+ seconds}
+\end{equation}
+
+\section{Data compression}
+
+If files such as image, sound, or perhaps video were stored
+uncompressed then they would take a lot of space on hard drives and
+require longer to transfer over a network. Therefore, data compression
+is generally used which involves using algorithms to reduce the size
+of a file. \textit{Lossless compression} simply reduces the size of a
+file and we will examine two example of this. \textit{Lossy
+compression}
+\footnote{\url{https://en.wikipedia.org/wiki/Data_compression}}
+involves removing data such as removing quiet tones from an audio
+file.
+
+\subsection{Run length encoding}
+
+Run length encoding is a form of compression where repeated values are
+compressed. For example, take the binary
+\begin{align*}
+ 111110000001110000
+\end{align*}
+by compressing the runs where a zero or a one is repeated, it can
+become
+\begin{align*}
+ (5,1),(6,0),(3,1)(4,0)
+\end{align*}
+
+\section{Huffman encoding}
+
+Huffman describes a method of encoding text. Suppose we have the text
+\begin{align*}
+ repetitive
+\end{align*}
+We want to create a Huffman tree by writing out each character and the
+number of times it occurs at the bottom. Then we join together the two
+lowest numbers and add their number of occurrences. We keep doing this
+until there is one number left. This ensures that each the letters
+that occur the most are closest to the top of the tree. Each branch is
+then labelled zero or one, depending on whether it goes left or right
+respectively. Each letter then has a new code, based on the path from
+the top of the tree to it.
+
+%% requires package \usepackage[edges]{forest}
+%\begin{forest}
+% for tree={
+% grow=south,
+% s sep=7mm
+% }
+% [rpveit (10)
+% [rpve (6)
+% [e (3)]
+% [rpv (3)
+% [ rp(2)
+% [r (1)]
+% [p (1)]
+% ]
+% [v (1)]
+% ]
+% ]
+% [it (4)
+% [i (2)]
+% [t (2)]
+% ]
+% ]
+% ]
+%\end{forest}
+
+\begin{tikzpicture}
+ \node at (0,0) (top) {rpveit (10)};
+
+ \node at (-2, -1.5) (2a) {rpve (6)};
+ \node at (4, -3) (2b) {it (4)};
+
+ \node at (-1, -3) (3a) {rpv (3)};
+ \node at (-2, -4.5) (3b) {rp (2)};
+
+ \node at (-6, -6) (e) {e (1)};
+ \node at (-4, -6) (r) {r (1)};
+ \node at (-1, -6) (p) {p (1)};
+ \node at (1, -6) (v) {v (1)};
+ \node at (3, -6) (i) {i (2)};
+ \node at (5, -6) (t) {t (2)};
+
+
+
+ \draw (top) to node[above]{$0$} (2a);
+ \draw (3a) to node[above left]{$0$} (3b);
+
+ \draw (top) to node[above]{$1$} (2b);
+ \draw (2a) to node[above right]{$1$} (3a);
+
+ \draw (3b) to node[above]{$0$} (r);
+ \draw (2b) to node[above left]{$0$} (i);
+ \draw (3b) to node[above]{$0$} (r);
+ \draw (2a) to node[left]{$0$} (e);
+
+ \draw (3b) to node[above right]{$1$} (p);
+ \draw (2b) to node[above right]{$1$} (t);
+ \draw (3a) to node[above]{$1$} (v);
+\end{tikzpicture}
+Here, the letter e has the code $00$, the letter r has the code
+$0100$, the letter p has the code $0101$, and so on, as these are the
+paths from the top of the tree to the letter in question. If we were
+encoding with standard ASCII (7 bits per character) then the size of
+the sting would be $7\times 10 = 70\ bits$ (as there are 10
+characters).
+
+Using the Huffman tree we have generated we can encode $repetitive$ as
+\begin{align*}
+ 0100000101001110111001100
+\end{align*}
+This uses $25\ bits$, which is a $(70-25)\div(70)\times 100 \approx
+64.3\%$ saving.