main.tex

\documentclass[11pt,a4paper,openright,twoside]{book}
\def\myauthor{Adrià Arrufat}
\def\mytitle{Multiple transforms for video coding}
\def\usepdfs{0} % whether to use pre-generated pdfs with 'make figures'
\def\useINSAcover{0} % whether to use the INSA front and back covers

\usepackage[british,french]{babel} % locale settings
\usepackage{geometry} % page configuration
\geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}
% font configuration
\usepackage[T1]{fontenc}
\usepackage{mathptmx}
\usepackage[scaled]{helvet}
\usepackage{courier}
% mathematical symbols
\usepackage{amsmath, amssymb}
\usepackage{tocbibind} % see http://www.howtotex.com/packages/how-to-add-bibliography-and-more-to-table-of-contents/
\usepackage[Lenny]{fncychap} % style for chapters
\ChTitleVar{\Huge\sf\bfseries}
\usepackage{titlesec,titletoc} % to modify titles and add partial tocs
\titleformat*{\section}{\Large\bfseries\sffamily}
\titleformat*{\subsection}{\large\bfseries\sffamily}
\titleformat*{\subsubsection}{\bfseries\sffamily}
\usepackage[printonlyused,withpage]{acronym} % options: printonlyused, withpage
\usepackage{imakeidx} % allows customizing the index in the makeindex command
\usepackage{emptypage} % removes headers and footers from empty pages
\usepackage{booktabs}  % nice looking tables
\usepackage{siunitx} % consistent units and number support
\sisetup{number-unit-product = \ } % adds a space between numbers and units
\sisetup{round-mode=places,round-precision=2} % set precision to 2 decimal places
\usepackage{subfig} % subfloat environment
% \usepackage{flafter} % make sure floats appear after they have been referenced
\usepackage[section, above, below]{placeins} % prevents float from bein far
\usepackage{nicefrac} % nicer fractions
\usepackage{contour} % for inexisting bold symbols
\contourlength{0.01em}
\usepackage{enumitem} % for custom labels in enumerate environments
\usepackage{algorithm2e} % for describing algorithms
\usepackage{multirow} % for columns that expland among multiple rows in tables
\usepackage{tabularx} % for fixed width tables
\usepackage{diagbox} % for diagonal separator in tables
\usepackage{fancyhdr} % customise headers and footers
\usepackage{pdfpages} % to include external pages from pdfs

% lines and paragraphs
% \usepackage{setspace}
% \setstretch{1}
% \parskip=\smallskipamount
% \setlength{\parindent}{0pt}

\usepackage[usenames,dvipsnames]{xcolor}
\usepackage{pgfplots,tikz}
\usetikzlibrary{shapes,arrows,fit,calc,decorations.markings,intersections}
\usepgfplotslibrary{fillbetween}
\usepackage{intcalc, numprint}
\usepackage{bookmark} % needed for \bookmarksetup{startatroot}
\usepackage{hyperref}
\hypersetup{
	unicode=true,
	pdfencoding=auto,
	colorlinks=true,
	citecolor=blue,
	filecolor=red,
	linkcolor=Blue,
	urlcolor=blue,
	linktoc=all,
	pdfauthor={\myauthor},
	pdftitle={\mytitle},
	pdfsubject={video coding},
	pdfkeywords={video, image, transform, coding, compression, phd, thesis},
	pdfinfo={
		CreationDate={D:20150317093606},
		%ModDate={...}
	},
}
% \urlstyle{same} % do not use monospaced fonts in urls

% Define a TODO command
\definecolor{yellowish}{rgb}{1,1,0.5}
\definecolor{redish}{rgb}{1,0.25,0.25}
\providecommand{\todo}[1]{
	\begin{center}
		\colorbox{yellowish}{
			\begin{minipage}{0.95\linewidth}
				\textbf{\color{redish}{TODO}:} #1
			\end{minipage}
		}
	\end{center}
}

% Define a partial ToC to use at the beginning of every chapter
\providecommand{\chaptertoc}{
	\startcontents[chapters]
	\hrule
	\vspace{1em}
	\printcontents[chapters]{}{1}{{\sf\large\bfseries Contents}}
	\vspace{1em}
	\hrule
}

% import mathematical and colour definitions
\input{./custom_defs.tex}

\numberwithin{equation}{section} % equations referred to sections
\numberwithin{figure}{section} % figures referred to sections
\numberwithin{table}{section} % tables referred to sections

\makeindex[options=-s index-alph-group.ist]

\title{\Huge\bf\mytitle}
\author{\myauthor}

\begin{document}

\selectlanguage{british}

\frontmatter
% front cover
\ifthenelse{\useINSAcover = 1}
{
	\includepdf[pages={2}]{./cover/cover-mtvc-thesis-INSA-UEB.pdf}
	\cleardoublepage
	\includepdf[pages={4}]{./cover/cover-mtvc-thesis-INSA-UEB.pdf}
}{\maketitle}

\chapter{Acknowledgements}
\label{cha:acknowledgements}

First of all, I would like to thank my PhD supervisor, Pierrick Philippe from
Orange Labs.
He has transmitted me the enthusiasm in the daily work by flooding me with new
ideas and challenges every day.
I feel that I am very lucky and honoured to have been able to work with him
and his team and I hope that our paths will cross again during our
professional and personal lives.

I also want to thank the rest of the team at Orange Labs: Gordon Clare and
Félix Henry for their invaluable help at the beginning, which allowed me to
dive into the code and get a better understanding on video coding.
A special mention goes to Patrick Boissonade, with whom I shared the office
during these years.
No matter how complicated seemed a technical difficulty, he always managed to
impress me with his knowledge on everything: from coding and optimisation on
different architectures, to system administration.
I also want to thank him the patience he has shown every time I made the
cluster crash with my experiments and how he came up with a way of finding the
issue and solving it in record time.
Without all these efforts, the results of my work during the last three years
would be very different.

Other co-workers at Orange also made my stay a lot more pleasant with the
interesting discussions we had at lunch time about almost any topic.
These discussions have allowed me to get to know better my team and other
people such as Patrick Gioia and Stéphane Pateux.

I also feel very grateful to Didier Gaubil, our team manager, for being always
available whenever I needed him.
Travelling feels so much safer in the knowledge that someone like him is in
charge and will know what to do in case of an emergency.

I have to thank Olivier Déforges for his advice, dedication and support,
especially when deadlines for publications approached.
Moreover, having attended to two conferences with him, allowed me to get to
be more confident and to know him better.

A special mention for Hendrik Vorwerk, whose work during his intern-ship
served as an invaluable starting point for my results, and without which I
would have struggled to achieve the same results.

The members of the jury (Christine Guillemot, Béatrice Pesquet-Popescu,
Mathias Wien, Fernando Pereira and Philippe Salembier) deserve a distinctive
mention as well for having accepted to review, assisted to my PhD defence and
given constructive feedback.

On a side note, I must mention that all the experiments have been carried out
using free and open-source software, as such I thank the Internet and Linux
community for making all this kind of knowledge available on-line.

Last but not least, I want to thank my parents, for bearing me on endless
phone conversations almost every day and having supported me through all
these years.

\chapter{French summary}
\label{cha:french_summary}

\selectlanguage{french}

Les codeurs vidéo état de l'art utilisent des transformées pour assurer une
représentation compacte du signal.
L'étape de transformée constitue le domaine dans lequel s'effectue la
compression, pourtant peu de variabilité dans les transformées est observé
dans la littérature: habituellement, une fois que la taille d'un bloc est
sélectionné, la transformée est figée, habituellement de type Transformée en
Cosinus Discrète (TCD).

D'autres transformées autres que cette transformée, qui constitue le choix de
facto, ont récemment reçu une attention en application de codage vidéo.
Par exemple, dans la dernière norme de compression vidéo appelée HEVC (High
Efficiency Video Coding, codage vidéo à haute efficacité), la Transformée en
Sinus Discrète (TSD) est également utilisée pour traiter les blocs issus de la
prédiction pour les tailles $4\times4$.
De plus, pour ces blocs particuliers, HEVC a le choix complémentaire de ne pas
transformer le bloc, par utilisation du signal transformSkip.
Ce fait révèle l'intérêt croissant pour étendre les choix entre transformées
pour accommoder les insatiables besoins en compression vidéo.

Cette thèse se concentre sur l'amélioration des performances en codage vidéo
par l'utilisation de multiples transformées.
Les résultats sont présentés pour le codage des images Intra, c'est-à-dire des
images qui ne sont codées qu'à partir de données locales à celle-ci.
Dans cette configuration la norme de compression HEVC (publiée en 2013), qui
représente la solution la plus aboutie en la matière, améliore la performance
de compression du précédent standard appelé AVC (publié en 2003) de 22\%.

HEVC obtient cette amélioration par la démultiplication des alternatives de
codage comme l'utilisation de plusieurs tailles de bloc (4, 8, 16, 32 et 64)
et modes de prédiction (35 modes) pour générer le signal résiduel (différence
entre les pixels de l'image originale et l'image issue de la prédiction) qui
est ensuite transformé par une transformée donnée selon la taille
sélectionnée.
L'objectif pour le codeur est de trouver le meilleur compromis entre la
distorsion apportée par la quantification et le débit nécessaire pour
transmettre les valeurs approximées.
On se rend compte que HEVC investit une part importante dans la génération de
résidus, mais peu d'alternatives existent quant à la transformée.

Cette thèse est motivée par le fait que l'utilisation de plusieurs
transformées permet d'obtenir une représentation plus parcimonieuse du signal
que dans le cas d'une seule transformée.
Comme ce thème est relativement peu abordé en codage vidéo, cette thèse tente
de combler le vide pour considérer des transformées autres que la transformée
en cosinus discrète.

Pour ce faire, un aspect de cette thèse concerne la conception de transformées
en utilisant deux techniques qui sont détaillées dans ce manuscrit.
L'approche traditionnelle à base de transformées de Karhunen-Loève (KLT) et
une transformée optimisée débit distorsion nommée RDOT.
La KLT est une transformée qui a pour vocation à minimiser la distorsion sous
une hypothèse de haute résolution au travers d'une allocation de bit optimale,
cela implique une décorrélation du signal dans le domaine transformé.
La RDOT quant à elle, essaie de rendre le signal le plus parcimonieux possible
tout en limitant la quantité de distorsion induite par la quantification.

La première approche basée transformée multiples est au travers d'une
technique nommée MDDT (Mode Dependent Directional Transform).
Celle-ci consiste à utiliser une transformée adaptée, par le biais d'une KLT
ou d'une RDOT, pour chaque mode de prédiction intra.
Par une utilisation de transformées séparables, un petit gain est observé par
rapport à HEVC (de l'ordre de 0.5\% du débit est économisé).
Néanmoins, l'utilisation de transformées non-séparables révèle des gains
tangibles de l'ordre de 2.4\% lorsque les transformées sont adaptées au
travers de la RDOT.
Ce gain est plus favorable que celui observé lorsque les transformées sont
construites à partir de l'approche KLT: celle-ci n'améliore HEVC que de
1.8\%.
Les résultats de cette étude sont résumées dans l'article intitulé
``Non-separable mode dependent transforms for intra coding in HEVC'' présenté
à la conférence VCIP 2014.
Ce chapitre conclut que les transformées basées sur la RDOT ont de meilleures
performances que celles basée KLT.

Dans l'objectif d'étendre l'approche MDDT, le chapitre suivant décrit une
approche nommée MDTC (Mode-Dependent Transform Competition) dans la quelle
chaque mode de prédiction est équipé de plusieurs transformées.
Lors du codage, ces transformées entrent en compétition de la même façon que
les modes de prédiction et tailles de blocs sont sélectionnés.
Ce système apporte des gains de l'ordre de 7\% pour des transformées
non-séparables et 4\% pour les transformées séparables, en comparaison avec
HEVC.
Les résultats de ce chapitre sont publiés dans l'article ``Mode-dependent
transform competition for HEVC'' publié lors de la conférence ICIP 2015.
Néanmoins la complexité de tels systèmes est notoire, à la fois en ressources
de calcul et en espace de stockage: un facteur de 10 en temps de codage et la
complexité de décodage est accrue de 40\% par rapport à HEVC.
Le stockage des transformées requiert en outre plus de 300 kilo-octets.

En conséquence les chapitres suivants de la thèse développent des approches
permettant de simplifier visant à simplifier les systèmes MDTC tout en
conservant dans la mesure du possible l'amélioration en débit.
Comme les transformées non-séparables apportent les gains les plus
prometteurs, le chapitre 5, présente une approche plus simple permettant
d'utiliser néanmoins les transformées non-séparables.
Ces travaux ont été publiés dans la référence ``Image coding with incomplete
transform competition for HEVC'' présentée à la conférence ICIP 2015.
L'approche développée consiste à ne plus utiliser l'ensemble des vecteurs de
base lors de la transformation, mais de ne conserver que la première base.
Un ensemble de transformées incomplète est ainsi produit et utilisé en
complément de la transformée HEVC qui conserve sa base complète.
Des gains en compression de l'ordre de 1\% sont observés avec cette technique,
avec une complexité au décodeur notablement abaissée par rapport aux
précédentes approches: elle devient même plus faible que celle de HEVC.

Finalement, une procédure de construction de systèmes MDTC à basse complexité
est présentée.
Ces travaux sont repris dans la publication ``Low complexity transform
competition for HEVC'' acceptée à la conférence ICASSP 2016.
Cette approche à basse complexité s'appuie sur trois composantes qui sont
évaluées: tout d'abord une sélection du nombre adéquat de transformées par
mode est effectuée, ce qui permet de réduire le nombre de transformées et
limiter l'espace de stockage et la complexité de codage.
De plus des symétries entre modes de prédiction sont exploitées pour réduire
la ROM d'un facteur 3.
Pour terminer l'utilisation de transformées trigonométriques (DTT, Discrete
Trigonometric Transforms) est motivé par existence d'algorithmes rapides.
L'ensemble de ces contributions réunies permet de proposer un système d'une
complexité d'encodage de 50\% accrue par rapport à l'état de l'art avec une
complexité ajoutée, au niveau décodage et stockage, mineure.

En conclusion les résultats de cette thèse montrent que les transformées
multiples apportent des gains significatifs en comparaison avec le plus récent
standard de codage vidéo.
Des gains très substantiels par rapport à HEVC sont apportés si l'on néglige
les aspects complexité.
Néanmoins pour des systèmes réalistes des gains tangibles sont obtenus pour
des complexités compétitives.

\selectlanguage{british}

% \setcounter{tocdepth}{5}
\tableofcontents
\cleardoublepage
\chapter{List of Acronyms}
\label{cha:glossary}
\input{./acronyms.tex}
\cleardoublepage
\listoffigures
\cleardoublepage
\listoftables
\cleardoublepage

\chapter{General introduction}
\label{cha:general_intoduction}
% \addcontentsline{toc}{chapter}{\protect\numberline{}General introduction}

\section*{Context}
\label{sec:context}
\addcontentsline{toc}{section}{\protect\numberline{}Context}

Nowadays, video services play a major role in information exchanges around the
world.
Despite the progress achieved in the last years with video coding standards,
improvements are still required as new formats emerge:
as \ac{HFR}, \ac{HDR} and \ac{HD} formats become more and more common, new
needs for video compression appear that must exploit properties in these
domains to achieve higher compression rates.

All these formats are made realistic in terms of service deployment thanks to
the fact that around every 10 years, the coding efficiency doubles for
equivalent quality.
In 2003, the H.264/\acs{MPEG}-4 \acs{AVC} standard was defined, providing
a compression rate of around 50\% with regards \acs{MPEG}-2 video, defined
in 1993.
In January 2013, the \acs{HEVC} standard was released, which outperforms
H.264/\acs{MPEG}-4 \acs{AVC} by 50\% in terms of bitrate savings for
equivalent perceptual quality~\cite{sullivan-12-overview-hevc}.

\bigskip

The work carried out in this thesis started in November 2012, with the
\acs{HEVC} standard almost completely defined.
Consequently, the focus has been put on improving \acs{HEVC} with
new techniques that could be adopted in a future standard, tentatively for
around 2020.
Recently, \acs{ITU} and \acs{ISO}, through their respective groups \acs{VCEG}
and \acs{MPEG}, have started working towards a possible future video coding
standard for that time frame.

\bigskip

Being at the beginning of the post-\acs{HEVC} era, the first steps in this
thesis strive to achieve important bitrate savings over \acs{HEVC} by
relaxing complexity constraints.

This thesis is strongly connected to the standardisation context.
The first exploratory direction points towards finding new techniques
regarding the role of transforms in video coding, such as different
transform design methods and the usage of multiple transforms adapted to the
nature of video coding signals.
Then, the studies move towards making these new techniques involving multiple
transforms admissible in a standardisation context, which implies having
reasonable impact on standardisation aspects, such as complexity, especially
on the decoder side.

\bigskip

Accordingly, this thesis has been organised into the following Chapters:

\begin{enumerate}
	[labelindent=3.8em,leftmargin=!,label={\bf Chapter \arabic{enumi}}]
	\item starts with an introduction to the basics of video coding and some
		essential concepts for this thesis on modern video coding standards.
		The focus is quickly put on the transform stage and its crucial role
		in the video coding scheme.

	\item contains a detailed study on the transform role inside video coding
		applications.

		As mentioned, the main motivation of this thesis is to improve video
		coding by making use of multiple transforms.
		In order to conceive those transforms, two design methods are studied:
		the \ac{KLT} and a \ac{RDOT}.

		The \ac{KLT} defines a well-known transform design method to conceive
		transforms that minimise the distortion under the high-resolution
		quantisation hypothesis and to provide optimal bit-allocation via
		signal decorrelation in the transform domain.

		The \ac{RDOT}, presented in details in this thesis, describes another
		design method that tries to output a signal which is as sparse as
		possible while minimising the distortion introduced by the
		quantisation.

	\item compares the \ac{KLT} and \ac{RDOT} design methods introduced in the
		previous Chapter by using multiple transforms in a modified version of
		\ac{HEVC} through the \ac{MDDT} technique, where one adapted transform
		(\ac{KLT} or \ac{RDOT}) is provided per \acl{IPM}.

		This experiment questions the optimality of the \ac{KLT} for video
		signals.
		Moreover, the impact of transform separability on video coding in
		terms of bitrate savings and coding complexity is reconsidered.

	\item extends the \ac{MDDT} system by introducing the \ac{MDTC} system.
		The main idea is to provide several transforms in each \acl{IPM}.
		These transforms compete against each other in the same way that block
		sizes and \aclp{IPM} do.
		\ac{MDTC} systems are able to offer notable bitrate savings at the
		expense of encoding and decoding complexity and the storage
		requirements for the used transforms.
		Therefore, the following Chapters of the thesis contain approaches on
		simplifying the \ac{MDTC} system whilst keeping bitrate improvements.

	\item explores a new way of simplifying non-separable \ac{MDTC} systems,
		which provided the most promising results in the previous Chapter, by
		using incomplete transforms.

		Incomplete transforms are low complexity transforms where, instead of
		using all the basis vectors for the non-separable transforms, only the
		first one is retained.
		They are used as companions of the default \acs{HEVC} transforms.

	\item presents different methods for reducing the storage requirements of
		\ac{MDTC} systems.
		The proposed methods are based on the fact that not all \aclp{IPM}
		present a uniform behaviour inside \ac{HEVC} in terms of usage
		frequency and signalling.
		Therefore, not all \aclp{IPM} need the same number of transforms,
		which allows for a reduction of signalling, encoding complexity and
		storage.

		Moreover, symmetries observed in \aclp{IPM} are exploited to further
		reduce the storage requirements while having low impact on the
		bitrate savings.

	\item replaces the \ac{RDOT}-based \ac{MDTC} system with another one based
		on \acp{DTT}.

		\acp{DTT} can be expressed compactly via an analytical formula, which
		reduces notably the storage requirements.
		And, since they are backed up by fast algorithms, they make suitable
		candidates for low complexity systems that have no impact on the
		decoding complexity compared to \acs{HEVC}.

	\item is a summary of the work carried out in this thesis in order to
		provide a better overview of the achievements of systems making use of
		multiple transforms.
		Since many systems offering various trade-offs between complexity and
		performance are presented in this thesis, the most relevant are
		grouped in this Chapter for an easier comparison.

	\item concludes on the thesis results and presents some perspectives for
		future work on multiple transforms for video coding.
\end{enumerate}

\section*{Thesis contributions}
\label{sec:thesis_contributions}
\addcontentsline{toc}{section}{\protect\numberline{}Thesis contributions}

Work carried out in this thesis contains several objectives and contributions,
amongst which:
\begin{itemize}
	\item Reconsider the separability on transforms for video coding.
	\item Revisit the use of the \acs{KLT} for video coding.
	\item Define a metric to be able to rank transforms and allow learning and
		classification algorithms.
	\item Design video coding systems that make use of multiple transforms.
\end{itemize}

The thesis ends with a summary of the thesis objectives, followed by the
conclusions and the perspectives for future work based on the use of multiple
transforms for video coding.

\mainmatter

\acresetall % reset all acronym expansions

\chapter{Video coding fundamentals}
\label{cha:video_coding_fundamentals}
\chaptertoc

\section{Introduction to video coding}
\label{sec:introduction_to_video_coding}

The purpose of video coding is to compress video streams, which consist of a
sequence of images that, at some point, will be either transmitted or stored.
Compressing means reducing the quantity of information so that the amount of
bits required to represent that information is low enough to enable the use of
video-based applications.

For instance, a video in \ac{HD} format ($1920 \times 1080$)
at a frame rate of 25 images per second and 8 bits to represent each one
of the \ac{RGB} channels, requires:
\[
	\frac{1920\times\SI{1080}{pix}}{\SI{1}{image}}
	\times \frac{\SI{25}{images}}{\SI{1}{s}}
	\times \frac{\SI{3}{channels}}{\SI{1}{pix}}
	\times \frac{\SI{8}{bits}}{\SI{1}{channel}}
	\approx \SI{1.2}{\giga bit/\second}
\]

Through this simple example, it is obvious that video coding is compulsory
to stream or even store video files:
the amount of bitrate reported is this example is beyond the limit of current
computing and service architectures and networks.

Depending on the target quality, it is common to have compression rates
ranging from 10 to 1000.
For content providers, being able to reduce the size of the content they
deliver the opportunity to increase the number of contents they can store, as
well as the number of subscribers they can reach using the same resources for
storage and network capacity.

In 2013, two thirds of the \ac{IP} traffic was due to video
streaming, and this trend is foreseen to continuously increase, reaching up to
84\% of the \acs{IP} traffic by 2018~\cite{cisco-13-vni-forecast}.
Such forecasts highlight the need to continue the research on new video
coding techniques.

As a network operator, being also an \ac{ISP} that delivers video services
such as live TV streaming or video on demand services, Orange is, therefore,
particularly motivated by coding schemes allowing better trade-offs between
the quantity and quality of served videos and the amount of reachable clients.

\section{The video coding system}
\label{sec:the_video_coding_system}
\index{video coding system}

The video coding system describes a work-flow to process with video sequences.
It is composed of several stages:
from the video acquisition at the source and to the video display.
A good understanding of each part is crucial in order to be able to take the
proper decisions when delivering a video coding solution.

This section presents a scheme containing the most important concepts
used in state-of-the-art video codecs.
Figure~\ref{fig:video_coding_system} describes a way in which a complete video
coding system can be organised as a block diagram.
The composing blocks are explained more in details in the following
subsections.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/video_coding_system.tex}}
	{\includegraphics{./figures/video_coding_system.pdf}}
	\caption{Video coding system}
	\label{fig:video_coding_system}
\end{figure}

\subsection{Pre-processing}
\label{sub:pre_processing}

Once the digital video has been captured at the source, which may be of many
different sorts such as natural scenes or synthetic computer-generated
content, it may need to be processed in order to be encoded.
Usually, the pre-processing stage can include some filtering, scaling and
colour space conversions on the raw (uncompressed) video sequence.

\subsubsection{Up-sampling and down-sampling}
\label{ssub:up-sampling_and_down-sampling}

The video captured at the source might not have the desired resolution.
Consequently, scaling operations should be applied at source to assure the
minimum impact to quality.

The sequences used in this thesis are centred around the \ac{HD} formats (720p
and 1080p).
However, other resolutions will also be considered, such as WVGA
($800\times480$) and WQXGA ($2560\times1600$).

\subsubsection{Colour space conversions}
\label{ssub:colour_space_conversions}
\index{colour space}
\index{HVS}
\index{YUV}
\index{luma}
\index{chroma}
\index{RGB}
\index{Y'CbCr}

A colour space is an abstract mathematical model specified by primary colours
that is used to describe a representation of a picture, usually as a set of
tristimulus values.

Colour space conversions are used to change the colour representation of the
content to better fit the \ac{HVS} and to decorrelate the components for
better coding efficiency.

Often, the colour space at the source is in the \ac{RGB} format, but since the
\ac{HVS} is more sensible to light variations than to colour variations, the
grey scale version of the image, which contains the light information (luma)
is separated from the colour information
(chroma)~\cite{poynton-95-color-space}.
The family colour spaces where the luma is separated from chroma are usually
referred to as YUV colour spaces.
This way the colour information can be sub-sampled without any visual
degradation.

As a result, in this thesis, the improvements on video coding quality will
focus the luma component, since it plays a more important role in perceptual
quality and represents an important part in the final bitstream.

\subsection{Encoding}
\label{sub:encoding}

This stage converts the pre-processed raw video sequence into a coded video
stream (called bitstream) to ease the storage and transmission.
It is, at this point, where the bitrate reduction takes place.
The amount of bits required to represent the video stream is reduced by
limiting the number of redundancies in the video and by introducing some
approximations such as the quantisation step, while limiting the impact of
these approximations on the perceived quality.

A deep look at the inners of a widely used coding scheme is provided
in \S\ref{sec:the_hybrid_video_coding_scheme}.

\subsection{Transmission}
\label{sub:transmission}
\index{random access}

The transmission stage represents the channel through which the encoded
bitstream is made available to the decoder.
The channel can be a physical storage, such as optical discs, or any other
transmission channel: wired/wireless connections with 1 to 1 (unicast) or 1 to
many (multicast/broadcast) transmissions.

Depending on the application and the channel, the behaviour of the encoder may
vary:
a video that is encoded for storage purposes will not have a real time
constraint, present in streaming applications.
For example, the \ac{RA} technique, i.e.\ the ability to access to a
particular piece of a video sequence, needs to be guaranteed for some
applications like TV, while the latency needs to be kept as low as possible to
enable services like surveillance or video conferencing systems.
This latency limitations need to be taken into account at the encoding stage.

\subsection{Decoding}
\label{sub:decoding}

As the stream is received by the decoder, it is buffered and used to
reconstruct the encoded data into the appropriate format, as signalled
by the encoder.
\acs{MPEG} and \acs{ITU} video coding standards define two key points:
the bitstream conveying the
compressed video data and the bit-exact decoding process, aiming at recovering
the sequence of images.
\acs{MPEG} (the \acs{JCT} 1/SC 29/WG 11 of the \acs{ISO}/\acs{IEC}
organisation) and \acs{VCEG} (the question 6 of ITU-T SG 16) are the main
organisations specifying video coding algorithms.
The most recent video coding specifications include the \acs{MPEG}-4 part 10
standard / \acs{ITU} H.264 recommendation, known as \ac{AVC}~\cite{itu-03-avc}
or \acs{MPEG}-H Part 2 / \acs{ITU} H.265, called
\acf{HEVC}~\cite{itu-13-hevc}.
The encoder must comply with this specification by generating a decodable
stream, there is no other normative behaviour defined by a video coding
standard.

\subsection{Post-processing}
\label{sub:post_processing}

The post-processing stage performs operations for image enhancement and
display adaptation, such as converting back to the original colour
space and to the display format.

\section{The hybrid video coding scheme}
\label{sec:the_hybrid_video_coding_scheme}
\index{hybrid video coding scheme}
\index{MPEG}
\index{H.261}
\index{AVC}
\index{H.264}
\index{HEVC}
\index{H.265}

State-of-the-art video coding standards such as H.264/\acs{MPEG}-4 \acs{AVC}
and \acs{HEVC} use a hybrid video coding scheme.
The overall coding structure appeared in H.261, in 1988.
Since then, all video coding standards and recommendations issued by
\acs{VCEG} and the \acs{MPEG} use this coding structure~\cite{wien-15-hevc}.

The hybrid video coding scheme is named after its use of both temporal
prediction and transform coding techniques for the prediction error.
A basic structure of the hybrid video coding scheme is presented in
figure~\ref{fig:hybrid_video_coding_scheme}.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/hybrid_video_coding_scheme.tex}}
	{\includegraphics{./figures/hybrid_video_coding_scheme.pdf}}
	\caption{Hybrid video coding scheme}
	\label{fig:hybrid_video_coding_scheme}
\end{figure}

The hybrid video coding scheme provides an efficient way of compressing
a video signal into a bitstream of the smallest possible size for a given
level of fidelity.
The key features to achieve such a small bitstream are the signal
prediction and the transformation and quantisation of the prediction error.

A decoder is included in the encoder, represented inside a blue box, to be
able to perform its coding decisions based on what the decoder would do
while decoding a bitstream.

The building blocks of the hybrid video coding scheme are explained in
the following subsections.

\subsection{Partitioning}
\label{sub:partitioning}
\index{partitioning}
\index{quad-tree}
\index{CTB}
\index{CB}
\index{PU}
\index{TU}

In order to process the video frames, they are exhaustively partitioned into
non-overlapping blocks.
Sub-partitioning a picture allows for a better matching to the spatial
distribution of energy.
The subdivision blocks ease, as well,  the succeeding stages of prediction and
transform:
the blocks can be processed, under some constraints, independently so that a
parallel processing is made possible.
The partitioning does not necessarily imply same-sized blocks, allowing
rectangular blocks of different sizes to be used, as illustrated in
figure~\ref{fig:part_orig_pred_res_image} (a).
This is due to the quad-tree partitioning that is implemented in \ac{HEVC}.
The figure provides a partitioning for a certain level of quantisation.
It can be seen that the picture has been divided into uniform regions.

The optimal choice of the block size is left to the encoder, through \acp{CTB}
in \ac{HEVC}.
Each \ac{CTB} can be split into four \acp{CB}, which can also be split into
four \acp{PU}, which can finally be split into four \acp{TU}.

\subsection{Prediction}
\label{sub:prediction}
\index{prediction}

Instead of coding the blocks coming from the original source picture directly,
the encoder computes an estimation of the block samples, which is then
subtracted from the original pixels, generating the residual block.
This technique is known as \ac{DPCM} in the literature~\cite{cutler-52-dpcm}.

Those block estimations are carried out by the prediction module, using some
information from previously processed blocks.
This way, predictable information present in the original blocks is removed
and the energy of the resulting signal is lowered so that it requires fewer
bits for a given distortion.
This is a result of the source coding
theory~\cite{jayant-84-digital-coding-waveforms}.
In other terms, the prediction mechanism aims at removing inter-block
redundancies in the video signal.

Predictions must be performed the same way at both encoder and decoder side,
and thus computed inside the blue box in
figure~\ref{fig:hybrid_video_coding_scheme}, referring to the decoder.
For this reason, the encoder uses reconstructed blocks (blocks that have
already been encoded) as the input data to compute the predictions, as these
blocks are equivalent to those the decoder will handle.

Commonly, predictions are of two types, depending on the origin of
the prediction source:
\begin{itemize}
	\item Intra prediction, also called spatial prediction, for those blocks
		predicted using information within the same picture.
	\item Inter prediction, also called temporal prediction, for those blocks
		predicted from pictures other than the one under consideration.
\end{itemize}

Figure~\ref{fig:part_orig_pred_res_image} (b) provides an example of an intra
predicted picture using the partitioning from
figure~\ref{fig:part_orig_pred_res_image} (a).
Figure~\ref{fig:part_orig_pred_res_image} (c) displays the residual picture (the
difference between the original and predicted pictures).
This image evidences the parts of the picture that could not be predicted
properly and will have to be encoded and transmitted.

\begin{figure}[tb]
	\centering
	\subfloat[Partitioning]
	{\includegraphics[width=0.5\linewidth]{./figures/partitioning-orig-all-001.png}}
	\\
	\subfloat[Predicted Picture]
	{\includegraphics[width=0.5\linewidth]{./figures/pred_image-all-001.png}}
	\\
	\subfloat[Residual Picture]
	{\includegraphics[width=0.5\linewidth]{./figures/res_image-all-001.png}}
	\caption{Example of a picture at different encoding points for \acs{AI}
	main configuration at QP 32}
	\label{fig:part_orig_pred_res_image}
\end{figure}

\subsubsection{Intra prediction}
\label{ssub:intra_prediction}
\index{intra prediction}

Intra prediction, sometimes referred to as spatial prediction, is used to
eliminate spatial redundancies by removing the correlation within local
regions of a picture.
The basic principle of intra prediction is based on the fact that the texture
of a picture region is similar to the texture of its neighbourhood and can be
predicted from there.
Pictures coded using this technique exclusively are named I-slices.

Different models of predictions can be used through projections of adjacent
decoded blocks.
These models include directional projections, gradient projections (called
Planar) and the projection of the mean value (called DC).
The \acp{IPM} are used to derive the predictions of the current block from its
available boundaries, formed by reconstructed blocks.

\subsubsection{Inter prediction}
\label{ssub:inter_prediction}
\index{inter prediction}

Inter prediction, or temporal prediction, takes advantage of the fact that
temporally close pictures share many similarities, and that some of their
component regions will move as a whole.
Since the encoding order is not necessarily the same as the viewing one, inter
predictions can have their origins in either past or future frames, and also
combine both origins.
This feature facilitates movement tracking across frames.
Pictures that are predicted using only one picture either from the past or
from the future are called predicted pictures or P-slices.
Bi-predicted pictures or B-slices have prediction origins in two different
pictures.

At the encoder side, an extra operation, called motion estimation, is
carried out.
This stage searches the best matching area in the reference picture for the
current prediction block.
It is one of the most complex parts of video coders in terms of computational
requirements.
Once a good prediction has been found, a motion vector is created, indicating
the offset that has to be applied in the block from the reference picture.

\subsection{Transform}
\label{sub:transform}
\index{transform}
\index{energy compaction}

The transform stage reduces the remaining correlations from the residual
block, computed as the difference between the original and the predicted
blocks.
The goal of the transform is to concentrate the residual signal into as few
coefficients as possible in the transform domain.
In the spatial domain, the residual signal is spread among the samples of the
blocks, while the objective of transform domains is to concentrate as much as
possible the residual signal into a few transform domain coefficients
exhibiting a large amplitude and the rest might be considered negligible.

This idea of the energy compaction is the main property of the transform
stage.

Most of the transforms used in standardised video coding schemes belong to the
\ac{DTT} family.
Amongst those, the \ac{DCT} of type II has received a considerable amount of
attention in the past and is the \emph{de facto} standard transform used in
\acs{ITU} and \acs{MPEG} codecs since \acs{MPEG}-1/H.261.

Additional choices were introduced recently, especially in \ac{HEVC}, where
the \ac{DST} of type VII was adopted.

Provided that the subject of this thesis is centred around the
transforms for video coding, the transform stage will be explained
thoroughly in Chapter~\ref{cha:transform_coding}.

\subsection{Quantisation}
\label{sub:quantisation}
\index{QP}
\index{quantisation}
\index{lossy}
\index{lossless}

The quantisation, applied in the transform domain, is used as an approximation
operator, reducing the amount of possible output values.
In standards like \ac{HEVC}, the quantisation is scalar:
each coefficient is approximated independently from its neighbouring values.
In these coding schemes, the quantisation step is controlled by a \ac{QP} that
discards any coefficient whose energy level is below a certain threshold.
High energy coefficients are also affected by the quantisation.
It is worth noticing that the quantisation is the only non-reversible step in
the whole hybrid video coding scheme, which induces lossy video coding.
Lossless (or near-lossless) video coding can be attained by not using
quantisation in the process.

\subsection{Loop filters}
\label{sub:loop_filters}
\index{Loop filters}

Loop filters have the goal of improving the quality of the reconstructed
picture for display purposes.
Being located within the loop, these filters have a strong impact on the
overall performance of the video coding scheme.

These filters can be classified into two classes, depending on their target
region.
The fist class is applied to a region of a picture or even a complete picture,
whereas filters from the second class operate on a local spatial domain of the
picture.

A detailed explanation on how these filters improve the picture quality is
explained in \S 2.4.6 and \S 9 of~\cite{wien-15-hevc}.

\subsection{Entropy coding}
\label{sub:entropy_coding}
\index{entropy coding}
\index{CABAC}
\index{scanning}

The last operation consists in reducing the amount of bits transmitted through
the use of an entropy code.
This is a lossless operation, as such the bit allocation performed during this
stage is reversible: no approximation operation is performed at this stage.

Once the transform coefficients have been quantised, they are scanned to make
sure they are sorted in a way that will make the entropy coder work
efficiently.

The scanning operation is a conversion from a 2D array, containing the
quantised transformed coefficients, towards a 1D vector containing the same
values sorted in a way that facilitates a compact transmission.
An appropriate scanning is crucial for efficient entropy
coding~\cite{ye-08-intra-directional-scanning-mddt}.

Signalling is also conveyed into the bitstream at this point, and the
entropy coder ensures a correct binarisation while using the adequate
number of bits.
In \acs{HEVC}, the entropy coding is named \ac{CABAC}.

\subsection{Intra coding in \acs{HEVC}}
\label{sub:intra_coding_in_hevc}
\index{PU}
\index{TU}
\index{scanning}

This subsection explains some particularities of the intra coding inside
\ac{HEVC}, as the work carried out in this thesis focuses on the intra
coding and take advantage of them.

With regards to previous standards, such as H.264/\acs{MPEG}-4 \ac{AVC}, which
has 9 different prediction modes for intra coding, \ac{HEVC} has an improved
prediction system, with 35 different prediction modes.
The upper-left part of figure~\ref{fig:mdcs} illustrates them.
A detailed explanation on how predictions are derived from the block
boundaries using those prediction modes can be found in Chapter 6
of~\cite{wien-15-hevc}.

Depending on the selected \ac{IPM}, residuals present different patterns, and
so do their transformed coefficients.
The top-right part of figure~\ref{fig:mdcs} presents the average \ac{HEVC}
$4\times4$ residuals by scanning mode, together with their average
representation in transform domain through the \acs{DST}-VII.
The average residual profiles have lower (dark) values near the available
borders, which increase with the distance from the boundaries: residuals
issued from horizontal and vertical \acp{IPM} only have the left and upper
borders available, respectively, whereas the remaining \acp{IPM} tend to have
both borders available.
It can also be observed that the scanning patterns match reasonably well the
transformed coefficients in each case, sorting them in an increasing order.

These patterns in the transform domain determine different scanning orders (in
different colours), as presented in the lower part of figure~\ref{fig:mdcs}.
An adapted mode to each pattern ensures, in average, a correct order of the
coefficients that will group all the null values together.
The patterns are described for $4\times4$ blocks, and the same pattern is used
on higher block sizes, which are recursively split into 4 sub-blocks until
size $4\times4$ is reached~\cite{sole-12-transform-coefficient-coding}.

\begin{figure}[tb]
	\centering
	\begin{minipage}{0.48\textwidth}

		\ifthenelse{\usepdfs = 0}
		{\input{./figures/pred-directions.tex}}
		{\includegraphics{./figures/pred-directions.pdf}}
	\end{minipage}
	\begin{minipage}{0.48\textwidth}
		\centering
		\small
		Average profile for residuals (prediction direction)
		\begin{tabular}[H]{ccc}
			\includegraphics[width=0.25\textwidth]{figures/resids-scan-diag.png}
			&
			\includegraphics[width=0.25\textwidth]{figures/resids-scan-horz.png}
			&
			\includegraphics[width=0.25\textwidth]{figures/resids-scan-vert.png}
			\\
			\color{red}{diagonal} & \color{greenish}{vertical} & \color{blue}{horizontal} \\
			& & \\
		\end{tabular}
		\\
		\small
		Average profile for coefficients (scanning mode)
		\begin{tabular}[H]{ccc}
			\includegraphics[width=0.25\textwidth]{figures/coeffs-scan-diag.png}
			&
			\includegraphics[width=0.25\textwidth]{figures/coeffs-scan-horz.png}
			&
			\includegraphics[width=0.25\textwidth]{figures/coeffs-scan-vert.png}
			\\
			\color{red}{diagonal} & \color{greenish}{horizontal} & \color{blue}{vertical} \\
		\end{tabular}
	\end{minipage}
	\subfloat[Diagonal scanning]
	{\ifthenelse{\usepdfs = 0}
	{\input{figures/mdcs-diag.tex}}
	{\includegraphics{figures/mdcs-diag.pdf}}}
	\subfloat[Horizontal scanning]
	{\ifthenelse{\usepdfs = 0}
	{\input{figures/mdcs-horz.tex}}
	{\includegraphics{figures/mdcs-horz.pdf}}}
	\subfloat[Vertical scanning]
	{\ifthenelse{\usepdfs = 0}
	{\input{figures/mdcs-vert.tex}}
	{\includegraphics{figures/mdcs-vert.pdf}}}
	\caption[\acs{HEVC} intra prediction modes and scannings for $4\times4$\acsp{TU}]
	{\acs{HEVC} \aclp{IPM} referred to their scanning patterns together with
	some average residual examples in the transform domain for $4\times4$ \acsp{TU}}
	\label{fig:mdcs}
\end{figure}

\section{Encoder control}
\label{sec:encoder_control}
\index{encoder control}
\index{Lagrange}

The encoder control is used by essentially all blocks in the diagram
from figure~\ref{fig:hybrid_video_coding_scheme}.
This set of operations allows the encoder to take decisions related to coding
based on the application requirements.
These decisions include the block sizes and the prediction to use.
For each block size and type of prediction, the encoder computes the
distortion, using its own decoder, and estimates the rate as illustrated in
figure~\ref{fig:rate_distortion_scheme}.

The appropriate set of choices, mainly among the prediction modes and block
sizes, is performed by comparing the Lagrange values, which balance the
distortion and rates obtained for each coding choice.
This behaviour is completely described in the standard
specification~\cite{itu-13-hevc}.

Depending on the coding configuration defined in the \acs{JCT-VC} \ac{CTC},
the encoder may also decide whether to use references from the current picture
or from other pictures.
The \ac{AI} configuration encodes each picture independently from the rest,
whereas in the \ac{RA} configuration, the pictures conforming the sequences
are organised into \acp{GOP}, formed by an I-slice, and several P and
B-slices.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/rate_distortion_scheme.tex}}
	{\includegraphics{./figures/rate_distortion_scheme.pdf}}
	\caption{Rate-distortion scheme of a transform-based codec}
	\label{fig:rate_distortion_scheme}
\end{figure}

\subsection{Distortion measures}
\label{sub:distortion_measures}

This section presents the two distortion measures for objective quality
used in this thesis.
Although other measures exist, such as \ac{SSIM}~\cite{wang-04-ssim},
which is designed to improve on traditional methods such as the \acs{MSE} and
the \acs{PSNR}, explained in detail below.

The reason for using the traditional methods over other ones is to be able to
provide more consistent comparisons with existing techniques present in
standardisation, which use the \acs{BD} measurements based on the \acs{PSNR},
even if they have can be inconsistent with the \ac{HVS}.

\subsubsection{Mean squared error}
\label{ssub:mean_squared_error}
\index{MSE}

The \ac{MSE} is the average of the square difference between two
signals.
For two-dimensional signals, such as images, the \ac{MSE} can be computed as:
\begin{equation}
	MSE_{I,K} = \frac{1}{m\,n} \sum\limits_{i=0}^{m-1} \sum\limits_{j=0}^{n-1}
	{\left[ I(i,j) - K(i,j) \right]} ^2
	\label{eqn:mse}
\end{equation}
Where $I$ and $K$ are two images of $m \times n$ pixels.
Usually, $I$ is the reference image and $K$ the coded image.

\subsubsection{Peak signal-to-noise ratio}
\label{ssub:peak_signal_to_noise_ratio}
\index{PSNR}

The \ac{PSNR} is an objective measure of quality that computes the ratio
between the maximum possible value of a signal and the energy of the noise
that affects the fidelity of its approximation.
It is usually defined in a logarithmic scale to cope with the wide range that
signals might have.
Defining the \ac{PSNR} in terms of the \ac{MSE} from~\eqref{eqn:mse}, it can
be expressed as:
\begin{equation}
	PSNR = 10 \log_{10} \left(\frac{MAX_I^2}{MSE_{I,K}}\right)
	= 20 \log_{10} (MAX_I) - 10 \log_{10} (MSE_{I,K})
	\label{eqn:psnr}
\end{equation}
For 8-bit depth images, which are the main format considered in this thesis,
the maximum pixel values writes: $MAX = 2^{8} - 1 = 255$.

\subsection{Rate-distortion optimisation}
\label{sub:rate_distortion_optimisation}
\index{RDO}
\index{Lagrange}

In order to carry out the most sensible decision each time, the encoder can
make use of a \ac{RDO} criterion~\cite{sullivan-98-rdo-video-compression}.

Each time the encoder has to make a decision about choosing a
particular block size for the partitioning or a prediction mode, it
checks the distortion that decision might cause as well as an estimation
of the bitrate needed.
The encoder performs this computation iteratively on the same block,
exploring different coding possibilities and finally selects the one that
provides the best score in terms of rate and distortion.  This is called the
\ac{RDO} loop.
For instance, in \ac{HEVC}, it allows choosing the best \ac{PU} size
(4,8,16,32,64), \ac{TU} size (4,8,16,32), \ac{IPM} (0,1,\ldots,34), prediction
source (intra, inter), amongst others.

The trade-off between the distortion and the rate is commonly expressed
using a Lagrange measure.

\begin{equation}
	J(\lambda) = \text{Distortion} + \lambda \text{Rate}
	\label{eqn:lagrangian_rdo}
\end{equation}

As seen in the previous subsections, computing the distortion is
reasonably straightforward.
However, estimating the bitrate is a bit more delicate, since the whole
entropy coder cannot be run each time the encoder explores the different
possibilities for a block because it has a huge impact on complexity.
As a consequence, an estimation of the bitrate is often used in the \ac{RDO}
loop.

\section{Bj{\o}ntegaard Delta measurements}
\label{sub:bjontegaard_delta_measurements}
\index{BD-rate}
\index{BD-PSNR}

Comparing two video coding techniques objectively might be complicated,
as both the distortion and the bitrate savings have to be taken into account
jointly.

Metrics introduced by Gisle Bjøntegaard, know as \ac{BD} measurements have
become the current \emph{de facto} procedure to objectively compare the result
of two encodings~\cite{VCEG-M33,VCEG-AI11}.
Two different metrics are defined and displayed in
figure~\ref{fig:bdsnr_bdrate}:
\begin{enumerate}[label = (\alph{enumi})]
	\item \ac{BD}-\ac{PSNR}: computes the relative quality improvement
		in \si{\decibel}.
	\item \ac{BD}-rate: computes the relative savings in bitrate for an
		equivalent distortion in percent.
\end{enumerate}

The two encodings in figure~\ref{fig:bdsnr_bdrate}, labelled as REF and EXP,
represent the reference and the experiment encodings, respectively.

\begin{figure}[tb]
	\centering
	\subfloat[\acs{BD}-\acs{PSNR}]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/bd_psnr_plot.tex}}
	{\includegraphics{./figures/bd_psnr_plot.pdf}}}
	\hfill
	\subfloat[\acs{BD}-rate]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/bd_rate_plot.tex}}
	{\includegraphics{./figures/bd_rate_plot.pdf}}}
	\caption{Schematic representation of rate-distortion plots using the
	\acs{BD} measurements}
	\label{fig:bdsnr_bdrate}
\end{figure}

The \ac{BD}-rate measurement is extensively used in this thesis to
appreciate the performance of the proposed systems, as it is the metric used
by \acs{JCT-VC} in standardisation.

\section{Conclusions}
\label{sec:conclusions_video_coding}

This Chapter has presented the motivation for video compression as well as
some general concepts concerning an overview of the video coding system.

Current video coders exploit redundancies existing within images of the video
sequence via predictions.
These predictions can be either spatial or temporal, depending respectively on
whether the prediction source is the same image or another image.
Unpredictable parts of the image, called residuals, are then passed through a
transformation step in order to concentrate the transformed residual into as
few coefficients as possible.

The appropriate block size is selected for the prediction and the transform
sizes, named \acp{PU} and \acp{TU}, respectively.
A lot of flexibility is allowed for the prediction and sizes while the set of
choices for a given codec is rather limited for the transform.

This thesis will focus on the extension of the choices for the transform
stage.
Currently, the decision of the prediction source (intra or inter), the block
sizes and \acp{IPM} are based on a single transform, usually the \ac{DCT}-II.
However, a single transform is unlikely to provide optimal signal compaction
for all kinds of possible signals.

This thesis focuses on the extension of the choices for the transform
stage, as set of transforms will be provided such that the encoder is able to
better adapt the transform to the varying nature of the prediction residuals.

Next Chapter explores the details of the transforms used in video coding,
namely their design principles and a comparison between two design approaches.

\chapter{Transform coding}
\label{cha:transform_coding}
\chaptertoc

\section{Introduction to transforms}
\label{sec:introduction_to_transforms}
\index{transform}
\index{transform coding}
\index{rotation}

In the previous Chapter, transforms were identified as an important part in
current video coding standards.
This Chapter studies the design and properties that make transforms useful in
video coding.
Although this Chapter is mainly a summary of the literature, a detailed
mathematical analysis on the $\ell_0$ norm is presented in
\S\ref{sub:the_lagrange_multiplier}.

A transform is a mathematical transfer function of a signal from a
representation domain to another.
The high energy compaction offered by transform process has led this technique
to be part of all the international video coding standards.

Transforms allow reducing existing signal correlations in the spatial domain,
leading to a more decorrelated signal in the transform domain and ensuring a
more compact representation.
This is of great importance for the upcoming stages of scanning and entropy
coding.

Transforms can be very abstract since they tend to work in $N$-dimensional
spaces, where $N$ represents the number of residual samples processed by the
transform.
Typical values vary from $N=4\times4$ to $N=32\times32$ using powers of two in
modern video codecs, such as \ac{HEVC}.

However, restraining ourselves to the two-dimensional case, one of the most
visual and representative example of transforms are rotations.
The example in figure~\ref{fig:transform_rotation} helps to visualise the
transform role in signal coding.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/transform_rotation_plot.tex}}
	{\includegraphics{./figures/transform_rotation_plot.pdf}}
	\caption{Simple transform performing a rotation}
	\label{fig:transform_rotation}
\end{figure}

Whereas on the left signal (in blue), both coordinates are needed to describe
the signal accurately, on the right signal (in red), one coordinate, namely
the vertical dimension is enough to provide an equally accurate signal
representation, since the horizontal dimension remains constant.

\subsection{Block transforms}
\label{sub:block_transforms}
\index{block transforms}
\index{JPEG}
\index{HEVC}
\index{AVC}

Block-based coding is widely adopted in image/video systems, such as
\acs{JPEG}~\cite{jpeg}, H.264/\acs{MPEG}-4 \ac{AVC} and \ac{HEVC}.

In these systems, the image to be transformed is split into non-overlapping
blocks, and each one is treated and transformed
independently~\cite{xu-09-intra-predictive-transforms}.
This provides an advantage of being less expensive in terms of computing that
other kinds of transforms, such as wavelet transforms used in
\acs{JPEG}-2000~\cite{jpeg2000}.

Since \ac{HEVC} uses block transforms only, they will be used in this thesis
to build up new systems.

\subsection{Orthogonal transforms}
\label{sub:orthogonal_transforms}
\index{orthogonal transforms}

Transforms used in image processing and video coding systems are orthogonal.
Orthogonal matrices are square matrices whose rows and columns are orthogonal
unit vectors, also known as orthonormal vectors, with:
\begin{equation}
	\A^T\A = \A\A^T = \I
\end{equation}
As a consequence, the inverse matrix of an orthogonal matrix is its
transposed version:
\begin{equation}
	\A^T = \A^{-1}
\end{equation}

This property offers some benefits:
\begin{itemize}
	\item Fast computation of inverse transform with no need to store it
		separately.
	\item Re-use of fast algorithms for both direct and inverse transform
		applications.
	\item Energy preservation.
\end{itemize}

\subsection{Separability}
\label{sub:separability}
\index{separable transforms}

Image and video coding deal with image blocks, which are two-dimensional
signals and, consequently, use transforms able to process those signals.

The straightforward approach to work with those signals is to use
non-separable transforms.
These transforms take the residual samples from a block previously reshaped
into a single-dimensional signal.
For instance, a $4\times4$ block becomes a $16\times1$ vector, as illustrated
in figure~\ref{fig:block_linearisation}.
Afterwards the transform is applied normally:
\begin{equation}
	\X = \A \, \x
\end{equation}
Where $\x$ is an $N \times N$ block, reshaped into an $N^2\times1$ vector
and $\A$ is an $N^2 \times N^2$ matrix.
The main disadvantage of this approach is the number of calculations
required to obtain the transformed signal: for an $N \times N$ block, the
number of operations required to transform it in a non-separable way is:
$N^4$ multiplications and $N^2(N^2-1)$ additions.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/block_linearisation.tex}}
	{\includegraphics{./figures/block_linearisation.pdf}}
	\caption{Example of a $4\times4$ block linearisation}
	\label{fig:block_linearisation}
\end{figure}
\index{block linearisation}

On the other hand, any linear correlation can be exploited between any sample
within a block:
complex relationships between distant samples can be handled regardless their
position in the block.

Due to the high amount of operations needed to transform a block using
non-separable transforms, separable transforms are widely used in video
coding.
A block is transformed separately using horizontal and vertical transforms
$\A_h, \A_v$ for its rows and columns, respectively, as:
\begin{equation}
	\X = \A_v {\left(\A_h \, \x^T\right)}^T = \A_v \, \x \, \A_h^T
	\label{eqn:sep_transform}
\end{equation}
The operation inside the parenthesis transforms the rows of $\x$, and the
outer part, the columns of the result.
By performing the horizontal and vertical transforms separately, the number of
operations required is reduced to $2N^3$ multiplications and $2N^2(N-1)$
additions.

However, this reduction in complexity comes at a price:
non-separable transforms are able to exploit any correlation amongst samples
within a block, whereas separable transforms can only decorrelate samples that
share the same row or column.
In other words, separable transforms might be less efficient than their
non-separable counterparts.
The impact in performance due to separability will be studied in detail
in Chapter~\ref{cha:mddt}.

If a transform can be expressed by the Kronecker product, denoted as
$\otimes$, of two separate transforms for horizontal and vertical directions,
it might be considered to be separable.
Moreover, a separable 2D transform can be equivalently be represented by a 1D
operation as shown in~\eqref{eqn:kronecker}.
\begin{equation}
	\mathbf{A}\otimes\mathbf{B} =
	\begin{bmatrix}
		a_{11} \mathbf{B} & \cdots & a_{1n}\mathbf{B} \\
		\vdots & \ddots & \vdots \\
		a_{m1} \mathbf{B} & \cdots & a_{mn} \mathbf{B}
	\end{bmatrix}
	\label{eqn:kronecker}
\end{equation}
Where $\mathbf{A}$ is an $m \times n$ matrix and $\mathbf{B}$ is a $p \times
q$ matrix.
More explicitly:
\begin{equation}
	\mathbf{A}\otimes\mathbf{B} =
	\begin{bmatrix}
		a_{11} b_{11} & a_{11} b_{12} & \cdots & a_{11} b_{1q} & \cdots & \cdots & a_{1n} b_{11} & a_{1n} b_{12} & \cdots & a_{1n} b_{1q} \\
		a_{11} b_{21} & a_{11} b_{22} & \cdots & a_{11} b_{2q} & \cdots & \cdots & a_{1n} b_{21} & a_{1n} b_{22} & \cdots & a_{1n} b_{2q} \\
		\vdots & \vdots & \ddots & \vdots & & & \vdots & \vdots & \ddots & \vdots \\
		a_{11} b_{p1} & a_{11} b_{p2} & \cdots & a_{11} b_{pq} & \cdots & \cdots & a_{1n} b_{p1} & a_{1n} b_{p2} & \cdots & a_{1n} b_{pq} \\
		\vdots & \vdots & & \vdots & \ddots & & \vdots & \vdots & & \vdots \\
		\vdots & \vdots & & \vdots & & \ddots & \vdots & \vdots & & \vdots \\
		a_{m1} b_{11} & a_{m1} b_{12} & \cdots & a_{m1} b_{1q} & \cdots & \cdots & a_{mn} b_{11} & a_{mn} b_{12} & \cdots & a_{mn} b_{1q} \\
		a_{m1} b_{21} & a_{m1} b_{22} & \cdots & a_{m1} b_{2q} & \cdots & \cdots & a_{mn} b_{21} & a_{mn} b_{22} & \cdots & a_{mn} b_{2q} \\
		\vdots & \vdots & \ddots & \vdots & & & \vdots & \vdots & \ddots & \vdots \\
		a_{m1} b_{p1} & a_{m1} b_{p2} & \cdots & a_{m1} b_{pq} & \cdots & \cdots & a_{mn} b_{p1} & a_{mn} b_{p2} & \cdots & a_{mn} b_{pq} 
	\end{bmatrix}
\end{equation}
\index{Kronecker product}

\subsection{Transform design}
\label{sub:transform_design}
\index{transform design}

Since the objective of the transforms is to be able to represent the signal
with as few coefficients as possible while minimising the distortion
introduced by the quantisation, transform design methods need to consider
trade-offs between the distortion and the number of bits needed to represent
those signals in the transform domain, as shown previously in
\S\ref{sub:rate_distortion_optimisation}.

Most transform design methods use the \ac{MSE} from~\eqref{eqn:mse} to
evaluate the distortion introduced by a quantiser applied in the transform
domain.
However, different transform designs exist depending on how the bitrate is
estimated and modelled.

Next sections will study different transform design approaches based on
different modellings of the rate constraint, namely the \acs{KLT} (\S
\ref{sec:klt}) and the \acs{RDOT} (\S \ref{sec:rdot}).

\section{The Karhunen-Loève transform}
\label{sec:klt}
\index{KLT}

The components of source signals within a residual block tend to be correlated
amongst them.
This correlation is expressed in terms of a correlation matrix containing the
linear inter sample correlations.
For two $N$-dimensional signals $\x,\y$, the covariance signal between them
can be computed as:
\begin{align}
	\nonumber
	\C_{\x,\y} & = \E\left\{\x\y^T\right\} =
	\E\left\{
	\begin{bmatrix}
		x_0 \\ x_1 \\ \vdots \\ x_{N-1}
	\end{bmatrix}
	\begin{bmatrix}
		y_0 & y_1 & \cdots & y_{N-1}
	\end{bmatrix}
	\right\} \\
	& = \E\left\{
	\begin{bmatrix}
		x_0y_0 & x_0y_1 & \hdots & x_{0}y_{N-1} \\
		x_1y_0 & x_1y_1 & \hdots & x_{1}y_{N-1} \\
		\vdots & \vdots & \ddots & \vdots \\
		x_{N-1}y_0 & x_{N-1}y_1 & \hdots & x_{N-1}y_{N-1}
	\end{bmatrix}
	\right\} \\ \nonumber
	& =
	\begin{bmatrix}
		\E\left\{x_0y_0\right\} & \E\left\{x_0y_1\right\} & \hdots & \E\left\{x_{0}y_{N-1}\right\} \\
		\E\left\{x_1y_0\right\} & \E\left\{x_1y_1\right\} & \hdots & \E\left\{x_{1}y_{N-1}\right\} \\
		\vdots & \vdots & \ddots & \vdots \\
		\E\left\{x_{N-1}y_0\right\} & \E\left\{x_{N-1}y_1\right\} & \hdots & \E\left\{x_{N-1}y_{N-1}\right\}
	\end{bmatrix}
\end{align}
\index{covariance matrix}
\index{correlation matrix}
If $\x=\y$, the covariance matrix is the correlation matrix of $\x$.

It is possible to select an orthogonal matrix $\A$ that will make $\X=\A\x$
have pairwise uncorrelated components in the transform
domain~\cite{gersho-92-vector-quantization}.
The \ac{KLT} is defined as the linear orthogonal transform that reduces the
redundancy by a maximum decorrelation of the data, so that the signal can be
stored more efficiently~\cite{rao-01-transform-data-compression-book}.

In this section, the \ac{KLT} is presented under its well-known optimal
condition:
the high-resolution quantisation assumption~\cite{goyal-00-high-resolution}.
The high-resolution assumption states that the number of quantisation levels
is high and the quantisation step size is small enough to consider the
\ac{PDF} constant for each quantisation interval.

Under this condition, the \ac{KLT} is the transform that achieves optimal
bit allocation for the quantisation of transform coefficients by distributing
their variances in such a way as to minimise their geometric
mean~\cite{jayant-84-digital-coding-waveforms}, while minimising the overall
distortion~\cite{gersho-92-vector-quantization}.

The \ac{KLT} decorrelates the signal in the transform domain, that is the
correlation function of the signal in the transform domain $\C_\X$ is a
diagonal matrix, which can be computed as follows.
Let $\x$ be a zero-mean process and $\A$ an orthogonal transform, then, in
the transform domain:
\begin{equation}
	\X = \A \x \qquad \text{s.t. } \A \A^T = \I
\end{equation}
The covariance matrix in the transform domain is expressed as:
\begin{equation}
	\C_\X = \E\left\{\X \X^T\right\} = \A \E\left\{\x\x^T\right\}\A^T =
	\A\C_\x\A^T
\end{equation}
Or equivalently:
\begin{equation}
	\A^T\C_\X = \C_\x\A^T
\end{equation}
And since $\C_\X$ is diagonal:
\begin{equation}
	\C_\x \a_i = \lambda_i\a_i
\end{equation}
Where:
\begin{itemize}
	\item $\a_i$'s are the eigenvectors of $\C_\x$.
	\item $\lambda_i$'s are the eigenvalues of $\C_\x$.
\end{itemize}
\index{eigen values}
\index{eigen vectors}

As a result, the \ac{KLT} is a transform whose base vectors are the
eigenvectors of the correlation matrix of the input signal.

\subsection{Particular case on natural images: the \acs{DCT}}
\label{sub:particular_case_dct}
\index{DCT}
\index{Markov}
\index{AR}
\index{Toeplitz matrix}

One of the most used transforms in image and video coding is the
\acf{DCT}.
In this section, the \ac{DCT} is justified over a particular kind
of signals: natural images.
The statistics of pixels in natural images match closely a first order
\ac{AR} process.
A first order \ac{AR} model, also known as Markov-1 process, is a stochastic
process that can be generated through the following regression formula:
\begin{equation}
	x(n) = \rho x(n-1) + w(n)
	\label{eqn:first_order_ar_model}
\end{equation}
Where $\rho$ is the correlation coefficient between two adjacent samples
and $w(n)$ is a white noise with zero mean, whose variance is related to
the variance of $x(n)$ $\sigma_x^2$ as:
\begin{equation}
	\sigma_w^2 = \E\left\{ w(n) w(n) \right\} =
	\left(1-\rho^2\right)\sigma_x^2
\end{equation}
The correlation matrix of this process takes the form of a Toeplitz
matrix~\cite{akansu-12-toeplitz-approximation}:
\begin{equation}
	\R_x = \sigma_x^2
	\begin{pmatrix}
		1          & \rho       & \rho^2     & \cdots & \rho^{N-1} \\
		\rho       & 1          & \rho       & \cdots & \rho^{N-2} \\
		\rho^2     & \rho       & 1          & \cdots & \rho^{N-3} \\
		\vdots     & \vdots     & \vdots     & \ddots & \vdots     \\
		\rho^{N-1} & \rho^{N-2} & \rho^{N-3} & \cdots & 1
	\end{pmatrix}
	\label{eqn:toeplitz_matrix}
\end{equation}
The \ac{KLT} for this kind of processes, that is, the eigenvectors of
$\R_\x$, tends to the \ac{DCT} as $\rho\to1$~\cite{britanak-06-dct-and-dst}.
The \ac{DCT}-II can be expressed compactly as:
\begin{equation}
	{\left[C_{N}^{II} \right]}_{n,k} =
	\sqrt{\frac{2}{N}}\epsilon_k\cos\left(\frac{\pi(2n+1)k}{2N}\right)
	\quad
	n,k=0,\dots,N-1
	\label{eqn:dct_ii}
\end{equation}

\begin{equation}
	\epsilon_k =
	\begin{cases}
		\frac{1}{\sqrt{2}} & k = 0 \\
		1 & \text{otherwise}
	\end{cases}
\end{equation}

The fact that the \ac{DCT} approximates the \ac{KLT} for image signals and its
efficient implementation has made it the preferred choice in image and video
coding algorithms to decorrelate the signals and provide optimal bit
allocation~\cite{sole-12-transform-coefficient-coding}.

\begin{figure}[tb]
	\centering
	\subfloat[\acs{DCT}-II $4\times4$]
	{\includegraphics[width=0.3\linewidth]{./figures/dct4-bases.png}}
	\hspace{0.2\linewidth}
	\subfloat[\acs{DST}-VII $4\times4$]
	{\includegraphics[width=0.3\linewidth]{./figures/dst4-bases.png}}
	\caption{Transforms used in \acs{HEVC} for $4\times4$ blocks}
	\label{fig:dct_dst}
\end{figure}

\subsection{Particular case on prediction residuals: the \acs{DST}}
\label{sub:particular_case_dst}
\index{DST}
\index{intra prediction}
\index{Toeplitz matrix}

Although the \ac{DCT} has been proved to be nearly the \ac{KLT} for natural
images, it is used in predictive transform coding based video standards.
This kind of coding leads to signals whose nature differs from that of
natural images: prediction residuals.
More precisely, prediction residuals resulting from intra prediction tend to
have particular properties.
Those residuals are computed using predictions from already
reconstructed blocks, usually available on the left and upper borders (in
grey) of the current residual (in white), as shown in
figure~\ref{fig:pred_scheme}.a.
An example of an average intra prediction residual is provided in
figure~\ref{fig:pred_scheme}.b.
It can be noticed that the energy of the residual is lower (dark) near
the borders where reconstructed blocks are available, and that the error
gets higher (lighter) as one moves away from the boundaries.
These properties motivated a study on this particular kind of blocks,
resulting in a transform that performs better on them than the \ac{DCT}:
the \acf{DST}~\cite{han-10-spatial-adaptive-transform}.
\begin{figure}[tb]
	\centering
	\subfloat[The \aclp{IPM} for the current block (white) from previously
	reconstructed blocks (grey)]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/pred-scheme.tex}}
	{\includegraphics{./figures/pred-scheme.pdf}}}
	\hspace{0.2\linewidth}
	\subfloat[Average intra prediction residual]
	{\includegraphics[width=0.3\linewidth]{./figures/resids-scan-diag.png}}
	\caption[Example of \aclp{IPM} on a prediction residual]
	{Prediction scheme showing all \aclp{IPM} and a prediction
	residual example}
	\label{fig:pred_scheme}
\end{figure}

Intra prediction residuals present a correlation matrix that can be modelled
using a Toeplitz tridiagonal matrix, as the one
in~\eqref{eqn:tridiagonal_matrix}~\cite{han-10-spatial-adaptive-transform,
yueh-05-eigenvalues-tridiagonal}.
The eigenvalues of that correlation matrix can be expressed in a closed form
using the \ac{DST}-VII from~\eqref{eqn:dst_vii}.

\begin{equation}
	\tilde\R_x =
	\begin{pmatrix}
		1+\rho^2 & -\rho    & 0        & 0      & \hdots & 0            \\
		-\rho    & 1+\rho^2 & -\rho    & 0      & \hdots & 0            \\
		0        & -\rho    & 1+\rho^2 & -\rho  & \hdots & 0            \\
		0        & \ddots   & \ddots   & \ddots & \ddots & \vdots       \\
		\vdots   & \ddots   & \ddots   & \ddots & \ddots & -\rho        \\
		0        & \hdots   & 0        & 0      & -\rho  & 1+\rho^2-\rho
	\end{pmatrix}
	\label{eqn:tridiagonal_matrix}
\end{equation}
\index{tridiagonal matrix}

\begin{equation}
	{\left[S_{N}^{VII} \right]}_{n,k} =
	\frac{2}{\sqrt{2N+1}}\sin\left(\frac{\pi(2n-1)k}{2N+1}\right),
	\quad
	n,k = 1, \dots, N
	\label{eqn:dst_vii}
\end{equation}

Figure~\ref{fig:dct_dst}.b presents the \ac{DST}-VII bases for
$4\times4$ blocks.
The similarity between an average residual block from
figure~\ref{fig:pred_scheme}.b can be spotted by comparing it to the first
base vector from the \ac{DST}.

The use of the \ac{DST} in \ac{HEVC} over the \ac{DCT} on $4\times4$
blocks leads to a bitrate reduction of 1\% on
average~\cite{sullivan-12-overview-hevc} for intra blocks.
An explanation on the adoption of the \ac{DST} into \ac{HEVC} is detailed in
Chapter~\ref{cha:mddt}, more precisely in \S\ref{sub:dst_and_mddt}.

\subsubsection{Note on the \acs{KLT}}
\label{ssub:note_on_the_klt}

The \ac{KLT} is often presented as the optimal transform, sometimes even for
all possible sources of signals.
However, it has been proved to be suboptimal in the transform coding / bit
allocation sense in some cases~\cite{effros-04-suboptimal-klt}.
For this reason, next section studies another kind of transform design that
leads to optimal transforms under different conditions.

\section{The rate-distortion optimised transform}
\label{sec:rdot}
\index{RDOT}

As discussed before, transforms try to find a compact representation of the
signal in the transform domain.
The \ac{KLT}, under certain assumptions, provides optimal bit allocation for
transformed coefficients and signal decorrelation.

Sezer proposes an alternative kind of transform to optimise the trade-off
between the distortion and the rate from~\eqref{eqn:lagrangian_rdo} by taking
into account the sparsity of the output signal in their
design~\cite{sezer-11-phd,sezer-08-sparse-orthonormal-transforms}.
This transform, named \ac{RDOT}, unlike the \ac{KLT}, does not make use
of the high-resolution assumption.
Instead, the bitrate constraint is modelled via the sparsity of the
transformed coefficients, by the introduction of the $\ell_0$ norm, which
counts the number of non-zero coefficients in a vector.
The proposed \ac{RDOT} can be expressed as:
\begin{equation}
	\A_{opt} = \arg\min\limits_{\A}
	\sum_{\forall i} \min\limits_{\c_i}
	\left(
	{\Vert \x_i - \A^T\c_i \Vert}_2^2 + \lambda{\Vert \c_i \Vert}_0
	\right)
	\label{eqn:rdot-nsep}
\end{equation}
Where $\x_i$ are the input signals, i.e.\ a block of the training set and
$\c_i$ are its quantised transformed coefficients using the transform $\A$.
$\A^T$ is its transposed matrix, as $\A$ is chosen orthogonal.
The constraint in the cost function is the $\ell_0$ norm of the coefficients,
i.e.\ the number of non-zero coefficients.
Finally, $\lambda$ is the Lagrange multiplier that tunes the constraint.

The fact that this transform design strives to obtain sparse signals in the
transform domain seems to be adapted to the state-of-the-art video coding
standard, \ac{HEVC}.
There are numerous syntax elements in \ac{HEVC} that deal with sparsity:
for a given transformed residual, the position of the last non-zero value is
signalled, meaning that, instead of explicitly transmitting all the following
zeroes, it is indicated that from that point onwards, all the values are zero.
More generally, flags indicating whether the individual coefficients are
significant or not are also part of the standard syntax.
Therefore, increasing the number of zeros of the quantised coefficients in the
transform domain seems a good objective.

A thorough study of~\eqref{eqn:rdot-nsep} analysing its properties and
consequences is detailed below.

\subsection{The \acs{RDOT} metric}
\label{sub:the_rdot_metric}
\index{RDOT metric}

The value that \ac{RDOT} minimises is expressed in~\eqref{eqn:rdot_metric} for
a single signal.
This metric depends exclusively on the quantisation step $\Delta$ and the
used transform $\A$.
It is shown in \S\ref{sub:the_lagrange_multiplier} that the Lagrange
multiplier $\lambda$ is tied to the quantisation step $\Delta$.
\begin{equation}
	J (\lambda) =
	{\Vert \x - \A^T \c \Vert}_2^2 + \lambda{\Vert \c \Vert}_0
	\label{eqn:rdot_metric}
\end{equation}
The first part of the equation represents the distortion introduced by
the quantisation.
The second term serves as a rate constraint, by ensuring that the number of
significant values in the transform domain is minimised together with the
distortion.

The minimisation of the metric is done in two steps, carried out iteratively
until convergence:
\begin{enumerate}
	\item Finding the optimal coefficients for a given transform.
	\item Updating the transform for the optimal coefficients.
\end{enumerate}

These two steps are detailed in the subsections below.
\subsubsection{Optimal coefficients for a given transform}
\label{ssub:optimal_coefficients_for_a_given_transform}

The optimal coefficients that introduce the minimum distortion for a given
quantisation step $\Delta$ are obtained by transforming the signal and
hard-thresholing them:
\begin{equation}
	\c = \lfloor \X \rfloor = \lfloor \A \x \rfloor
\end{equation}
The threshold value is tightly related to the Lagrange multiplier $\lambda$,
as demonstrated in \S\ref{sub:the_lagrange_multiplier}:
\begin{equation}
	\c[n] =
	\begin{cases}
		\X[n] & \vert \X[n] \vert \ge \displaystyle \frac{\Delta}{2} \\
		0     & \text{otherwise} \\
	\end{cases}
	\label{eqn:hard_threshold}
\end{equation}

\subsubsection{Optimal transform for given coefficients}
\label{ssub:optimal_transform_for_given_coefficients}

Once the optimal coefficients have been found, the transform $\A$, chosen
orthogonal, is updated to provide the mapping between $\x$ and $\c$ while
minimising the reconstruction error.
\begin{equation}
	\A_{opt} = \arg\min\limits_{\A}
	\left(
	\sum_{\forall i}{\Vert \x_i - \A^T\c_i\Vert}^2
	\right)
	\qquad \text{s.t. } \A\A^T = \I
\end{equation}
Since the expression is a scalar value, it can be rewritten using the trace
(the sum of a matrix diagonal):
\begin{equation}
	\A_{opt} = \arg\min\limits_{\A}
	\left(\sum_{\forall i}\tr\left( 
	{\left(\x_i - \A^T\c_i\right)}^T\left( \x_i - \A^T\c_i\right)
	\right)\right)
\end{equation}
Operating:
\begin{equation}
	\A_{opt} = \arg\min\limits_{\A}
	\left(\sum_{\forall i}\tr\left( 
	\x_i^T\x_i -\x_i^T\A^T\c_i -\c_i^T\A\x_i + \c_i^T\A\A^T\c_i
	\right)\right)
\end{equation}
Since the trace is a linear operator and $\A\A^T=\I$:
\begin{equation}
	\A_{opt} = \arg\min\limits_{\A}
	\left(\sum_{\forall i}
	\tr\left(\x_i^T\x_i\right)
	-\tr\left(\x_i^T\A^T\c_i\right)
	-\tr\left(\c_i^T\A\x_i\right)
	+\tr\left(\c_i^T\c_i \right)
	\right)
\end{equation}
Making use of the cyclic property of the trace and removing
$\A$-independent terms:
\begin{equation}
	\A_{opt} = \arg\min\limits_{\A}
	\left(\sum_{\forall i}
	-2\tr\left(\c_i\x_i^T\A^T\right)
	\right)
\end{equation}
Defining $\Y=\displaystyle\sum_{\forall i}\c_i\x_i^T$ and its \ac{SVD}
$\Y=\U\Lambdab^{\nicefrac{1}{2}}\V^T$, where $\U$ and $\V$
are orthogonal and $\Lambdab$ is a positive semi-definite diagonal matrix.
The equation rewrites as follows:
\begin{equation}
	\A_{opt} = \arg\min\limits_{\A}
	\left(
	-2\tr\left(\U\Lambdab^{\nicefrac{1}{2}}\V^T\A^T\right)
	\right)
\end{equation}
Minimising a negative expression is equivalent to maximising its positive
version, and re-arranging the terms using the trace cyclic property:
\begin{equation}
	\A_{opt} = \arg\max\limits_{\A}
	\left(
	2\tr\left(\Lambdab^{\nicefrac{1}{2}}\V^T\A^T\U\right)
	\right)
\end{equation}
Let $\Pb=\V^T\A^T\U$.
Since $\V$, $\A$ and $\U$ are orthogonal, so is $\Pb$.
The equation is now:
\begin{equation}
	\A_{opt} = \arg\max\limits_{\A}
	\left(
	\tr\left(\Lambdab^{\nicefrac{1}{2}}\Pb\right)
	\right)
\end{equation}
Since $\Lambdab$ is a diagonal matrix whose entries are non-negative by
definition and $\Pb$ is orthogonal, the maximisation is achieved when
$\Pb=\I$:
\begin{equation}
	\V^T\A_{opt}^T\U=\I\quad \Rightarrow \quad \A_{opt} = \U\V^T
\end{equation}

Summing up, the optimal transform is obtained using the \ac{SVD} of the
covariance matrix between the output signal (the transformed and quantised
coefficients) and the input signal (the prediction residuals).

\subsection{Separable \acs{RDOT} design}
\label{sub:separable_rdot_design}
\index{separable RDOT}
\index{RDOT}

The methods for transform design and learning presented in the previous
sections are non-separable.
This means that the input block is linearised and the transform is applied in
one single step.
Due to computational complexity issues, non-separable transforms are hardly
used in performing solutions.
A lower complexity approach involves separable transforms, which may not be
able to concentrate the energy in the transform domain as well as their
non-separable counterparts, as explained in \S\ref{sub:separability}.
In order to design and learn separable transforms, the design and learning
methods have to be adapted.
In the case of the \ac{KLT}, it is straightforward to see that one can learn a
horizontal \ac{KLT} to transform the rows of the signal and a vertical
\ac{KLT} for the columns.
However, the \ac{RDOT} algorithm needs further tuning in order to obtain
separable transforms.
A possible way of learning a separable \ac{RDOT} was also proposed by
Sezer~\cite{sezer-11-phd} and validated by independent
researches~\cite{sole-09-sparsity-optimisation-separable-transforms}.
The proposed method consists in updating each one of the horizontal and
vertical transforms separately, also chosen to be orthogonal transforms.

The separable transformation of a block $\x$ has been previously defined as a
two step transformation in~\eqref{eqn:sep_transform}:
\begin{equation*}
  \X = {\A_v\left( \A_h \x^T \right)}^T = \A_v \x \A_h^T
\end{equation*}
Where $\x$ is the two-dimensional block to transform, $\A_h$ is the horizontal
transform, used to transform the rows of $\x$ and $A_v$ is the vertical
transform, used to transform the resulting columns.

The equation to be optimised using separable transforms reads as follows:
\begin{equation}
	\A_{v_{opt}}, \A_{h_{opt}} = \arg\min\limits_{\A_v,\A_h}
	\left(
	\sum_{\forall i} \min\limits_{\c_i}{\Vert \x_i - \A_v^T\c_i\A_h \Vert}_2^2
	+ {\Vert \c_i \Vert}_0
	\right)
	\label{eqn:rdot-sep}
\end{equation}

This minimisation problem can be solved in a similar way to the non-separable
version.

\subsubsection{Optimal coefficients for a given transform}

As in the non-separable problem, the optimal coefficients $\c_i$ are found by
hard-thresholding the components of $\X_i=\A_v\x_i\A_h^T$ with
$\nicefrac{\Delta}{2}$.

\subsubsection{Optimal vertical transform for given coefficients}
The vertical transform needs to be updated accordingly:
\begin{equation}
	\A_{v_{opt}} = \text{arg}\min\limits_{\A_v}
	\left(
	\sum_{\forall i} {\Vert \x_i - \A_v^T\c_i\A_h \Vert}_2^2
	\right)
	\qquad \text{s.t. } \A_v\A_v^T = \I
\end{equation}
Expanding the expression and grouping the terms as in the non-separable
the covariance matrix $\Y$ can be defined as:
\begin{equation}
\Y = \sum_{\forall i} \c_i\A_h\x_i =
\U\Lambdab^{1/2}\V^T
\end{equation}
Then the optimal transform is given by:
\begin{equation}
  \A_{v_{opt}} = \U\V^T
\end{equation}

\subsubsection{Optimal coefficients for updated vertical transform}

As in the non-separable problem, the optimal coefficients $\c_i$ are
determined by hard-thresholding the components of $\X=\A_{v_{opt}}\x_i\A_h^T$
with $\nicefrac{\Delta}{2}$. However, this time the optimal vertical
transform from the previous step is used.

\subsubsection{Optimal horizontal transform for given coefficients}

Similarly, the horizontal transform needs to be updated accordingly:
\begin{equation}
	\A_{h_{opt}} = \text{arg}\min\limits_{\A_h}
	\left(
	\sum_{\forall i} {\Vert \x_i - \A_{v_{opt}}^T\c_i\A_h \Vert}_2^2
	\right)
	\qquad \text{s.t. } \A_h\A_h^T = \I
\end{equation}
Expanding the expression and grouping the terms as in the non-separable
the covariance matrix $\Y$ can be defined as:
\begin{equation}
	\Y = \sum_{\forall i} \c_i^T\A_{v_{opt}}\x_i =
\U\Lambdab^{1/2}\V^T
\end{equation}
The optimal transform is given by:
\begin{equation}
	\A_{h_{opt}} = \U\V^T
\end{equation}

Hence, the optimal transform are obtained in two iteratively alternated steps
until convergence.

\subsection{The Lagrange multiplier and the zero norm}
\label{sub:the_lagrange_multiplier}
\index{generalised normal distribution}
\index{generalised Gaussian distribution}
\index{exponential power distribution}
\index{Lagrange muliplier}
\index{zero norm}

Video coding residuals distribution can be modelled using a \ac{GGD}, also
known as generalised normal distribution or exponential power
distribution~\cite{lam-00-dct-coefficient-distribution,
yovanof-96-analysis-dct-coefficients}.
For this reason, in order to obtain the optimal Lagrange multiplier in a
reasonably general case, a \ac{GGD} is used to represent the residuals
\ac{PDF}.

Figure~\ref{fig:probability_density_functions} illustrates the \acp{PDF}
of a set of residuals transformed with the \ac{DCT} and the same residuals
transformed with an adapted \ac{RDOT}.  As expected, the \ac{RDOT},
represented by the red curve, presents higher sparsity than the generic
\ac{DCT}, in black:
its value is above the \ac{DCT} on zero, meaning that the coefficients have
more zeros when using the \ac{RDOT}.
Using different transforms, modifies the resulting \ac{PDF}, but since the
used transforms are orthogonal, the variance remains unaltered.
Additionally, since residuals are computed as the difference between predicted
and original blocks, they are prediction errors.
These errors deviate evenly above and below from the actual value, evidenced
by their zero-mean \ac{PDF}, which is symmetrical and centred around zero.
\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/pdfs_plot.tex}}
	{\includegraphics{./figures/pdfs_plot.pdf}}
	\caption[Different residuals \acsp{PDF} compared to Laplace and Normal
	distributions]
	{\acsp{PDF} of the residuals with different transforms
	compared to Laplace and normal distributions}
	\label{fig:probability_density_functions}
\end{figure}
\index{probability density function}
\index{PDF}
Figure~\ref{fig:probability_density_functions} also includes the
\ac{PDF} of a Laplace distribution and a normal distribution.
Those two distributions are particular cases of a \ac{GGD}.

The centred \ac{GGD} \ac{PDF} can be expressed compactly as:
\index{gamma function}
\begin{equation}
	\GGD(\sigma,\gamma,x)=
	a\e^{-{\left(b\vert x \vert\right)}^\gamma}
	\label{eqn:ggd}
\end{equation}
Where:
\begin{align}
	b &= \frac{1}{\sigma}\sqrt{
	\frac{\Gamma\left(\nicefrac{3}{\gamma}\right)}
	{\Gamma \left(\nicefrac{1}{\gamma}\right)}} \\
	a &= \frac{b\gamma}{2\Gamma \left(\nicefrac{1}{\gamma}\right)}
\end{align}
and $\Gamma(z)$ is the gamma function, defined as:
\begin{equation}
	\Gamma(z) = \int_0^\infty t^{z-1}\e^{-t}\d t
\end{equation}
\index{Laplace distribution}
\index{Gaussian distribution}
\index{normal distribution}
\index{uniform distribution}
The Laplace and normal or Gaussian \acp{PDF} are achieved with
$\gamma=1$ and $\gamma=2$, respectively.
The uniform distribution can even be reached by making $\gamma\to\infty$.
However, from figure~\ref{fig:probability_density_functions} one could
approximate the video coding residuals \ac{PDF} with
$1\le\gamma\le2$~\cite{li-08-laplacian-modeling-dct-coefficients}.

In Sezer's work~\cite{sezer-11-phd,sezer-08-sparse-orthonormal-transforms},
the optimal value of the Lagrange multiplier $\lambda$ has been claimed
to be straightforward to obtain.
Nevertheless, no analytical way of proving its optimality has been found in
the literature.
Hence, a detailed study has been carried out below.

In order to find the optimal $\lambda$ from~\eqref{eqn:rdot_metric},
which describes the trade-off between the distortion and the rate,
$J(\lambda)$ has to be derived.
The problem will be tackled in two separate steps:
\begin{enumerate}
	\item Compute the distortion analytically and derive it.
	\item Compute the rate constraint and derive it.
\end{enumerate}
It has been decided to normalise the~\eqref{eqn:rdot_metric} by $N$ (the
signal dimension) to simplify the equations.
This scaling factor does not affect the solution.

\subsubsection{Derivation of the distortion function}
\label{ssub:derivation_of_the_distortion_function}

The distortion introduced by the hard-thresholding
from~\eqref{eqn:hard_threshold} can be expressed as:
\begin{equation}
	D
	= \frac{1}{N} \int_{-\infty}^{\infty} N {(x - \hat x)}^2 \P_X (x)\d x
	= \int_{\nicefrac{-\Delta}{2}}^{\nicefrac{\Delta}{2}}
	\label{eqn:int_distortion}
	x^2\P_X (x) \d x
\end{equation}
The integration intervals have been reduced to where the quantised values
differ from the original ones, that is, the values that have been affected by
the hard-thresholding from~\eqref{eqn:hard_threshold}.

Substituting $P_X(x)$ with the residuals \ac{PDF}:
\begin{align}
	D
	&=\int_{\nicefrac{-\Delta}{2}}^{\nicefrac{\Delta}{2}}
	x^2 a \e^{-{\left(b\vert x \vert\right)}^\gamma} \d x \\
	&=2 \int_0^{\nicefrac{\Delta}{2}}
	x^2 a \e^{-{\left(b x \right)}^\gamma} \d x\\
	&=
		2a \frac{\Gamma\left(\frac{3}{\gamma}\right)-
		\Gamma\left(
		\frac{3}{\gamma},{\left(\frac{b\Delta}{2}\right)}^\gamma
		\right)}{b^3\gamma}
	\label{eqn:distortion}
\end{align}
Where $\Gamma(a,z)$ is the incomplete upper gamma function, defined as:
\index{incomplete upper gamma function}
\begin{equation}
	\Gamma(a,z)=\int_z^\infty t^{a-1}\e^{-t}\d t
\end{equation}
Deriving~\eqref{eqn:distortion} in $\Delta$:
\begin{equation}
	\frac{\d D}{\d\Delta} =
	\frac{\Delta^2 a \e^{{(-b\nicefrac{\Delta}{2})}^\gamma}}{4}
	\label{eqn:diff_distortion}
\end{equation}

\subsubsection{Derivation of the zero norm function}
\label{ssub:derivation_of_the_zero_norm_function}

By definition, the $\ell_0$ norm is the total number of non-zero
elements in a vector.
Consequently, the constraint can be expressed as:
\begin{equation}
	R 
	= \frac{1}{N} N \P_X\left(\vert X \vert \ge \frac{\Delta}{2}\right)
	= 1-P_X\left(\vert X \vert < \frac{\Delta}{2}\right)
	= 1-\int_{\nicefrac{-\Delta}{2}}^{\nicefrac{\Delta}{2}}\P_X(x)\d x
	\label{eqn:int_rate}
\end{equation}
Substituting $P_X(x)$ with the residuals \ac{PDF}:
\begin{align}
	R
	&=1-\int_{\nicefrac{-\Delta}{2}}^{\nicefrac{\Delta}{2}}
	a \e^{-{\left(b\vert x \vert\right)}^\gamma} \d x \\
	&=1-2\int_0^{\nicefrac{\Delta}{2}}
	a \e^{-{\left(b x \right)}^\gamma} \d x \\
	&=1+2a\frac{\Gamma\left(
		\frac{1}{\gamma},{\left(\frac{b\Delta}{2}\right)}^\gamma\right)-
		\Gamma\left(\frac{1}{\gamma}\right)}
		{b\gamma}
	\label{eqn:rate}
\end{align}
Deriving~\eqref{eqn:rate} in $\Delta$:
\begin{equation}
	\frac{\d R}{\d\Delta} =
	-a\e^{{\left(\nicefrac{-b\Delta}{2}\right)}^\gamma}
	\label{eqn:diff_rate}
\end{equation}
\subsubsection{Optimal Lagrange multiplier}
\label{ssub:optimal_lagrange_multiplier}

With both the distortion~\eqref{eqn:diff_distortion} and the
constraint~\eqref{eqn:diff_rate} derived, the optimal Lagrange
multiplier can be found as:
\begin{equation}
	\frac{\d J(\lambda)}{\d \Delta}
	= \frac{\d D}{\d \Delta} +
	\lambda \frac{\d R}{\d \Delta} = 0 \\
	\label{eqn:diff_lagrange_multiplier}
\end{equation}
Substituting both derivatives:
\begin{equation}
	\frac{\Delta^2 a \e^{{(-b\nicefrac{\Delta}{2})}^\gamma}}{4}
	- \lambda
	a\e^{{\left(\nicefrac{-b\Delta}{2}\right)}^\gamma} = 0
	\quad \Rightarrow \quad \boxed{\lambda = \frac{\Delta^2}{4}}
\end{equation}

This proves how the Lagrange multiplier is only related to the quantisation
step.
In other words, once this value is fixed, so is the optimal balance between
the distortion and the rate constraint.

An important consequence of using the $\ell_0$ norm is that the optimal
Lagrange multiplier is independent from the data's \ac{PDF} (it does
not depend on $\sigma$ neither on $\gamma$), meaning that the optimal
Lagrange multiplier remains the same, no matter which transform has been
used.

In fact, these results can be generalised to any \ac{PDF}, making the
$\ell_0$ norm a robust approximation of the rate.
This property makes the $\ell_0$ norm suitable for iterative learning methods
where the transform changes at each iteration and so does the \ac{PDF} of the
training data in the transform domain.

\subsection{Independence from the \acs{PDF}}
\label{sub:independence_from_the_pdf}
\index{PDF}
\index{probability density function}

In the previous subsection, for a given quantisation step, the value of
the optimal Lagrange multiplier $\lambda$ has been found for the particular
case of the residuals displaying \ac{PDF} that can be modelled after a
\ac{GGD}.
By using the fundamental theorem of calculus, which relates integrals
and derivatives of a function, one can generalise that conclusion for
any continuous \ac{PDF}.

Let $f(x)$ be the residuals \ac{PDF}.
The distortion is computed reusing~\eqref{eqn:int_distortion}:
\begin{equation}
	D = \int_{\nicefrac{-\Delta}{2}}^{\nicefrac{\Delta}{2}} x^2 f(x) \d x
\end{equation}
Deriving the distortion with respect to $\Delta$:
\begin{equation}
	\frac{\d D}{\d \Delta} =
	\frac{\Delta^2}{8}\left[
	f\left(\frac{\Delta}{2}\right)+f\left(-\frac{\Delta}{2}\right)\right]
\end{equation}
The rate constraint from~\eqref{eqn:int_rate} is computed as
follows with the generic \ac{PDF}
$f(x)$:
\begin{equation}
	R = 1 - \int_{\nicefrac{-\Delta}{2}}^{\nicefrac{\Delta}{2}} f(x) \d x\\
\end{equation}
Again, deriving with respect to $\Delta$:
\begin{equation}
	\frac{\d R}{\d \Delta} =
	-\frac{1}{2}\left[
	f\left(\frac{\Delta}{2}\right)+
	f\left(-\frac{\Delta}{2}\right)\right]
\end{equation}
Combining previous equations using~\eqref{eqn:diff_lagrange_multiplier},
is made clear that the optimal Lagrange multiplier $\lambda$ does not depend
on the residuals \ac{PDF}:
\begin{equation}
	\frac{\Delta^2}{8}\left[
	f\left(\frac{\Delta}{2}\right)+f\left(-\frac{\Delta}{2}\right)\right]
	- \lambda
	\frac{1}{2}\left[
	f\left(\frac{\Delta}{2}\right)+
	f\left(-\frac{\Delta}{2}\right)\right] = 0
\end{equation}
\begin{equation}
	\boxed{\lambda = \frac{\Delta^2}{4}}
\end{equation}

This fact makes the $\ell_0$ norm suitable even for distributions that cannot
be modelled after a \ac{GGD}.
The independence from the \ac{PDF} comforts the choice of the $\ell_0$
norm over other models that might provide a more realistic and accurate
approximation of the rate, such as the entropy.

An example of the $\ell_0$ norm independence from the \ac{PDF} is provided
below.
Consider the following scenario: some residuals \acp{PDF} that follow \ac{GGD}
with different exponents $\gamma$.
Consider, as well, a quantisation step $\Delta=14.256$, issued from a \ac{QP}
27 in \ac{HEVC}.
It has been demonstrated previously, that $J(\lambda)$
from~\eqref{eqn:rdot_metric} achieves its minimum value at
$\lambda=\frac{\Delta^2}{4}\approx50.81$, independently from the
residuals \ac{PDF} when modelling the rate with the $\ell_0$ norm.

However, if instead of using the $\ell_0$ norm, entropy (H) is used, it can no
longer be assumed that the optimal value of $\lambda$ providing the best
trade-off between distortion and rate does not depend on the residuals
\ac{PDF}.
Due to the complexity of the calculations involved then using the entropy
together with a uniform quantiser, the dependency to the \ac{PDF} will be
evidenced through the example from figure~\ref{fig:j_lambda_qp}.
It shows that the value $J(\lambda)$ is plotted for various \acp{GGD} at
different \ac{QP} values, which served as the hard-threshold value.
When using the $\ell_0$ norm, $J(\lambda)$ reaches its minimum value at
QP 27, as expected.
On the other hand, using the entropy leads to a minimum value of
$J(\lambda)$ that is \ac{PDF}-dependent.
An unwanted consequence of this dependence is that, for an iterative
learning algorithm, after updating the transform at each iteration, the
new \ac{PDF} would have to be estimated to find the new optimal Lagrange
multiplier $\lambda$, thus complicating the whole learning process and
leading to probable instabilities.
\begin{figure}[tp]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/j_qp27_plot.tex}}
	{\includegraphics{./figures/j_qp27_plot.pdf}}
	\caption
	[$J(\lambda)$ for different \acsp{GGD} modelling the rate with
	the $\ell_0$ norm and the entropy]
	{$J(\lambda)$ for different \acsp{GGD} modelling the rate with
	the $\ell_0$ norm and the entropy (H)}
	\label{fig:j_lambda_qp}
\end{figure}

\subsection{Rate-distortion improvement through the learning}
\label{sub:rate_distortion_improvement_through_the_learning}

Assuming the source \ac{PDF} can be modelled after a \ac{GGD} with a
given variance $\sigma^2$ and exponent $\gamma$, the impact of the
learning in terms of the $J(\lambda)$ from~\eqref{eqn:rdot_metric} can
be illustrated with the following example.
By learning an adapted transform over a set of residuals, the amount of
transformed coefficients mapped to zero increases, increasing its
kurtosis (the distribution ``looks more sharp''), hence reducing the
exponent $\gamma$.
Figure~\ref{fig:lambda_zero_norm_dist} illustrates the relationship
between the distortion and the rate model, at different \acp{QP}.
\acp{QP} 27 and 32 are highlighted, corresponding to quantisation steps
$\Delta = 14.256$ and $\Delta = 25.504$ in \ac{HEVC}, respectively.

\begin{figure}[tp]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/dist_zero_qp_plot.tex}}
	{\includegraphics{./figures/dist_zero_qp_plot.pdf}}
	\caption{Distortion (\acs{MSE}) and $\ell_0$ norm with $\lambda_{opt}$ at
	\acsp{QP} 27 and 32 for different \acsp{PDF}}
	\label{fig:lambda_zero_norm_dist}
\end{figure}

As the exponent $\gamma$ decreases, so do both terms of $J(\lambda)$,
the distortion and the $\ell_0$ norm.
It can also be seen that since the Lagrange multiplier value $\lambda$
does not change, neither does the slope of a tangent line to the circled
points, corresponding to the optimal trade-off between the distortion
and the rate constraint.
The amount of improvement in each direction is weighted by $\lambda$: at
\ac{QP} 27 there is more room for sparsity improvement than there is for
reducing the distortion, compared to \ac{QP} 32.

\section{Conclusions}
\label{sec:conclusions_transforms}

This chapter has introduced different types of transforms and some
basic concepts, such as separability and its relation to computational
complexity.

Two different approaches of transform design have been studied:
the \ac{KLT} and the \ac{RDOT}.
The \ac{KLT} defines the transform providing optimal bit allocation under the
high-resolution hypothesis.
The \ac{DCT} and \ac{DST} used in \ac{HEVC} have been obtained from the
\ac{KLT} for particular kinds of signals.

The \ac{RDOT} finds the optimal balance between the distortion introduced by
the quantiser and a rate constraint, expressed in terms of signal sparseness.
A detailed study on its design, based on the $\ell_0$ norm, has been carried
out to justify the appropriateness of the approach, lacking in current
literature.

In the next Chapter, both transform designs will be tested in a real
scenario to evaluate their performances in video coding.


\chapter{The mode-dependent directional transforms}
\label{cha:mddt}
\chaptertoc

\section{Introduction}
\label{sec:mddt_introduction}

The previous Chapter has introduced two different transform design approaches.
In order to test their performances for video coding, a technique called
\ac{MDDT} is developed in this section.
The principle of the \ac{MDDT} lies on using an adapted transform learnt
specifically for each \acf{IPM}.

The \ac{MDDT} technique was proposed during the \ac{HEVC} standardisation
phase, but it was finally discarded in favour of a new transform, the
\ac{DST} as explained in \S\ref{sub:dst_and_mddt}.

\subsection{Motivation and principles of the \acs{MDDT}}
\label{sub:mddt_motivation}
\index{MDDT}

Depending on the selected prediction mode, transformed coefficients might
present different patterns, making low and high values appear at different
block positions.
This heterogeneity can be harmful for the entropy coding, which is one of the
reasons why different scanning patterns exist in
\ac{HEVC}~\cite{sole-12-transform-coefficient-coding}.
These scanning patterns, presented in the lower part of
figure~\ref{fig:mdcs} depend only on the \ac{IPM} used to compute the
residual.

The \ac{MDDT} was motivated by the fact that residuals issued from different
\acp{IPM} present notable differences.
An example of these differences is presented in
figure~\ref{fig:residual_differences} for $4\times4$ and $8\times8$ prediction
residuals.
An adapted transform for each \ac{IPM} can provide good signal compaction by
specialisation, which means having an adapted transform for each one of the 35
\acp{IPM}.

\begin{figure}[tb]
	\centering
	\subfloat[$4\times4$ --- \acs{IPM} 6]
	{\includegraphics[width=0.2\linewidth]{./figures/avg-residual-s4-p06.png}}
	\hfill
	\subfloat[$4\times4$ --- \acs{IPM} 10]
	{\includegraphics[width=0.2\linewidth]{./figures/avg-residual-s4-p10.png}}
	\hfill
	\subfloat[$4\times4$ --- \acs{IPM} 18]
	{\includegraphics[width=0.2\linewidth]{./figures/avg-residual-s4-p18.png}}
	\hfill
	\subfloat[$4\times4$ --- \acs{IPM} 26]
	{\includegraphics[width=0.2\linewidth]{./figures/avg-residual-s4-p26.png}}
	\\
	\subfloat[$8\times8$ --- \acs{IPM} 6]
	{\includegraphics[width=0.2\linewidth]{./figures/avg-residual-s8-p06.png}}
	\hfill
	\subfloat[$8\times8$ --- \acs{IPM} 10]
	{\includegraphics[width=0.2\linewidth]{./figures/avg-residual-s8-p10.png}}
	\hfill
	\subfloat[$8\times8$ --- \acs{IPM} 18]
	{\includegraphics[width=0.2\linewidth]{./figures/avg-residual-s8-p18.png}}
	\hfill
	\subfloat[$8\times8$ --- \acs{IPM} 26]
	{\includegraphics[width=0.2\linewidth]{./figures/avg-residual-s8-p26.png}}
	\caption{Average residual profiles for different \acsp{IPM}}
	\label{fig:residual_differences}
\end{figure}

A possible implementation of the \ac{MDDT} inside a video coding scheme is
illustrated in figure~\ref{fig:mddt_enc}.
This encoder works in the same way as the hybrid encoder from
figure~\ref{fig:hybrid_video_coding_scheme}, with the exception that, in this
case, each \ac{IPM} will be tested in the \ac{RDO} loop with its corresponding
transform.
Three examples of residuals are shown:
for \acp{IPM} 10, 18 and 26, which are obtained as the differences between the
predictions derived from their corresponding \acp{IPM} and the original image.
The \ac{IPM} 10 stands for a horizontal prediction from left blocks, 18 for
the diagonal, from upper and left blocks and 26 for the vertical prediction
from the upper blocks.
When inside the \ac{RDO} loop, each intra prediction residual is tested using
the assigned transform.
Then, the \ac{IPM} offering the best rate-distortion trade-off is selected and
signalled into the bitstream.

On the decoder side (figure~\ref{fig:mddt_dec}), the only needed information
to decode the block is represented in coloured blocks:
the transformed coefficients and the used \ac{IPM}.
No additional information has to be sent to the decoder, with regards to the
hybrid video coding scheme:
the signalled \ac{IPM} allows the decoder to know which transform to use to
convert the transformed coefficients back to the pixel domain, without having
to perform any extra calculations or take any decisions.

One of the positive aspects of the \ac{MDDT} is the fact that, despite having
more than one transform for a particular block size, only the transform
corresponding to the selected \ac{IPM} is used to transform that residual.
This means that, the \ac{IPM} conditions the transform, so there is no extra
step in the \ac{RDO} loop for the transform.
However, there might be an increase of the complexity on both \ac{MDDT}
encoder and decoder due not to the number of transforms, but to the possible
lack of fast algorithms for the adapted transforms implementation.

\begin{figure}[tp]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/mddt_enc.tex}}
	{\includegraphics{./figures/mddt_enc.pdf}}
	\caption{Encoding scheme for the \acs{MDDT}}
	\label{fig:mddt_enc}
\end{figure}

\begin{figure}[tp]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/mddt_dec.tex}}
	{\includegraphics{./figures/mddt_dec.pdf}}
	\caption{Decoding scheme for the \acs{MDDT}}
	\label{fig:mddt_dec}
\end{figure}

\subsection{The \acs{DST} as a simplification of the \acs{MDDT}}
\label{sub:dst_and_mddt}
\index{DST}
\index{KTA}
\index{TMuC}
\index{MDDT}
\index{HEVC}
\index{KLT}

The \ac{DST} has been presented as the \ac{KLT} for intra prediction residuals
in \S\ref{sub:particular_case_dct}.
The way of generating the prediction residuals leads to a particular kind of
signals with properties that differ from those of natural images.
As a result, the \ac{DCT} is no longer a good approximation of the \ac{KLT}.
However, it was not straightforward to realise that the \ac{KLT} for those
signals is the \ac{DST}:
many studies, detailed below, were carried out in order to find better
performing transform for these signals.

Over the \ac{HEVC} standardisation phase, various techniques to improve the
performance of H.264/\acs{MPEG}-4 \ac{AVC} were explored and designated as
\ac{KTA}~\cite{VCEG-AA06}.
Amongst those techniques, in order to improve the coding performance of intra
prediction residuals, an adapted transform for each \ac{IPM} (at that time 9,
which evolved into 35 in \ac{HEVC}), was included in the \ac{KTA} software.
The adapted transforms were \acp{KLT}, and the resulting technique was called
\ac{MDDT}.
It became a core component of the \ac{TMuC}~\cite{JCTVC-A204} due to its
performance improvements over the previous standard.

During the \ac{JCT-VC} meetings, many efforts were done to yield the first
implementation of the \ac{MDDT}, which was non-separable, more
lightweight~\cite{VCEG-AG11}.

One of the first attempts was to make the \acp{KLT}
separable~\cite{JCTVC-B024}.
A fast algorithm was derived for the $4\times4$ transform by analysing
the correlation matrix of the intra prediction residuals, which can be
modelled using tridiagonal
matrices~\cite{yueh-05-eigenvalues-tridiagonal}.
In order to further reduce the complexity, rotational transforms were
also considered in~\cite{JCTVC-C096}.
The main idea was to use the \ac{DCT} followed by a secondary transform
implemented in the form of Given's
rotations~\cite{yang-04-matrix-decomposition} to improve the coding
efficiency.

The \ac{DST} was theoretically proved to be the optimal transform, in
terms of \ac{KLT} approximation, of the intra prediction residuals
in~\cite{JCTVC-C108, JCTVC-D033}.
It was used together with the \ac{DCT} as combinations of horizontal and
vertical transforms, depending on the \ac{IPM}.
However, it was not adopted in the \ac{TMuC} until the design of a fast
algorithm of the \ac{DST}~\cite{saxena-13-fast-transforms-intra-coding}, which
introduced limited increase in the decoding time with regards to the \ac{DCT}.
It was only then when it fully replaced the \ac{MDDT}~\cite{JCTVC-D048,
JCTVC-E125, JCTVC-F283, JCTVC-G108}.
Finally, after many \acp{CE}, cross-checks between companies and combinations
of \ac{DST}, \ac{DCT} on different block sizes and luma and chroma channels,
the whole system was simplified by using the \ac{DST} for all $4\times4$ luma
intra prediction residuals, and the \ac{DCT} for the rest of the block sizes,
due to the lack of fast algorithms and limited gains, in both luma and
chroma~\cite{JCTVC-J0021}.

In the following section, the \ac{MDDT} is re-implemented using both
transforms designs from \S\ref{sec:klt} (\ac{KLT})
and \S\ref{sec:rdot} (\ac{RDOT}), in both separable and non-separable manners,
to empirically compare its performance with regards to the current standard,
\ac{HEVC} and unveil the merits of the transform designs.

\section{Design and implementation of \acs{MDDT} systems}
\label{sec:mddt_design_and_implementation}
\index{MDDT}

The preliminary work carried out for this section has been published in the
form of an article~\cite{arrufat-14-mddt-rdot}.

\subsection{\acs{MDDT} system learning}
\label{sub:mddt_system_learning}
\index{RDOT metric}
\index{KLT}

The transform designs presented in the previous Chapter, namely the \ac{KLT}
and the \ac{RDOT}, are tested in this Section, through the \ac{MDDT}
technique.
This experiment reveals the ability of each design approach to adapt to a
particular kind of signals and to fit the video coding demands.

To design a transform, a learning set is needed.
Since transforms are being designed to improve upon the \acs{MDDT}, the
learning set has been built with intra prediction residuals only, issued from
\ac{HEVC} encodings at four \acp{QP} (22, 27, 32, 37), coming from classes B
and C from the \ac{HEVC} test set, defined by the \ac{JCT-VC} in the
\ac{CTC}~\cite{bossen-12-common-test-conditions}.
These 2 classes contain sequences of $1920\times1080$ and $832\times480$,
respectively, and cover various frame rates of 24, 30, 50 and 60 frames per
second.
As a result, the number of the residuals used for the learning set exceeds the
96 million for $4\times4$ \acp{TU} and 140 million for $8\times8$ \acp{TU}.
Originally, the \ac{MDDT} was designed to work with $4\times4$ blocks, but in
this thesis, $8\times8$ blocks are also considered.

These residuals, grouped by the \ac{IPM} they are issued from, are used to
compute an adapted transform for each block size and for each \ac{IPM} using
both \ac{KLT} and \ac{RDOT} design approaches in separable and non-separable
versions.

To illustrate the results of the iterative learning algorithm for the
\ac{RDOT}, presented in \S\ref{sec:rdot},
figure~\ref{fig:rdot_metric_learning} shows the value of the \ac{RDOT} metric
during the learning phase, averaged across all \acp{IPM}, for separable and
non-separable designs.

\begin{figure}[tb]
	\centering
	\subfloat[$4\times4$ transforms]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/rdot_learning_4_plot.tex}}
	{\includegraphics{./figures/rdot_learning_4_plot.pdf}}}
	\\	
	\subfloat[$8\times8$ transforms]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/rdot_learning_8_plot.tex}}
	{\includegraphics{./figures/rdot_learning_8_plot.pdf}}}
	\caption[\acs{RDOT} metric during separable and non-separable transform
	learnings]
	{Average \acs{RDOT} metric evolution during different transform
	learnings: separable (S) and non-separable (NS)}
	\label{fig:rdot_metric_learning}
\end{figure}

The learning algorithm of the \ac{RDOT} converges smoothly for both separable
and non-separable designs.
As expected, the non-separable \ac{RDOT} presents a lower metric value than
the separable version, since non-separable transforms are able to exploit
linear correlations between any pair of pixels within a block, allowing for
sparser signals in the transform domain.

Furthermore, figure~\ref{fig:rdot_metric_learning} also contains the \ac{RDOT}
metric evaluated on the \ac{KLT}, both separable and non-separable, and the
default \ac{HEVC} transforms.
The use of this metric confirms some points stated in
\S\ref{sub:particular_case_dst} in the \ac{BD}-rate domain that justified the
use of the \ac{DST}-VII over the \ac{DCT}-II for $4\times4$ blocks:
the value of the metric using the \ac{DST}-VII is notably lower than that of
the \ac{DCT}-II, and the fact that the \ac{DST} for these blocks is close to
the separable \ac{KLT} is visible in terms of the \ac{RDOT} metric.

Some examples of $4\times4$ learnt \acp{RDOT} are presented in
figure~\ref{fig:rdot_4x4_bases}.
Although it is complicated to compare transform bases visually, non-separable
transforms have captured the main direction for diagonal \acp{IPM} (2, 6, 18).
These directions can be observed in the base vectors, especially in the first
ones, as their patters exhibit the direction of the \ac{IPM}.
Separable transforms are not able to capture accurately those directions by
construction, since they combine transforms operating row and column wise.

\begin{figure}[tb]
	\centering
	\subfloat[sep. \acs{IPM} 2]
	{\includegraphics[width=0.15\linewidth]{./figures/mddt_sep_rdot_s4_p02.png}}
	\hfill
	\subfloat[sep. \acs{IPM} 6]
	{\includegraphics[width=0.15\linewidth]{./figures/mddt_sep_rdot_s4_p06.png}}
	\hfill
	\subfloat[sep. \acs{IPM} 10]
	{\includegraphics[width=0.15\linewidth]{./figures/mddt_sep_rdot_s4_p10.png}}
	\hfill
	\subfloat[sep. \acs{IPM} 18]
	{\includegraphics[width=0.15\linewidth]{./figures/mddt_sep_rdot_s4_p18.png}}
	\hfill
	\subfloat[sep. \acs{IPM} 26]
	{\includegraphics[width=0.15\linewidth]{./figures/mddt_sep_rdot_s4_p26.png}}
	\\
	\subfloat[n-sep. \acs{IPM} 2]
	{\includegraphics[width=0.15\linewidth]{./figures/mddt_nsep_rdot_s4_p02.png}}
	\hfill
	\subfloat[n-sep. \acs{IPM} 6]
	{\includegraphics[width=0.15\linewidth]{./figures/mddt_nsep_rdot_s4_p06.png}}
	\hfill
	\subfloat[n-sep. \acs{IPM} 10]
	{\includegraphics[width=0.15\linewidth]{./figures/mddt_nsep_rdot_s4_p10.png}}
	\hfill
	\subfloat[n-sep. \acs{IPM} 18]
	{\includegraphics[width=0.15\linewidth]{./figures/mddt_nsep_rdot_s4_p18.png}}
	\hfill
	\subfloat[n-sep. \acs{IPM} 26]
	{\includegraphics[width=0.15\linewidth]{./figures/mddt_nsep_rdot_s4_p26.png}}
	\caption{$4\times4$ separable and non-separable \acs{RDOT} for different \acsp{IPM}}
	\label{fig:rdot_4x4_bases}
\end{figure}

After having confirmed that the learning has output better performing
transforms in terms of distortion and sparseness, their implementation on top
of \ac{HEVC} is tested and discussed in the next section.

\subsection{\acs{MDDT} results on video coding}
\label{sub:mddt_results_on_video_coding}

The \ac{MDDT} systems designed in the previous Section have been used in
\ac{HEVC} using the common test conditions, defined by the
\ac{JCT-VC}~\cite{bossen-12-common-test-conditions}, for \ac{AI} and \ac{RA}
coding configurations.
The common test conditions consist in encoding each sequence at four different
\ac{QP} values (22, 27, 32, 37) and computing the average bitrate savings
with regards to \ac{HEVC} using the \ac{BD}-rate metric.

\subsubsection{Bitrate savings}
\label{ssub:mddt_bit_rate_savings}

Tables~\ref{tab:mddt_ai} and~\ref{tab:mddt_ra} contain the performances of
different \ac{MDDT} systems for \ac{AI} and \ac{RA} coding configurations,
respectively, for the different sequence classes of the \ac{HEVC} test set
using the \acs{BD}-rate metric.
The last line of the table represents the average bitrate savings for all
sequences (it does not correspond to the averaged bitrate savings per class).

\begin{table}[tb]
	\centering
	\small
	\begin{tabularx}{\linewidth}{X|rr|rr|rr|rr|rr|rr}
		\multicolumn{1}{c}{} &
		\multicolumn{4}{c|}{$4\times4$} &
		\multicolumn{4}{c|}{$8\times8$} &
		\multicolumn{4}{c}{$4\times4$ \& $8\times8$} \\
		\cline{2-13}
		\multicolumn{1}{c}{} &
		\multicolumn{2}{c|} {sep} &
		\multicolumn{2}{c|} {non-sep} &
		\multicolumn{2}{c|} {sep} &
		\multicolumn{2}{c|} {non-sep} &
		\multicolumn{2}{c|} {sep} &
		\multicolumn{2}{c} {non-sep} \\
		\hline
		Cl. & KLT & RDOT & KLT & RDOT & KLT & RDOT & KLT & RDOT & KLT & RDOT & KLT & RDOT \\
		\hline\hline
		\centering A   &  0.08 &  0.12 & -0.18 & -0.17 & -0.96 & -0.98 & -1.60 & -1.56 & -0.88 & -0.90 & -1.63 & -1.60 \\
		\centering B   & -0.07 & -0.20 & -0.50 & -0.80 & -0.40 & -0.76 & -1.46 & -2.47 & -0.42 & -0.93 & -1.65 & -2.77 \\
		\centering C   & -0.18 & -0.60 & -1.46 & -2.47 & -0.42 & -0.82 & -2.40 & -4.03 & -0.51 & -1.33 & -3.09 & -5.15 \\
		\centering D   & -0.12 & -0.58 & -1.17 & -2.04 & -0.23 & -0.56 & -1.17 & -2.08 & -0.27 & -1.06 & -1.88 & -3.41 \\
		\centering E   &  0.39 &  0.35 & -0.07 & -0.45 & -0.92 & -1.20 & -2.11 & -2.91 & -0.57 & -0.95 & -1.83 & -3.00 \\
		\centering F   &  0.10 & -0.78 & -0.57 & -2.04 &  0.09 & -0.22 & -1.12 & -2.37 &  0.25 & -0.93 & -1.20 & -3.51 \\
		\hline\hline
		\centering Av. &  0.02 & -0.31 & -0.67 & -1.34 & -0.45 & -0.74 & -1.62 & -2.55 & -0.39 & -1.02 & -1.87 & -3.23 \\
	\end{tabularx}
	\caption{Average bitrate savings (\%) for each \acs{HEVC} Class in \acs{AI}}
	\label{tab:mddt_ai}
\end{table}

\begin{table}[tb]
	\centering
	\small
	\begin{tabularx}{\linewidth}{X|rr|rr|rr|rr|rr|rr}
		\multicolumn{1}{c}{} &
		\multicolumn{4}{c|}{$4\times4$} &
		\multicolumn{4}{c|}{$8\times8$} &
		\multicolumn{4}{c}{$4\times4$ \& $8\times8$} \\
		\cline{2-13}
		\multicolumn{1}{c}{} &
		\multicolumn{2}{c|} {sep} &
		\multicolumn{2}{c|} {non-sep} &
		\multicolumn{2}{c|} {sep} &
		\multicolumn{2}{c|} {non-sep} &
		\multicolumn{2}{c|} {sep} &
		\multicolumn{2}{c} {non-sep} \\
		\hline
		Cl. & KLT & RDOT & KLT & RDOT & KLT & RDOT & KLT & RDOT & KLT & RDOT & KLT & RDOT \\
		\hline\hline
		\centering A   &  0.03 &  0.05 & -0.03 & -0.07 & -0.29 & -0.42 & -0.71 & -0.80 & -0.27 & -0.37 & -0.70 & -0.87 \\
		\centering B   &  0.00 & -0.10 & -0.26 & -0.51 & -0.24 & -0.46 & -0.86 & -1.42 & -0.25 & -0.55 & -0.99 & -1.63 \\
		\centering C   & -0.08 & -0.36 & -0.77 & -1.41 & -0.20 & -0.39 & -1.18 & -2.22 & -0.23 & -0.70 & -1.57 & -2.88 \\
		\centering D   & -0.03 & -0.25 & -0.50 & -1.05 & -0.08 & -0.25 & -0.53 & -1.05 & -0.11 & -0.50 & -0.88 & -1.73 \\
		\centering E   &  0.10 & -0.19 & -0.36 & -0.87 & -0.31 & -0.74 & -1.02 & -2.04 & -0.27 & -0.91 & -1.14 & -2.47 \\
		\centering F   &  0.07 & -0.59 & -0.29 & -1.26 &  0.08 & -0.10 & -0.50 & -1.42 &  0.18 & -0.65 & -0.56 & -2.21 \\
		\hline\hline
		\centering Av. &  0.01 & -0.24 & -0.37 & -0.85 & -0.17 & -0.38 & -0.80 & -1.47 & -0.16 & -0.60 & -0.97 & -1.93 \\
	\end{tabularx}
	\caption{Average bitrate savings (\%) for each \acs{HEVC} Class in \acs{RA}}
	\label{tab:mddt_ra}
\end{table}

Three system configurations are reported:
a \ac{MDDT} system operating on $4\times4$ only \acp{TU}, $8\times8$ only
\acp{TU} and jointly on both sizes.
The first column of both tables refers to the separable \ac{KLT}-based
\ac{MDDT} for $4\times4$ \acp{TU}, which, as demonstrated
in~\cite{jain-75-nearest-neighbors, jain-76-klt-random-process}, corresponds
to the use of a \ac{DST}.
The average bitrate savings with regards to \ac{HEVC} for this system are
insignificant since \ac{HEVC} already uses the \ac{DST} for those blocks.

The differences between both transform learning approaches (\ac{KLT} and
\ac{RDOT}) can be observed by looking at each pair of columns.
Results are very consistent with what was anticipated in the \ac{RDOT} metric
domain:
the \ac{RDOT}-based \ac{MDDT} outperforms the \ac{KLT} in every case
(separability, \ac{TU} size, class and coding configuration).

It is also worth-noticing the impact of separability:
non-separable configurations provide systematically higher bitrate savings
than the separable ones, especially for those systems involving $8\times8$
\acp{TU}.

For the combined $4\times4$ and $8\times8$ system in \ac{AI} the bitrate
savings are over 3\% and almost 2\% in \ac{RA}.
The \ac{KLT} systems are about one point below.
The detailed performances of the combined systems are included in
table~\ref{tab:detailed_mddt_bd_rate}.
Despite having used classes B and C for the transform learning, consistent
bitrate savings are achieved across different resolutions among classes.
Moreover, classes D and E (not included in the learning set) present higher
bitrate savings than class B.
It is also worth noticing that the \ac{RDOT} systems do not present losses for
any sequence, which is not the case for the \ac{KLT}.

\begin{table}[tb]
	\centering
	\small
	\begin{tabularx}{\textwidth}{c|X|rr|rr|rr|rr}
		\multicolumn{2}{c}{} &
		\multicolumn{8}{c}{$4\times4$ \& $8\times8$} \\
		\cline{3-10}
		\multicolumn{2}{c}{} &
		\multicolumn{4}{c|}{AI} &
		\multicolumn{4}{c}{RA} \\
		\cline{3-10}
		\multicolumn{2}{c}{} &
		\multicolumn{2}{c|}{sep} &
		\multicolumn{2}{c|}{non-sep} &
		\multicolumn{2}{c|}{sep} &
		\multicolumn{2}{c}{non-sep} \\
		\cline{2-10}
		\multicolumn{1}{c}{} & {Sequence} &
		{KLT} & {RDOT} & {KLT} & {RDOT} &
		{KLT} & {RDOT} & {KLT} & {RDOT} \\
		\hline\hline
		\multirow{5}{0.10\textwidth}{\centering Class A \scriptsize($2560\times1600$)}
		& NebutaFestival       & -0.55 & -0.37 & -0.72 &  -0.44 & -0.08 & -0.05 & -0.08 & -0.11 \\
		& PeopleOnStreet       & -1.17 & -1.31 & -2.53 &  -2.50 & -0.45 & -0.52 & -1.14 & -1.11 \\
		& SteamLocTrain        & -0.50 & -0.41 & -0.60 &  -0.55 &  0.03 &  0.00 & -0.11 & -0.30 \\
		& Traffic              & -1.33 & -1.49 & -2.66 &  -2.90 & -0.58 & -0.91 & -1.47 & -1.96 \\
		\cline{2-10}
		& Average              & -0.88 & -0.90 & -1.63 &  -1.60 & -0.27 & -0.37 & -0.70 & -0.87 \\
		\hline\hline
		\multirow{6}{0.10\textwidth}{\centering Class B \scriptsize($1920\times1080$)}
		& BasketballDrive      &  0.12 & -0.59 & -1.12 &  -2.82 & -0.25 & -0.42 & -1.24 & -1.85 \\
		& BQTerrace            &  0.27 & -0.85 & -2.19 &  -4.62 & -0.05 & -0.63 & -1.28 & -2.45 \\
		& Cactus               & -0.92 & -1.47 & -2.28 &  -3.41 & -0.34 & -0.78 & -1.19 & -2.16 \\
		& Kimono1              & -0.38 & -0.47 & -0.95 &  -1.11 & -0.03 & -0.21 & -0.31 & -0.48 \\
		& ParkScene            & -1.19 & -1.28 & -1.69 &  -1.92 & -0.58 & -0.72 & -0.91 & -1.23 \\
		\cline{2-10}
		& Average              & -0.42 & -0.93 & -1.65 &  -2.77 & -0.25 & -0.55 & -0.99 & -1.63 \\
		\hline\hline
		\multirow{5}{0.10\textwidth}{\centering Class C \scriptsize($832\times480$)}
		& BasketballDrill      & -1.20 & -1.77 & -7.92 & -11.88 & -0.55 & -0.94 & -3.79 & -6.34 \\
		& BQMall               &  0.03 & -1.14 & -0.83 &  -2.58 &  0.12 & -0.58 & -0.33 & -1.47 \\
		& PartyScene           & -0.04 & -1.15 & -1.00 &  -2.56 & -0.07 & -0.68 & -0.74 & -1.75 \\
		& RaceHorses           & -0.84 & -1.26 & -2.60 &  -3.57 & -0.40 & -0.58 & -1.42 & -1.97 \\
		\cline{2-10}
		& Average              & -0.51 & -1.33 & -3.09 &  -5.15 & -0.23 & -0.70 & -1.57 & -2.88 \\
		\hline\hline
		\multirow{5}{0.10\textwidth}{\centering Class D \scriptsize($416\times240$)}
		& BasketballPass       & -0.11 & -0.94 & -1.47 &  -3.12 & -0.12 & -0.52 & -0.96 & -1.78 \\
		& BlowingBubbles       & -0.05 & -0.97 & -1.80 &  -3.49 & -0.07 & -0.52 & -0.93 & -1.90 \\
		& BQSquare             &  0.09 & -1.00 & -0.91 &  -2.69 &  0.08 & -0.53 & -0.24 & -1.40 \\
		& RaceHorses           & -1.00 & -1.34 & -3.36 &  -4.34 & -0.34 & -0.43 & -1.40 & -1.84 \\
		\cline{2-10}
		& Average              & -0.27 & -1.06 & -1.88 &  -3.41 & -0.11 & -0.50 & -0.88 & -1.73 \\
		\hline\hline
		\multirow{4}{0.10\textwidth}{\centering Class E \scriptsize($1280\times720$)}
		& FourPeople           & -1.19 & -1.51 & -2.41 &  -3.20 & -0.59 & -1.23 & -1.52 & -2.54 \\
		& Johnny               & -0.33 & -0.75 & -1.52 &  -2.82 & -0.12 & -0.78 & -0.97 & -2.37 \\
		& KristenAndSara       & -0.18 & -0.59 & -1.56 &  -2.99 & -0.12 & -0.73 & -0.94 & -2.50 \\
		\cline{2-10}
		& Average              & -0.57 & -0.95 & -1.83 & -3.00 & -0.27 & -0.91 & -1.14 & -2.47 \\
		\hline\hline
		\multirow{5}{0.10\textwidth}{\centering Class F \scriptsize(various resolutions)}
		& BasketDrillText      & -0.84 & -1.80 & -6.14 &  -9.96 & -0.48 & -0.88 & -3.15 & -5.57 \\
		& ChinaSpeed           &  0.44 & -0.81 & -0.02 &  -1.81 &  0.17 & -0.43 & -0.08 & -0.95 \\
		& SlideEditing         &  0.68 & -0.57 &  0.98 &  -0.46 &  0.52 & -0.76 &  0.77 & -0.76 \\
		& SlideShow            &  0.73 & -0.55 &  0.37 &  -1.81 &  0.50 & -0.53 &  0.23 & -1.55 \\
		\cline{2-10}
		& Average              &  0.25 & -0.93 & -1.20 &  -3.51 & 0.18  & -0.65 & -0.56 & -2.21 \\
		\hline\hline
		All Sequences
		& Overall          & -0.39 & -1.02 & -1.87 &  -3.23 & -0.16 & -0.60 & -0.97 & -1.93 \\
	\end{tabularx}
	\caption{Detailed bitrate savings (\%) for combined $4\times4$ \& $8\times8$
	\acs{MDDT} systems}
	\label{tab:detailed_mddt_bd_rate}
\end{table}

A remarkable point stands out of the test set: the \emph{BasketballDrill}
sequence from class C, with bitrate savings of almost 12\% using the
non-separable \ac{RDOT} in \ac{AI}.
This is due to the fact that this sequence presents strong directional
patterns that cannot be dealt with separable transforms.
Almost all the performance is lost when using separable systems.
For illustrative purposes, figure~\ref{fig:detailed_mddt_bd_rate} represents
graphically the results from table~\ref{tab:detailed_mddt_bd_rate}.
The behaviour in the \ac{RA} coding configuration is very similar to the
\ac{AI}, with the bitrate savings in \ac{RA} being around two thirds of those
in \ac{AI}.

\begin{figure}[tp]
	\centering
	\subfloat[\acs{AI} coding configuration]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/mddt_perf_ai_plot.tex}}
	{\includegraphics{./figures/mddt_perf_ai_plot.pdf}}}

	\subfloat[\acs{RA} coding configuration]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/mddt_perf_ra_plot.tex}}
	{\includegraphics{./figures/mddt_perf_ra_plot.pdf}}}

	\caption{Bitrate savings for combined $4\times4$ \& $8\times8$ \acs{MDDT}
	systems}
	\label{fig:detailed_mddt_bd_rate}
\end{figure}

As a reminder, all transforms learnt for the presented \ac{MDDT} systems have
used residuals issued from \ac{AI} coding configurations and have only been
enabled for intra coded residuals in \ac{HEVC}.
Nonetheless, around two thirds of the bitrate savings achieved for the
\ac{AI} coding configurations have been achieved in \ac{RA}.
This is due to the fact that I-frames, and intra coded blocks in \ac{RA} are
of higher quality, and serve, therefore, as better references to derive the
temporal predictions.

A final comment on the designed transforms regarding the scanning of the
transformed coefficients:
the transform base vectors for each \ac{IPM} have been sorted taking into
account the scanning that \ac{HEVC} performs implicitly, illustrated in
figure~\ref{fig:mdcs}.
For non-separable transforms, base vectors can be placed in any order, but
separable transforms does not have this freedom, as such, an adapted scanning
matrix is needed to make sure the transform coefficients are properly sorted.

\subsubsection{Coding complexity}
\label{ssub:mddt_coding_complexity}

Regarding the complexity of the system, the increase in the encoding and
decoding time is only due to the lack of fast algorithms for \acp{KLT} and
\acp{RDOT}.
Due to the way the quad-tree partitioning works in \ac{HEVC}, if, for
instance, a $16\times16$ \ac{TU} provides a better rate-distortion trade-off
than splitting it into $8\times8$ \acp{TU}, the $4\times4$ will not be
explored.
Since the \ac{MDDT} system improves $4\times4$ and $8\times8$ \acp{TU}, it
becomes more likely that splitting a $16\times16$ \ac{TU} into four $8\times8$
\acp{TU} works better than not splitting.
In that case, the $4\times4$ \acp{TU} will also be explored, which will lead
to an increase in complexity with regards to \ac{HEVC}, as more coding
possibilities are being explored.
Table~\ref{tab:mddt_summary} summarises the coding complexity for \ac{MDDT}
systems using separable and non-separable transforms for $4\times4$ and
$8\times8$ \acp{TU}.
The complexity for \ac{KLT} and \ac{RDOT} based systems is equivalent, since
transforms are performed as generic matrix multiplications.
As seen in \S\ref{sub:separability}, the number of operations required for a
non-separable transform is about $N$ times the amount for separable
transforms, with $N$ being the \ac{TU} size.
For $4\times4$ \acp{TU}, the complexity in both coding and decoding times
remains approximately the same as that of \ac{HEVC}.
However, for larger \acp{TU}, the differences due to fast algorithms start
becoming more noticeable.

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{c|rr|rr|rr}
		\multicolumn{1}{c}{}
		& \multicolumn{2}{c|}{$4\times4$}
		& \multicolumn{2}{c|}{$8\times8$}
		& \multicolumn{2}{c}{$4\times4$ \& $8\times8$} \\
		\multicolumn{1}{c}{}
		& \multicolumn{1}{c}{sep} & \multicolumn{1}{c|}{non-sep}
		& \multicolumn{1}{c}{sep} & \multicolumn{1}{c|}{non-sep}
		& \multicolumn{1}{c}{sep} & \multicolumn{1}{c}{non-sep} \\
		\hline\hline
		Y \acs{BD}-rate & -0.31\% & -1.34\% & -0.74\% & -2.55\% & -1.02\% & -3.23\% \\
		Encoding  & 101\% & 102\% & 107\% & 111\% & 108\% & 112\% \\
		Decoding  & 101\% & 102\% & 103\% & 115\% & 105\% & 120\% \\
		\acs{ROM} & \SI{1.64}{\kilo B} & \SI{8.75}{\kilo B} &
				\SI{6.56}{\kilo B} & \SI{140.00}{\kilo B} &
				\SI{8.20}{\kilo B} & \SI{148.75}{\kilo B}\\
	\end{tabular}
	\caption{Summary of \acs{RDOT}-based \acs{MDDT} systems compared to
	\acs{HEVC} in \acs{AI}}
	\label{tab:mddt_summary}
\end{table}

\subsubsection{\acs{MDDT} storage requirements}
\label{ssub:mddt_storage_requirements}
\index{ROM}

The transforms used in the \ac{MDDT} system are, not describable with a
mathematical formula unlike the \ac{DCT} and the \ac{DST}.
As such, additional memory is required to store these transforms and make them
available to the encoder and the decoder.
Each transform coefficient has been quantised to 1 byte by rounding it to the
nearest integer.
More advanced quantisation techniques could have been implemented that would
lead to lower degradations~\cite{britanak-06-dct-and-dst}.
The rounding method does not guarantee that the transform remains orthogonal
and that the energy preservation property in the \ac{RDOT} is retained,
although experiments done with this quantisation show that observed losses are
in the order of 0.2 \ac{BD}-rate points.
Taking all this into account, the storage requirements for a non-separable
transform for $N\times N$ \acp{TU} is:
\begin{equation}
	\acs{ROM}_{\text{non-sep}} =
	N^2 \cdot N^2 \cdot \frac{\SI{1}{\kilo B}}{\SI{1024}{B}} \cdot M
	\label{eqn:rom_nsep}
\end{equation}
The needed \acs{ROM} for separable transforms is:
\begin{equation}
	\acs{ROM}_{\text{sep}} =
	3 \cdot N \cdot N \cdot \frac{\SI{1}{\kilo B}}{\SI{1024}{B}} \cdot M
	\label{eqn:rom_sep}
\end{equation}
Where the number of \acp{IPM} $M$ is 35 for the \ac{MDDT} system.
Sorting the transform base vectors in the proper order is essential to assist
the entropy coding~\cite{ye-08-intra-directional-scanning-mddt}.
For this reason, there is a factor 3 for the \acs{ROM} requirements in the
separable case: horizontal transform, vertical transform and scanning matrix
for each pair of transforms.
The non-separable transforms do not need a scanning matrix, since the base
vectors can be sorted to output the signal with an appropriate coefficients
order.
Table~\ref{tab:mddt_summary} contains the \acs{ROM} requirements for the
\ac{MDDT} systems in both separable and non-separable designs.
This table illustrates the drawback of using non-separable transforms
regarding the storage.
Even the \ac{MDDT} system using only non-separable transforms for $4\times4$
\acp{TU} needs more \acs{ROM} than the complete separable \ac{MDDT} system
making use of both sizes.

\section{Conclusions}
\label{sec:mddt_conclusions}

Transform designs presented in Chapter~\ref{cha:transform_coding}, namely the
\ac{KLT} and the \ac{RDOT}, have been evaluated through the \ac{MDDT}, a
technique for intra coded blocks that consists in designing an adapted
transform per \ac{IPM}.
During the learning phase, the \ac{RDOT} metric has presented coherent results
with the knowledge acquired from literature regarding the \ac{KLT}, \ac{DCT}
and \ac{DST}.
These transforms have been evaluated using the \ac{RDOT} metric and have been
confirmed by the results obtained in the \ac{BD}-rate domain:
for intra prediction residuals, the \ac{DST}-VII has a better \ac{BD}-rate
score than the \ac{DCT}-II, and so does using the \ac{RDOT} metric.

The designed transforms have been tested in a modified version of \ac{HEVC}
using the \ac{MDDT} technique.
Non-separable transforms present a significant improvement in terms of
bitrate savings over their separable counterparts.
Moreover, the \ac{RDOT} design approach has provided better results in terms
of \ac{BD}-rate than the \ac{KLT} all over the tested sequences, with no
losses for any of them.
Bitrate savings between around 2--3\% are achieved when using the
\ac{RDOT}-based \ac{MDDT}, which validate the transfom design over the
\ac{KLT}-based design, providing bitrate savings around 1 point below.
Consequently, the \ac{KLT} will no longer be considered in the upcoming
sections, only systems designed using the \ac{RDOT} metric will be considered.

The complexity of the \ac{MDDT} systems remains comparable to that of
\ac{HEVC}, since the modification only implies that default transform is
replaced with an adapted one.
Therefore, the increase in complexity is only due to the fact that transforms
are implemented as matrix multiplications due to the lack of fast algorithms
for general orthogonal transforms.

The following Chapter presents an improvement of the \ac{MDDT}, which will
allow transforms to be more adapted to intra prediction residuals and provide
new coding alternatives to \ac{HEVC}.

\chapter{The mode-dependent transform competition system}
\label{cha:mdtc}
\chaptertoc

\section{Introduction}
\label{sec:introduction_mdtc}
\index{MDDT}
\index{MDTC}

Chapter~\ref{cha:mddt} revisited the existing \acf{MDDT} technique, its
origins and motivation, and summarised how it was discarded in the final
\ac{HEVC} standard in favour of a more simplified approach using the \ac{DST}.
However, the \ac{MDDT} technique has good potential in bitrate savings when
used with \acp{RDOT}, especially in their non-separable design.

This Chapter is focused on improving upon the \ac{MDDT} technique by
increasing the number of available transforms.
The main idea is to provide a fixed number of transforms in each \acf{IPM}
that compete against each other in the \ac{RDO} loop, in the same ways as
block sizes and \acp{IPM} do.
This implies that, for a given block size and \ac{IPM}, there is no longer a
unique transform, but a set of them, and the encoder selects the one that
provides the best trade-off in terms of rate-distortion.
This evolution of the \ac{MDDT} system has been named \ac{MDTC}
system~\cite{arrufat-15-mdtc}.

Transform competition is not an entirely new concept in \ac{HEVC}.
A rudimentary form of competition is present for $4\times4$ intra predicted
luma residuals:
the choice between using the \ac{DST}-VII and not using a transform at all
already exists.
This behaviour is controlled by the
\texttt{transform\_skip\_flag}~\cite{JCTVC-F077, JCTVC-H0208}.

Some work on transform competition has been carried out
in~\cite{arrufat-14-transform-competition-rdot,fengzou-13-rdot-lloyd-intra}.
The main difference is that, in this Chapter, each \ac{IPM} has its own
transform sets instead of a common set for all modes.
The approach from~\cite{fengzou-13-rdot-lloyd-intra} uses the same transforms
for all \acp{IPM}.
The block size and \ac{IPM} decision is based on the default \ac{HEVC}
transforms, then the residual is consequently transformed with the transform
that provides the best trade-off in terms of rate-distortion.

Further work on different mode-dependent tools for video coding have been
explored in~\cite{ma-13-mode-dependent-tools-video-coding}, notably ways of
simplifying and improving the \ac{MDDT} system by adding re-orderings in the
residuals and additional transforms after the main transform.

Early work on \ac{MDTC} systems closely related to this Chapter has been
presented in~\cite{arrufat-15-mdtc}.

\section{Multiple transform design using the \acs{RDOT} metric}
\label{sec:multiple_transform_design}

\subsection{\acs{HEVC} reproducibility}
\label{sub:mdtc_hevc_reproducibility}
\index{HEVC}

Since the main purpose of the transforms in \ac{HEVC} is to compact the energy
of the signal in order to increase the bitrate savings, the design process is
developed as follows:
the default \ac{HEVC} transforms for $4\times4$ and $8\times8$ luma blocks
(the \ac{DST} and \ac{DCT}, respectively) are kept and a number of additional
transforms are learnt to capture those residuals for which \ac{HEVC}
transforms are not adapted to.
This conservative approach has been made to guarantee that the original
\ac{HEVC} coding choices are still available when using the \ac{MDTC} system.

\subsection{The learning algorithm}
\label{sub:mdtc_learning_algorithm}

The chosen transform design is the \ac{RDOT}, since, as demonstrated in the
previous Chapter, it provides significant improvements in terms of bitrate
savings with respect to the \ac{KLT}.
However, in this Chapter, more than one \ac{RDOT} will be learnt per \ac{IPM},
using the same learning set:
residuals issued from an \ac{AI} \ac{HEVC} coding of classes B and C from the
\ac{HEVC} test set, grouped by \ac{IPM}.
For each \ac{IPM}, $2^N$ transforms are learnt, in addition to the \ac{HEVC}
default transforms (for compatibility reasons).
Algorithm~\ref{alg:clustering} describes how the learning has been carried out
in each \ac{IPM}.
Assuming the desired outputs are $2^N$ additional transforms, the performed
steps are:
\begin{enumerate}
	\item Initial random classification of the residuals into $1+2^N$ classes.
	\item For the $2^N$ classes that are not assigned to the \ac{HEVC}
		transform, learn a separable or non-separable transform, depending on
		the desired configuration.
	\item Evaluate each residual using the \ac{RDOT} metric and assign it to
		the transform that minimises the value.
	\item Repeat steps 2 and 3 until convergence.
\end{enumerate}
A conceptual example for $N=1\Rightarrow2^N=2$ additional transforms is provided in
figure~\ref{fig:clustering}.This figure displays the residuals and how a they
would be assigned to the transform that provides the lowest value of the
\acs{RDOT} metric.
The algorithm is subject to improvements, since the initial conditions are
difficult to define.
Consequently, in order to increase the confidence level of not ending up into
a local minimum, several runs per learning are done with different
initialisations.

\begin{algorithm}
	\small
	\SetKwData{append}{append}
	\SetKwInOut{Input}{input}
	\SetKwInOut{Output}{output}
	\Input{Residuals $\x$ from a given \acl{IPM}}
	\Output{Set of $2^N$ \acp{RDOT} $A_n$}
	\BlankLine%
	Initial random classification into $1+2^N$ classes
	\BlankLine%
	\While{!convergence}
	{
			\For{$n=1$ \KwTo{} $2^N$}
			{
				Learn a \ac{RDOT} on $\text{Class}_n$
				using~\eqref{eqn:rdot-nsep} or~\eqref{eqn:rdot-sep}, depending
				on separability
			}
			\ForEach{block $\x$}
			{
				\For{$n=0$ \KwTo{} $2^N$}
				{
					$\delta_n =
					{\Vert \x - \A^T_n\c\Vert}^2 + \lambda{\Vert\c\Vert}_0$
				}
				$\displaystyle n^* = \text{arg}\min\limits_n(\delta_n)$\\
				$\text{Class}_{n^*}$.\append($\x$)
			}
	}
	\caption{Multiple transform design}
	\label{alg:clustering}
\end{algorithm}

\begin{figure}[tp]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/clustering.tex}}
	{\includegraphics{./figures/clustering.pdf}}
	\caption{Clustering and transform learning for a given set of residuals}
	\label{fig:clustering}
\end{figure}

In order to confirm the coherence of the learnings, the \ac{RDOT} metric has
been evaluated for different learning configurations, depending on the $N$
value for $4\times4$ and $8\times8$ blocks.
Figure~\ref{fig:mdtc_rdot_metric_ntransforms} presents the averaged \ac{RDOT}
metric value for all \acp{IPM} residuals when using an increasing number of
transforms for $4\times4$ and $8\times8$ residuals.
The starting point in~\ref{fig:mdtc_rdot_metric_ntransforms}.a
and~\ref{fig:mdtc_rdot_metric_ntransforms}.b coincides with the \ac{RDOT}
metric evaluated on \ac{HEVC} default transforms, which corresponds to the one
shown in figure~\ref{fig:rdot_metric_learning}, from the previous Chapter.

The \ac{RDOT} metric decreases with the number of transforms, but it stagnates
when the number of transforms is high.

Furthermore, there is an important gap in the \ac{RDOT} metric between
separable and non-separable transforms.
For $4\times4$ blocks, the value achieved with separable transforms, is
achieved with half the number of non-separable transforms.
The gap is even more important for $8\times8$ blocks.

\begin{figure}[tb]
	\centering
	\subfloat[\acs{RDOT} metric evolution with the number of $4\times4$
	transforms]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/rdot_ntransforms_4_plot.tex}}
	{\includegraphics{./figures/rdot_ntransforms_4_plot.pdf}}}
	\\
	\subfloat[\acs{RDOT} metric evolution with the number of $8\times8$
	transforms]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/rdot_ntransforms_8_plot.tex}}
	{\includegraphics{./figures/rdot_ntransforms_8_plot.pdf}}}
	\caption[\acs{RDOT} metric for different separable and non-separable
	transform learnings]
	{Average \acs{RDOT} metric evolution for the learning set depending
	on the number of transforms}
	\label{fig:mdtc_rdot_metric_ntransforms}
\end{figure}

\section{The \acs{MDTC} system in video coding}
\label{sec:performances_of_the_mdtc_system}

This section presents the performances of the \ac{MDTC} systems designed in
the previous section on the full \ac{HEVC} test set.
Systems are evaluated in terms of bitrate savings, encoding complexity and
decoding complexity.

\subsection{Signalling the transforms in the bitstream}
\label{sub:mdtc_signalling}

The main reason for learning $1+2^N$ transform configurations, with the first
one being the default \ac{HEVC} transform for the target block size, has been
backward compatibility, as mentioned in the previous Section.
Since the default transforms are kept, the encoder is able to choose them and
reproduce the same coding choices as \ac{HEVC} in case they provide the best
trade-off.
Moreover, this configuration also favours the signalling of the transforms.
The chosen signalling strategy is to use a flag, which, when enabled,
indicates that a different transform from the default \ac{HEVC} is used for
that \ac{TU} size.
The flag is conveyed though a \ac{CABAC} encoder and uses a context for each
\ac{TU} size to help the entropy coding predict its value and to have
negligible impact on performances in case it is never enabled.
If the flag is enabled, a code word of fixed length ($N$ bits) is used to
signal the selected transform by the encoder among the $2^N$ remaining.
This signalling strategy, apart from keeping backward compatibility with
\ac{HEVC}, favours a simple approach that matches with the fact that, during
the learning phase, all additional transforms were used almost uniformly.

\subsection{Performances of different configurations}
\label{sub:mdtc_performances}

Different \ac{MDTC} systems have been considered and tested, corresponding to
different number of additional transforms used for $4\times4$ and $8\times8$
\acp{TU}.
All tests have been carried out following the common test conditions,
stipulated in~\cite{bossen-12-common-test-conditions}.

Figure~\ref{fig:mdtc_bdrate_ntransforms} illustrates the relationship between
the \ac{BD}-rate and the number of transforms on the \ac{HEVC} test set for
the \ac{AI} configuration.
The \ac{BD}-rate decreases approximately with a logarithmic law on the number
of additional transforms.
The behaviour in the \ac{BD}-rate domain is close to the one observed in the
\ac{RDOT} metric domain in figure~\ref{fig:mdtc_rdot_metric_ntransforms},
despite the fact that signalling has not been taken into account during the
learnings.
Unlike during the transform learnings, additional transforms have been
signalled during encoding tests to provide a decodable stream.
This signalling counterbalances the sparseness increase in the transformed
residuals and contributes to the saturation in \ac{BD}-rate when the number of
transforms is high.

For $4\times4$ transforms, bitrate savings of almost 2.5\% can be achieved,
whereas 6\% are obtained when using only $8\times8$ transforms on the
\ac{HEVC} test set.

\begin{figure}[tb]
	\centering
	\subfloat[\acs{BD}-rate evolution with the number of additional $4\times4$
	transforms]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/bdrate_ntransforms_4_plot.tex}}
	{\includegraphics{./figures/bdrate_ntransforms_4_plot.pdf}}}
	\\
	\subfloat[\acs{BD}-rate evolution with the number additional of $8\times8$
	transforms]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/bdrate_ntransforms_8_plot.tex}}
	{\includegraphics{./figures/bdrate_ntransforms_8_plot.pdf}}}
	\caption[\acs{BD}-rate for different separable and non-separable
	transform sets]
	{Average \acs{BD}-rate evolution for the learning set depending
	on the number of transforms}
	\label{fig:mdtc_bdrate_ntransforms}
\end{figure}

In order to find out the minimum and maximum performances of the combined
\ac{MDTC} systems, two systems have been evaluated on the full \ac{HEVC} test
set in \ac{AI} and \ac{RA} coding configurations, named:
\begin{itemize}
	\item Low complexity system: one additional transform is set per \ac{IPM}
		for both $4\times4$ and $8\times8$ \acp{TU}.
	\item High performance system: 16 additional transforms per \ac{IPM} for
		$4\times4$ \acp{TU} and 32 for $8\times8$ \acp{TU}.
\end{itemize}
The number of transforms per block size has been chosen to guarantee that
enough residuals are available for transform learning in each \ac{IPM} and
that the obtained performances are not counterbalanced by the transform
signalling.
The performances of the low complexity and high performance systems are
summarised in table~\ref{tab:bd_rate_mdtc}.
The first thing to notice is that almost no losses are observed with regards
to \ac{HEVC} on any sequence: only \emph{SteamLocomotiveTrain} presents small
losses in the \ac{RA} coding configuration.
On the one hand, the low complexity system is able to achieve average bitrate
savings of around 3.8\% with non-separable transforms, and around 2.4\% with
separable transforms.
On the other hand, the high complexity system provides average bitrate
savings of over 4\% for its separable version, and over 7\% for the
non-separable one.
Around two thirds of the gain observed in the \ac{AI} coding configuration are
obtained in the \ac{RA}.

Graphical representations of table~\ref{tab:bd_rate_mdtc} are shown in
figures~\ref{fig:mdtc_low_comp} and~\ref{fig:mdtc_high_perf} for the
low complexity and high performance \ac{MDTC} systems, respectively.

\begin{table}[tb]
	\centering
	\small
	\begin{tabularx}{\textwidth}{c|X|rr|rr|rr|rr}
		\multicolumn{2}{c}{} &
		\multicolumn{4}{c|}{$4\times4$: 1+1 --- $8\times8$: 1+1} &
		\multicolumn{4}{c}{$4\times4$: 1+16 --- $8\times8$: 1+32} \\
		\cline{3-10}
		\multicolumn{2}{c}{} &
		\multicolumn{2}{c|}{sep.} &
		\multicolumn{2}{c|}{non-sep.} &
		\multicolumn{2}{c|}{sep.} &
		\multicolumn{2}{c}{non-sep.} \\
		\cline{2-10}
		\multicolumn{1}{c}{} & {Sequence} &
		\multicolumn{1}{c}{ \acs{AI}} & \multicolumn{1}{c|}{ \acs{RA}} &
		\multicolumn{1}{c}{ \acs{AI}} & \multicolumn{1}{c|}{ \acs{RA}} &
		\multicolumn{1}{c}{ \acs{AI}} & \multicolumn{1}{c|}{ \acs{RA}} &
		\multicolumn{1}{c}{ \acs{AI}} & \multicolumn{1}{c}{ \acs{RA}} \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class A \scriptsize($2560\times1600$)}
		& NebutaFestival         & -0.34 & -0.05 &  -0.52 & -0.04 & -1.06 & -0.08 & -1.15  &  -0.13 \\
		& PeopleOnStreet         & -1.40 & -0.48 &  -2.76 & -1.16 & -4.21 & -1.49 & -5.65  &  -2.27 \\
		& SteamLocTrain          & -0.36 &  0.33 &  -0.46 & -0.10 & -0.60 &  0.23 & -0.68  &   0.03 \\
		& Traffic                & -1.68 & -1.25 &  -3.07 & -2.32 & -4.52 & -3.73 & -6.06  &  -5.12 \\
		\cline{2-10} &
		Average                  & -0.94 & -0.36 &  -1.70 & -0.90 & -2.60 & -1.27 & -3.38  &  -1.87 \\
		\hline
		\hline
		\multirow{6}{2cm}{\centering Class B \scriptsize($1920\times1080$)}
		& BasketballDrive        & -1.17 & -0.35 &  -3.15 & -1.76 & -3.22 & -0.72 & -5.52  &  -2.36 \\
		& BQTerrace              & -1.80 & -1.08 &  -5.11 & -2.83 & -4.70 & -2.65 & -9.22  &  -4.93 \\
		& Cactus                 & -1.95 & -1.15 &  -3.75 & -2.42 & -5.22 & -3.08 & -10.92 &  -7.68 \\
		& Kimono1                & -0.46 & -0.25 &  -1.06 & -0.50 & -1.10 & -0.79 & -1.80  &  -1.18 \\
		& ParkScene              & -1.68 & -1.09 &  -2.20 & -1.50 & -4.61 & -3.22 & -5.27  &  -3.69 \\
		\cline{2-10} &
		Average                  & -1.41 & -0.78 &  -3.05 & -1.80 & -3.77 & -2.09 & -6.55  &  -3.97 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class C \scriptsize($832\times480$)}
		& BasketballDrill        & -2.09 & -1.32 & -12.83 & -7.06 & -5.92 & -3.53 & -25.06 & -14.80 \\
		& BQMall                 & -2.04 & -1.20 &  -3.27 & -1.91 & -4.79 & -2.79 & -6.21  &  -3.64 \\
		& PartyScene             & -2.18 & -1.40 &  -3.31 & -2.29 & -4.88 & -3.21 & -6.19  &  -4.30 \\
		& RaceHorses             & -1.71 & -0.70 &  -3.99 & -2.16 & -4.25 & -1.59 & -6.75  &  -3.24 \\
		\cline{2-10} &
		Average                  & -2.00 & -1.15 &  -5.85 & -3.36 & -4.96 & -2.78 & -11.05 & -6.49 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class D \scriptsize($416\times240$)}
		& BasketballPass         & -1.68 & -0.82 &  -3.71 & -1.97 & -3.91 & -1.82 & -6.15  &  -3.14 \\
		& BlowingBubbles         & -1.96 & -1.21 &  -4.21 & -2.35 & -4.30 & -2.69 & -6.73  &  -3.95 \\
		& BQSquare               & -2.26 & -1.21 &  -3.72 & -2.00 & -4.58 & -2.74 & -6.12  &  -3.61 \\
		& RaceHorses             & -1.57 & -0.58 &  -4.61 & -2.03 & -3.85 & -1.54 & -7.13  &  -3.20 \\
		\cline{2-10} &
		Average                  & -1.87 & -0.95 &  -4.06 & -2.09 & -4.16 & -2.20 & -6.53  & -3.47 \\
		\hline
		\hline
		\multirow{4}{2cm}{\centering Class E \scriptsize($1280\times720$)}
		& FourPeople             & -1.83 & -1.94 &  -3.42 & -3.27 & -4.56 & -5.18 & -6.33  &  -6.70 \\
		& Johnny                 & -1.29 & -1.61 &  -2.96 & -2.86 & -3.34 & -4.16 & -5.21  &  -5.78 \\
		& KristenAndSara         & -1.34 & -1.75 &  -3.49 & -3.26 & -3.82 & -4.47 & -6.01  &  -6.17 \\
		\cline{2-10} &
		Average                  & -1.49 & -1.77 &  -3.29 & -3.13 & -3.91 & -4.60 & -5.85  & -6.22 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class F \scriptsize(various resolutions)}
		& BasketDrillText        & -2.37 & -1.51 & -11.01 & -6.31 & -6.24 & -3.71 & -21.64 & -13.15 \\
		& ChinaSpeed             & -2.02 & -1.22 &  -2.87 & -1.58 & -4.25 & -2.71 & -4.86  &  -3.03 \\
		& SlideEditing           & -2.04 & -2.18 &  -1.82 & -1.97 & -4.80 & -5.05 & -3.67  &  -4.06 \\
		& SlideShow              & -2.53 & -2.32 &  -3.49 & -2.96 & -5.66 & -5.61 & -6.03  &  -5.91 \\
		\cline{2-10} &
		Average                  & -2.24 & -1.81 &  -4.80 & -3.21 & -5.24 & -4.27 & -9.05  & -6.54 \\
		\hline
		\hline
		All sequences &
		Overall                  & -1.66 & -1.10 &  -3.78 & -2.36 & -4.10 & -2.76 & -7.10  &  -4.67 \\
	\end{tabularx}
	\caption{Y \acs{BD}-rate (\%) for low complexity and high performance \acs{MDTC} systems}
	\label{tab:bd_rate_mdtc}
\end{table}

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/mdtc_low_comp_plot.tex}}
	{\includegraphics{./figures/mdtc_low_comp_plot.pdf}}
	\caption{\acs{BD}-rate for low complexity \acs{MDTC} system}
	\label{fig:mdtc_low_comp}
\end{figure}

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/mdtc_high_perf_plot.tex}}
	{\includegraphics{./figures/mdtc_high_perf_plot.pdf}}
	\caption{\acs{BD}-rate for high performance \acs{MDTC} system}
	\label{fig:mdtc_high_perf}
\end{figure}

\ac{MDTC} systems overcome the \ac{MDDT} systems by a significant amount,
even in their low complexity configuration, around 1 point of \ac{BD}-rate is
gained.

There are some sequences that stand out, notably the \emph{BasketballDrill},
which already showed high gains using the \ac{MDDT} technique (see
\S\ref{ssub:mddt_bit_rate_savings}).
This sequence has the particularity of presenting a large amount of directional
patterns, which lead to bitrate savings of over 25\%.
The \ac{BD}-rate curves for this sequence are presented in
figure~\ref{fig:mdtc_bdrate_bdrill_mdtc}.
For low bitrates, corresponding to \acp{QP} of 32 and 37, the bitstream size
is almost kept untouched, but the \ac{PSNR} is substantially improved.
For higher bitrates, the improvements are present in both axes.
This behaviour has been observed for most of the sequences.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/bdrate_bdrill_wvga_mdtc_plot.tex}}
	{\includegraphics{./figures/bdrate_bdrill_wvga_mdtc_plot.pdf}}
	\caption[\acs{BD}-rate curves for \acs{MDTC} system on
	\emph{BasketballDrill}]
	{\acs{BD}-rate curves of \emph{BasketballDrill} for the non-separable high
	performance \acs{MDTC} system.
	This curve represents a bitrate improvement of 25.06\% over \acs{HEVC}.}
	\label{fig:mdtc_bdrate_bdrill_mdtc}
\end{figure}

An example of visual improvements are provided in
figure~\ref{fig:mdtc_bdrill_visual}.
The first sub-figure displays the original block, which has not been coded
yet, and the two other sub-figures the result of coding that block using
\ac{HEVC} and the non-separable high performance \ac{MDTC}, respectively.
The comparison is pertinent, as both sequences have the same bitrate at
\ac{QP} 37 (around \SI{3.3}{\mega bps}, see
figure~\ref{fig:mdtc_bdrate_bdrill_mdtc}).
It is easy to see the improvements made by the \ac{MDTC} system along the
diagonal patterns of the image:
The lines are cut when coding with \ac{HEVC} but they remain continuous using
\ac{MDTC} thanks to the non-separability of the used transforms.
For the rest of the sequences, since the bitrate savings are much lower, the
visual improvements are, in general,  not easy to spot.
Nevertheless, no specific artefacts have been noticed when using the \ac{MDTC}
technique for either systems.

\begin{figure}[tb]
	\centering
	\subfloat[Original crop]
	{\includegraphics[width=0.25\linewidth]{./figures/bdrill_orig_crop.png}}
	\hfill
	\subfloat[\acs{HEVC} at \acs{QP} 37]
	{\includegraphics[width=0.25\linewidth]{./figures/bdrill_hevc_qp37_crop.png}}
	\hfill
	\subfloat[\acs{MDTC} at \acs{QP} 37]
	{\includegraphics[width=0.25\linewidth]{./figures/bdrill_mdtc_qp37_crop.png}}
	\caption[Example of a $100\times100$ block from \emph{BasketballDrill}
	coded at \acs{QP} 37]
	{A $100\times100$ block from \emph{BasketballDrill} encoded at \acs{QP} 37
	with \acs{HEVC} and the non-separable high performance \acs{MDTC} system}
	\label{fig:mdtc_bdrill_visual}
\end{figure}

\subsection{Coding complexity}
\label{sub:mdtc_coding_complexity}

In this section, the complexity of the \ac{MDTC} systems is analysed.
Contrary to \ac{MDDT} systems, presented in Chapter~\ref{cha:mddt}, \ac{MDTC}
systems explore many more coding alternatives, which leads to significant
increments in terms of encoding time:
up to 20 times the complexity of the reference \ac{HEVC} encodings for the
high performance \acs{MDTC} system.
The separable version presents a much lower complexity, even if it is 8
times higher than the one of \ac{HEVC}.

The decoding time is increased due to the lack of efficient fast algorithms
for the designed \acp{RDOT}, as in the \ac{MDDT} systems.
Nevertheless, in this case, the default \ac{HEVC} transforms are still
available, making it possible that the \ac{MDTC} decoder becomes less complex
that the \ac{MDDT} in some cases.
On the other hand, the decoding complexity can increase slightly with the
number of transforms, since it becomes less likely to use the default
\ac{HEVC} transform.

For the same decoding time, the low complexity \ac{MDTC} system provides
higher bitrate savings (around 0.6 \ac{BD}-rate points) than the \ac{MDDT}
using $4\times4$ and $8\times8$ \acp{TU}, with the only burden being on the
encoding time.

Table~\ref{tab:mdtc_summary} contains the complexity figures for the low
complexity and high performance \ac{MDTC} systems.
A graphical version of the table is presented in
figure~\ref{fig:four_way_mdtc_comparison_mdtc}.
All axes in the four diagrams have the same scaling factor in order to ease
the comparison among them.
Using this representation, \acs{HEVC} would be a dot in the centre of the
diagram, and an ideal system would only have one arrow downwards, meaning that
the \acs{BD}-rate has been improved without any additional coding complexity
or \acs{ROM} storage requirements.

\begin{table}
	\centering
	\small
	\begin{tabular}{c|cc|cc}
		\multicolumn{1}{c}{} &
		\multicolumn{2}{c|}{\multirow{2}{2cm}{\centering $4\times4$: 1+1 $8\times8$: 1+1}} &
		\multicolumn{2}{c}{\multirow{2}{2cm}{\centering $4\times4$: 1+16 $8\times8$: 1+32}} \\
		\multicolumn{1}{c}{} & & & & \\
		\cline{2-5}
		\multicolumn{1}{c}{} & sep. & non-sep. & sep. & non-sep. \\
		\hline
		\hline
		Y \acs{BD}-rate & -1.66\% & -3.78\% & -4.10\% & -7.10\% \\
		Enc. Time & 150\% & 200\% & 800\% & 2000\% \\
		Dec. Time & 103\% & 110\% & 105\% & 120\% \\
		\acs{ROM} & \SI{8.20}{\kilo B} & \SI{148.75}{\kilo B} &
			\SI{236.25}{\kilo B} & \SI{4.51}{\mega B} \\
	\end{tabular}
	\caption{Summary of \acs{RDOT}-based \acs{MDTC} systems compared to \acs{HEVC}}
	\label{tab:mdtc_summary}
\end{table}

\subsection{Storage requirements}
\label{sub:mdtc_storage_requirements}
\index{ROM}

The storage requirements for the \ac{MDTC} systems are computed as explained
in~\eqref{eqn:rom_nsep} and~\eqref{eqn:rom_sep}, with the exception that, for
the \ac{MDTC} systems, the total \acs{ROM} is affected by the number of
transforms in each \ac{TU} size.
The quantisation of the transform coefficient has remained unchanged:
each coefficient is rounded to the closest integer, quantised to 1 byte.
The low complexity \ac{MDTC} system has exactly the same \acs{ROM}
requirements as the \ac{MDDT}, since only one transform is used per \ac{IPM}.
On the other hand, the high performance \ac{MDTC} system has storage
requirements largely superior: more than 4.5 MB are required for the
non-separable version.
The actual values of required \acs{ROM} are presented in
table~\ref{tab:mdtc_summary} and in
figure~\ref{fig:four_way_mdtc_comparison_mdtc}.

\begin{figure}[tb]
	\def\scale{0.26}
	\def\encmax{2000}
	\def\decmax{130}
	\def\bdrmax{-8}
	\def\rommax{8192}
	\centering
	\def\bdr{-1.66}
	\def\enc{150}
	\def\dec{103}
	\def\rom{8.20}
	\subfloat[Separable low complexity \acs{MDTC}]{\input{./scripts/four-way-diagram.tex}}
	\def\bdr{-3.78}
	\def\enc{200}
	\def\dec{110}
	\def\rom{148.75}
	\subfloat[Non-separable low complexity \acs{MDTC}]{\input{./scripts/four-way-diagram.tex}}

	\def\bdr{-4.10}
	\def\enc{800}
	\def\dec{105}
	\def\rom{236.25}
	\subfloat[Separable high performance \acs{MDTC}]{\input{./scripts/four-way-diagram.tex}}
	\def\bdr{-7.10}
	\def\enc{2000}
	\def\dec{120}
	\def\rom{4618.24}
	\subfloat[Non-separable high performance \acs{MDTC}]{\input{./scripts/four-way-diagram.tex}}
	\caption{Graphical comparison of the four proposed \acs{MDTC} systems}
	\label{fig:four_way_mdtc_comparison_mdtc}
\end{figure}

\section{Conclusions}
\label{sec:mdtc_conclusions}

A systematic procedure based on~\cite{sezer-08-sparse-orthonormal-transforms}
to learn multiple transforms using the \ac{RDOT} metric is described in this
Chapter.
Despite the learning algorithm being subject to some improvements, the
potential of using multiple non-separable transforms in video coding is
unveiled:
average bitrate savings of more than 7\% can be achieved in \ac{AI}
configurations, with peaks of up to 25\% for some sequences containing large
amounts of diagonal patterns.
The interest of using multiple separable transforms has also been shown, with
bitrate savings of around 4\% on average in \ac{AI}.

However, in order to achieve those gains, the encoder complexity is affected
by a factor of 8--20, depending on separability.
The decoding time using separable transforms remains reasonable, with only 5\%
increase, but up to 20\% when using non-separable transforms.
Storage requirements are also important, if not prohibitive, especially for
non-separable transforms.

\ac{MDTC} systems have also proved to be able to provide different trade-offs
in terms of bitrate savings, complexity and storage requirements.
Bitrate savings of 3.78\% can be achieved with twice the encoding complexity,
10\% added decoding complexity and around \SI{300}{\kilo B} of \acs{ROM}.
In the separable version, savings of 1.66\% are obtained at 50\% added
complexity, 3\% added decoding complexity and around \SI{8}{\kilo B} of
\acs{ROM}.

Next Chapters will explore different ways of simplifying the \ac{MDTC} systems
in three important dimensions of codec design:
encoding complexity, decoding complexity and storage
requirements while trying to keep most part of the bitrate savings.

\chapter{Incomplete transforms}
\label{cha:incomplete_transforms}
\chaptertoc

\section{Introduction}
\label{sec:it_introduction}
\index{incomplete transforms}

Chapters~\ref{cha:mddt} and~\ref{cha:mdtc} have presented the potential of
non-separable transforms in terms of bitrate savings for video coding.
However, they come at a cost:
the amount of algorithmic operations needed to transform a block is highly
increased with respect to a transform having a fast implementation.
Besides, the amount of \acs{ROM} needed to store the transforms themselves is
important, sometimes in the order of several MB.

This Chapter presents an attempt at making non-separable transforms usable for
video coding applications.
The work presented here is mainly based on the one published
in~\cite{arrufat-15-inc-transforms}.

\section{Motivations of incomplete transforms}
\label{sec:it_motivations}

\subsection{Forcing sparse data representation}
\label{sub:it_forcing_sparse_data_representation}
\index{sparse representation}

Sparse data representation has been an important field of study in the last
years thanks to its countless applications in many domains, in which,
compression and feature extraction stand out.
Sparse representation focuses on finding the most compact representation for a
given signal~\cite{huang-06-sparse-representation}.
Amongst them, the K-\acs{SVD} is one way of designing overcomplete
dictionaries to achieve sparse data
representation~\cite{aharon-06-overcomplete-sparse-dict}.
Also, the \ac{RDOT} systems presented in this thesis are part of this
ensemble.
\begin{itemize}
	\item The \ac{RDOT} metric contains a rate constraint turned into a
		measure of the signal sparsity in the transform domain (the number of
		coefficients set to zero).
	\item The \ac{MDTC} system provides an overcomplete representation, as
		\ac{HEVC} core transforms are complemented with $2^N$ transforms.
		As such, $1+2^N$ sets of coefficients are computed on the encoding
		side.
		A selection amongst those is made through the transform signalling and
		the coefficient coding.
\end{itemize}

The usage of multiple complementary transforms to provide sparse
representations has been addressed in Chapter~\ref{cha:mdtc}, with the
\ac{MDTC} systems, where high computational requirements were pointed out.
In this Chapter, a low complexity solution for sparse data representation is
proposed.
The approach is based on a standard orthogonal transform, the \acf{DCT}, in
competition with multiple elementary sparse transforms, named \emph{incomplete
transforms}.
The competition exists in the sense that the encoder selects, for each
\ac{TU}, the transform that provides the best signal representation in the
distortion-sparsity plane.
The effectiveness of this approach is measured using the \ac{RDOT} metric,
defined in~\ref{eqn:rdot-nsep}.

Incomplete transforms can be considered as a special case of \acfp{RDOT},
introduced in \S\ref{sec:rdot}, in which
only one base vector is retained and considered.
Consequently, a signal transformed using an incomplete transform has only one
coefficient different from zero in the transform domain.
In order to be able to represent any signal within a given distortion,
incomplete transforms are conceived to work as companions of a main orthogonal
transform, such as the \ac{DCT} for image coding.

To illustrate a case where incomplete transforms might be useful,
figure~\ref{fig:it_diagram} presents a 2D scenario, where the small dots
symbolise the 2D signals to be transformed.
The main transform, whose base vectors are $v_0$ and $v_1$, is able to
represent the signal efficiently, as $v_0$ follows the main direction of the
dark dots.
By construction, $v_1$ is orthogonal to $v_0$.

\begin{figure}[tp]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/it_diagram.tex}}
	{\includegraphics{./figures/it_diagram.pdf}}
	\caption[Illustration of the incomplete transform concepts]
	{Illustration of the incomplete transform concepts.
	An additional basis vector ($w_0$ is added to assist an orthogonal
	transform ($v_0,v_1$))}
	\label{fig:it_diagram}
\end{figure}

However, there exists a secondary direction that cannot be represented
compactly using the $(v_0,v_1)$ base vectors:
both axes are required to describe the coordinates of those points.
Just by adding an extra axis, $w_0$, adapted to this secondary direction, an
effective and sparser representation of those dots can be achieved.

Therefore, the dots plotted in this space can be efficiently represented thanks
to the union of two transforms.
One transform is complete, the second one, which can be conceived as a
complete transform, is restricted to only one base vector, the principal
component:
this way, the compactness is guaranteed, as only one transform coefficient
needs to be transmitted.

If only one adapted transform had been used in figure~\ref{fig:it_diagram}
to adapt to all those points, such as the \acl{KLT}, the main axis would have
been placed somewhere between $v_0$ and $w_0$.

\subsection{Complexity analysis}
\label{sub:it_complexity_analysis}

Incomplete transforms are proposed in this work to provide a low complexity
approach to the non-separable \ac{MDTC} systems.
A remarkable consequence of using incomplete transforms is a decrease in
complexity when decoding a signal, since there will be one coefficient
different from zero, decoding implies multiplying the base vector by the
transformed coefficient.

In image coding, a separable two-dimensional transform is expressed as
in~\eqref{eqn:sep_transform}.
Assuming the image is composed of $8\times8$ blocks, $\x$ stands for the
$8\times8$ pixels, and $\X$ their representation in the transform domain.
$\A$ is the $8\times8$ 1D transform.
The usual transform used in image coding is the \ac{DCT}-II, whose fast
algorithm requires in the order of 12 multiplications and 29 additions per
$8\times1$ vector.
As $8$ vectors per block need to be processed both for the vertical and
horizontal transform, processing an $8\times8$ block requires a total of 192
multiplications and 464 additions.
This number of operations is identical for the inverse transform.

For an incomplete transform, only one axis needs to be processed:
each axis being formed by 64 values in this example.
Consequently, only 64 multiplications and 63 additions are also needed to
transform the input block $\x$ into the transform domain, that is, a simple
correlation of the first base vector with the block $\x$.
For the inverse transformation, only 64 multiplications are needed, since the
first and only transform coefficient multiplies the first basis vector to
recover the spatial domain samples.

As a result, in this case, the incomplete transforms can be applied with a
number of operations of approximately one third of the cost of the fastest
separable transforms.
This complexity reduction benefits both the encoder and the decoder.

It is also worth noticing that those incomplete transforms are chosen
non-separable and, therefore, able to exploit any linear correlation amongst
pixels within a block.

\section{Design of incomplete transforms}
\label{sec:it_design_of_incomplete_transforms}

The incomplete transform design is based upon the \ac{RDOT} model proposed
in~\cite{sezer-08-sparse-orthonormal-transforms} and detailed in
\S\ref{sec:rdot}.
The original method describes a way of iteratively deriving one optimal
transform for some given training data and an initial transform by using a
metric that includes a sparsity constraint.
This section adapts the learning method for incomplete transforms.

\subsection{Incomplete transform learning}
\label{sub:it_incomplete_transform_learning}

The learning phase of incomplete transforms uses the same method presented in
\S\ref{sec:rdot}, particularly in~\eqref{eqn:rdot-nsep}.
The design of an incomplete transform takes one extra step apart from those
required in the \ac{RDOT} learning:
after the hard-thresholding of the coefficients, only the first coefficient is
kept.
Consequently, the $\ell_0$ norm of the transform coefficients is always equal
to one for any incomplete transform.
Therefore, for a given set of training signals, one obtains a transform
consisting of one meaningful base vector.
The remaining vectors, albeit constituting a complete transform with the first
base vector, are useless for the aim of the application.

For consistency with previous Chapters, the learning set consists of
prediction residuals issued from an \ac{HEVC} \ac{AI} coding for $4\times4$
and $8\times8$ blocks, the same one used in Chapters~\ref{cha:mddt}
and~\ref{cha:mdtc}.

\subsubsection{Multiple incomplete transforms}
\label{ssub:it_multiple_incomplete_transforms}

In Chapter~\ref{cha:mdtc} the use of multiple transforms inside each \ac{IPM}
was introduced.
A decision has been made to keep the default \ac{HEVC} transform in each
\ac{TU} size for compatibility reasons.
Besides, since incomplete transforms are designed with a strong sparsity
constraint on the quantised coefficients, a complete transform is needed to
guarantee that all signals can be expressed in the transform domain for a
chosen level of fidelity.
By using the same design as in the previous Chapter (one main transform plus
$2^N$ incomplete transforms), signals would be able to have a fall-back
transform when incomplete transforms do not provide the desired level of
fidelity in terms of distortion.

As a result, the learning algorithm for non-separable transforms has remained
almost untouched from algorithm~\ref{alg:clustering}, described in
\S\ref{sub:mdtc_learning_algorithm}, only the design of the transform has been
modified by just keeping the first coefficient after the hard-thresholding.
After that, residuals are evaluated using the \ac{RDOT} metric to assign them
to the best transform.

To illustrate the effectiveness of the learning process,
figure~\ref{fig:it_num_zero} presents how an increase on the number of
incomplete transforms is able to provide a more sparse representation of
the signal in the transform domain.
This illustration is based on the \ac{IPM} 6 for $8\times8$ blocks.

\begin{figure}[tp]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/it_num_zero_plot.tex}}
	{\includegraphics{./figures/it_num_zero_plot.pdf}}
	\caption{Percentage of non-zero coefficients referred to the \acs{DCT}}
	\label{fig:it_num_zero}
\end{figure}

To evaluate this, the average number of significant coefficients is computed
for different coding configurations (from 1 to 32 incomplete transforms used
in conjunction with the \ac{DCT}).
The results are presented relative to the reference system, which consists of
a coding system using the \ac{DCT} alone, such as the \ac{HEVC}.
As the number of incomplete transforms is increased, the proportion of
significant coefficients is decreased to 66\% of its original value for a
similar distortion:
this validates the fact that more sparse representations can be achieved with
the joining of incomplete transforms to the traditional \ac{DCT}.

The result of a learning experiment is shown in
figure~\ref{fig:it_32_transforms}, where 32 $8\times8$ incomplete transforms
are presented.
\begin{figure}[tp]
	\centering
	\includegraphics[width=0.8\linewidth]{./figures/it_32_transforms.png}
	\caption{List of 32 incomplete transforms for $8\times8$ blocks and
	\acs{IPM} 6}
	\label{fig:it_32_transforms}
\end{figure}
For each one, only the first base vector is displayed, since it is the only
one delivering significant transformed coefficients.
In the case of two-dimensional signals, such as images, incomplete transforms can
be directly interpreted as texture patterns against whom the input signal is
matched.
The learning set in this experiment is made of prediction residuals extracted
from a directional mode from the \ac{HEVC} coding scheme.
The selected mode (\ac{IPM} 6) corresponds to an angular prediction of
approximately $+\SI{26}{\degree}$.
Accordingly, the blocks selected by \ac{HEVC} for this mode mostly present a
directional pattern following that prediction.
It can be observed how the incomplete transforms have patterns containing that
particular direction, each exhibiting a particular band-shaped pattern.
Note that the \ac{DCT} requires a significant number of coefficients to
represent such directional, and inherently non-separable, patterns because
none of the \ac{DCT} bases expresses that direction.

\section{Incomplete transforms in video coding}
\label{sec:it_video_coding}

To evaluate the performance of the approach, a set of $4\times4$ and
$8\times8$ incomplete transforms is designed for each \ac{HEVC} \ac{IPM}.
At the encoder, for each block, the best \ac{IPM}/transform pair is selected,
exactly in the same way as the \ac{MDTC} system from Chapter~\ref{cha:mdtc}.

\subsection{Signalling of incomplete transforms in the bitstream}
\label{sub:it_signalling}

The signalling scheme for incomplete transforms is inherited from the
\ac{MDTC} system:
a flag indicates whether the default \ac{HEVC} transform is used.
If that is not the case a fixed length codeword indicates which incomplete
transform has been chosen by the encoder for that \ac{IPM}.

Incomplete transforms have a special feature that generic \acp{RDOT} used in
the \ac{MDTC} system do not have:
they always output a signal that has only one coefficient different from zero.
Consequently, the residual signal bitstream syntax can be reduced:
the significance map (i.e.\ the location of the significant coefficients) does
not need to be transmitted, since the position of the last element different
from zero is known when using incomplete transforms.
This fact can be taken advantage of by removing the
\texttt{last\_sig\_coeff\_x} and \texttt{last\_sig\_coeff\_y} prefix and
suffix syntax elements from the conveyed bitstream~\cite{JCTVC-G704}.

\subsection{Performances of different configurations}
\label{sub:it_performances}

Experiments have been performed following the common test conditions described
in~\cite{bossen-12-common-test-conditions} for the \ac{HEVC} test set.
Incomplete transforms have been enabled for $4\times4$ and $8\times8$
\acp{TU} in order to make the system comparable to the \ac{MDTC} systems from
the previous section.
In particular, 8 additional transforms have been designed for $4\times4$
\acp{TU} and 32 for $8\times8$ \acp{TU}.

The results of encoding the \ac{HEVC} test set using incomplete transforms are
detailed in table~\ref{tab:bd_rate_it}.
Performances are much lower than those presented in the previous Chapter for
the \ac{MDTC} system.
However, as explained in \S\ref{sub:it_coding_complexity} and
\S\ref{sub:it_storage_requirements}, incomplete transforms requirements are
also much lower than a \ac{MDTC} system.

When incomplete transforms are enabled only for $4\times4$ \acp{TU}, the
system presents modest performances (around 0.4\% of bitrate savings) for
both \ac{AI} and \ac{RA} coding configurations.
The system starts presenting more substantial performances when enabling
incomplete transforms for $8\times8$ \acp{TU}, providing 1\% of average
bitrate savings.
If incomplete transforms are enabled for both \ac{TU} sizes, the average
bitrate savings achieve 1.4\% for the \ac{AI} coding configuration and 1\%
for \ac{RA}.

Sequences that performed exceptionally well using the \ac{MDTC} system due to
their non-separable nature, such as \emph{BasketballDrill} present bitrate
savings of more that 8\% using this approach.

\begin{table}[tb]
	\centering
	\small
	\begin{tabularx}{\textwidth}{c|X|rr|rr|rr}
		\multicolumn{2}{c}{} &
		\multicolumn{2}{c|}{$4\times4$} &
		\multicolumn{2}{c|}{$8\times8$} &
		\multicolumn{2}{c}{$4\times4$ \& $8\times8$} \\
		\cline{2-8}
		\multicolumn{1}{c}{} & {Sequence} &
		\multicolumn{1}{c}{ \acs{AI}} & \multicolumn{1}{c|}{ \acs{RA}} &
		\multicolumn{1}{c}{ \acs{AI}} & \multicolumn{1}{c|}{ \acs{RA}} &
		\multicolumn{1}{c}{ \acs{AI}} & \multicolumn{1}{c}{ \acs{RA}} \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class A ($2560\times1600$)}
		& NebutaFestival         &  0.01 &  0.00 &  0.00 & -0.04 &  0.01 & -0.03 \\
		& PeopleOnStreet         &  0.00 &  0.02 & -0.64 & -0.32 & -0.86 & -0.37 \\
		& SteamLocomotiveTrain   &  0.04 & -0.12 &  0.03 &  0.16 &  0.03 &  0.19 \\
		& Traffic                &  0.02 & -0.24 & -0.68 & -0.62 & -0.88 & -1.04 \\
		\cline{2-8} &
		Average                  &  0.02 & -0.08 & -0.32 & -0.20 & -0.43 & -0.31 \\
		\hline
		\hline
		\multirow{6}{2cm}{\centering Class B ($1920\times1080$)}
		& BasketballDrive        & -0.10 &  0.02 & -1.07 & -0.72 & -1.31 & -0.78 \\
		& BQTerrace              & -0.36 & -0.39 & -1.06 & -0.76 & -1.41 & -1.10 \\
		& Cactus                 & -0.18 & -0.32 & -1.38 & -0.87 & -1.73 & -1.25 \\
		& Kimono1                &  0.07 &  0.02 & -0.31 & -0.17 & -0.32 & -0.23 \\
		& ParkScene              &  0.13 &  0.08 & -0.15 & -0.16 & -0.19 & -0.24 \\
		\cline{2-8} &
		Average                  & -0.09 & -0.12 & -0.80 & -0.54 & -0.99 & -0.72 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class C ($832\times480$)}
		& BasketballDrill        & -3.21 & -1.95 & -6.22 & -2.91 & -8.22 & -4.35 \\
		& BQMall                 & -0.10 & -0.19 & -0.66 & -0.44 & -1.03 & -0.77 \\
		& PartyScene             & -0.10 & -0.29 & -0.18 & -0.15 & -0.40 & -0.54 \\
		& RaceHorses             & -0.31 & -0.23 & -0.44 & -0.37 & -0.85 & -0.51 \\
		\cline{2-8} &
		Average                  & -0.93 & -0.67 & -1.88 & -0.97 & -2.63 & -1.54 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class D ($416\times240$)}
		& BasketballPass         & -0.18 & -0.06 & -0.47 & -0.31 & -0.88 & -0.55 \\
		& BlowingBubbles         & -0.15 & -0.37 & -0.21 & -0.07 & -0.52 & -0.49 \\
		& BQSquare               & -0.32 & -0.37 & -0.25 & -0.19 & -0.59 & -0.51 \\
		& RaceHorses             & -0.37 & -0.09 & -0.32 & -0.12 & -0.86 & -0.37 \\
		\cline{2-8} &
		Average                  & -0.26 & -0.22 & -0.31 & -0.17 & -0.71 & -0.48 \\
		\hline
		\hline
		\multirow{4}{2cm}{\centering Class E ($1280\times720$)}
		& FourPeople             & -0.03 & -0.46 & -1.07 & -1.03 & -1.36 & -1.58 \\
		& Johnny                 & -0.13 & -0.75 & -1.25 & -1.16 & -1.56 & -1.84 \\
		& KristenAndSara         & -0.31 & -0.59 & -1.15 & -0.77 & -1.56 & -1.44 \\
		\cline{2-8} &
		Average                  & -0.16 & -0.60 & -1.16 & -0.99 & -1.49 & -1.62 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class F (various resolutions)}
		& BasketDrillText        & -2.54 & -1.67 & -4.83 & -2.33 & -6.49 & -3.66 \\
		& ChinaSpeed             & -0.40 & -0.29 & -0.45 & -0.26 & -0.90 & -0.61 \\
		& SlideEditing           & -0.23 & -0.24 & -0.51 & -0.61 & -0.71 & -0.81 \\
		& SlideShow              & -0.73 & -0.76 & -0.81 & -0.83 & -1.51 & -1.64 \\
		\cline{2-8} &
		Average                  & -0.97 & -0.74 & -1.65 & -1.01 & -2.40 & -1.68 \\
		\hline
		\hline
		All sequences &
		Overall                  & -0.40 & -0.38 & -1.00 & -0.63 & -1.42 & -1.02 \\
	\end{tabularx}
	\caption{Y \acs{BD}-rate (\%) for \acs{MDTC} systems using incomplete
	transforms}
	\label{tab:bd_rate_it}
\end{table}

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/it_perf_4_8_plot.tex}}
	{\includegraphics{./figures/it_perf_4_8_plot.pdf}}
	\caption{\acs{BD}-rate for incomplete transforms enabled for $4\times4$
	and $8\times8$ \acsp{TU}}
	\label{fig:it_perf_4_8}
\end{figure}

Figure~\ref{fig:mdtc_bdrate_bdrill_inc_tr} shows the \ac{BD}-rate curves of
standard \ac{HEVC} and the \ac{MDTC} system using incomplete transforms on
$4\times4$ and $8\times8$ on the \emph{BasketballDrill} sequence.
The bitrate savings are over 8\% and it can be seen that the system behaves
virtually in the same way as the \ac{MDTC} systems presented in the previous
Chapter for all bitrate ranges.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/bdrate_bdrill_wvga_inc_tr_plot.tex}}
	{\includegraphics{./figures/bdrate_bdrill_wvga_inc_tr_plot.pdf}}
	\caption[\acs{BD}-rate curves for incomplete transforms on
	\emph{BasketballDrill}]
	{\acs{BD}-rate curves of \emph{BasketballDrill} for the \acs{MDTC} system
	using incomplete transforms.
	This curve represents a bitrate improvement of 8.22\% over \acs{HEVC}.}
	\label{fig:mdtc_bdrate_bdrill_inc_tr}
\end{figure}

\subsection{Coding complexity}
\label{sub:it_coding_complexity}

Despite using non-separable incomplete transforms, which are notably less
complex than regular non-separable and even separable transforms, the encoding
complexity is increased.
This is due to the fact that incomplete transforms are used as a simplified
\ac{MDTC} system, where many transforms are tested for a same block size.
This fact makes the encoder explore many coding possibilities that are
unavailable to \ac{HEVC}.
That being said, the increase in encoding complexity is lower than for a
full-fledged \ac{MDTC} system, since testing of incomplete transforms is much
faster.

On the decoder side, the situation changes in favour of the incomplete
transforms with regards to \ac{HEVC}.
Since all coding choices are made by the encoder and the decoder only has to
apply the signalled transform.
Therefore, the decoder is of equal complexity, whenever the \ac{DCT}/\ac{DST}
is chosen, or lower, if incomplete transforms are used, than that of
\ac{HEVC}.

The actual results for encoding and decoding complexity are contained in
table~\ref{tab:it_complexity}.

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{l|r|r|r}
		& \multicolumn{1}{c|}{$4\times4$}
		& \multicolumn{1}{c|}{$8\times8$}
		& \multicolumn{1}{c}{$4\times4$ \& $8\times8$} \\
		\hline\hline
		Encoding & 210\% & 250\% & 340\% \\
		Decoding &  97\% &  99\% & 100\% \\
	\end{tabular}
	\caption{Relative average complexity of incomplete transform systems to
	\acs{HEVC}}
	\label{tab:it_complexity}
\end{table}


\subsection{Storage requirements}
\label{sub:it_storage_requirements}
\index{ROM}

Due to the similarities to the \ac{MDTC} system, the non-separable incomplete
transforms also need to be stored.
However, in this case, the storage requirements are divided by $N^2$, with $N$
being the \ac{TU} size for which the transforms are conceived, since only one
base vector is stored.
Assuming each transform coefficient is quantised to 1 byte and rounded to the
nearest integer, as in the \ac{MDDT} and \ac{MDTC} systems, this leads to
storage requirements of:
\begin{itemize}
	\item $4\times4$ transforms: \SI{16}{B} per transform.
	\item $8\times8$ transforms: \SI{64}{B} per transform.
\end{itemize}
For the system using 8 $4\times4$ transforms and 32 $8\times8$ transforms, the
global storage requirements are \SI{74.375}{\kilo B}, as there are 35
\acp{IPM}.

\section{Conclusions}
\label{sec:it_conclusions}

Incomplete transforms provide a simplified version of the \ac{MDTC} system,
presented in Chapter~\ref{cha:mdtc}.
These complementary transforms are able to increment the sparsity of the
signal in the transform domain:
the number of non-zero coefficients can be reduced to around two thirds of
its original value.

Experimental results on \ac{HEVC} prove that the sparsity improvements can
lead to bitrate savings of 1.4\% with peaks of up to 8\%, while keeping
the decoding complexity lower than that of \ac{HEVC}.
As a result, incomplete transforms constitute an example of application for
non-separable transforms that can lead to bitrate savings for reasonable
encoding complexity and storage amounts.
On the decoder side, incomplete transforms require fewer operations per block
than \ac{HEVC} transforms, leading to a decrease on the decoding complexity.

Summing up, \ac{MDTC} systems using incomplete transforms represent a more
lightweight alternative to \ac{RDOT}-based \ac{MDTC} systems presented in
Chapter~\ref{cha:mdtc} that are able to provide improvements over \ac{HEVC}
across a wide range of bitrate with low decoding complexity.

\chapter{Simplifications of \acs{MDTC} systems for storage requirements}
\label{cha:real_world_system}
\chaptertoc

\section{Introduction}
\label{sec:rw_introduction}
\index{ROM}

Chapter~\ref{cha:mdtc} introduced the \acf{MDTC} system for video coding
schemes.
Significant bitrate savings were obtained with regards to \ac{HEVC},
especially when using the non-separable high-performance system, where many
non-separable transforms were provided in each \acf{IPM}:
up to 7\% on average for the \ac{AI} coding configuration.

This Chapter takes a look at different approaches to simplify the
system presented in Chapter~\ref{cha:mdtc} in terms of storage requirements
while trying to have minimum impact on its bitrate savings.

\subsection{Discarding non-separable transforms}
\label{sub:discarding_non_separable_transforms}

In order to conceive a low complexity system that makes use of transform
competition, some concessions are required.
This Chapter proposes simplifications to the areas leading to significant
complexity increases in \ac{MDTC} systems, especially, to the storage
requirements.

The first approach for simplification is to suppress the use of non-separable
transforms, even though they provide notably better performances than their
separable counterparts, as seen in Chapters~\ref{cha:mddt} and~\ref{cha:mdtc}.
Non-separable transforms can definitively not accommodate a system with
neither storage restrictions, nor coding complexity limitations.
For this reason, this Chapter will only consider systems composed by separable
transforms.
As a consequence of ditching non-separable transforms, the average bitrate
savings achievable on the \ac{HEVC} using separable high performance \ac{MDTC}
systems are around 4\%, which will be used as a reference when comparing the
simplified systems.

Furthermore, separable transform present an increase of around 5\% in decoding
complexity, instead of the 20\% added by non-separable transforms.

\section{Source of the learning set}
\label{sec:rw_source_of_the_learning_set}

\subsection{Completely independent sets}
\label{sub:rw_completely_independent_sets}

A criticism that could be made to systems presented in
Chapters~\ref{cha:mddt},~\ref{cha:mdtc} and~\ref{cha:incomplete_transforms} is
that the testing set includes the learning set:
classes B and C have served as the learning set and the resulting transforms
to build the \ac{MDDT} and \ac{MDTC} systems have been tested against those
classes.
That being said, performances on classes not included in the learning set (A,
D, E and F) have similar levels of performances, as described previously on
their respective system tables (see tables~\ref{tab:detailed_mddt_bd_rate},
\ref{tab:bd_rate_mdtc} and~\ref{tab:bd_rate_it}).

To remove any doubt about the effect of the learning data set, a completely
independent data set has been chosen and used to learn the transforms.
In order to make all results available and reproducible, the sequence of
choice from which residuals have been output is \emph{Tears of Steel}, a
Creative Commons short film~\cite{blender-tearsofsteel}.
The film features computed generated imagery and at a resolution of
$1920\times800$ at a frame rate of \SI{24}{\hertz} and a length of
\SI{12}{\minute} and \SI{14}{\second}.
Although it is composed of more that 130 sub-sequences, all of them share the
same video post-processing, frame rate, resolution, making a source of
residuals less heterogeneous than classes B and C from the \ac{HEVC} test set,
but providing a larger number of residuals (around as twice as many, in
total):
\begin{itemize}
	\item $4\times4$ residuals: over 142 million.
	\item $8\times8$ residuals: over 343 million.
\end{itemize}

The resulting transforms have served to build new \ac{MDTC} systems that have
been compared to the previous ones on different sequence sets.

Additionally, another independent set of sequences has been used to test the
performances of both learning sets (classes B and C, and \emph{Tears of
Steel}).
This set consists of 59 sequences at different resolutions ($1920\times1080$,
$1280\times720$, $832\times480$) and various frame rates (25, 30, 50, 60, 100,
120, 240 frames per second).
Each sequence has been limited to 5 frames and all the tests are performed
in the \ac{AI} configuration.

From now on, the different sets will be referred to as:
\begin{itemize}
	\item BC: classes B and C from the \ac{HEVC} test set, used as the
		learning set in the previous Chapters.
	\item ADEF: classes A, D, E and F from the \ac{HEVC} test set.
	\item 59seq: 5 frames of 59 heterogeneous (private) sequences for testing.
	\item ToS: the \emph{Tears of Steel} publicly available short film.
\end{itemize}

\subsection{Impact on the coding performances over the learning sets}
\label{sub:rw_performances_new_data_set}

This section presents the results of different \ac{MDTC} systems learnt with
BC and ToS, and then tested against BC, ADEF, 59seq and ToS.

Various additional transform configurations have been learnt for $4\times4$
and $8\times8$ \acp{TU}.
The resulting performances are contained in
tables~\ref{tab:residuals_independence_4}
and~\ref{tab:residuals_independence_8} for $4\times4$ and $8\times8$ separable
transforms, respectively.
The table headers read as follows: Test set | Learning set.
The value before the pipe sign corresponds to the set of sequences that have
been tested and the value after the pipe to the learning set that has been
used.
Moreover, since the ToS learning set provides a larger number of residuals,
more transforms have been learnt for the $8\times8$ blocks.
Bitrate savings are present for all transform configurations from both
residuals sources on all test sets, and they coherently increase with the
number of transforms.

\begin{table}[tb]
	\centering
	\small
	\begin{tabularx}{\linewidth}{r|cc|cc|cc|cc}
		\# tr. & BC | BC & BC | ToS & ADEF | BC & ADEF | ToS & 59seq | BC &
		59seq | ToS & ToS | BC & ToS | ToS \\
		\hline\hline
		1  & -0.69 & -0.61 & -0.85 & -0.95 & -0.48 & -0.48 & -0.13 & -0.17 \\
		2  & -0.93 & -0.85 & -1.16 & -1.21 & -0.66 & -0.66 & -0.21 & -0.25 \\
		4  & -1.08 & -1.05 & -1.32 & -1.54 & -0.76 & -0.80 & -0.23 & -0.31 \\
		8  & -1.23 & -1.14 & -1.48 & -1.68 & -0.86 & -0.91 & -0.27 & -0.34 \\
		16 & -1.28 & -1.22 & -1.65 & -1.85 & -0.93 & -0.96 & -0.30 & -0.34 \\
		32 & -1.28 & -1.25 & -1.76 & -1.89 & -0.93 & -0.99 & -0.27 & -0.35 \\
	\end{tabularx}
	\caption{\acs{BD}-rate (\%) for different testing and learning sets for
	$4\times4$ blocks}
	\label{tab:residuals_independence_4}
\end{table}

\begin{table}[tb]
	\centering
	\small
	\begin{tabularx}{\linewidth}{r|cc|cc|cc|cc}
		\# tr. & BC | BC & BC | ToS & ADEF | BC & ADEF | ToS & 59seq | BC &
		59seq | ToS & ToS | BC & ToS | ToS \\
		\hline\hline
		1   & -0.97 & -0.68 & -0.82 & -0.74 & -0.93 & -0.83 & -0.75 & -1.00 \\
		2   & -1.43 & -1.11 & -1.20 & -1.13 & -1.38 & -1.32 & -1.01 & -1.33 \\
		4   & -1.91 & -1.44 & -1.55 & -1.47 & -1.84 & -1.70 & -1.30 & -1.59 \\
		8   & -2.35 & -1.81 & -1.91 & -1.83 & -2.15 & -2.08 & -1.48 & -1.84 \\
		16  & -2.81 & -2.19 & -2.27 & -2.12 & -2.50 & -2.46 & -1.69 & -2.05 \\
		32  & -3.32 & -2.51 & -2.63 & -2.43 & -2.84 & -2.76 & -1.84 & -2.17 \\
		64  & -3.88 & -2.79 & -3.01 & -2.66 & -3.10 & -3.04 & -1.93 & -2.28 \\
		128 & ---   & -3.07 & ---   & -2.92 & ---   & -3.30 & ---   & -2.38 \\
	\end{tabularx}
	\caption{\acs{BD}-rate (\%) for different testing and learning sets for
	$8\times8$ blocks}
	\label{tab:residuals_independence_8}
\end{table}

To better understand the impact of the learning set on the different test
sets, figure~\ref{fig:residuals_independence} shows the results from
tables~\ref{tab:residuals_independence_4}
and~\ref{tab:residuals_independence_8}.
For the $4\times4$ case, there are few differences in performances between
both learning sets, BC and ToS.
Only classes ADEF (in blue) present some preference for the Tears of
Steel data set.
For the $8\times8$ case, there seems to be some over-learning for the
sequences used during the learning phase (BC and ToS), since they present
notable differences in performances when using their own learning set.
The rest of test sets present similar performance levels for both learning
sets.

\begin{figure}[tb]
	\centering
	\subfloat[Additional $4\times4$ transforms]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/residuals_independence_4_plot.tex}}
	{\includegraphics{./figures/residuals_independence_4_plot.pdf}}}
	\\
	\subfloat[Additional $8\times8$ transforms]
	{\ifthenelse{\usepdfs = 0}
	{\input{./figures/residuals_independence_8_plot.tex}}
	{\includegraphics{./figures/residuals_independence_8_plot.pdf}}}
	\\
	\caption{\acs{BD}-rate for different testing and learning sets}
	\label{fig:residuals_independence}
\end{figure}

This comes as a reassuring fact, since the residuals from ToS have less
variability in terms of camera filters, frame rate, resolution and there are
some computer generated textures, whereas the rest of sequences are
exclusively composed of natural images issued from different sources, frame
rates and resolutions.

Table~\ref{tab:cross_bdrate_learn} shows the impact of changing the learning
set to design the high performance \ac{MDTC} system: losses of 0.45 \% points
are observed for the \ac{AI} coding configuration for the \ac{HEVC} test set,
leading to bitrate savings of 3.65\% instead of 4.1\%.

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{c|cc|c}
		& \multicolumn{2}{c|}{\acs{HEVC} test set} & 59seq \\
		\diagbox{Learn}{Test}
		& \acs{AI} & \acs{RA} & \acs{AI} \\
		\hline\hline
		BC  & -4.10\% & -2.76\% & -3.54\% \\
		ToS & -3.65\% & -2.62\% & -3.36\% \\
	\end{tabular}
	\caption[Bitrate savings for separable high performance \acs{MDTC} using
	different test sets]
	{Bitrate savings for separable high performance \acs{MDTC} using
	different test sets. This system uses 16 additional $4\times4$ transforms
	and 32 additional $8\times8$ transforms.}
	\label{tab:cross_bdrate_learn}
\end{table}

Finally, as the 59seq set will be serving as the new reference system, against
which new improvements will be measured, the separable high performance
\ac{MDTC} system has been evaluated on this test set.
The results are also presented in table~\ref{tab:cross_bdrate_learn}.
Performances of the high performance \ac{MDTC} system are almost equivalent on
the 59seq testing set, independently of the used learning set.

In this section, it has been proved that the transforms learnt and tested in
previous Chapters were subject to some over-learning, leading to slightly
better performances than when using a learning set completely independent from
the test set.
This over-learning reassures the fact that the designed transforms are able to
adapt to the learning set, while still showing that there is little dependence
to it.
As a matter of fact, it can be concluded that changing the learning set does
not affect the overall performance:
the order of improvement is learning set independent although some small local
differences appear.

\section{Non-homogeneous \acs{MDTC} systems}
\label{sec:non_homogeneous_mdtc_systems}
\index{non-homogeneous MDTC systems}

In Chapter~\ref{cha:mdtc}, the \ac{MDTC} system was introduced.
The \ac{MDTC} system was conceived as an evolved \ac{MDDT} system, in which,
instead of only one transform, a fixed number of transforms was provided
inside each \ac{IPM}.
However, as exemplified in figure~\ref{fig:hevc_ipm_usage}, \ac{HEVC} does not
use all \acp{IPM} equally:
\ac{HEVC} favours the signalling of the \acp{IPM} that are most often used
during the encoding of a sequence in terms of signalling.
This technique is named \acp{MPM} and it is explained in detail
in~\cite{wien-15-hevc}.
As a result, it might be counter productive to add a high number of transforms
to those modes that are hardly chosen, making them even less cost efficient,
since the extra transforms need to be signalled, as well.
Using a different number of transforms inside each \ac{IPM} does not require
any additional signalling to the transform index, since the encoder will know
how many transforms are available for each \ac{IPM}.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/ipm_hist_bdrill_qp22_4_8_plot.tex}}
	{\includegraphics{./figures/ipm_hist_bdrill_qp22_4_8_plot.pdf}}
	\caption[\acs{IPM} usage in \acs{HEVC}]
	{\acs{IPM} usage in \acs{HEVC} on \emph{BasketballDrive} for \acs{QP} 22}
	\label{fig:hevc_ipm_usage}
\end{figure}

Consequently, next Section proposes a systematic way of generating \ac{MDTC}
systems taking into account the storage requirements.
Since the \acs{ROM} required to store the transforms is directly related to
the number of transforms, and the number of transforms to the encoding
complexity, by limiting the amount of \acs{ROM}, the system might also
simplified in terms of encoding time.

This section proposes a gradient descent algorithm to generate \ac{MDTC} with
\acs{ROM} constraints and a non-homogeneous number of transforms in its
\acp{IPM}.

\subsection{Preparation phase}
\label{sub:rw_preparation_phase}

In order to build a system with a different number of transforms in each
\ac{IPM}, additional transforms for $4\times4$ and $8\times8$ \acp{TU} have
been learnt.
For the $4\times4$ \ac{TU}, up to 32 additional transforms are available, and
up to 128 for $8\times8$ \ac{TU}.

As a reference point, tables~\ref{tab:homogeneous_mdtc_4}
and~\ref{tab:homogeneous_mdtc_8} contain different separable homogeneous
\ac{MDTC} systems using $4\times4$ and $8\times8$ transforms, respectively.
Systems have been labelled using the following convention: {4tN--8tM}.
Where 4 and 8 stand for the transform sizes, t for the type, in this case s,
meaning separable transforms, and N and M represent the number of additional
transforms added in each \ac{IPM} for the respective transform size.
The tables have information for each system about its \acs{ROM} requirements,
the bitrate savings and the encoding complexity with regards to \ac{HEVC}.
Consistent bitrate savings are observed when increasing the number of
transforms for both \ac{TU} sizes, even if the $4\times4$ transform
performances seem to saturate with 16 and 32 additional transforms.

\begin{table}[tb]
	\centering
	\begin{tabular}{r|r|r|r}
	System & \multicolumn{1}{c|}{\acs{ROM}} & Y \acs{BD}-rate & Complexity \\
	\hline\hline
	 4s1--8s0 &  \SI{1.64}{\kilo B} & -0.49\% & 127\% \\
	 4s2--8s0 &  \SI{3.28}{\kilo B} & -0.67\% & 141\% \\
	 4s4--8s0 &  \SI{6.56}{\kilo B} & -0.82\% & 167\% \\
	 4s8--8s0 & \SI{13.13}{\kilo B} & -0.93\% & 219\% \\
	4s16--8s0 & \SI{26.25}{\kilo B} & -0.98\% & 322\% \\
	4s32--8s0 & \SI{52.50}{\kilo B} & -1.01\% & 527\% \\
	\end{tabular}
	\caption{Summary of $4\times4$ homogeneous \acs{MDTC} systems on
	59seq | ToS}
	\label{tab:homogeneous_mdtc_4}
\end{table}

\begin{table}[tb]
	\centering
	\begin{tabular}{r|r|r|r}
	System & \multicolumn{1}{c|}{\acs{ROM}} & Y \acs{BD}-rate & Complexity \\
	\hline\hline
	  4s0--8s1 &   \SI{6.56}{\kilo B} & -0.84\% &  127\% \\
	  4s0--8s2 &  \SI{13.13}{\kilo B} & -1.33\% &  142\% \\
	  4s0--8s4 &  \SI{26.25}{\kilo B} & -1.72\% &  171\% \\
	  4s0--8s8 &  \SI{52.50}{\kilo B} & -2.10\% &  229\% \\
	 4s0--8s16 & \SI{105.00}{\kilo B} & -2.48\% &  345\% \\
	 4s0--8s32 & \SI{210.00}{\kilo B} & -2.79\% &  577\% \\
	 4s0--8s64 & \SI{420.00}{\kilo B} & -3.07\% & 1041\% \\
	4s0--8s128 & \SI{840.00}{\kilo B} & -3.33\% & 1970\% \\
	\end{tabular}
	\caption{Summary of $8\times8$ homogeneous \acs{MDTC} systems on
	59seq | ToS}
	\label{tab:homogeneous_mdtc_8}
\end{table}

\subsection{Determination of number of transforms per \acs{IPM}}
\label{sub:rw_determination_number_transforms_ipm}

In order to improve current homogeneous \ac{MDTC} systems in the \acs{ROM} ---
Y \acs{BD}-rate plane, algorithm~\ref{alg:vmdtc_iter} presents a systematic
way of designing non-homogeneous \ac{MDTC} systems, using a gradient approach.
The algorithm starts with an unmodified version of \ac{HEVC}.
Then, it encodes the 59seq test set enabling one transform by \ac{IPM} at a
time.
After that, it chooses the \ac{IPM} providing the best \acs{BD}-rate/\acs{ROM}
trade-off, computed as the quotient between the differences in \acs{BD}-rate
and \acs{ROM} from the configuration chosen in the previous iteration.
That configuration is then used as the new base system and the transforms are
doubled each time for the \ac{IPM} that gives the best ratio.

\begin{algorithm}
	\small
	\SetKwData{encode}{encode}
	\SetKwData{computeBestRatio}{computeBestRatio}
	\SetKwInOut{Input}{testSet}
	\SetKwInOut{Output}{nTransf[35]}
	\Input{59seq}
	\Output{0}
	\BlankLine%
	\BlankLine%
	tmpTransf[35]: 0
	\BlankLine%
	/* Initial encoding of the test set using HEVC */
	\BlankLine%
	\encode(nTransf, testSet)
	\BlankLine%
	\While{1}
	{
		/* increment the transforms per IPM */
		\BlankLine%
		\For{$i=0$ \KwTo{} $34$}
		{
			\eIf{nTransf[i] == 0}
			{
				tmpTransf[i] = 1
			}
			{
				tmpTransf[i] = 2 * nTransf[i]
			}
		}
		\BlankLine%
		/* encode the test set using the 35 configurations */
		\BlankLine%
		results = \encode(tmpTransf, testSet)
		\BlankLine%
		/* find the IPM providing the best BD-rate/ROM ratio */
		\BlankLine%
		bestIPM = \computeBestRatio(results)
		\BlankLine%
		/* update the transform configuration */
		\BlankLine%
		nTransf[bestIPM] = tmpTransf[bestIPM]
	}
	\caption{Iterative non-homogeneous \acs{MDTC} design}
	\label{alg:vmdtc_iter}
\end{algorithm}

Figure~\ref{fig:vmdtc_iter} displays different \ac{MDTC} systems on the
\acs{ROM} --- Y \ac{BD}-rate plane.
The main systems that are compared are the homogeneous and non-homogeneous
(issued from the iterative algorithm) systems using $4\times4$ and $8\times8$
transforms.
Each point in this figure requires encoding 59 sequences at 4 \acp{QP} for
each one of the 35 \acp{IPM}, leading to a total of \num{8260} encodings.

Systems that combine both transform sizes are also presented in the figure.
Iterative non-homogeneous \ac{MDTC} systems have more flexibility in terms of
the number of transforms enabled in each \ac{IPM} and, therefore, in their
storage requirements.
For this reason, iterative systems provide a lot more granularity than
homogeneous ones, allowing for better \acs{ROM}--\ac{BD}-rate trade-offs
and more selectable \acs{ROM} operating points.

Red lines in figure~\ref{fig:vmdtc_iter} represent the $4\times4$ only
\ac{MDTC} systems.
Homogeneous $4\times4$ systems start saturating in terms of bitrate savings
after 8 additional transform in each \ac{IPM}.
The non-homogeneous $4\times4$ system is able to achieve the same performances
as the homogeneous system, with less \acs{ROM}.

The interest of the non-homogeneous systems is clearer when using $8\times8$
transforms (blue lines), since the storage requirements are higher.
In this case, a factor of almost two can be observed in terms of \acs{ROM} for
the same \ac{BD}-rate in homogeneous and non-homogeneous systems.
Moreover, it can be noticed that, for the $8\times8$ systems, the bitrate
savings are proportional, in a logarithmic scale, to the \acs{ROM}
requirements and, consequently, to the number of transforms.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/vmdtc_iter_plot.tex}}
	{\includegraphics{./figures/vmdtc_iter_plot.pdf}}
	\caption{Different \acs{MDTC} systems on the \acs{ROM} --- Y \acs{BD}-rate
	plane}
	\label{fig:vmdtc_iter}
\end{figure}

Finally, some interesting configurations have been built combining $4\times4$
and $8\times8$ systems, in both homogeneous and non-homogeneous ways,
represented using green lines.
Again, for these systems, especially on low \acs{ROM} configurations, a factor
of two in terms of storage requirements is observed.

Each one of the points from the learnt systems has two coordinates: \acs{ROM}
and \ac{BD}-rate.
This means that, for a given \acs{ROM} constraint, one can find the
combination of the $4\times4$ and $8\times8$ systems that offers the best
trade-off in terms of storage requirements and bitrate savings.
Consequently, the optimal combinations of non-homogeneous \ac{MDTC} systems
have been found by setting a \acs{ROM} specification and selecting the
combination of heterogeneous $4\times4$ and $8\times8$ systems that provide
the best \ac{BD}-rate without exceeding the specified \acs{ROM}.
The results of all system combinations can be found in the scatter plot of
figure~\ref{fig:vmdtc_combined}.
It can bee seen that the points present a lower boundary, which corresponds to
the best combinations for a given \acs{ROM}.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/vmdtc_systems_combined_plot.tex}}
	{\includegraphics{./figures/vmdtc_systems_combined_plot.pdf}}
	\caption[All possible combinations of $4\times4$ and $8\times8$
	non-homogeneous systems]
	{All possible combinations of $4\times4$ and $8\times8$
	non-homogeneous systems based on independent selection of the appropriate
	number of $4\times4$ and $8\times8$ transforms.
	The lower boundary shows the hypothetical bitrate savings that could be
	obtained if there were no penalty when combining independently designed
	$4\times4$ and $8\times8$ transforms.}
	\label{fig:vmdtc_combined}
\end{figure}

The selected \acs{ROM} requirements for the systems presented in
figure~\ref{fig:vmdtc_iter} are detailed in
table~\ref{tab:non_hom_mdtc}.
Information about their actual \acs{ROM} requirements, \ac{BD}-rate and
encoding complexity are provided.
There are some remarkable points, such as the system at \SI{8}{\kilo B},
which can achieve bitrate savings of 1.8\% with an encoding complexity
slightly higher than twice of \ac{HEVC}'s.
Another interesting point is the \SI{64}{\kilo B} system, being able to
provide bitrate savings of 3\% at roughly seven times the complexity of
\ac{HEVC}.

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{r|r|r|r}
	System & \multicolumn{1}{c|}{\acs{ROM}} & Y \acs{BD}-rate & Complexity \\
	\hline\hline
	  \SI{1}{\kilo B} &   \SI{0.98}{\kilo B} & -0.74\% & 122\% \\
	  \SI{2}{\kilo B} &   \SI{1.97}{\kilo B} & -1.02\% & 131\% \\
	  \SI{4}{\kilo B} &   \SI{3.98}{\kilo B} & -1.39\% & 157\% \\
	  \SI{8}{\kilo B} &   \SI{7.97}{\kilo B} & -1.82\% & 200\% \\
	 \SI{12}{\kilo B} &  \SI{12.00}{\kilo B} & -2.10\% & 221\% \\
	 \SI{16}{\kilo B} &  \SI{15.61}{\kilo B} & -2.22\% & 245\% \\
	 \SI{24}{\kilo B} &  \SI{23.95}{\kilo B} & -2.52\% & 345\% \\
	 \SI{32}{\kilo B} &  \SI{31.78}{\kilo B} & -2.68\% & 387\% \\
	 \SI{48}{\kilo B} &  \SI{47.39}{\kilo B} & -2.85\% & 520\% \\
	 \SI{64}{\kilo B} &  \SI{63.89}{\kilo B} & -3.00\% & 660\% \\
	 \SI{96}{\kilo B} &  \SI{95.53}{\kilo B} & -3.27\% & 851\% \\
	\SI{128}{\kilo B} & \SI{127.41}{\kilo B} & -3.41\% & 931\% \\
	\end{tabular}
	\caption{Non-homogeneous \acs{MDTC} systems with \acs{ROM} constraints}
	\label{tab:non_hom_mdtc}
\end{table}

Nonetheless, learning the $4\times4$ and $8\times8$ non-homogeneous \ac{MDTC}
systems independently and then combining them to design the complete system is
a suboptimal approach.
The suboptimality can be spotted by comparing figures~\ref{fig:vmdtc_iter}
and~\ref{fig:vmdtc_combined}:
the lower boundary of figure~\ref{fig:vmdtc_combined} is lower than the actual
measurements from figure~\ref{fig:vmdtc_iter}.
This is due to the fact that, when combining both transform sizes, there is an
overlap between them in the coded blocks, making the performances of the
combined $4\times4$ and $8\times8$ systems to be a bit lower than the sum of
the corresponding separate systems.

Moreover, presented iterated \ac{MDTC} systems have only the choice to use 1
to 32 additional transforms for $4\times4$ \acs{TU} and 1 to 128 for
$8\times8$ \acp{TU}.
This is another source of suboptimality once the maximum number of transforms
has been enabled in one \ac{IPM}, since the system lacks the choice to double
the number of transforms for that \ac{IPM}.

As a final observation, non-homogeneous systems learnt iteratively with
\acs{ROM} constraints present a pattern in the number of transforms enabled in
each \ac{IPM}.
Those \acp{IPM} that are most used in \ac{HEVC} (0, 1, 10 and 26) tend to make
use of a higher number of transforms than the rest of them, which matches
closely the \ac{IPM} use in \ac{HEVC} from figure~\ref{fig:hevc_ipm_usage}:
systems tend to use more transforms on \acsp{IPM} 0, 1, 10, 26 and around
them.
A detailed table breaking down each \ac{MDTC} system listed in
table~\ref{tab:non_hom_mdtc} is presented in
appendix~\ref{cha:transform_usage_in_mdtc_systems}, particularly
tables~\ref{tab:config_nonsym_mdtc_4} and~\ref{tab:config_nonsym_mdtc_8}, for
$4\times4$ and $8\times8$ \acp{TU}, respectively.

\section{Symmetrical \acs{MDTC} systems}
\label{sec:sym_mdtc}
\index{symmetrical MDTC systems}

\subsection{Simplifications taking advantage of \acs{IPM} symmetries}
\label{sub:simplifications_taking_advantage_of_ipm_symmetries}

In order to reduce the \acs{ROM} impact of the transforms, one can take
advantage of geometrical symmetries existing amongst \ac{HEVC} \acp{IPM}:
within the 35 \acp{IPM} in \ac{HEVC}, symmetries in prediction residuals can
be observed for directional modes (\acp{IPM} from 2 to 34).
Figure~\ref{fig:sym_pred_directions} contains a schematic version of the
\ac{HEVC} \acp{IPM} that illustrates the proposed symmetrical relations.

\begin{figure}[t]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/sym-pred-directions.tex}}
	{\includegraphics{./figures/sym-pred-directions.pdf}}
	\caption{The 35 \acsp{IPM} in \acs{HEVC} and their symmetries}
	\label{fig:sym_pred_directions}
\end{figure}

It can be stated that residuals issued from the first half (2--17) are closely
related to the transposed version of the second half (19--34).
Figure~\ref{fig:symmetry_transposed} illustrates the average $4\times4$
residuals profile and how the first half of the directional \acp{IPM} relates
to the second one.

\begin{figure}[tb]
	\centering
	\def\scalefactor{0.125}
	\small
	\npdigits{4}{0}
	\foreach \direction in {2,3,...,17}
	{
		\def\size{4}
		\numprint{\direction}
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\direction.png}
		%
		\pgfmathtruncatemacro{\directiont}{36-\direction}
		\numprint{\directiont}
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directiont.png}
		%
		\numprint{\directiont}$^T$
		\reflectbox{               % mirror +
		\rotatebox[origin=c]{-90}{ % rotation = transposition
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directiont.png}
		}}
		\hfill
		\def\size{8}
		\numprint{\direction}
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\direction.png}
		%
		\pgfmathtruncatemacro{\directiont}{36-\direction}
		\numprint{\directiont}
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directiont.png}
		%
		\numprint{\directiont}$^T$
		\reflectbox{               % mirror +
		\rotatebox[origin=c]{-90}{ % rotation = transposition
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directiont.png}
		}}
		\par
	}
	\caption{Average $4\times4$ and $8\times8$ residual profiles showing the
	symmetries around \acs{IPM} 18}
	\label{fig:symmetry_transposed}
\end{figure}

These symmetries can be taken one step further and be applied inside the first
half.
Although it might seem counter-intuitive at first, this decision can be
justified geometrically:
around \ac{IPM} 10, which is purely horizontal, it seems natural to think that
residuals from \ac{IPM} 9 (slightly diagonal up) and \ac{IPM} 11 (slightly
diagonal down) can be related via a horizontal mirroring operation
(top-bottom) or reflection.
Since the second half is related to the first half transposed, this property
applies around \ac{IPM} 26, through a vertical mirroring (left-right).
Figures~\ref{fig:symmetry_transposed_mirror_4}
and~\ref{fig:symmetry_transposed_mirror_8} contain illustrative examples of
average profiles and they relations using these imposed symmetries.

\begin{figure}[tb]
	\centering
	\begin{minipage}{0.6\textwidth}
	\def\size{4}
	\def\scalefactor{0.125}
	\small
	\foreach \direction in {2}
	{
	    \numprint{\direction}
	    \includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\direction.png}
		%
		\pgfmathtruncatemacro{\directiont}{36-\direction}
		$\quad\directiont^T$
		\reflectbox{               % mirror +
		\rotatebox[origin=c]{-90}{ % rotation = transposition
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directiont.png}
		}}
		\pgfmathtruncatemacro{\directionm}{20-\direction}
		%
		\pgfmathtruncatemacro{\directiontm}{52-\directiont}
		$\quad\updownarrow\directionm$
		\raisebox{\depth}{\scalebox{1}[-1]{ % vertical mirror
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directionm.png}
		}}
		\par
	}
	\foreach \direction in {3,4,5,6,7,8,9}
	{
	    $\direction$
	    \includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\direction.png}
		%
		\pgfmathtruncatemacro{\directiont}{36-\direction}
		$\quad\directiont^T$
		\reflectbox{               % mirror +
		\rotatebox[origin=c]{-90}{ % rotation = transposition
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directiont.png}
		}}
		%
		\pgfmathtruncatemacro{\directionm}{20-\direction}
		\pgfmathtruncatemacro{\directiontm}{52-\directiont}
		$\quad\updownarrow\directionm$
		\raisebox{\depth}{\scalebox{1}[-1]{ % vertical mirror
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directionm.png}
		}}
		%
		\pgfmathtruncatemacro{\directiontm}{52-\directiont}
		$\quad\updownarrow\directiontm^T$ % rotation 90 = transposition + mirror
		\includegraphics[scale=\scalefactor,angle=90]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directiontm.png}
		\par
}
\end{minipage}
	\caption[Average $4\times4$ residual profiles showing imposed symmetries]
	{Average $4\times4$ residual profiles showing imposed symmetries with
	transposition and vertical mirroring}
	\label{fig:symmetry_transposed_mirror_4}
\end{figure}

\begin{figure}[tb]
	\centering
	\begin{minipage}{0.6\textwidth}
	\def\size{8}
	\def\scalefactor{0.125}
	\small
	\foreach \direction in {2}
	{
	    \numprint{\direction}
	    \includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\direction.png}
		%
		\pgfmathtruncatemacro{\directiont}{36-\direction}
		$\quad\directiont^T$
		\reflectbox{               % mirror +
		\rotatebox[origin=c]{-90}{ % rotation = transposition
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directiont.png}
		}}
		\pgfmathtruncatemacro{\directionm}{20-\direction}
		%
		\pgfmathtruncatemacro{\directiontm}{52-\directiont}
		$\quad\updownarrow\directionm$
		\raisebox{\depth}{\scalebox{1}[-1]{ % vertical mirror
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directionm.png}
		}}
		\par
	}
	\foreach \direction in {3,4,5,6,7,8,9}
	{
	    $\direction$
	    \includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\direction.png}
		%
		\pgfmathtruncatemacro{\directiont}{36-\direction}
		$\quad\directiont^T$
		\reflectbox{               % mirror +
		\rotatebox[origin=c]{-90}{ % rotation = transposition
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directiont.png}
		}}
		%
		\pgfmathtruncatemacro{\directionm}{20-\direction}
		\pgfmathtruncatemacro{\directiontm}{52-\directiont}
		$\quad\updownarrow\directionm$
		\raisebox{\depth}{\scalebox{1}[-1]{ % vertical mirror
		\includegraphics[scale=\scalefactor]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directionm.png}
		}}
		%
		\pgfmathtruncatemacro{\directiontm}{52-\directiont}
		$\quad\updownarrow\directiontm^T$ % rotation 90 = transposition + mirror
		\includegraphics[scale=\scalefactor,angle=90]{./figures/residuals_profile_\size x\size/residuals-\size x\size-\directiontm.png}
		\par
}
\end{minipage}
	\caption[Average $8\times8$ residual profiles showing imposed symmetries]
	{Average $8\times8$ residual profiles showing imposed symmetries with
	transposition and horizontal mirroring}
	\label{fig:symmetry_transposed_mirror_8}
\end{figure}

When both symmetries are exploited, the number of transforms of an \ac{MDTC}
system is no longer affected by a factor of 35 (the number of \acp{IPM}), but
by a factor of 11 (the number of basic \acp{IPM}), as presented in
table~\ref{tab:relations_ipm_residuals}.

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{c|ccc}
	\bf IPM &
	${(\cdot)}^T$ &
	\bf $\updownarrow(\cdot)$ &
	\bf $\updownarrow{(\cdot)}^T$ \\[1ex]
	\hline\hline
	\bf 0       & ---           & ---    & ---    \\
	\bf 1       & ---           & ---    & ---    \\
	\bf 2       & 34            & 18     & ---    \\
	\bf 3       & 33            & 17     & 19     \\
	\bf 4       & 32            & 16     & 20     \\
	\bf 5       & 31            & 15     & 21     \\
	\bf 6       & 30            & 14     & 22     \\
	\bf 7       & 29            & 13     & 23     \\
	\bf 8       & 28            & 12     & 24     \\
	\bf 9       & 27            & 11     & 25     \\
	\bf 10      & 26            & ---    & ---    \\
	\end{tabular}
	\caption{Symmetrical relations among \ac{IPM} residuals}
	\label{tab:relations_ipm_residuals}
\end{table}

Using these symmetries between \acp{IPM} implies manipulating the residual
before the transform stage.
Before transmitting a residual, the encoder will consider the following cases,
depending on the \ac{IPM}:
\begin{equation}
	\X =
	\begin{cases}
		\A \, \x & \;\;\, 0 \le \ac{IPM} \le 10 \\
		\A \, \updownarrow\x & 11 \le \ac{IPM} \le 18 \\
		\A \, \updownarrow\x^T & 19 \le \ac{IPM} \le 25 \\
		\A \, \x^T & 26 \le \ac{IPM} \le 34 \\
	\end{cases}
\end{equation}
Where $\A$ is a transform designed for the basic IPM set, $\x$ is the current
residual and $\updownarrow$ represents the horizontal mirroring operator
(top-bottom).
Mirroring and transposing operations are used to make residuals compatible
with the transforms learnt for the basic IPM set.
These operations are only different ways of re-arranging the residual pixels
consistently, which come at no computational cost.

\subsection{Symmetries performances on video coding}
\label{sub:symmetries_performances_on_video_coding}

Tables~\ref{tab:sym4} and~\ref{tab:sym8} show the impact of symmetries on
$4\times4$ and $8\times8$ \ac{MDTC} systems, respectively in the \acs{ROM} and
\ac{BD}-rate plane.
As demonstrated before, for the full-symmetrical \ac{MDTC} systems, \acs{ROM}
is decreased to around one third of its original value.
Bitrate savings also experiment some losses, tending to decrease as the
number of transforms increases, since symmetrical systems are more constrained
than non-symmetrical systems.
The encoding complexity for symmetrical system is equivalent to the one for
non-symmetrical ones, detailed in tables~\ref{tab:homogeneous_mdtc_4}
and~\ref{tab:homogeneous_mdtc_8}, as the number of transforms per \ac{IPM}
remains unchanged.

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{c|rr|rr}
		& \multicolumn{2}{c|}{non-symmetrical}
		& \multicolumn{2}{c}{symmetrical} \\
		\# tr & \multicolumn{1}{c}{\acs{ROM}} &
		\multicolumn{1}{c|}{Y \acs{BD}-rate} &
		\multicolumn{1}{c}{\acs{ROM}} &
		\multicolumn{1}{c}{Y \acs{BD}-rate} \\
		\hline \hline
		 1 &  \SI{1.64}{\kilo B} & -0.49\% &  \SI{0.52}{\kilo B} & -0.50\% \\
		 2 &  \SI{3.28}{\kilo B} & -0.67\% &  \SI{1.03}{\kilo B} & -0.63\% \\
		 4 &  \SI{6.56}{\kilo B} & -0.82\% &  \SI{2.06}{\kilo B} & -0.72\% \\
		 8 & \SI{13.13}{\kilo B} & -0.93\% &  \SI{4.13}{\kilo B} & -0.84\% \\
		16 & \SI{26.25}{\kilo B} & -0.98\% &  \SI{8.25}{\kilo B} & -0.91\% \\
		32 & \SI{52.50}{\kilo B} & -1.01\% & \SI{16.50}{\kilo B} & -0.93\% \\
	\end{tabular}
	\caption{Symmetries impact on $4\times4$ transforms on the 59seq test set}
	\label{tab:sym4}
\end{table}

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{c|rr|rr}
		& \multicolumn{2}{c|}{non-symmetrical}
		& \multicolumn{2}{c}{symmetrical} \\
		\# tr & \multicolumn{1}{c}{\acs{ROM}} &
		\multicolumn{1}{c|}{Y \acs{BD}-rate} &
		\multicolumn{1}{c}{\acs{ROM}} &
		\multicolumn{1}{c}{Y \acs{BD}-rate} \\
		\hline \hline
		  1 &   \SI{6.56}{\kilo B} & -0.84\% &   \SI{2.06}{\kilo B} & -0.76\% \\
		  2 &  \SI{13.13}{\kilo B} & -1.33\% &   \SI{4.13}{\kilo B} & -1.24\% \\
		  4 &  \SI{26.25}{\kilo B} & -1.72\% &   \SI{8.25}{\kilo B} & -1.48\% \\
		  8 &  \SI{52.50}{\kilo B} & -2.10\% &  \SI{16.50}{\kilo B} & -1.97\% \\
		 16 & \SI{105.00}{\kilo B} & -2.48\% &  \SI{33.00}{\kilo B} & -2.22\% \\
		 32 & \SI{210.00}{\kilo B} & -2.79\% &  \SI{66.00}{\kilo B} & -2.69\% \\
		 64 & \SI{420.00}{\kilo B} & -3.07\% & \SI{132.00}{\kilo B} & -2.99\% \\
		128 & \SI{840.00}{\kilo B} & -3.33\% & \SI{264.00}{\kilo B} & -3.25\% \\
	\end{tabular}
	\caption{Symmetries impact on $8\times8$ transforms on the 59seq test set}
	\label{tab:sym8}
\end{table}

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{rr|rr}
		\multicolumn{2}{c|}{non-symmetrical} & \multicolumn{2}{c}{symmetrical} \\
		\multicolumn{1}{c}{\acs{ROM}} & \multicolumn{1}{c|}{Y \acs{BD}-rate} &
		\multicolumn{1}{c}{\acs{ROM}} & \multicolumn{1}{c}{Y \acs{BD}-rate} \\
		\hline \hline
		\SI{236.25}{\kilo B} & -3.36\% & \SI{74.25}{\kilo B} & -3.50\% \\
	\end{tabular}
	\caption[Symmetries impact of the high performance \acs{MDTC} system on
	the 59seq test set]
	{Symmetries impact of the high performance \acs{MDTC} system on the 59seq
	test set.
	This system uses 16 additional transform for $4\times4$ \acp{TU} and 32
	for $8\times8$ \acp{TU}}
	\label{tab:sym_mdtc}
\end{table}

Table~\ref{tab:sym_mdtc} unveils impact of symmetries for the high performance
\ac{MDTC} system, defined in \S\ref{sub:mdtc_performances}, on the 59seq test
set.
As a reminder, the high performance \ac{MDTC} system uses 16 additional
transforms for the $4\times4$ \acp{TU} and 32 for the $8\times8$ \acp{TU}.
The symmetrical \acs{MDTC} system presents slightly better performances for
the 59seq test set than the non-symmetrical one.
Consequently, exploiting intra prediction residual symmetries can be
considered to have no global impact on bitrate savings.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/vmdtc_sym_plot.tex}}
	{\includegraphics{./figures/vmdtc_sym_plot.pdf}}
	\caption{Symmetry impact on the \acs{ROM} --- Y \acs{BD}-rate plane for
	different \acs{MDTC} systems}
	\label{fig:vmdtc_sym}
\end{figure}

\section{Performances of non-homogeneous symmetrical \acs{MDTC} systems}
\label{sec:performances_non_homogeneous_symmetrical_mdtc}

The previous two Sections have explored new methods to reduce the storage
requirements of \ac{MDTC} systems.
The approaches make use of:
\begin{itemize}
	\item Heterogeneous repartition of the number of transforms per
		\ac{IPM}.
	\item Symmetries in prediction residuals.
\end{itemize}

These techniques are not mutually exclusive, as such, this section presents
\ac{MDTC} systems that combine both of them, resulting in non-homogeneous
symmetrical \ac{MDTC} systems.
These systems use a different number of transforms per \ac{IPM} but
guaranteeing the symmetrical relations established in \S\ref{sec:sym_mdtc}.
Building this kind of systems requires the iterative approach presented in
\S\ref{sec:non_homogeneous_mdtc_systems}, described in
algorithm~\ref{alg:vmdtc_iter}, except that, instead of enabling transforms
for one \ac{IPM} at a time, transforms will be enabled for one of the basic
\acp{IPM} (0--10) and their symmetrical ones.

Results of the iterations of these non-homogeneous symmetrical systems are
presented in figures~\ref{fig:vmdtc_iter_sym_rom}
and~\ref{fig:vmdtc_iter_sym_complx}.
The first figure compares the systems in the \acs{ROM} --- Y \acs{BD}-rate
plane with the designed systems from previous sections making use of one
technique only.
The second figure places the systems in the complexity --- Y \acs{BD}-rate
plane to ensure that systems remain comparable in terms of encoding times.

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/vmdtc_iter_sym_rom_plot.tex}}
	{\includegraphics{./figures/vmdtc_iter_sym_rom_plot.pdf}}
	\caption{\acs{MDTC} systems making use of non-homogeneous transform
	repartition and symmetries}
	\label{fig:vmdtc_iter_sym_rom}
\end{figure}

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/vmdtc_iter_sym_cmplx_plot.tex}}
	{\includegraphics{./figures/vmdtc_iter_sym_cmplx_plot.pdf}}
	\caption[Symmetry impact on the complexity and \acs{BD}-rate of
	non-homogeneous \acs{MDTC} systems]
	{Symmetry impact on the coding complexity and \acs{BD}-rate of
	non-homogeneous \acs{MDTC} systems}
	\label{fig:vmdtc_iter_sym_complx}
\end{figure}

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{r|r|r|r}
	System & \multicolumn{1}{c|}{\acs{ROM}} & Y \acs{BD}-rate & Complexity \\
	\hline\hline
	  \SI{1}{\kilo B} &   \SI{0.98}{\kilo B} & -0.89\% &  126\% \\
	  \SI{2}{\kilo B} &   \SI{1.92}{\kilo B} & -1.25\% &  150\% \\
	  \SI{4}{\kilo B} &   \SI{3.98}{\kilo B} & -1.71\% &  175\% \\
	  \SI{8}{\kilo B} &   \SI{7.88}{\kilo B} & -2.16\% &  226\% \\
	 \SI{12}{\kilo B} &  \SI{11.62}{\kilo B} & -2.39\% &  262\% \\
	 \SI{16}{\kilo B} &  \SI{15.84}{\kilo B} & -2.51\% &  379\% \\
	 \SI{24}{\kilo B} &  \SI{23.67}{\kilo B} & -2.83\% &  425\% \\
	 \SI{32}{\kilo B} &  \SI{31.92}{\kilo B} & -2.97\% &  492\% \\
	 \SI{48}{\kilo B} &  \SI{48.00}{\kilo B} & -3.13\% &  698\% \\
	 \SI{64}{\kilo B} &  \SI{63.84}{\kilo B} & -3.28\% &  776\% \\
	 \SI{96}{\kilo B} &  \SI{88.88}{\kilo B} & -3.45\% & 1084\% \\
	\SI{128}{\kilo B} & \SI{123.38}{\kilo B} & -3.56\% & 1322\% \\
	\end{tabular}
	\caption{Non-homogeneous symmetrical \acs{MDTC} systems with \acs{ROM}
	constraints}
	\label{tab:non_hom_sym_mdtc}
\end{table}

As in the non-symmetrical \ac{MDTC} systems,
appendix~\ref{cha:transform_usage_in_mdtc_systems} details all the symmetrical
\ac{MDTC} systems from table~\ref{tab:non_hom_sym_mdtc} in
tables~\ref{tab:config_sym_mdtc_4} and~\ref{tab:config_sym_mdtc_8} for
$4\times4$ and $8\times8$ \acp{TU}, respectively.
The same pattern is observed:
more transforms are used in Planar, DC, horizontal and vertical modes.

\section{Conclusions}
\label{sec:rw_conclusions}

\ac{MDTC} systems presented in Chapter~\ref{cha:mdtc} proved the fact that
using multiple transforms competing against each other inside every \ac{IPM}
is able to provide bitrate savings of up to 7\% when using non-separable
transforms, and up to 4\% when using separable transforms.

Nonetheless, the storage requirements for those transforms is not negligible.
Therefore, this Chapter has presented two approaches to reduce the needed
\acs{ROM} for the transforms.

The first technique proposes a non-homogeneous distribution of the number of
transforms per \ac{IPM}.
In Chapter~\ref{cha:mdtc} all \acp{IPM} were using the same number of
transforms, which could be sub-optimal, since not all \acp{IPM} are used
equally in \ac{HEVC}.
Around a factor of two in terms of \acs{ROM} has been saved when using
non-homogeneous \ac{MDTC} systems for the same bitrate savings.

The second technique exploited existing symmetries to be able to re-use
transforms from different \ac{IPM} by pre-processing residuals with
transposition and mirroring operations.
This technique allows saving around two thirds of the \acs{ROM} presented for
the \ac{MDTC} systems from Chapter~\ref{cha:mdtc}.

Finally, the techniques summarised above have been combined to provide better
\acs{ROM} --- Y \ac{BD}-rate trade-offs.
When combining both techniques, storage requirements can be reduced up to 75\%
of the values presented in Chapter~\ref{cha:mdtc} for some systems.

Regarding complexity, using a non-homogeneous number of transforms per
\ac{IPM} decreases the decoding time, since fewer transforms need to be
tested.
On the other hand, symmetries allow putting more transforms without having an
impact to the \acs{ROM} requirements.
As a result, the encoding complexity remains comparable to the one presented
in Chapter~\ref{cha:mdtc}.

\chapter{\acs{MDTC} using discrete trigonometric transforms}
\label{cha:dtt}
\chaptertoc

\section{Introduction}
\label{sec:introduction}

Chapter~\ref{cha:mdtc} has unveiled the potential of gains achievable through
the \ac{MDTC} technique.
Nevertheless, the presented system has a level of complexity which is too
high to be used in commercial applications.
As a result, the Chapters~\ref{cha:incomplete_transforms}
and~\ref{cha:real_world_system} propose two different approaches to make the
system less complex.

Chapter~\ref{cha:incomplete_transforms} simplifies the system by designing
transforms that make use of only one base vector.
This makes transforming a block less complex than with a regular transform,
even if the incomplete transform is non-separable.
Benefits of this approach are observed on the decoder, being less complex than
that of \ac{HEVC}, while still providing bitrate savings.

Chapter~\ref{cha:real_world_system} proposes two ways of reducing the
complexity of separable \ac{MDTC} systems, especially the storage
requirements:
\begin{itemize}
	\item Using a different number of transforms in each \ac{IPM}.
	\item Taking advantage of the symmetries that might exist within different
		\acp{IPM}.
\end{itemize}
Both approaches provide notable savings in the \acs{ROM} required to store the
transforms.
Besides, the encoder complexity is slightly decreased using non-homogeneous
systems, since there are fewer transforms to test in each \ac{IPM}.

However, even if the main motivation of the approaches above is to reduce the
complexity of the high performance \ac{MDTC} system presented in
Chapter~\ref{cha:mdtc}, the improvements have only been focused on the
\acs{ROM} axis, leaving the encoder and decoder complexities decreases to be a
side effect of the storage constraint.

This Chapter proposes a simplification approach that takes into account the
storage requirements, the encoding complexity and the decoding complexity at
the same time.
It is also worth-noticing that the systems described in this Chapter are in
early stages of development and, therefore, not as mature as the systems
presented previously.

The experiments presented in this Chapter served as a base for the work
published in~\cite{arrufat-16-low-complexity}.

\section{The discrete trigonometric transform family (\acsp{DCT} \& \acsp{DST})}
\label{sec:the_dtt_family}
\index{DTT}
\index{DCT}
\index{DST}

Chapter~\ref{cha:mddt} has highlighted the appropriateness of the \ac{RDOT}
design method over the \ac{KLT} for video coding.
However, for the sake of simplicity, a different family of transforms is
considered in this Chapter: the \acfp{DTT}.

\acsp{DTT} are orthogonal transforms based on trigonometric functions.
This family of transforms consists of 8 types (I to VIII) of \acfp{DCT} and
\acfp{DST}~\cite{rao-01-transform-data-compression-book,
puschel-08-algorithms-dct-dst}.

Historically, the \ac{DCT}-II has been the \emph{de facto} standard transform
for image and video coding applications.
Recently, other transforms from the \ac{DTT} family are starting to arise the
interest in video coding applications:
\begin{itemize}
	\item The \ac{DST}-VII is used in \ac{HEVC} for $4\times4$ intra
		prediction luma residuals.
	\item The \ac{DST}-III has been proposed for inter-layer prediction
		residuals in scalable video
		coding~\cite{guo-14-transform-inter-layer-scalable}.
\end{itemize}

The interest of the \acp{DTT} in this Chapter is motivated by the existing
fast algorithms for transform implementation, which are notably less complex
than a full matrix multiplication required by generic block transforms.
The algorithmic complexity for a 1D \ac{DTT} is in the order of $N\log_2(N)$
instead of the $N^2$, where $N$ stands for the transform
size~\cite{puschel-08-algorithms-dct-dst}.
Table~\ref{tab:comparison_dtt_separable} compares the number of operations
required for a regular separable 2D transform and a \ac{DTT} using a fast
algorithm.
Results are presented for both $4\times4$ and $8\times8$ \acp{TU}.

Since \acp{DTT} coefficients can be computed using an analytical formula,
their storage requirement is negligible.
However, a scanning matrix is still required to sort the transformed
coefficients in a globally decreasing order to ease the entropy coding stage.
As a result, the necessary storage amount per transform is $N^2$ bytes.
For \acp{RDOT}, the required memory is $3N^2$ bytes per transform, as the
horizontal and vertical transforms are stored along with a dedicated scanning
pattern.
Storage values are compared in table~\ref{tab:comparison_dtt_separable}.

Nevertheless, these transforms are a restrained subset of the orthogonal
transform class, as such, their performance in coding gains is expected to be
lower than that of \acp{RDOT}.

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{c|cc|cc}
		\multirow{2}{2cm}{\diagbox{Size}{Type}} &
		\multicolumn{2}{c|}{Reg.\ sep.\ transf.} &
		\multicolumn{2}{c}{\acs{DTT}} \\
		& Operations & \acs{ROM} & Operations & \acs{ROM} \\
		\hline
		$4\times4$ &  128 &  48 B &  64 & 16 B \\
		$8\times8$ & 1024 & 192 B & 192 & 64 B \\
	\end{tabular}
	\caption{Comparison of \acs{ROM} and complexity between \acsp{DTT} and
	regular separable transforms}
	\label{tab:comparison_dtt_separable}
\end{table}

To sum up, \acp{DTT} are able to divide the storage requirements of separable
transforms by a factor of 3 and present a lower computational complexity due
to fast algorithms.
This Chapter explores whether the \acs{ROM} and complexity reductions balance
the expected losses in bitrate savings regarding separable \acp{RDOT}.

The formal definition of the 8 types of \ac{DCT} and \ac{DST} can be found in
tables~\ref{tab:dcts_def} and~\ref{tab:dsts_def}, respectively.
Some of them have scaling factors, namely $\epsilon_n, \epsilon_n$, that vary
depending on the position in the matrix to ensure the energy preservation
property.

Transforms in the \ac{DTT} family present strong relations amongst them and
their inverse transforms~\cite{reznik-13-relationship-dct-dst,
saxena-13-fast-transforms-intra-coding}, detailed in
table~\ref{tab:dtt_relationships}.
Since all inverse \acsp{DCT} and \acsp{DST} can be expressed in terms of a
direct transform, there are 16 unique 1D transforms in total.
Combining them as a horizontal and vertical transforms leads to a total of 256
unique 2D transforms, which are able to capture different kinds of residual
patterns.
Some of these transforms are displayed in figure~\ref{fig:some_dtts}.

As an observation, using an orthogonal transform $\A$ for the rows of a block
and another orthogonal transform $\B$ for the resulting columns does not
result in the same signal in the transform domain as if the operations are
carried out the other way around if $\B\neq\A$.
Nonetheless, both resulting signals can be expressed in terms of the other.
Assuming $\x$ is the input signal and $\X$ stands for its representation in
the transform domain:
\begin{align}
	\X = \B{\left(\A\x^T\right)}^T = \B \x \A^T \\
	\X = \A{\left(\B\x^T\right)}^T = \A \x \B^T
\end{align}
If instead of transforming $\x$ directly, a transposition is applied before:
\begin{align}
	\X = \B{\left(\A{\left(\x^T\right)}^T\right)}^T = \B \x^T \A^T =
	{\left(\A\x\B^T\right)}^T\\
	\X = \A{\left(\B{\left(\x^T\right)}^T\right)}^T = \A \x^T \B^T =
	{\left(\B\x\A^T\right)}^T
\end{align}
Meaning that a transposition in the spatial domain is equivalent to invert the
order in which the row and column transforms are applied and then transpose
the result.

Additionally, if a different scanning is allowed per transform, transposing or
not a signal in the spatial domain become equivalent operations, since the
only difference is the order in which the transform coefficients are output,
but their values remain the same, resulting in fewer combinations than 256.

In order to expand the number of transforms available, three geometrical
operations have been considered:
\begin{itemize}
	\item Residual transposition: $\x^T$
	\item Horizontal mirroring: $\updownarrow\x$
	\item Vertical mirroring: $\overset{\leftrightarrow}{\x}$
\end{itemize}
These operators can be combined, resulting into $2^3=8$ different operations.
By applying these pixel permutations to the residual before the transform
stage, the number of 2D transforms can be increased from 256 to 2048.

This kind of relations will be taken into account when designing
\acs{DTT}-based \acs{MDTC} systems in the next Section.

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{l|l}
		Direct transform & Inverse transform \\
		\hline\hline
		\hspace{0.5cm} DCT-I    & \hspace{0.75cm} DCT-I    \\
		\hspace{0.5cm} DCT-II   & \hspace{0.75cm} DCT-III  \\
		\hspace{0.5cm} DCT-III  & \hspace{0.75cm} DCT-II   \\
		\hspace{0.5cm} DCT-IV   & \hspace{0.75cm} DCT-IV   \\
		\hspace{0.5cm} DCT-V    & \hspace{0.75cm} DCT-V    \\
		\hspace{0.5cm} DCT-VI   & \hspace{0.75cm} DCT-VII  \\
		\hspace{0.5cm} DCT-VII  & \hspace{0.75cm} DCT-VI   \\
		\hspace{0.5cm} DCT-VIII & \hspace{0.75cm} DCT-VIII \\
		\hline
		\hspace{0.5cm} DST-I    & \hspace{0.75cm} DST-I    \\
		\hspace{0.5cm} DST-II   & \hspace{0.75cm} DST-III  \\
		\hspace{0.5cm} DST-III  & \hspace{0.75cm} DST-II   \\
		\hspace{0.5cm} DST-IV   & \hspace{0.75cm} DST-IV   \\
		\hspace{0.5cm} DST-V    & \hspace{0.75cm} IDST-V   \\
		\hspace{0.5cm} DST-VI   & \hspace{0.75cm} DST-VII  \\
		\hspace{0.5cm} DST-VII  & \hspace{0.75cm} DST-VI   \\
		\hspace{0.5cm} DST-VIII & \hspace{0.75cm} DST-VIII \\
	\end{tabular}
	\caption{Relationships between the different members of the \acs{DTT}
	family}
	\label{tab:dtt_relationships}
\end{table}

\begin{figure}[tb]
	\centering
	\subfloat[\acs{DCT}-II --- \acs{DCT}-II]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_017_DCT_IIxDCT_II.png}}
	\hfill
	\subfloat[\acs{DST}-VII --- \acs{DST}-VII]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_238_DST_VIIxDST_VII.png}}
	\hfill
	\subfloat[\acs{DCT}-IV --- \acs{DCT}-IV]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_051_DCT_IVxDCT_IV.png}}
	\hfill
	\subfloat[\acs{DCT}-III --- \acs{DCT}-IV]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_035_DCT_IIIxDCT_IV.png}}

	\subfloat[\acs{DST}-VII --- \acs{DCT}-IV]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_227_DST_VIIxDCT_IV.png}}
	\hfill
	\subfloat[\acs{DST}-V --- \acs{DCT}-IV]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_067_DCT_VxDCT_IV.png}}
	\hfill
	\subfloat[\acs{DST}-VII --- \acs{DCT}-V]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_228_DST_VIIxDCT_V.png}}
	\hfill
	\subfloat[\acs{DST}-VII --- \acs{DST}-II]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_105_DCT_VIIxDST_II.png}}

	\subfloat[\acs{DST}-VII --- \acs{DCT}-V]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_228_DST_VIIxDCT_V.png}}
	\hfill
	\subfloat[\acs{DST}-VII --- \acs{DCT}-IV]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_227_DST_VIIxDCT_IV.png}}
	\hfill
	\subfloat[\acs{DST}-VII --- \acs{DST}-I]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_232_DST_VIIxDST_I.png}}
	\hfill
	\subfloat[\acs{DST}-VII --- \acs{DST}-IV]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_235_DST_VIIxDST_IV.png}}

	\subfloat[\acs{DCT}-III --- \acs{DCT}-IV]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_035_DCT_IIIxDCT_IV.png}}
	\hfill
	\subfloat[\acs{DCT}-V --- \acs{DCT}-III]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_066_DCT_VxDCT_III.png}}
	\hfill
	\subfloat[\acs{DCT}-V --- \acs{DCT}-IV]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_067_DCT_VxDCT_IV.png}}
	\hfill
	\subfloat[\acs{DCT}-II --- \acs{DCT}-III]
	{\includegraphics[width=0.20\linewidth]
	{./figures/dtts_4/dtt_018_DCT_IIxDCT_III.png}}

	\caption{$4\times4$ \acs{DTT} combination examples}
	\label{fig:some_dtts}
\end{figure}

\begin{table}[tb]
	\centering
	\begin{tabular}{rll}
		\acs{DCT}-I: &
		$\displaystyle{\left[C_{N}^{I} \right]}_{n,k} =
		\sqrt{\frac{2}{N-1}}\epsilon_n\epsilon_k\cos\left(\frac{\pi\,n\,k}{N-1}\right)$
		&
		$\displaystyle\epsilon_n,\epsilon_k =
		\begin{cases}
			\frac{1}{\sqrt{2}} & n, k = 0 \\
			\frac{1}{\sqrt{2}} & n, k = N-1 \\
			1 & \text{otherwise}
		\end{cases}$ \\
		\acs{DCT}-II: &
		$\displaystyle{\left[C_{N}^{II} \right]}_{n,k} =
		\sqrt{\frac{2}{N}}\epsilon_k\cos\left(\frac{\pi(2n+1)k}{2N}\right)$
		&
		$\displaystyle\epsilon_k =
		\begin{cases}
			\frac{1}{\sqrt{2}} & k = 0 \\
			1 & \text{otherwise}
		\end{cases}$ \\
		\acs{DCT}-III: &
		$\displaystyle{\left[C_{N}^{III} \right]}_{n,k} =
		\sqrt{\frac{2}{N}}\epsilon_n\cos\left(\frac{\pi(2k+1)n}{2N}\right)$
		&
		$\displaystyle\epsilon_n =
		\begin{cases}
			\frac{1}{\sqrt{2}} & n = 0 \\
			1 & \text{otherwise}
		\end{cases}$\\
		\acs{DCT}-IV: &
		$\displaystyle{\left[C_{N}^{IV} \right]}_{n,k} =
		\sqrt{\frac{2}{N}}\cos\left(\frac{\pi(2n+1)(2k+1)}{4N}\right)$ \\
		\acs{DCT}-V: &
		$\displaystyle{\left[C_{N}^{V} \right]}_{n,k} =
		\frac{2}{\sqrt{2(N-1)+1}}\epsilon_n\epsilon_k\cos\left(\frac{2\pi\,n\,k}{2N-1}\right)$
		&
		$\displaystyle\epsilon_n,\epsilon_k =
		\begin{cases}
			\frac{1}{\sqrt{2}} & n, k = 0 \\
			1 & \text{otherwise}
		\end{cases}$ \\
		\acs{DCT}-VI: &
		$\displaystyle{\left[C_{N}^{VI} \right]}_{n,k} =
		\frac{2}{\sqrt{2(N-1)}}\epsilon_n\epsilon_k\cos\left(\frac{\pi(2n+1)k}{2N-1}\right)$
		&
		$\displaystyle\epsilon_n =
		\begin{cases}
			\frac{1}{\sqrt{2}} & n = N-1 \\
			1 & \text{otherwise}
		\end{cases}
		\quad
		\epsilon_k =
		\begin{cases}
			\frac{1}{\sqrt{2}} & k = 0 \\
			1 & \text{otherwise}
		\end{cases}$ \\
		\acs{DCT}-VII: &
		$\displaystyle{\left[C_{N}^{VII} \right]}_{n,k} =
		\frac{2}{\sqrt{2(N-1)}}\epsilon_n\epsilon_k\cos\left(\frac{\pi(2k+1)n}{2N-1}\right)$
		&
		$\displaystyle\epsilon_k =
		\begin{cases}
			\frac{1}{\sqrt{2}} & k = N-1 \\
			1 & \text{otherwise}
		\end{cases}
		\quad
		\epsilon_n =
		\begin{cases}
			\frac{1}{\sqrt{2}} & n = 0 \\
			1 & \text{otherwise}
		\end{cases}$ \\
		\acs{DCT}-VIII: &
		$\displaystyle{\left[C_{N}^{VIII} \right]}_{n,k} =
		\frac{2}{\sqrt{2(N-1)}}\cos\left(\frac{\pi(2n+1)(2k+1)}{4N-2}\right)$
	\end{tabular}
	\caption{\acs{DCT} definitions of size $N$, where $n,k=0,\dots,N-1$}
	\label{tab:dcts_def}
\end{table}

\begin{table}[tb]
	\centering
	\begin{tabular}{rll}
		\acs{DST}-I: &
		$\displaystyle{\left[S_{N}^{I} \right]}_{n,k} =
		\sqrt{\frac{2}{2N+1}}\sin\left(\frac{\pi(n+1)(k+1)}{N+1}\right)$ \\
		\acs{DST}-II: &
		$\displaystyle{\left[S_{N}^{II} \right]}_{n,k} =
		\sqrt{\frac{2}{2N}}\epsilon_k\sin\left(\frac{\pi(2n+1)(k+1)}{2N}\right)$
		&
		$\displaystyle\epsilon_k =
		\begin{cases}
			\frac{1}{\sqrt{2}} & k = N-1 \\
			1 & \text{otherwise}
		\end{cases}$ \\
		\acs{DST}-III: &
		${\left[S_{N}^{III} \right]}_{n,k} =
		\sqrt{\frac{2}{2N}}\epsilon_n\sin\left(\frac{\pi(2n+1)(k+1)}{2N}\right)$
		&
		$\displaystyle\epsilon_n =
		\begin{cases}
			\frac{1}{\sqrt{2}} & n = N-1 \\
			1 & \text{otherwise}
		\end{cases}$ \\
		\acs{DST}-IV: &
		$\displaystyle{\left[S_{N}^{IV} \right]}_{n,k} =
		\frac{2}{\sqrt{2N}}\sin\left(\frac{\pi(2n+1)(2k+1)}{4N}\right)$ \\
		\acs{DST}-V: &
		$\displaystyle{\left[S_{N}^{V} \right]}_{n,k} =
		\frac{2}{\sqrt{2N+1}}\sin\left(\frac{2\pi(n+1)(k+1)}{2N+1}\right)$ \\
		\acs{DST}-VI: &
		$\displaystyle{\left[S_{N}^{VI} \right]}_{n,k} =
		\frac{2}{\sqrt{2N+1}}\sin\left(\frac{\pi(2n+1)(k+1)}{2N+1}\right)$ \\
		\acs{DST}-VII: &
		$\displaystyle{\left[S_{N}^{VII} \right]}_{n,k} =
		\frac{2}{\sqrt{2N+1}}\sin\left(\frac{\pi(2k+1)(n+1)}{2N+1}\right)$ \\
		\acs{DST}-VIII: &
		$\displaystyle{\left[S_{N}^{VIII} \right]}_{n,k} =
		\frac{2}{\sqrt{2(N-1)+1}}\epsilon_n\epsilon_k\sin\left(\frac{\pi(2n+1)(2k+1)}{2((2N-1)+1)}\right)$
		&
		$\displaystyle\epsilon_n,\epsilon_k =
		\begin{cases}
			\frac{1}{\sqrt{2}} & n, k = N-1 \\
			1 & \text{otherwise}
		\end{cases}$
	\end{tabular}
	\caption{\acs{DST} definitions of size $N$, where $n,k=0,\dots,N-1$}
	\label{tab:dsts_def}
\end{table}

\section{Design of \acs{DTT}-based \acs{MDTC} systems}
\label{sec:design_dtt_based_mdtc_systems}
\index{DTT-based MDTC systems}

The \ac{RDOT} metric, defined in \S\ref{eqn:rdot_metric}, has proved to be a
good way of designing transforms that offer a balance between the distortion
introduced by the quantisation and the sparsity of transformed coefficients.
Nonetheless, the \ac{RDOT} metric has also been used in
Chapters~\ref{cha:mddt} and~\ref{cha:mdtc} to measure the appropriateness of a
given transform to compactly represent a residual in the rate-distortion
plane.
By using this metric, one is able to evaluate the performance of a transform
on a residual and define a set of residuals for which a transform gives the
best rate-distortion trade-off.
In practice, this metric is able to rate a transform, i.e.\ its
appropriateness with respect to a set of residuals.

Due to the fact that \ac{MDTC} systems learnt using the \acs{RDOT} metric
lead to significant bitrate savings over \ac{HEVC}, it is also used in this
Chapter to design a \ac{DTT}-based \ac{MDTC} system.
The learning method is based on Algorithm~\ref{alg:clustering} for the
classification part:
residuals from the learning set are assigned to the transform that provide the
lowest \acs{RDOT} metric value.
The main difference is that, when using \acp{DTT}, transforms are already
learnt, so the learning phase of the algorithm can be skipped.
Instead, all transforms will be tested against the learning set, and only the
set of transforms that provide the lowest \ac{RDOT} metric will be
retained.

As explained in the previous Section, in order to expand the transform space
when using the \ac{DTT} family, which consists of 8 types of \ac{DCT}, 8 types
of \ac{DST} and their inverse versions, 8 spatial pixel permutations have also
been considered.
The chosen permutations are combinations of transposed and mirrored
(vertically and horizontally) versions of residuals so as not to break pixel
correlations and relative pixel distances within a block.
Using \acp{DTT} together with the proposed residual modifications leads to a
total of 2048 possible transform combinations.
In order to find the optimal \acs{DTT}-based \ac{MDTC} system, one would have
to compute the best set of $N$ transforms from the 2048 available giving the
lowest value of the \ac{RDOT} metric.
This results into
\begin{align}
	\begin{pmatrix}
		2048 \\
		N
	\end{pmatrix}
\end{align}
possible transform combinations.
Fortunately, due to the nature of the \ac{DTT} family, there are many
redundancies within the transform combinations.
For instance, for $4\times4$ transforms, the 2048 list reduced to 256
unique transforms, if a re-ordering is allowed in the transform domain.
As an example, for $N=4$ additional transforms, the number of combinations
that need to be tested is:
\begin{align}
	\begin{pmatrix}
		256 \\
		4
	\end{pmatrix} =
	\num{174792640}\approx\num{1.75e8}
\end{align}
instead of:
\begin{align}
	\left(
	\begin{matrix}
		2048 \\
		4
	\end{matrix}
	\right) =
	\num{730862190080}\approx\num{7.31e11}
\end{align}
Meaning that the number of possible combinations is reduced by more than a
factor of 4000.

In order to find a good combination of transforms and permutations, all
combinations are put into a list, which is iterated through.
Then, the first $N$ transforms are used to classify the residuals into $N$
groups, that is, each residual is assigned to the transform that gives the
lowest value of the \ac{RDOT} metric defined in~\eqref{eqn:rdot_metric}.
Afterwards, the transform $N+1$ tentatively replaces each transform and, if it
makes the global \acs{RDOT} metric lower, it replaces the outperformed
transform, otherwise it is discarded.
The same step is carried out for the next transform until the end of the list,
using a different number of transforms per \ac{IPM} and exploiting prediction.

This approach does not provide the global optimal and the classification
algorithm is subject to be improved.
Nevertheless, this simple method allows determining the validity of the
approach.
Furthermore, the classification algorithm becomes even less optimal when the
number of additional transforms increases.

Since \acp{DTT} have a lower complexity than regular separable transforms, the
main objective in this Chapter is to build low complexity \ac{MDTC} systems,
with high emphasis in the decoding time and \acs{ROM} requirements.

\section{Performances of \acs{DTT}-based \acs{MDTC} systems}
\label{sec:performances_dtt_based_mdtc_systems}

In the previous Chapter, two approaches are proposed to improve the existing
separable \ac{MDTC} systems in the \acs{ROM} --- Y \ac{BD}-rate plane:
using a different number of transforms per \ac{IPM} and exploiting prediction
residuals symmetries to re-use the same transforms amongst symmetrical
\acp{IPM}.
\ac{MDTC} systems that make use of both techniques at the same time present
the best trade-off in the three axes: \acs{ROM}, \ac{BD}-rate and even
encoding complexity.

As in Chapter~\ref{cha:real_world_system}, in order to build iterated
non-homogeneous symmetrical \ac{MDTC} systems, several configurations have
been previously learnt per \ac{IPM} in both $4\times4$ and $8\times8$ \ac{TU}
sizes.
These configurations enable from 1 to 16 additional transforms for $4\times4$
\acp{TU} and from 1 to 32 for $8\times8$ \acp{TU} in steps of powers of 2.

Due to the intention of designing low complexity systems, only configurations
of 1, 2 and 4 kB have been assembled.
Table~\ref{tab:non_hom_sym_dtt_mdtc} contains the iterated non-homogeneous
DTT-based \acs{MDTC} systems for 1, 2 and 4 kB.
A detailed description of the number of transforms used in each \ac{IPM} for
$4\times4$ and $8\times8$ \acp{TU} can be found in
appendix~\ref{cha:transform_usage_in_mdtc_systems}, more precisely in
table~\ref{tab:config_sym_dtt}.

A comparison with previously designed systems (the symmetrical and
non-symmetrical versions of the iterated non-homogeneous \acs{MDTC} systems)
is presented in figures~\ref{fig:vmdtc_dtt_iter_sym_rom}
and~\ref{fig:vmdtc_dtt_iter_sym_cmplx} for \acs{ROM} and encoding complexity,
respectively.
The first figure reveals how the \ac{DTT}-based \acs{MDTC} systems outperform
the \ac{RDOT}-based \acs{MDTC} systems in the \acs{ROM} --- Y \acs{BD}-rate
plane.
The sub-optimality of the learning algorithm can be observed as the number of
additional \acp{DTT} increases:
the \ac{DTT}-based \ac{MDTC} system gets closer to the \ac{RDOT}-based
\ac{MDTC} systems.
Due to the low storage requirements of the \acp{DTT}, more transforms can be
fit for a given \acs{ROM} constraint, leading to more coding possibilities,
and thus to more encoding complexity.
Moreover, the \acp{DTT} have been implemented as separable transforms inside
the encoder, therefore, there is room for improvement in terms of encoding
complexity once they make use of fast algorithms.

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{c|c|c|c}
	System & \acs{ROM} & Y \acs{BD}-rate & Complexity \\
	\hline\hline
	  \SI{1}{\kilo B} & \SI{0.98}{\kilo B} & -1.28\% & 164\% \\
	  \SI{2}{\kilo B} & \SI{1.97}{\kilo B} & -1.54\% & 177\% \\
	  \SI{4}{\kilo B} & \SI{3.97}{\kilo B} & -1.83\% & 230\% \\
	\end{tabular}
	\caption{Non-homogeneous symmetrical \acs{DTT}-based \acs{MDTC} systems with
	\acs{ROM} constraints}
	\label{tab:non_hom_sym_dtt_mdtc}
\end{table}

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/vmdtc_dtt_iter_sym_rom_plot.tex}}
	{\includegraphics{./figures/vmdtc_dtt_iter_sym_rom_plot.pdf}}
	\caption[\acs{DTT}-based \acs{MDDT} systems in the \acs{ROM} --- Y
	\acs{BD}-rate plane]
	{\acs{DTT}-based \acs{MDDT} systems in the \acs{ROM} --- Y
	\acs{BD}-rate plane compared to other non-homogeneous \acs{MDTC} systems}
	\label{fig:vmdtc_dtt_iter_sym_rom}
\end{figure}

\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/vmdtc_dtt_iter_sym_cmplx_plot.tex}}
	{\includegraphics{./figures/vmdtc_dtt_iter_sym_cmplx_plot.pdf}}
	\caption[\acs{DTT}-based \acs{MDDT} systems in the Complexity --- Y
	\acs{BD}-rate plane]
	{\acs{DTT}-based \acs{MDDT} systems in the Complexity --- Y
	\acs{BD}-rate plane compared to other non-homogeneous \acs{MDTC} systems}
	\label{fig:vmdtc_dtt_iter_sym_cmplx}
\end{figure}

The system that offers the best performances in terms of \acs{BD}-rate,
labelled as 4 kB, presents a particular usage of transforms from the \ac{DTT}
family.
Tables~\ref{tab:dtt_combinations_4} and~\ref{tab:dtt_combinations_8} detail
the different transform combinations used for $4\times4$ and $8\times8$
\ac{TU} sizes, respectively, as well as the number of occurrences.

For $4\times4$ blocks, the 4 kB system uses a total of 38 additional
transforms (including all \acp{IPM}), which are combinations of different
\acp{DTT}.
Despite having 256 \acp{DTT} from where to choose, only 15 different
transforms are used with different levels of repetition.
The most used transforms are the \acs{DCT}-IV--\acs{DCT}-IV,
\acs{DCT}-III--\acs{DCT}-IV, the \acs{DST}-VII--\acs{DCT}-IV and the
\acs{DCT}-V--\acs{DCT}-IV pairs.
For $8\times8$ blocks, the system uses 55 additional transforms, of which only
20 are unique.
From these unique transforms, there are two combinations that stand out: the
\acs{DST}-VII--\acs{DST}-VII and the \acs{DST}-VII--\acs{DCT}-V pairs.
Moreover, the \acs{DST}-VII and the \acs{DCT}-V appear frequently in the
other combinations.
The usage of these transforms is summarised  in
tables~\ref{tab:dtt_combinations_4} and~\ref{tab:dtt_combinations_8} for
$4\times4$ and $8\times8$ \acp{TU}, respectively.

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{cc|c}
		Horizontal transform & Vertical transform & Number \\
		\hline\hline
		DCT-I   & DCT-III & 1 \\
		DCT-II  & DCT-IV  & 1 \\
		DCT-III & DCT-III & 1 \\
		DCT-III & DCT-IV  & 6 \\
		DCT-IV  & DCT-IV  & 8 \\
		DCT-V   & DCT-III & 1 \\
		DCT-V   & DCT-IV  & 5 \\
		DCT-V   & DST-II  & 1 \\
		DST-I   & DCT-IV  & 1 \\
		DST-V   & DST-V   & 1 \\
		DST-VII & DCT-III & 1 \\
		DST-VII & DCT-IV  & 6 \\
		DST-VII & DCT-V   & 2 \\
		DST-VII & DST-II  & 2 \\
		DST-VII & DST-V   & 1 \\
	\end{tabular}
	\caption{The different $4\times4$ transform combinations for the 4
	kB \acs{DTT}-based \acs{MDTC} system}
	\label{tab:dtt_combinations_4}
\end{table}

\begin{table}[tb]
	\centering
	\small
	\begin{tabular}{cc|c}
		Horizontal transform & Vertical transform & Number \\
		\hline\hline
		DCT-II  & DCT-IV  & 1  \\
		DCT-II  & DST-V   & 1  \\
		DCT-III & DCT-IV  & 2  \\
		DCT-III & DST-VII & 1  \\
		DCT-IV  & DCT-IV  & 2  \\
		DCT-V   & DCT-III & 2  \\
		DCT-V   & DCT-VII & 1  \\
		DCT-V   & DST-II  & 1  \\
		DCT-V   & DST-III & 1  \\
		DCT-V   & DST-IV  & 2  \\
		DCT-V   & DST-V   & 1  \\
		DCT-V   & DST-VI  & 1  \\
		DST-I   & DCT-IV  & 1  \\
		DST-VII & DCT-III & 3  \\
		DST-VII & DCT-IV  & 4  \\
		DST-VII & DCT-V   & 10 \\
		DST-VII & DST-I   & 4  \\
		DST-VII & DST-III & 1  \\
		DST-VII & DST-IV  & 3  \\
		DST-VII & DST-VII & 12 \\
	\end{tabular}
	\caption{The different $8\times8$ transform combinations for the 4
	kB \acs{DTT}-based \acs{MDTC} system}
	\label{tab:dtt_combinations_8}
\end{table}

\section{Conclusions}
\label{sec:dtt_conclusions}

Chapter~\ref{cha:real_world_system} presented some ways of reducing the
storage requirements by using a different number of transforms in each
\ac{IPM} and by exploiting intra prediction residuals symmetries.

This Chapter goes one step further on simplifying \ac{MDTC} systems:
instead of using \ac{RDOT}-based \ac{MDTC} systems, \acp{RDOT} are replaced by
\acp{DTT}.
The coefficients of these trigonometric transforms can be deduced
analytically, hence the only storage requirement is the scanning pattern.
Although fast algorithms exist for \acp{DTT}, this Chapter has implemented
them as regular separable transforms.
As a result, the encoding and decoding times presented here do not represent
the values that an actual implementation would have.

Finally, the \ac{DCT}-based \ac{MDTC} systems explored in this Chapter are
preliminary, but still very competitive and promising with regards to the more
complex approaches using separable \acp{RDOT}.
As such, coding complexity is subject to further improvements once \acp{DTT}
are implemented using fast algorithms.
Moreover, revisiting the learning algorithm to fix its sub-optimality can lead
to even better performances of the \ac{DTT}-based \ac{MDTC} systems.

\chapter{Proposed \acs{MDTC} configurations}
\label{cha:proposed_mdtc_configurations}
\chaptertoc

\section{Motivation}
\label{sec:summary_motivation}

Numerous alternative \acs{MDTC} systems have been designed and presented
throughout this thesis.
As a result, this Chapter presents several interesting alternatives that offer
different levels of trade-offs between the provided \acs{BD}-rate and the
added complexity.

In Chapters~\ref{cha:real_world_system} and~\ref{cha:dtt}, various systems
have been designed with complete independence from the \ac{HEVC} test set
proposed in the \ac{CTC}~\cite{bossen-12-common-test-conditions}.
In order to provide results in well specified test conditions with a known set
of test sequences, this Chapter evaluates the most interesting systems in
terms of trade-off between complexity (\acs{ROM} and encoding time) and
bitrate savings on the \ac{HEVC} test set.

\section{Retained systems}
\label{sec:retained_systems}

Some interesting systems in terms of bitrate savings, encoding time, decoding
time and storage requirements have been selected to cover different complexity
and performance trade-offs.
The retained systems have been designed using a non-homogeneous number of
transforms per \ac{IPM} (\S\ref{sec:non_homogeneous_mdtc_systems}) and
exploiting existing residual symmetries (\S\ref{sec:sym_mdtc}).
Depending on the desired performance and complexity trade-off, the proposed
systems are:
\begin{itemize}
	\item \SI{4}{\kilo B} \acs{DTT}-based \acs{MDTC}
	\item \SI{16}{\kilo B} \acs{RDOT}-based \acs{MDTC}
	\item \SI{32}{\kilo B} \acs{RDOT}-based \acs{MDTC}
	\item \SI{64}{\kilo B} \acs{RDOT}-based \acs{MDTC}
	\item \SI{128}{\kilo B} \acs{RDOT}-based \acs{MDTC}
\end{itemize}

Table~\ref{tab:summary_final_systems} contains a detailed summary of the
bitrate savings obtained with each system in both \acs{AI} and \acs{RA}
coding configurations, as well as the encoding complexity.
The actual results per sequence of the \ac{HEVC} test set for \acs{AI} and
\acs{RA} are provided in tables~\ref{tab:final_systems_ai}
and~\ref{tab:final_systems_ra}, respectively.
Moreover, a graphical representation of the \acs{AI} coding configuration
bitrate savings is shown in figure~\ref{fig:final_systems_ai} and in
figure~\ref{fig:final_systems_ra} for \ac{RA}.

\begin{table}[tb]
	\centering
	\small
	\begin{tabularx}{0.95\linewidth}{X|r|rrrr||r|rrrr}
		\multicolumn{1}{c}{} &
		\multicolumn{5}{c||}{\acs{AI}} &
		\multicolumn{5}{c}{\acs{RA}} \\
		\cline{2-11}
		\multicolumn{1}{c}{} &
		\multicolumn{1}{c|} {\acs{DTT}} &
		\multicolumn{4}{c||} {\acs{RDOT}} &
		\multicolumn{1}{c|} {\acs{DTT}} &
		\multicolumn{4}{c} {\acs{RDOT}} \\
		\multicolumn{1}{c}{} &
		\SI{4}{\kilo B} &
		\SI{16}{\kilo B} &
		\SI{32}{\kilo B} &
		\SI{64}{\kilo B} &
		\SI{128}{\kilo B} &
		\SI{4}{\kilo B} &
		\SI{16}{\kilo B} &
		\SI{32}{\kilo B} &
		\SI{64}{\kilo B} &
		\SI{128}{\kilo B} \\
		\hline\hline
		Class A    & -1.55 & -2.07 & -2.45 & -2.73 & -3.11 & -0.67 & -0.98 & -1.19 & -1.33 & -1.54 \\
		Class B    & -1.74 & -2.27 & -2.74 & -3.05 & -3.28 & -1.01 & -1.37 & -1.63 & -1.80 & -1.96 \\
		Class C    & -2.02 & -3.00 & -3.36 & -3.73 & -3.85 & -1.23 & -1.94 & -2.09 & -2.31 & -2.45 \\
		Class D    & -2.05 & -2.96 & -3.26 & -3.62 & -3.87 & -1.09 & -1.60 & -1.76 & -1.97 & -2.09 \\
		Class E    & -1.84 & -2.66 & -3.19 & -3.65 & -3.90 & -2.41 & -3.46 & -3.95 & -4.46 & -4.86 \\
		Class F    & -1.97 & -4.34 & -4.40 & -4.66 & -4.91 & -1.78 & -3.73 & -3.73 & -3.93 & -4.22 \\
		\hline
        Worst      & -0.57 & -0.58 & -0.64 & -0.64 & -0.61 &  0.21 &  0.20 &  0.27 &  0.26 &  0.23 \\
        Best       & -2.92 & -5.16 & -5.29 & -5.60 & -5.92 & -2.68 & -5.11 & -5.14 & -5.43 & -5.74 \\
        Median     & -1.98 & -3.03 & -3.40 & -3.75 & -3.95 & -1.28 & -2.09 & -2.24 & -2.44 & -2.62 \\
		\hline
        Complexity &   229 &   372 &   481 &   761 &  1297 &   106 &  114  &   119 &   133 &   163 \\
		\hline
        Mean       & -1.86 & -2.87 & -3.21 & -3.55 & -3.79 & -1.31 & -2.09 & -2.29 & -2.53 & -2.73 \\
	\end{tabularx}
	\caption[Summary of the retained \acs{MDTC} systems referred to \acs{HEVC}]
	{Summary of the retained \acs{MDTC} systems referred to \ac{HEVC} (\%).
	Y \acs{BD}-rates are presented per Class, as well as their mean, median, best and worst values.
	The encoding complexity of each system is also provided.}
	\label{tab:summary_final_systems}
\end{table}

\begin{figure}[tb]
	\def\scale{0.40}
	\def\encmax{1300}
	\def\decmax{110}
	\def\bdrmax{-4}
	\def\rommax{128}
	\centering

	\def\bdr{-1.86}
	\def\enc{229}
	\def\dec{100}
	\def\rom{3.97}
	\subfloat[4 kB \acs{DTT}-based \acs{MDTC}]{\input{./scripts/four-way-diagram.tex}}
	\def\bdr{-2.87}
	\def\enc{372}
	\def\dec{105}
	\def\rom{15.84}
	\subfloat[16 kB \acs{RDOT}-based \acs{MDTC}]{\input{./scripts/four-way-diagram.tex}}

	\def\bdr{-3.21}
	\def\enc{481}
	\def\dec{105}
	\def\rom{31.92}
	\subfloat[32 kB \acs{RDOT}-based \acs{MDTC}]{\input{./scripts/four-way-diagram.tex}}
	\def\bdr{-3.55}
	\def\enc{761}
	\def\dec{105}
	\def\rom{63.84}
	\subfloat[64 kB \acs{RDOT}-based \acs{MDTC}]{\input{./scripts/four-way-diagram.tex}}

	\def\bdr{-3.79}
	\def\enc{1297}
	\def\dec{105}
	\def\rom{123.38}
	\subfloat[128 kB \acs{RDOT}-based \acs{MDTC}]{\input{./scripts/four-way-diagram.tex}}

	\caption{Graphical comparison of the five retained \acs{MDTC} systems}
	\label{fig:four_way_mdtc_comparison}
\end{figure}
\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/summary_retained_systems_ai_plot.tex}}
	{\includegraphics{./figures/summary_retained_systems_ai_plot.pdf}}
	\caption{Bar chart graphically summarising the retained \acs{MDTC} systems
	in \acs{AI}}
	\label{fig:final_systems_ai}
\end{figure}
\begin{figure}[tb]
	\centering
	\ifthenelse{\usepdfs = 0}
	{\input{./figures/summary_retained_systems_ra_plot.tex}}
	{\includegraphics{./figures/summary_retained_systems_ra_plot.pdf}}
	\caption{Bar chart graphically summarising the retained \acs{MDTC} systems
	in \acs{RA}}
	\label{fig:final_systems_ra}
\end{figure}

\begin{table}[tb]
	\centering
	\small
	\begin{tabularx}{\textwidth}{c|X|r|rrrr}
		\multicolumn{2}{c}{} &
		\multicolumn{1}{c|}{\acs{DTT}} &
		\multicolumn{4}{c}{\acs{RDOT}} \\
		\cline{2-7}
		\multicolumn{1}{c}{} & {Sequence} &
		\SI{4}{\kilo B} &
		\SI{16}{\kilo B} &
		\SI{32}{\kilo B} &
		\SI{64}{\kilo B} &
		\SI{128}{\kilo B} \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class A ($2560\times1600$)}
		& NebutaFestival         & -0.77 & -0.77 & -0.95 & -1.16 & -1.40 \\
		& PeopleOnStreet         & -2.59 & -3.36 & -3.95 & -4.46 & -5.23 \\
		& SteamLocomotiveTrain   & -0.57 & -0.58 & -0.64 & -0.64 & -0.61 \\
		& Traffic                & -2.27 & -3.59 & -4.25 & -4.68 & -5.20 \\
		\cline{2-7} &
		Average                  & -1.55 & -2.07 & -2.45 & -2.73 & -3.11 \\
		\hline
		\hline
		\multirow{6}{2cm}{\centering Class B ($1920\times1080$)}
		& BasketballDrive        & -1.33 & -1.68 & -2.17 & -2.54 & -2.63 \\
		& BQTerrace              & -1.47 & -2.22 & -2.68 & -3.04 & -3.24 \\
		& Cactus                 & -2.33 & -3.02 & -3.59 & -4.02 & -4.19 \\
		& Kimono1                & -0.63 & -0.85 & -1.00 & -1.07 & -1.30 \\
		& ParkScene              & -2.92 & -3.57 & -4.26 & -4.59 & -5.04 \\
		\cline{2-7} &
		Average                  & -1.74 & -2.27 & -2.74 & -3.05 & -3.28 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class C ($832\times480$)}
		& BasketballDrill        & -1.39 & -2.47 & -2.54 & -2.83 & -2.99 \\
		& BQMall                 & -2.30 & -3.12 & -3.68 & -4.09 & -4.36 \\
		& PartyScene             & -2.43 & -3.50 & -3.85 & -4.22 & -4.40 \\
		& RaceHorses             & -1.96 & -2.92 & -3.37 & -3.77 & -3.65 \\
		\cline{2-7} &
		Average                  & -2.02 & -3.00 & -3.36 & -3.73 & -3.85 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class D ($416\times240$)}
		& BasketballPass         & -1.77 & -2.69 & -3.14 & -3.47 & -3.73 \\
		& BlowingBubbles         & -2.25 & -3.04 & -3.35 & -3.70 & -3.95 \\
		& BQSquare               & -2.10 & -3.50 & -3.62 & -3.94 & -4.24 \\
		& RaceHorses             & -2.10 & -2.63 & -2.93 & -3.36 & -3.54 \\
		\cline{2-7} &
		Average                  & -2.05 & -2.96 & -3.26 & -3.62 & -3.87 \\
		\hline
		\hline
		\multirow{4}{2cm}{\centering Class E ($1280\times720$)}
		& FourPeople             & -2.31 & -3.26 & -3.88 & -4.42 & -4.74 \\
		& Johnny                 & -1.47 & -2.17 & -2.66 & -3.08 & -3.24 \\
		& KristenAndSara         & -1.74 & -2.53 & -3.03 & -3.45 & -3.72 \\
		\cline{2-7} &
		Average                  & -1.84 & -2.66 & -3.19 & -3.65 & -3.90 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class F (various resolutions)}
		& BasketDrillText        & -1.82 & -3.42 & -3.43 & -3.73 & -3.96 \\
		& ChinaSpeed             & -1.58 & -3.84 & -3.98 & -4.24 & -4.45 \\
		& SlideEditing           & -2.01 & -4.93 & -4.90 & -5.06 & -5.31 \\
		& SlideShow              & -2.46 & -5.16 & -5.29 & -5.60 & -5.92 \\
		\cline{2-7} &
		Average                  & -1.97 & -4.34 & -4.40 & -4.66 & -4.91 \\
		\hline
		\hline
		All sequences &
		Overall                  & -1.86 & -2.87 & -3.21 & -3.55 & -3.79 \\
	\end{tabularx}
	\caption{Y \acs{BD}-rate (\%) for proposed \acs{MDTC} systems in \acs{AI}}
	\label{tab:final_systems_ai}
\end{table}

\begin{table}[tb]
	\centering
	\small
	\begin{tabularx}{\textwidth}{c|X|r|rrrr}
		\multicolumn{2}{c}{} &
		\multicolumn{1}{c|}{\acs{DTT}} &
		\multicolumn{4}{c}{\acs{RDOT}} \\
		\cline{2-7}
		\multicolumn{1}{c}{} & {Sequence} &
		\SI{4}{\kilo B} &
		\SI{16}{\kilo B} &
		\SI{32}{\kilo B} &
		\SI{64}{\kilo B} &
		\SI{128}{\kilo B} \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class A ($2560\times1600$)}
		& NebutaFestival         & -0.07 & -0.08 & -0.09 & -0.09 & -0.11 \\
		& PeopleOnStreet         & -0.99 & -1.27 & -1.52 & -1.71 & -2.03 \\
		& SteamLocomotiveTrain   &  0.21 &  0.20 & 0.27  & 0.26  &  0.23 \\
		& Traffic                & -1.84 & -2.78 & -3.41 & -3.81 & -4.25 \\
		\cline{2-7} &
		Average                  & -0.67 & -0.98 & -1.19 & -1.33 & -1.54 \\
		\hline
		\hline
		\multirow{6}{2cm}{\centering Class B ($1920\times1080$)}
		& BasketballDrive        & -0.25 & -0.35 & -0.51 & -0.57 & -0.51 \\
		& BQTerrace              & -0.97 & -1.52 & -1.69 & -1.87 & -2.07 \\
		& Cactus                 & -1.44 & -1.94 & -2.28 & -2.52 & -2.63 \\
		& Kimono1                & -0.41 & -0.57 & -0.68 & -0.81 & -0.99 \\
		& ParkScene              & -1.95 & -2.46 & -3.00 & -3.26 & -3.61 \\
		\cline{2-7} &
		Average                  & -1.01 & -1.37 & -1.63 & -1.80 & -1.96 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class C ($832\times480$)}
		& BasketballDrill        & -1.18 & -2.08 & -1.99 & -2.25 & -2.61 \\
		& BQMall                 & -1.29 & -1.95 & -2.28 & -2.51 & -2.63 \\
		& PartyScene             & -1.67 & -2.60 & -2.76 & -2.98 & -3.10 \\
		& RaceHorses             & -0.77 & -1.15 & -1.33 & -1.52 & -1.47 \\
		\cline{2-7} &
		Average                  & -1.23 & -1.94 & -2.09 & -2.31 & -2.45 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class D ($416\times240$)}
		& BasketballPass         & -0.80 & -1.18 & -1.45 & -1.71 & -1.71 \\
		& BlowingBubbles         & -1.44 & -2.10 & -2.21 & -2.39 & -2.57 \\
		& BQSquare               & -1.28 & -2.11 & -2.12 & -2.35 & -2.56 \\
		& RaceHorses             & -0.83 & -1.01 & -1.25 & -1.43 & -1.52 \\
		\cline{2-7} &
		Average                  & -1.09 & -1.60 & -1.76 & -1.97 & -2.09 \\
		\hline
		\hline
		\multirow{4}{2cm}{\centering Class E ($1280\times720$)}
		& FourPeople             & -2.68 & -3.83 & -4.47 & -5.06 & -5.53 \\
		& Johnny                 & -2.12 & -3.10 & -3.52 & -4.00 & -4.34 \\
		& KristenAndSara         & -2.43 & -3.46 & -3.85 & -4.32 & -4.71 \\
		\cline{2-7} &
		Average                  & -2.41 & -3.46 & -3.95 & -4.46 & -4.86 \\
		\hline
		\hline
		\multirow{5}{2cm}{\centering Class F (various resolutions)}
		& BasketDrillText        & -1.28 & -2.36 & -2.39 & -2.59 & -2.85 \\
		& ChinaSpeed             & -1.08 & -2.39 & -2.36 & -2.50 & -2.72 \\
		& SlideEditing           & -2.13 & -5.11 & -5.14 & -5.22 & -5.56 \\
		& SlideShow              & -2.64 & -5.06 & -5.02 & -5.43 & -5.74 \\
		\cline{2-7} &
		Average                  & -1.78 & -3.73 & -3.73 & -3.93 & -4.22 \\
		\hline
		\hline
		All sequences &
		Overall                  & -1.31 & -2.09 & -2.29 & -2.53 & -2.73 \\
	\end{tabularx}
	\caption{Y \acs{BD}-rate (\%) for proposed \acs{MDTC} systems in \acs{RA}}
	\label{tab:final_systems_ra}
\end{table}

\section{Conclusions}
\label{sec:summary_conclusions}

This Chapter summarises and groups the most representative separable \ac{MDTC}
systems designed in this thesis and tests them against the \ac{HEVC} test set
in order to provide results for a well-known sequence set.

The retained systems prove the interest of using multiple transforms for video
coding.
Different levels of performances are provided depending on the allowed
complexity on the encoder side.

The iterative design described in Chapter~\ref{cha:real_world_system} has
allowed creating a continuum of systems, satisfying different \acs{ROM}
requirements.
Although the system design has been carried out for different \acs{ROM}
constraint, a system could be conceived for any point.
This means that systems using more than 128 kB are possible and can provide
higher bitrate savings.

\chapter{Conclusions and future work}
\label{cha:conclusions_and_future_work}

\section*{Thesis objectives}
\label{sec:thesis_objectives}

The objective of this thesis is to provide a set of tools and systematic
methods of designing multiple transforms to be used in video coders.
In order to design performing learning algorithms, the \ac{RDOT} metric, based
on the $\ell_0$ norm was selected.
The appropriateness of this design method is justified over the traditional
\ac{KLT} in Chapter~\ref{cha:transform_coding}.

The methodology for learning adapted transforms based on the \ac{RDOT} metric
and presented in Chapters~\ref{cha:transform_coding},~\ref{cha:mddt}
and~\ref{cha:mdtc} consists of the following steps:
\begin{enumerate}
	\item Select a set of learning sequences.
	\item Encode the sequences with \ac{HEVC} using an \ac{AI} configuration.
	\item Extract the residuals (difference between original and predicted
		blocks) for the desired \ac{TU} sizes, grouped by \ac{IPM}.
	\item Learn adapted set of transforms for each \ac{TU} size and \ac{IPM}
		using the \acs{RDOT} metric from~\eqref{eqn:rdot_metric}:
		\begin{enumerate}
			\item \acs{MDDT} (Chapter~\ref{cha:mddt}).
			\item \acs{MDTC} (Chapter~\ref{cha:mdtc}).
			\item Incomplete transforms
				(Chapter~\ref{cha:incomplete_transforms}).
			\item \acp{DTT} (Chapter~\ref{cha:dtt}).
		\end{enumerate}
	\item Simplify the system storage requirements with methods proposed in
		Chapter~\ref{cha:real_world_system}.
\end{enumerate}

By having a look at the steps described above, one can see that is fairly
straightforward to extend the usage of multiple transforms for video coding
the following ways:
\begin{itemize}
	\item Learning transforms for \ac{TU} sizes other than $4\times4$ and
		$8\times8$, such as $16\times16$ and $32\times32$.
	\item Learning transforms for inter prediction residuals by encoding in
		\ac{RA} instead of \ac{AI}.
\end{itemize}

\section*{Conclusions}
\label{sec:final_conclusions}

This thesis has proved that state-of-the-art video coding standards, such as
\ac{HEVC}, can be improved by designing multiple transforms adapted to video
coding.

The design of the transforms is carried out off-line through a learning.
Despite the fact that off-line learning techniques might have known issues
such as over learning, if the learning set is chosen wisely (using several
resolutions, frame rates and sources), the resulting transforms can be used
outside the sequences of the learning set with good performances.
This way, hardware implementations of the transforms are possible, since
transforms coefficients are known.

Moreover, off-line learning has been possible thanks to the \ac{RDOT} design
introduced in~\cite{sezer-08-sparse-orthonormal-transforms}.
The \ac{RDOT} metric, based on the $\ell_0$ norm, has served as a good
approximation of the one implemented in \ac{RDO} loop  from \ac{HEVC}.

The following points should be kept in mind:
\begin{itemize}
	\item Every 10 years, a new video codec is standardised halving the
		bitrate required for an equivalent perceived quality.
	\item \acs{HEVC} provided 23\% of bitrate savings with regards
		H.264/\acs{MPEG}-4 \acs{AVC}
		in \acs{AI} coding configuration with about twice the
		complexity~\cite{JCTVC-M0329}.
	\item A video standard is the result of hard-work and negotiations between
		many engineers and companies during years.
\end{itemize}

With those points stated, the results in this thesis have proved the interest
in using multiple transforms for video coding.

The first attempt is to compare the performances of the \ac{KLT} against the
\ac{RDOT} through the \ac{MDDT} technique.
Improvements of over 1 \ac{BD}-rate point are obtained when using the
\ac{RDOT} design, which takes into account the sparsity of the signal in the
transform domain.

In order to put the use of multiple transforms further, the \ac{MDTC} is born
as an evolution of the \ac{MDDT} where more than one transform is available in
each \ac{IPM}.
Depending on the desired complexity and bitrate savings trade-off, this
technique alone is able to obtain bitrate savings of 7\%, for non-separable
transforms, and around 4\% for separable transforms.

However, to obtain those levels of performance, first obtained in
Chapter~\ref{cha:mdtc}, the encoding complexity and the storage requirements
are too high for a commercial system.
For this reason, Chapters~\ref{cha:real_world_system} and~\ref{cha:dtt}
provide different systematic ways to reduce the storage requirements for the
\ac{MDTC} systems.
As a side effect, the encoding complexity is also reduced.

Chapter~\ref{cha:dtt} presents the \ac{DTT}-based \ac{MDTC} systems as a
simplification of the \ac{MDTC} systems used throughout this thesis.
Despite the early stages of the work, \acp{DTT} have proved their interest by
being a very competitive approach in terms of bitrate savings, complexity and
storage requirements.
However, the learning algorithm for these systems is known to be sub-optimal
when using numerous \acp{DTT}.
For this reason, improvements in the learning algorithm are essential to
continue improving the \ac{DTT} approach.

The last Chapter groups some of the most representative \ac{MDTC} systems
presented in this thesis.
Thanks to the learning methods and simplifications presented in
Chapter~\ref{cha:real_world_system} and~\ref{cha:dtt}, a continuum of systems
can be designed which are able to satisfy different complexity and bitrate
savings trade-offs.

\section*{Perspectives for future work}
\label{sec:perspectives_for_future_work}

Although encouraging results have been proved when using multiple transforms
for video coding, there is still room for improvement in many areas of the
techniques proposed that have been carried out in this thesis.

\subsection*{Learning set}
\label{sub:learning_set}

In Chapters~\ref{cha:mddt},~\ref{cha:mdtc}
and~\ref{cha:incomplete_transforms}, the test set that was used included in
the learning set to be in the best possible conditions and validate the
approach that the use of multiple transforms in video coding provide bitrate
savings.
Due to the promising results obtained in those Chapters, in order to make sure
there was no over-learning, the learning set is replaced by a completely
independent set of sequences in Chapter~\ref{cha:real_world_system}.
The new learning set provides many more residuals from which to learn the
transforms, but is less heterogeneous, since all frames come from the same
sequence, which implies same frame rate, resolution, filters, etc.

Chapter~\ref{cha:real_world_system} shows that there can be a small impact
due to transform over-learning to the learning set.
However, the over-learning effect impact could be lowered by choosing a more
diversified learning set in terms of resolution, frame rate and contents.
The choice of \emph{Tears of Steel} short film was made to make a point:
even with a learning set that is nothing like the test set, good results can
be achieved.
Moreover, due to the fact that the short film is publicly available in raw
format, makes the results reproducible by any party.

Therefore, using a learning set made up from sequences coming from different
sources, at different resolutions, frame rates and contents can lead to better
performing transforms.

\subsection*{Signalling}
\label{sub:signalling}

The current way of indicating the chosen transform to the decoder is via a
very simplistic approach:
during the \ac{RDO} loop, the encoder tests all transforms in each \ac{TU} and
selects the one that provides the best trade-off in terms of rate-distortion.
Then, a flag is used to indicate that a transform different from those of
\ac{HEVC} is used, and signalled with a fixed length codeword.
This approach has the advantage of being very flexible, allowing a \ac{PU}
that has been split into four \acp{TU} to have different transforms.

However, if transforms were signalled one level upwards, that is, at a \ac{PU}
level instead of per \ac{TU}, transform signalling would be reduced by a
factor of 4, although all transforms in a \ac{PU} would be identical.
It seems reasonable to think that more bitrate savings could be achieved by
using this new signalling scheme.

Another improvement in signalling might come by not using fixed length
codewords, but favouring some transforms instead with shorter codewords.
This approach will also allow using a number of transforms different from a
power of 2.
By signalling some transforms using fewer bits, they would be used more often
than those with longer code words, leading to possible bitrate savings.
Nevertheless, for this approach to work properly, the signalling cost of the
transform should be taken into account during the learning step in order to
minimise the mismatch between the learning and the testing phases.

\subsection*{Transform coefficients quantisation}
\label{sub:transform_coefficients_quantisation}

All the transforms implemented in this thesis are integer versions of
orthogonal transforms, whose coefficients are floating point numbers.
As previously explained, each transform coefficient has been scaled and
rounded to the nearest integer fitting into 1 byte range.
This method is far from being optimal, and thus a thorough study on how
transform coefficients should be quantised could lead to some improvements in
terms of rate-distortion.

\subsection*{Coding complexity}
\label{sub:coding_complexity}

The presented solutions to reduce the storage requirements for \ac{MDTC}
systems in Chapter~\ref{cha:real_world_system} have reduced the encoding
complexity as a side effect:
smartly reducing the number of transforms to test on the encoder side leads to
modest decreases in the encoding time, since less coding alternatives are
explored.

However, no specific attention or effort has been put to reduce the encoding
time of the proposed solutions.
This is due to, in part, the fact that the encoder is not standardised, only
the decoder, and focusing on the decoder time seemed more sensible to prove
the viability of a new technique, such as the use of multiple transforms.

Currently, the proposed \ac{MDTC} systems perform an exhaustive search to find
the best combination of block size (\ac{PU} and \ac{TU}), \ac{IPM} and
transform.
A possible way of simplifying the approach would be not to test the complete
set of transforms by implementing a fast decision mode.
Another way would be to remove the transform selection from the \ac{RDOQ} (see
\S8.1.7 from~\cite{wien-15-hevc}) and select it by computing directly the
\ac{RDOT} metric value for all available transforms.

\subsection*{Further study of \acs{DTT}-based \acs{MDTC} systems}
\label{sub:further_study_of_dtt_based_mdtc_systems}

The \ac{DCT}-based \ac{MDTC} systems introduced in Chapter~\ref{cha:dtt} have
come late in the work presented in this thesis.
Accordingly, these systems have not been studied with the same level of detail
as the \ac{RDOT}-based \ac{MDTC} systems.

Despite the low level of maturity of the approach using \acp{DTT}, these
systems have presented very promising trade-offs in the \acs{ROM} --- Y
\ac{BD}-rate plane.
As a consequence, a thorough study on the \ac{DTT} family, their properties
and fast algorithm implementations might lead to improvements in the encoding
and decoding complexities.
Besides, the algorithm used to design the \ac{MDTC} systems is sub-optimal,
since the computational complexity of choosing the best set of $N$ transforms
amongst the 256 available unique 2D transforms is high, and grows with $N$.
Improvements to the learning algorithm would lead to better choices regarding
the transforms used in each \ac{IPM}.

\appendix

\chapter{Publications}
\label{cha:publications}

\section*{Conference papers}
\label{sec:conference_papers}

\subsection*{Non-separable mode-dependent transforms for intra coding in HEVC}
\subsubsection*{IEEE VCIP 2014}
\paragraph{\textsf{Authors:}}
	Adrià Arrufat, Pierrick Philippe and Olivier Déforges
\subsubsection*{Abstract}

Transform coding plays a crucial role in video coders.
Recently, additional transforms based on the \acs{DST} and the \acs{DCT} have
been included in the latest video coding standard, \acs{HEVC}.
Those transforms were introduced after a thoroughly analysis of the video
signal properties.
In this paper, we design additional transforms by using an alternative
learning approach.
The appropriateness of the design over the classical \acs{KLT} learning is
also shown.
Subsequently, the additional designed transforms are applied to the latest
\acs{HEVC} scheme.
Results show that coding performance is improved compared to the standard.
Additional results show that the coding performance can be significantly
further improved by using non-separable transforms.
Bitrate reductions in the range of 2\% over \acs{HEVC} are achieved with
those proposed transforms.

\paragraph{Note}
\label{par:note}
This publication received the \emph{Best Paper Award} at IEEE VCIP 2014, which
has held in Valletta, Malta in December 2014.

\subsection*{Rate-distortion optimised transform competition for intra coding in
HEVC}
\paragraph{\textsf{Authors:}}
	Adrià Arrufat, Pierrick Philippe and Olivier Déforges
\subsubsection*{IEEE VCIP 2014}
\label{sub:ieee_vcip_2014_tc}
\subsubsection*{Abstract}
\label{ssub:tc_abstract}

State of the art video coders are based on prediction and transform coding.
The transform decorrelates the signal to achieve high compression levels.
In this paper we propose improving the performances of the latest video coding
standard, \acs{HEVC}, by adding a set of \acp{RDOT}.
The transform design is based upon a cost function that incorporates a bit
rate constraint.
These new \acp{RDOT} compete against classical \acs{HEVC} transforms in the
\ac{RDO} loop in the same way as prediction modes and block sizes, providing
additional coding possibilities.
Reductions in \acs{BD}-rate of around 2\% are demonstrated when making these
transforms available in \acs{HEVC}.

\subsection*{Mode-dependent transform competition in HEVC}
\subsubsection*{IEEE ICIP 2015}
\paragraph{\textsf{Authors:}}
	Adrià Arrufat, Pierrick Philippe and Olivier Déforges
\subsubsection*{Abstract}

Transform coding plays a key role in state-of-the-art video coders, such as
\acs{HEVC}.
However, transforms used in current solutions do not cover the varieties
of video coding signals.
This work presents an adaptive transform design method that enables the use of
multiple transforms in \acs{HEVC}.
A different transform set is learnt for each \acl{IPM}, allowing the video
encoder to perform better decisions regarding block sizes, prediction modes
and transforms.
Different systems are proposed to accommodate trade-offs between complexity
and performance.
Bitrate reductions in the range of 2\% to 7\% are reported, depending on
complexity.

\subsection*{Image coding with incomplete transforms for HEVC}
\subsubsection*{IEEE ICIP 2015}
\paragraph{\textsf{Authors:}}
	Adrià Arrufat, Pierrick Philippe and Anne-Flore Perrin
\subsubsection*{Abstract}

Overcomplete transforms have received considerable attention over the past
years.
However, they often suffer from a complexity burden.
In this paper, a low complexity approach is provided, where an orthonormal
basis is complemented with a set of incomplete transforms: those incomplete
transforms include a reduced number of basis vectors that allow a reduction on
the coding complexity and ensure a certain level of sparsity.
The solution has been implemented in the \acs{HEVC} standard and coding
gains of around 1\% on average are reported while reducing the decoder
complexity in about 5\%.

\subsection*{Low complexity transform competition for HEVC}
\subsubsection{IEEE ICASSP 2016}
\paragraph{\textsf{Authors:}}
	Adrià Arrufat, Pierrick Philippe, Kevin Reuzé and Olivier Déforges
\subsubsection*{Abstract}

The use of multiple transforms in video coding can lead to substantial
bitrate savings.
However, these savings come at the expense of increased coding complexity
and storage requirements, which challenge the usability of this approach.
In this paper, a systematic procedure is proposed to design low complexity
systems making use of transform competition.
Multiple trade-offs accommodating the complexity are unveiled and it is
demonstrated that they can keep a certain level of performance.
Compared to the \acs{HEVC} standard, some of them provide bitrate savings
around 2\% with a 50\% increase in the encoding time, using less than
\SI{4}{\kilo B} of extra \acs{ROM} and no added decoding complexity.

\clearpage

\section*{Patent applications}
\label{sec:patent_applications}

\begin{enumerate}
	\item Procédé de codage et de décodage d'images, dispositif de codage et
		de décodage d'images et programmes d'ordinateur correspondants,
		patent application INPI 1457768, August 2014
	\item Procédé de codage et de décodage d'images, dispositif de codage et
		de décodage d'images et programmes d'ordinateur correspondants,
		patent application INPI 155394, April 2015.
	\item Procédé de codage et de décodage d'images, dispositif de codage et
		de décodage d'images et programmes d'ordinateur correspondants,
		patent application INPI 1558066 August 2015
	\item Procédé de codage et de décodage d'images, dispositif de codage et
		de décodage d'images et programmes d'ordinateur correspondants,
		patent pending, October 2015
	\item Procédé de codage et de décodage d'images, dispositif de codage et
		de décodage d'images et programmes d'ordinateur correspondants,
		patent pending November 2015.
\end{enumerate}

\numberwithin{table}{chapter} % tables referred to chapters

\chapter{Transform usage in \acs{MDTC} systems}
\label{cha:transform_usage_in_mdtc_systems}

This appendix contains tables describing used \ac{MDTC} systems from
Chapters~\ref{cha:real_world_system} and~\ref{cha:dtt}.
The \emph{System} top row represents the system label and upper \acs{ROM}
boundary in kB.

\section*{Non-symmetrical \acs{MDTC} systems}
\label{sec:non_symmetrical_mdtc_systems}

\begin{table}[ht]
	\centering
	\small
	\def\arraystretch{0.85}
	\begin{tabular}{c|rrrrrrrrrrrr}
		\diagbox{\acs{IPM}}{System} &
		1 & 2 & 4 & 8  & 12 & 16 & 24 & 32 & 48 & 64 & 96 & 128 \\
		\hline
		0 & 2 & 2 & 8 & 16 & 16 & 16 & 32 & 32 & 32 & 32 & 32 & 32 \\
		1 & 2 & 2 & 2 & 8  & 8  & 8  & 8  & 8  & 8  & 8  & 8  & 8 \\
		2 & 0 & 0 & 1 & 1  & 1  & 1  & 1  & 2  & 2  & 1  & 4  & 4 \\
		3 & 0 & 0 & 0 & 1  & 1  & 1  & 2  & 2  & 4  & 2  & 8  & 8 \\
		4 & 0 & 0 & 0 & 0  & 0  & 0  & 1  & 1  & 4  & 1  & 4  & 8 \\
		5 & 0 & 0 & 1 & 1  & 1  & 1  & 1  & 4  & 4  & 1  & 4  & 4 \\
		6 & 0 & 0 & 0 & 1  & 1  & 2  & 2  & 4  & 4  & 2  & 8  & 8 \\
		7 & 0 & 0 & 0 & 2  & 2  & 2  & 2  & 2  & 4  & 2  & 8  & 8 \\
		8 & 0 & 0 & 2 & 2  & 2  & 2  & 2  & 2  & 2  & 2  & 2  & 4 \\
		9 & 1 & 1 & 1 & 1  & 1  & 1  & 1  & 1  & 1  & 1  & 4  & 4 \\
		10 & 1 & 1 & 2 & 4  & 4  & 4  & 8  & 8  & 8  & 8  & 8  & 16 \\
		11 & 0 & 0 & 1 & 1  & 2  & 2  & 2  & 2  & 2  & 2  & 4  & 4 \\
		12 & 1 & 1 & 1 & 1  & 1  & 1  & 1  & 1  & 1  & 1  & 2  & 2 \\
		13 & 1 & 1 & 1 & 1  & 1  & 1  & 1  & 1  & 2  & 1  & 4  & 4 \\
		14 & 0 & 0 & 1 & 1  & 1  & 1  & 2  & 2  & 8  & 2  & 8  & 8 \\
		15 & 0 & 0 & 0 & 0  & 0  & 0  & 1  & 1  & 4  & 1  & 8  & 8 \\
		16 & 0 & 0 & 0 & 0  & 0  & 0  & 0  & 1  & 2  & 0  & 8  & 8 \\
		17 & 0 & 0 & 1 & 2  & 2  & 2  & 2  & 2  & 2  & 2  & 4  & 4 \\
		18 & 0 & 0 & 0 & 0  & 1  & 1  & 2  & 2  & 2  & 2  & 4  & 4 \\
		19 & 0 & 0 & 0 & 0  & 0  & 0  & 0  & 1  & 2  & 0  & 2  & 4 \\
		20 & 0 & 0 & 1 & 1  & 1  & 1  & 1  & 2  & 2  & 2  & 4  & 4 \\
		21 & 0 & 0 & 0 & 1  & 2  & 2  & 2  & 8  & 16 & 4  & 32 & 32 \\
		22 & 0 & 0 & 0 & 0  & 0  & 0  & 2  & 2  & 2  & 2  & 4  & 4 \\
		23 & 0 & 0 & 0 & 0  & 0  & 0  & 2  & 2  & 2  & 2  & 4  & 4 \\
		24 & 0 & 0 & 1 & 1  & 2  & 2  & 2  & 2  & 2  & 2  & 2  & 2 \\
		25 & 0 & 0 & 0 & 1  & 1  & 1  & 1  & 2  & 2  & 2  & 4  & 4 \\
		26 & 1 & 1 & 1 & 4  & 8  & 8  & 16 & 16 & 16 & 16 & 16 & 32 \\
		27 & 0 & 1 & 2 & 2  & 2  & 2  & 4  & 8  & 8  & 4  & 8  & 8 \\
		28 & 0 & 0 & 0 & 1  & 2  & 2  & 2  & 4  & 4  & 2  & 8  & 8 \\
		29 & 0 & 0 & 0 & 1  & 2  & 2  & 2  & 2  & 2  & 2  & 2  & 2 \\
		30 & 0 & 0 & 0 & 0  & 0  & 0  & 0  & 0  & 0  & 0  & 4  & 4 \\
		31 & 0 & 0 & 1 & 1  & 1  & 1  & 1  & 1  & 1  & 1  & 8  & 8 \\
		32 & 0 & 0 & 1 & 2  & 2  & 2  & 4  & 4  & 4  & 4  & 4  & 8 \\
		33 & 0 & 0 & 0 & 0  & 0  & 0  & 0  & 0  & 4  & 0  & 4  & 4 \\
		34 & 0 & 0 & 0 & 0  & 0  & 0  & 1  & 2  & 4  & 1  & 4  & 4 \\
	\end{tabular}
	\caption{Transform repartition for non-symmetrical \acs{MDTC} systems on
	$4\times4$ \acsp{TU}}
	\label{tab:config_nonsym_mdtc_4}
\end{table}

\begin{table}[ht]
	\centering
	\small
	\def\arraystretch{0.85}
	\begin{tabular}{c|rrrrrrrrrrrr}
		\diagbox{\acs{IPM}}{System} &
		1 & 2 & 4 & 8  & 12 & 16 & 24 & 32 & 48 & 64 & 96 & 128 \\
		\hline
		0  & 1 & 4 & 4 & 4 & 4 & 8  & 32 & 32 & 64 & 128 & 128 & 128 \\
		1  & 0 & 0 & 1 & 4 & 8 & 16 & 16 & 32 & 64 & 64  & 128 & 128 \\
		2  & 0 & 0 & 0 & 0 & 0 & 0  & 0  & 0  & 1  & 1   & 1   & 4 \\
		3  & 0 & 0 & 0 & 0 & 0 & 0  & 0  & 0  & 1  & 1   & 1   & 2 \\
		4  & 0 & 0 & 0 & 1 & 1 & 1  & 1  & 1  & 1  & 1   & 2   & 8 \\
		5  & 0 & 0 & 0 & 0 & 0 & 1  & 1  & 1  & 1  & 2   & 4   & 16 \\
		6  & 0 & 0 & 0 & 1 & 1 & 1  & 2  & 2  & 2  & 2   & 2   & 8 \\
		7  & 0 & 0 & 0 & 0 & 0 & 0  & 0  & 1  & 1  & 1   & 4   & 4 \\
		8  & 0 & 0 & 1 & 1 & 1 & 1  & 1  & 2  & 4  & 8   & 8   & 8 \\
		9  & 0 & 0 & 1 & 1 & 2 & 2  & 2  & 2  & 4  & 8   & 8   & 16 \\
		10 & 1 & 2 & 2 & 4 & 4 & 4  & 8  & 16 & 16 & 32  & 32  & 64 \\
		11 & 0 & 0 & 1 & 1 & 2 & 2  & 2  & 4  & 8  & 8   & 8   & 16 \\
		12 & 0 & 0 & 0 & 0 & 2 & 2  & 2  & 2  & 2  & 2   & 2   & 4 \\
		13 & 0 & 0 & 0 & 1 & 1 & 2  & 2  & 2  & 2  & 2   & 2   & 4 \\
		14 & 0 & 0 & 0 & 1 & 1 & 1  & 1  & 1  & 1  & 2   & 2   & 4 \\
		15 & 0 & 0 & 0 & 0 & 0 & 1  & 1  & 1  & 1  & 1   & 4   & 16 \\
		16 & 0 & 0 & 0 & 0 & 0 & 0  & 0  & 1  & 1  & 1   & 1   & 2 \\
		17 & 0 & 0 & 0 & 0 & 0 & 0  & 1  & 1  & 1  & 2   & 2   & 4 \\
		18 & 0 & 0 & 0 & 0 & 1 & 1  & 2  & 2  & 2  & 2   & 2   & 4 \\
		19 & 0 & 0 & 0 & 0 & 0 & 0  & 0  & 0  & 0  & 0   & 0   & 2 \\
		20 & 0 & 0 & 0 & 0 & 0 & 1  & 1  & 1  & 1  & 1   & 2   & 4 \\
		21 & 0 & 0 & 0 & 1 & 1 & 2  & 2  & 2  & 2  & 2   & 2   & 2 \\
		22 & 0 & 0 & 0 & 0 & 1 & 1  & 1  & 1  & 1  & 1   & 2   & 8 \\
		23 & 0 & 0 & 0 & 0 & 1 & 1  & 2  & 2  & 2  & 2   & 2   & 2 \\
		24 & 0 & 0 & 0 & 1 & 2 & 2  & 2  & 2  & 2  & 2   & 2   & 8 \\
		25 & 0 & 0 & 1 & 1 & 2 & 2  & 2  & 2  & 2  & 2   & 4   & 4 \\
		26 & 1 & 2 & 2 & 4 & 8 & 8  & 8  & 8  & 8  & 8   & 64  & 64 \\
		27 & 0 & 0 & 1 & 1 & 2 & 4  & 4  & 8  & 8  & 16  & 16  & 32 \\
		28 & 0 & 0 & 0 & 1 & 1 & 1  & 1  & 1  & 1  & 2   & 4   & 4 \\
		29 & 0 & 0 & 0 & 0 & 1 & 1  & 1  & 1  & 1  & 1   & 1   & 4 \\
		30 & 0 & 0 & 0 & 0 & 0 & 0  & 1  & 2  & 2  & 2   & 2   & 2 \\
		31 & 0 & 0 & 0 & 0 & 0 & 0  & 1  & 1  & 1  & 1   & 1   & 2 \\
		32 & 0 & 0 & 0 & 0 & 0 & 0  & 0  & 1  & 2  & 2   & 2   & 8 \\
		33 & 0 & 0 & 0 & 0 & 0 & 0  & 0  & 1  & 1  & 2   & 2   & 8 \\
		34 & 0 & 0 & 0 & 0 & 0 & 0  & 0  & 0  & 0  & 0   & 2   & 16 \\
	\end{tabular}
	\caption{Transform repartition for non-symmetrical \acs{MDTC} systems on
	$8\times8$ \acsp{TU}}
	\label{tab:config_nonsym_mdtc_8}
\end{table}

\clearpage

\section*{Symmetrical \acs{MDTC} systems}
\label{sec:symmetrical_mdtc_systems}

\begin{table}[h]
	\centering
	\small
	\def\arraystretch{0.85}
	\begin{tabular}{c|rrrrrrrrrrrr}
		\diagbox{\acs{IPM}}{System} &
		1 & 2 & 4 & 8  & 12 & 16 & 24 & 32 & 48 & 64 & 96 & 128 \\
		\hline
		0  & 0 & 2 & 2 & 2  & 2  & 16 & 16 & 16 & 32 & 16 & 32 & 32 \\
		1  & 0 & 2 & 4 & 8  & 8  & 8  & 8  & 8  & 32 & 32 & 32 & 32 \\
		2  & 0 & 0 & 0 & 4  & 4  & 4  & 4  & 4  & 8  & 8  & 8  & 8 \\
		3  & 0 & 1 & 1 & 1  & 1  & 8  & 8  & 8  & 8  & 8  & 16 & 16 \\
		4  & 0 & 1 & 1 & 2  & 2  & 8  & 8  & 8  & 32 & 16 & 32 & 32 \\
		5  & 1 & 2 & 2 & 2  & 2  & 4  & 4  & 4  & 16 & 4  & 16 & 16 \\
		6  & 0 & 0 & 0 & 0  & 0  & 4  & 4  & 4  & 4  & 4  & 4  & 4 \\
		7  & 0 & 1 & 1 & 2  & 2  & 2  & 2  & 2  & 8  & 4  & 8  & 8 \\
		8  & 1 & 1 & 1 & 1  & 1  & 1  & 1  & 1  & 4  & 4  & 4  & 4 \\
		9  & 1 & 1 & 1 & 2  & 2  & 2  & 2  & 2  & 16 & 2  & 16 & 16 \\
		10 & 2 & 2 & 4 & 16 & 16 & 32 & 32 & 32 & 32 & 32 & 32 & 32 \\
	\end{tabular}
	\caption{Transform repartition for symmetrical \acs{MDTC} systems on
	$4\times4$ \acsp{TU}}
	\label{tab:config_sym_mdtc_4}
\end{table}

\begin{table}[h]
	\centering
	\small
	\def\arraystretch{0.85}
	\begin{tabular}{c|rrrrrrrrrrrr}
		\diagbox{\acs{IPM}}{System} &
		1 & 2 & 4 & 8  & 12 & 16 & 24 & 32 & 48 & 64 & 96 & 128 \\
		\hline
		0  & 1 & 2 & 4 & 4 & 16 & 16 & 32 & 32 & 32 & 32 & 128 & 128 \\
		1  & 0 & 0 & 1 & 4 & 8  & 8  & 16 & 32 & 32 & 64 & 64  & 128 \\
		2  & 0 & 0 & 0 & 0 & 2  & 2  & 2  & 2  & 8  & 16 & 16  & 32 \\
		3  & 0 & 0 & 0 & 2 & 4  & 4  & 4  & 4  & 16 & 16 & 16  & 32 \\
		4  & 0 & 0 & 1 & 2 & 2  & 2  & 2  & 4  & 4  & 4  & 16  & 16 \\
		5  & 0 & 0 & 0 & 2 & 2  & 2  & 2  & 4  & 8  & 16 & 16  & 32 \\
		6  & 0 & 0 & 1 & 2 & 4  & 4  & 4  & 4  & 8  & 8  & 8   & 32 \\
		7  & 0 & 0 & 2 & 2 & 2  & 2  & 2  & 2  & 4  & 8  & 16  & 64 \\
		8  & 0 & 1 & 2 & 2 & 2  & 2  & 8  & 16 & 16 & 16 & 16  & 16 \\
		9  & 2 & 2 & 4 & 4 & 4  & 4  & 16 & 16 & 16 & 64 & 64  & 64 \\
		10 & 1 & 2 & 2 & 8 & 8  & 16 & 16 & 32 & 64 & 64 & 64  & 64 \\
	\end{tabular}
	\caption{Transform repartition for symmetrical \acs{MDTC} systems on
	$8\times8$ \acsp{TU}}
	\label{tab:config_sym_mdtc_8}
\end{table}

\section*{Symmetrical \acs{DTT}-based \acs{MDTC} systems}
\label{sec:symmetrical_dtt_based_mdtc_systems}

\begin{table}[h]
	\centering
	\small
	\def\arraystretch{0.85}
	\begin{tabular}{c|rrr|rrr}
		& \multicolumn{3}{c|}{$4\times4$} & \multicolumn{3}{c}{$8\times8$} \\
		\diagbox{\acs{IPM}}{System} & 1 & 2 & 4 & 1 & 2 & 4\\
		\hline
		0  & 8 & 8 & 8 & 2 & 4 & 16 \\
		1  & 2 & 4 & 4 & 2 & 4 & 16 \\
		2  & 0 & 0 & 0 & 0 & 2 & 4 \\
		3  & 4 & 4 & 4 & 1 & 1 & 2 \\
		4  & 2 & 2 & 2 & 0 & 1 & 4 \\
		5  & 2 & 2 & 2 & 1 & 2 & 4 \\
		6  & 0 & 1 & 1 & 1 & 1 & 1 \\
		7  & 2 & 2 & 2 & 0 & 0 & 0 \\
		8  & 2 & 2 & 2 & 0 & 0 & 0 \\
		9  & 1 & 1 & 1 & 0 & 0 & 0 \\
		10 & 4 & 8 & 8 & 2 & 8 & 8 \\
	\end{tabular}
	\caption{Transform repartition for symmetrical \acs{DTT}-based \acs{MDTC}
	systems on $4\times4$ and $8\times8$ \acsp{TU}}
	\label{tab:config_sym_dtt}
\end{table}

\backmatter

\bibliographystyle{abbrv} % numbers
%\bibliographystyle{apalike} % Surname et al. 2014
%\bibliographystyle{alpha} % ABC14
\bibliography{refs}

\printindex
\label{cha:index}
\addcontentsline{toc}{chapter}{Index}

% jury
\ifthenelse{\useINSAcover = 1}
{
\thispagestyle{empty}
\mbox{}
\includepdf[pages={5}]{./cover/attestation-reussite+jury.pdf}}{}
\cleardoublepage

% back cover
\ifthenelse{\useINSAcover = 1}
{
\thispagestyle{empty}
\mbox{}
\includepdf[pages={3}]{./cover/cover-mtvc-thesis-INSA-UEB.pdf}}{}

\end{document}