-
Notifications
You must be signed in to change notification settings - Fork 1
/
main.tex
6839 lines (6153 loc) · 291 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[11pt,a4paper,openright,twoside]{book}
\def\myauthor{Adrià Arrufat}
\def\mytitle{Multiple transforms for video coding}
\def\usepdfs{0} % whether to use pre-generated pdfs with 'make figures'
\def\useINSAcover{0} % whether to use the INSA front and back covers
\usepackage[british,french]{babel} % locale settings
\usepackage{geometry} % page configuration
\geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}
% font configuration
\usepackage[T1]{fontenc}
\usepackage{mathptmx}
\usepackage[scaled]{helvet}
\usepackage{courier}
% mathematical symbols
\usepackage{amsmath, amssymb}
\usepackage{tocbibind} % see http://www.howtotex.com/packages/how-to-add-bibliography-and-more-to-table-of-contents/
\usepackage[Lenny]{fncychap} % style for chapters
\ChTitleVar{\Huge\sf\bfseries}
\usepackage{titlesec,titletoc} % to modify titles and add partial tocs
\titleformat*{\section}{\Large\bfseries\sffamily}
\titleformat*{\subsection}{\large\bfseries\sffamily}
\titleformat*{\subsubsection}{\bfseries\sffamily}
\usepackage[printonlyused,withpage]{acronym} % options: printonlyused, withpage
\usepackage{imakeidx} % allows customizing the index in the makeindex command
\usepackage{emptypage} % removes headers and footers from empty pages
\usepackage{booktabs} % nice looking tables
\usepackage{siunitx} % consistent units and number support
\sisetup{number-unit-product = \ } % adds a space between numbers and units
\sisetup{round-mode=places,round-precision=2} % set precision to 2 decimal places
\usepackage{subfig} % subfloat environment
% \usepackage{flafter} % make sure floats appear after they have been referenced
\usepackage[section, above, below]{placeins} % prevents float from bein far
\usepackage{nicefrac} % nicer fractions
\usepackage{contour} % for inexisting bold symbols
\contourlength{0.01em}
\usepackage{enumitem} % for custom labels in enumerate environments
\usepackage{algorithm2e} % for describing algorithms
\usepackage{multirow} % for columns that expland among multiple rows in tables
\usepackage{tabularx} % for fixed width tables
\usepackage{diagbox} % for diagonal separator in tables
\usepackage{fancyhdr} % customise headers and footers
\usepackage{pdfpages} % to include external pages from pdfs
% lines and paragraphs
% \usepackage{setspace}
% \setstretch{1}
% \parskip=\smallskipamount
% \setlength{\parindent}{0pt}
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage{pgfplots,tikz}
\usetikzlibrary{shapes,arrows,fit,calc,decorations.markings,intersections}
\usepgfplotslibrary{fillbetween}
\usepackage{intcalc, numprint}
\usepackage{bookmark} % needed for \bookmarksetup{startatroot}
\usepackage{hyperref}
\hypersetup{
unicode=true,
pdfencoding=auto,
colorlinks=true,
citecolor=blue,
filecolor=red,
linkcolor=Blue,
urlcolor=blue,
linktoc=all,
pdfauthor={\myauthor},
pdftitle={\mytitle},
pdfsubject={video coding},
pdfkeywords={video, image, transform, coding, compression, phd, thesis},
pdfinfo={
CreationDate={D:20150317093606},
%ModDate={...}
},
}
% \urlstyle{same} % do not use monospaced fonts in urls
% Define a TODO command
\definecolor{yellowish}{rgb}{1,1,0.5}
\definecolor{redish}{rgb}{1,0.25,0.25}
\providecommand{\todo}[1]{
\begin{center}
\colorbox{yellowish}{
\begin{minipage}{0.95\linewidth}
\textbf{\color{redish}{TODO}:} #1
\end{minipage}
}
\end{center}
}
% Define a partial ToC to use at the beginning of every chapter
\providecommand{\chaptertoc}{
\startcontents[chapters]
\hrule
\vspace{1em}
\printcontents[chapters]{}{1}{{\sf\large\bfseries Contents}}
\vspace{1em}
\hrule
}
% import mathematical and colour definitions
\input{./custom_defs.tex}
\numberwithin{equation}{section} % equations referred to sections
\numberwithin{figure}{section} % figures referred to sections
\numberwithin{table}{section} % tables referred to sections
\makeindex[options=-s index-alph-group.ist]
\title{\Huge\bf\mytitle}
\author{\myauthor}
\begin{document}
\selectlanguage{british}
\frontmatter
% front cover
\ifthenelse{\useINSAcover = 1}
{
\includepdf[pages={2}]{./cover/cover-mtvc-thesis-INSA-UEB.pdf}
\cleardoublepage
\includepdf[pages={4}]{./cover/cover-mtvc-thesis-INSA-UEB.pdf}
}{\maketitle}
\chapter{Acknowledgements}
\label{cha:acknowledgements}
First of all, I would like to thank my PhD supervisor, Pierrick Philippe from
Orange Labs.
He has transmitted me the enthusiasm in the daily work by flooding me with new
ideas and challenges every day.
I feel that I am very lucky and honoured to have been able to work with him
and his team and I hope that our paths will cross again during our
professional and personal lives.
I also want to thank the rest of the team at Orange Labs: Gordon Clare and
Félix Henry for their invaluable help at the beginning, which allowed me to
dive into the code and get a better understanding on video coding.
A special mention goes to Patrick Boissonade, with whom I shared the office
during these years.
No matter how complicated seemed a technical difficulty, he always managed to
impress me with his knowledge on everything: from coding and optimisation on
different architectures, to system administration.
I also want to thank him the patience he has shown every time I made the
cluster crash with my experiments and how he came up with a way of finding the
issue and solving it in record time.
Without all these efforts, the results of my work during the last three years
would be very different.
Other co-workers at Orange also made my stay a lot more pleasant with the
interesting discussions we had at lunch time about almost any topic.
These discussions have allowed me to get to know better my team and other
people such as Patrick Gioia and Stéphane Pateux.
I also feel very grateful to Didier Gaubil, our team manager, for being always
available whenever I needed him.
Travelling feels so much safer in the knowledge that someone like him is in
charge and will know what to do in case of an emergency.
I have to thank Olivier Déforges for his advice, dedication and support,
especially when deadlines for publications approached.
Moreover, having attended to two conferences with him, allowed me to get to
be more confident and to know him better.
A special mention for Hendrik Vorwerk, whose work during his intern-ship
served as an invaluable starting point for my results, and without which I
would have struggled to achieve the same results.
The members of the jury (Christine Guillemot, Béatrice Pesquet-Popescu,
Mathias Wien, Fernando Pereira and Philippe Salembier) deserve a distinctive
mention as well for having accepted to review, assisted to my PhD defence and
given constructive feedback.
On a side note, I must mention that all the experiments have been carried out
using free and open-source software, as such I thank the Internet and Linux
community for making all this kind of knowledge available on-line.
Last but not least, I want to thank my parents, for bearing me on endless
phone conversations almost every day and having supported me through all
these years.
\chapter{French summary}
\label{cha:french_summary}
\selectlanguage{french}
Les codeurs vidéo état de l'art utilisent des transformées pour assurer une
représentation compacte du signal.
L'étape de transformée constitue le domaine dans lequel s'effectue la
compression, pourtant peu de variabilité dans les transformées est observé
dans la littérature: habituellement, une fois que la taille d'un bloc est
sélectionné, la transformée est figée, habituellement de type Transformée en
Cosinus Discrète (TCD).
D'autres transformées autres que cette transformée, qui constitue le choix de
facto, ont récemment reçu une attention en application de codage vidéo.
Par exemple, dans la dernière norme de compression vidéo appelée HEVC (High
Efficiency Video Coding, codage vidéo à haute efficacité), la Transformée en
Sinus Discrète (TSD) est également utilisée pour traiter les blocs issus de la
prédiction pour les tailles $4\times4$.
De plus, pour ces blocs particuliers, HEVC a le choix complémentaire de ne pas
transformer le bloc, par utilisation du signal transformSkip.
Ce fait révèle l'intérêt croissant pour étendre les choix entre transformées
pour accommoder les insatiables besoins en compression vidéo.
Cette thèse se concentre sur l'amélioration des performances en codage vidéo
par l'utilisation de multiples transformées.
Les résultats sont présentés pour le codage des images Intra, c'est-à-dire des
images qui ne sont codées qu'à partir de données locales à celle-ci.
Dans cette configuration la norme de compression HEVC (publiée en 2013), qui
représente la solution la plus aboutie en la matière, améliore la performance
de compression du précédent standard appelé AVC (publié en 2003) de 22\%.
HEVC obtient cette amélioration par la démultiplication des alternatives de
codage comme l'utilisation de plusieurs tailles de bloc (4, 8, 16, 32 et 64)
et modes de prédiction (35 modes) pour générer le signal résiduel (différence
entre les pixels de l'image originale et l'image issue de la prédiction) qui
est ensuite transformé par une transformée donnée selon la taille
sélectionnée.
L'objectif pour le codeur est de trouver le meilleur compromis entre la
distorsion apportée par la quantification et le débit nécessaire pour
transmettre les valeurs approximées.
On se rend compte que HEVC investit une part importante dans la génération de
résidus, mais peu d'alternatives existent quant à la transformée.
Cette thèse est motivée par le fait que l'utilisation de plusieurs
transformées permet d'obtenir une représentation plus parcimonieuse du signal
que dans le cas d'une seule transformée.
Comme ce thème est relativement peu abordé en codage vidéo, cette thèse tente
de combler le vide pour considérer des transformées autres que la transformée
en cosinus discrète.
Pour ce faire, un aspect de cette thèse concerne la conception de transformées
en utilisant deux techniques qui sont détaillées dans ce manuscrit.
L'approche traditionnelle à base de transformées de Karhunen-Loève (KLT) et
une transformée optimisée débit distorsion nommée RDOT.
La KLT est une transformée qui a pour vocation à minimiser la distorsion sous
une hypothèse de haute résolution au travers d'une allocation de bit optimale,
cela implique une décorrélation du signal dans le domaine transformé.
La RDOT quant à elle, essaie de rendre le signal le plus parcimonieux possible
tout en limitant la quantité de distorsion induite par la quantification.
La première approche basée transformée multiples est au travers d'une
technique nommée MDDT (Mode Dependent Directional Transform).
Celle-ci consiste à utiliser une transformée adaptée, par le biais d'une KLT
ou d'une RDOT, pour chaque mode de prédiction intra.
Par une utilisation de transformées séparables, un petit gain est observé par
rapport à HEVC (de l'ordre de 0.5\% du débit est économisé).
Néanmoins, l'utilisation de transformées non-séparables révèle des gains
tangibles de l'ordre de 2.4\% lorsque les transformées sont adaptées au
travers de la RDOT.
Ce gain est plus favorable que celui observé lorsque les transformées sont
construites à partir de l'approche KLT: celle-ci n'améliore HEVC que de
1.8\%.
Les résultats de cette étude sont résumées dans l'article intitulé
``Non-separable mode dependent transforms for intra coding in HEVC'' présenté
à la conférence VCIP 2014.
Ce chapitre conclut que les transformées basées sur la RDOT ont de meilleures
performances que celles basée KLT.
Dans l'objectif d'étendre l'approche MDDT, le chapitre suivant décrit une
approche nommée MDTC (Mode-Dependent Transform Competition) dans la quelle
chaque mode de prédiction est équipé de plusieurs transformées.
Lors du codage, ces transformées entrent en compétition de la même façon que
les modes de prédiction et tailles de blocs sont sélectionnés.
Ce système apporte des gains de l'ordre de 7\% pour des transformées
non-séparables et 4\% pour les transformées séparables, en comparaison avec
HEVC.
Les résultats de ce chapitre sont publiés dans l'article ``Mode-dependent
transform competition for HEVC'' publié lors de la conférence ICIP 2015.
Néanmoins la complexité de tels systèmes est notoire, à la fois en ressources
de calcul et en espace de stockage: un facteur de 10 en temps de codage et la
complexité de décodage est accrue de 40\% par rapport à HEVC.
Le stockage des transformées requiert en outre plus de 300 kilo-octets.
En conséquence les chapitres suivants de la thèse développent des approches
permettant de simplifier visant à simplifier les systèmes MDTC tout en
conservant dans la mesure du possible l'amélioration en débit.
Comme les transformées non-séparables apportent les gains les plus
prometteurs, le chapitre 5, présente une approche plus simple permettant
d'utiliser néanmoins les transformées non-séparables.
Ces travaux ont été publiés dans la référence ``Image coding with incomplete
transform competition for HEVC'' présentée à la conférence ICIP 2015.
L'approche développée consiste à ne plus utiliser l'ensemble des vecteurs de
base lors de la transformation, mais de ne conserver que la première base.
Un ensemble de transformées incomplète est ainsi produit et utilisé en
complément de la transformée HEVC qui conserve sa base complète.
Des gains en compression de l'ordre de 1\% sont observés avec cette technique,
avec une complexité au décodeur notablement abaissée par rapport aux
précédentes approches: elle devient même plus faible que celle de HEVC.
Finalement, une procédure de construction de systèmes MDTC à basse complexité
est présentée.
Ces travaux sont repris dans la publication ``Low complexity transform
competition for HEVC'' acceptée à la conférence ICASSP 2016.
Cette approche à basse complexité s'appuie sur trois composantes qui sont
évaluées: tout d'abord une sélection du nombre adéquat de transformées par
mode est effectuée, ce qui permet de réduire le nombre de transformées et
limiter l'espace de stockage et la complexité de codage.
De plus des symétries entre modes de prédiction sont exploitées pour réduire
la ROM d'un facteur 3.
Pour terminer l'utilisation de transformées trigonométriques (DTT, Discrete
Trigonometric Transforms) est motivé par existence d'algorithmes rapides.
L'ensemble de ces contributions réunies permet de proposer un système d'une
complexité d'encodage de 50\% accrue par rapport à l'état de l'art avec une
complexité ajoutée, au niveau décodage et stockage, mineure.
En conclusion les résultats de cette thèse montrent que les transformées
multiples apportent des gains significatifs en comparaison avec le plus récent
standard de codage vidéo.
Des gains très substantiels par rapport à HEVC sont apportés si l'on néglige
les aspects complexité.
Néanmoins pour des systèmes réalistes des gains tangibles sont obtenus pour
des complexités compétitives.
\selectlanguage{british}
% \setcounter{tocdepth}{5}
\tableofcontents
\cleardoublepage
\chapter{List of Acronyms}
\label{cha:glossary}
\input{./acronyms.tex}
\cleardoublepage
\listoffigures
\cleardoublepage
\listoftables
\cleardoublepage
\chapter{General introduction}
\label{cha:general_intoduction}
% \addcontentsline{toc}{chapter}{\protect\numberline{}General introduction}
\section*{Context}
\label{sec:context}
\addcontentsline{toc}{section}{\protect\numberline{}Context}
Nowadays, video services play a major role in information exchanges around the
world.
Despite the progress achieved in the last years with video coding standards,
improvements are still required as new formats emerge:
as \ac{HFR}, \ac{HDR} and \ac{HD} formats become more and more common, new
needs for video compression appear that must exploit properties in these
domains to achieve higher compression rates.
All these formats are made realistic in terms of service deployment thanks to
the fact that around every 10 years, the coding efficiency doubles for
equivalent quality.
In 2003, the H.264/\acs{MPEG}-4 \acs{AVC} standard was defined, providing
a compression rate of around 50\% with regards \acs{MPEG}-2 video, defined
in 1993.
In January 2013, the \acs{HEVC} standard was released, which outperforms
H.264/\acs{MPEG}-4 \acs{AVC} by 50\% in terms of bitrate savings for
equivalent perceptual quality~\cite{sullivan-12-overview-hevc}.
\bigskip
The work carried out in this thesis started in November 2012, with the
\acs{HEVC} standard almost completely defined.
Consequently, the focus has been put on improving \acs{HEVC} with
new techniques that could be adopted in a future standard, tentatively for
around 2020.
Recently, \acs{ITU} and \acs{ISO}, through their respective groups \acs{VCEG}
and \acs{MPEG}, have started working towards a possible future video coding
standard for that time frame.
\bigskip
Being at the beginning of the post-\acs{HEVC} era, the first steps in this
thesis strive to achieve important bitrate savings over \acs{HEVC} by
relaxing complexity constraints.
This thesis is strongly connected to the standardisation context.
The first exploratory direction points towards finding new techniques
regarding the role of transforms in video coding, such as different
transform design methods and the usage of multiple transforms adapted to the
nature of video coding signals.
Then, the studies move towards making these new techniques involving multiple
transforms admissible in a standardisation context, which implies having
reasonable impact on standardisation aspects, such as complexity, especially
on the decoder side.
\bigskip
Accordingly, this thesis has been organised into the following Chapters:
\begin{enumerate}
[labelindent=3.8em,leftmargin=!,label={\bf Chapter \arabic{enumi}}]
\item starts with an introduction to the basics of video coding and some
essential concepts for this thesis on modern video coding standards.
The focus is quickly put on the transform stage and its crucial role
in the video coding scheme.
\item contains a detailed study on the transform role inside video coding
applications.
As mentioned, the main motivation of this thesis is to improve video
coding by making use of multiple transforms.
In order to conceive those transforms, two design methods are studied:
the \ac{KLT} and a \ac{RDOT}.
The \ac{KLT} defines a well-known transform design method to conceive
transforms that minimise the distortion under the high-resolution
quantisation hypothesis and to provide optimal bit-allocation via
signal decorrelation in the transform domain.
The \ac{RDOT}, presented in details in this thesis, describes another
design method that tries to output a signal which is as sparse as
possible while minimising the distortion introduced by the
quantisation.
\item compares the \ac{KLT} and \ac{RDOT} design methods introduced in the
previous Chapter by using multiple transforms in a modified version of
\ac{HEVC} through the \ac{MDDT} technique, where one adapted transform
(\ac{KLT} or \ac{RDOT}) is provided per \acl{IPM}.
This experiment questions the optimality of the \ac{KLT} for video
signals.
Moreover, the impact of transform separability on video coding in
terms of bitrate savings and coding complexity is reconsidered.
\item extends the \ac{MDDT} system by introducing the \ac{MDTC} system.
The main idea is to provide several transforms in each \acl{IPM}.
These transforms compete against each other in the same way that block
sizes and \aclp{IPM} do.
\ac{MDTC} systems are able to offer notable bitrate savings at the
expense of encoding and decoding complexity and the storage
requirements for the used transforms.
Therefore, the following Chapters of the thesis contain approaches on
simplifying the \ac{MDTC} system whilst keeping bitrate improvements.
\item explores a new way of simplifying non-separable \ac{MDTC} systems,
which provided the most promising results in the previous Chapter, by
using incomplete transforms.
Incomplete transforms are low complexity transforms where, instead of
using all the basis vectors for the non-separable transforms, only the
first one is retained.
They are used as companions of the default \acs{HEVC} transforms.
\item presents different methods for reducing the storage requirements of
\ac{MDTC} systems.
The proposed methods are based on the fact that not all \aclp{IPM}
present a uniform behaviour inside \ac{HEVC} in terms of usage
frequency and signalling.
Therefore, not all \aclp{IPM} need the same number of transforms,
which allows for a reduction of signalling, encoding complexity and
storage.
Moreover, symmetries observed in \aclp{IPM} are exploited to further
reduce the storage requirements while having low impact on the
bitrate savings.
\item replaces the \ac{RDOT}-based \ac{MDTC} system with another one based
on \acp{DTT}.
\acp{DTT} can be expressed compactly via an analytical formula, which
reduces notably the storage requirements.
And, since they are backed up by fast algorithms, they make suitable
candidates for low complexity systems that have no impact on the
decoding complexity compared to \acs{HEVC}.
\item is a summary of the work carried out in this thesis in order to
provide a better overview of the achievements of systems making use of
multiple transforms.
Since many systems offering various trade-offs between complexity and
performance are presented in this thesis, the most relevant are
grouped in this Chapter for an easier comparison.
\item concludes on the thesis results and presents some perspectives for
future work on multiple transforms for video coding.
\end{enumerate}
\section*{Thesis contributions}
\label{sec:thesis_contributions}
\addcontentsline{toc}{section}{\protect\numberline{}Thesis contributions}
Work carried out in this thesis contains several objectives and contributions,
amongst which:
\begin{itemize}
\item Reconsider the separability on transforms for video coding.
\item Revisit the use of the \acs{KLT} for video coding.
\item Define a metric to be able to rank transforms and allow learning and
classification algorithms.
\item Design video coding systems that make use of multiple transforms.
\end{itemize}
The thesis ends with a summary of the thesis objectives, followed by the
conclusions and the perspectives for future work based on the use of multiple
transforms for video coding.
\mainmatter
\acresetall % reset all acronym expansions
\chapter{Video coding fundamentals}
\label{cha:video_coding_fundamentals}
\chaptertoc
\section{Introduction to video coding}
\label{sec:introduction_to_video_coding}
The purpose of video coding is to compress video streams, which consist of a
sequence of images that, at some point, will be either transmitted or stored.
Compressing means reducing the quantity of information so that the amount of
bits required to represent that information is low enough to enable the use of
video-based applications.
For instance, a video in \ac{HD} format ($1920 \times 1080$)
at a frame rate of 25 images per second and 8 bits to represent each one
of the \ac{RGB} channels, requires:
\[
\frac{1920\times\SI{1080}{pix}}{\SI{1}{image}}
\times \frac{\SI{25}{images}}{\SI{1}{s}}
\times \frac{\SI{3}{channels}}{\SI{1}{pix}}
\times \frac{\SI{8}{bits}}{\SI{1}{channel}}
\approx \SI{1.2}{\giga bit/\second}
\]
Through this simple example, it is obvious that video coding is compulsory
to stream or even store video files:
the amount of bitrate reported is this example is beyond the limit of current
computing and service architectures and networks.
Depending on the target quality, it is common to have compression rates
ranging from 10 to 1000.
For content providers, being able to reduce the size of the content they
deliver the opportunity to increase the number of contents they can store, as
well as the number of subscribers they can reach using the same resources for
storage and network capacity.
In 2013, two thirds of the \ac{IP} traffic was due to video
streaming, and this trend is foreseen to continuously increase, reaching up to
84\% of the \acs{IP} traffic by 2018~\cite{cisco-13-vni-forecast}.
Such forecasts highlight the need to continue the research on new video
coding techniques.
As a network operator, being also an \ac{ISP} that delivers video services
such as live TV streaming or video on demand services, Orange is, therefore,
particularly motivated by coding schemes allowing better trade-offs between
the quantity and quality of served videos and the amount of reachable clients.
\section{The video coding system}
\label{sec:the_video_coding_system}
\index{video coding system}
The video coding system describes a work-flow to process with video sequences.
It is composed of several stages:
from the video acquisition at the source and to the video display.
A good understanding of each part is crucial in order to be able to take the
proper decisions when delivering a video coding solution.
This section presents a scheme containing the most important concepts
used in state-of-the-art video codecs.
Figure~\ref{fig:video_coding_system} describes a way in which a complete video
coding system can be organised as a block diagram.
The composing blocks are explained more in details in the following
subsections.
\begin{figure}[tb]
\centering
\ifthenelse{\usepdfs = 0}
{\input{./figures/video_coding_system.tex}}
{\includegraphics{./figures/video_coding_system.pdf}}
\caption{Video coding system}
\label{fig:video_coding_system}
\end{figure}
\subsection{Pre-processing}
\label{sub:pre_processing}
Once the digital video has been captured at the source, which may be of many
different sorts such as natural scenes or synthetic computer-generated
content, it may need to be processed in order to be encoded.
Usually, the pre-processing stage can include some filtering, scaling and
colour space conversions on the raw (uncompressed) video sequence.
\subsubsection{Up-sampling and down-sampling}
\label{ssub:up-sampling_and_down-sampling}
The video captured at the source might not have the desired resolution.
Consequently, scaling operations should be applied at source to assure the
minimum impact to quality.
The sequences used in this thesis are centred around the \ac{HD} formats (720p
and 1080p).
However, other resolutions will also be considered, such as WVGA
($800\times480$) and WQXGA ($2560\times1600$).
\subsubsection{Colour space conversions}
\label{ssub:colour_space_conversions}
\index{colour space}
\index{HVS}
\index{YUV}
\index{luma}
\index{chroma}
\index{RGB}
\index{Y'CbCr}
A colour space is an abstract mathematical model specified by primary colours
that is used to describe a representation of a picture, usually as a set of
tristimulus values.
Colour space conversions are used to change the colour representation of the
content to better fit the \ac{HVS} and to decorrelate the components for
better coding efficiency.
Often, the colour space at the source is in the \ac{RGB} format, but since the
\ac{HVS} is more sensible to light variations than to colour variations, the
grey scale version of the image, which contains the light information (luma)
is separated from the colour information
(chroma)~\cite{poynton-95-color-space}.
The family colour spaces where the luma is separated from chroma are usually
referred to as YUV colour spaces.
This way the colour information can be sub-sampled without any visual
degradation.
As a result, in this thesis, the improvements on video coding quality will
focus the luma component, since it plays a more important role in perceptual
quality and represents an important part in the final bitstream.
\subsection{Encoding}
\label{sub:encoding}
This stage converts the pre-processed raw video sequence into a coded video
stream (called bitstream) to ease the storage and transmission.
It is, at this point, where the bitrate reduction takes place.
The amount of bits required to represent the video stream is reduced by
limiting the number of redundancies in the video and by introducing some
approximations such as the quantisation step, while limiting the impact of
these approximations on the perceived quality.
A deep look at the inners of a widely used coding scheme is provided
in \S\ref{sec:the_hybrid_video_coding_scheme}.
\subsection{Transmission}
\label{sub:transmission}
\index{random access}
The transmission stage represents the channel through which the encoded
bitstream is made available to the decoder.
The channel can be a physical storage, such as optical discs, or any other
transmission channel: wired/wireless connections with 1 to 1 (unicast) or 1 to
many (multicast/broadcast) transmissions.
Depending on the application and the channel, the behaviour of the encoder may
vary:
a video that is encoded for storage purposes will not have a real time
constraint, present in streaming applications.
For example, the \ac{RA} technique, i.e.\ the ability to access to a
particular piece of a video sequence, needs to be guaranteed for some
applications like TV, while the latency needs to be kept as low as possible to
enable services like surveillance or video conferencing systems.
This latency limitations need to be taken into account at the encoding stage.
\subsection{Decoding}
\label{sub:decoding}
As the stream is received by the decoder, it is buffered and used to
reconstruct the encoded data into the appropriate format, as signalled
by the encoder.
\acs{MPEG} and \acs{ITU} video coding standards define two key points:
the bitstream conveying the
compressed video data and the bit-exact decoding process, aiming at recovering
the sequence of images.
\acs{MPEG} (the \acs{JCT} 1/SC 29/WG 11 of the \acs{ISO}/\acs{IEC}
organisation) and \acs{VCEG} (the question 6 of ITU-T SG 16) are the main
organisations specifying video coding algorithms.
The most recent video coding specifications include the \acs{MPEG}-4 part 10
standard / \acs{ITU} H.264 recommendation, known as \ac{AVC}~\cite{itu-03-avc}
or \acs{MPEG}-H Part 2 / \acs{ITU} H.265, called
\acf{HEVC}~\cite{itu-13-hevc}.
The encoder must comply with this specification by generating a decodable
stream, there is no other normative behaviour defined by a video coding
standard.
\subsection{Post-processing}
\label{sub:post_processing}
The post-processing stage performs operations for image enhancement and
display adaptation, such as converting back to the original colour
space and to the display format.
\section{The hybrid video coding scheme}
\label{sec:the_hybrid_video_coding_scheme}
\index{hybrid video coding scheme}
\index{MPEG}
\index{H.261}
\index{AVC}
\index{H.264}
\index{HEVC}
\index{H.265}
State-of-the-art video coding standards such as H.264/\acs{MPEG}-4 \acs{AVC}
and \acs{HEVC} use a hybrid video coding scheme.
The overall coding structure appeared in H.261, in 1988.
Since then, all video coding standards and recommendations issued by
\acs{VCEG} and the \acs{MPEG} use this coding structure~\cite{wien-15-hevc}.
The hybrid video coding scheme is named after its use of both temporal
prediction and transform coding techniques for the prediction error.
A basic structure of the hybrid video coding scheme is presented in
figure~\ref{fig:hybrid_video_coding_scheme}.
\begin{figure}[tb]
\centering
\ifthenelse{\usepdfs = 0}
{\input{./figures/hybrid_video_coding_scheme.tex}}
{\includegraphics{./figures/hybrid_video_coding_scheme.pdf}}
\caption{Hybrid video coding scheme}
\label{fig:hybrid_video_coding_scheme}
\end{figure}
The hybrid video coding scheme provides an efficient way of compressing
a video signal into a bitstream of the smallest possible size for a given
level of fidelity.
The key features to achieve such a small bitstream are the signal
prediction and the transformation and quantisation of the prediction error.
A decoder is included in the encoder, represented inside a blue box, to be
able to perform its coding decisions based on what the decoder would do
while decoding a bitstream.
The building blocks of the hybrid video coding scheme are explained in
the following subsections.
\subsection{Partitioning}
\label{sub:partitioning}
\index{partitioning}
\index{quad-tree}
\index{CTB}
\index{CB}
\index{PU}
\index{TU}
In order to process the video frames, they are exhaustively partitioned into
non-overlapping blocks.
Sub-partitioning a picture allows for a better matching to the spatial
distribution of energy.
The subdivision blocks ease, as well, the succeeding stages of prediction and
transform:
the blocks can be processed, under some constraints, independently so that a
parallel processing is made possible.
The partitioning does not necessarily imply same-sized blocks, allowing
rectangular blocks of different sizes to be used, as illustrated in
figure~\ref{fig:part_orig_pred_res_image} (a).
This is due to the quad-tree partitioning that is implemented in \ac{HEVC}.
The figure provides a partitioning for a certain level of quantisation.
It can be seen that the picture has been divided into uniform regions.
The optimal choice of the block size is left to the encoder, through \acp{CTB}
in \ac{HEVC}.
Each \ac{CTB} can be split into four \acp{CB}, which can also be split into
four \acp{PU}, which can finally be split into four \acp{TU}.
\subsection{Prediction}
\label{sub:prediction}
\index{prediction}
Instead of coding the blocks coming from the original source picture directly,
the encoder computes an estimation of the block samples, which is then
subtracted from the original pixels, generating the residual block.
This technique is known as \ac{DPCM} in the literature~\cite{cutler-52-dpcm}.
Those block estimations are carried out by the prediction module, using some
information from previously processed blocks.
This way, predictable information present in the original blocks is removed
and the energy of the resulting signal is lowered so that it requires fewer
bits for a given distortion.
This is a result of the source coding
theory~\cite{jayant-84-digital-coding-waveforms}.
In other terms, the prediction mechanism aims at removing inter-block
redundancies in the video signal.
Predictions must be performed the same way at both encoder and decoder side,
and thus computed inside the blue box in
figure~\ref{fig:hybrid_video_coding_scheme}, referring to the decoder.
For this reason, the encoder uses reconstructed blocks (blocks that have
already been encoded) as the input data to compute the predictions, as these
blocks are equivalent to those the decoder will handle.
Commonly, predictions are of two types, depending on the origin of
the prediction source:
\begin{itemize}
\item Intra prediction, also called spatial prediction, for those blocks
predicted using information within the same picture.
\item Inter prediction, also called temporal prediction, for those blocks
predicted from pictures other than the one under consideration.
\end{itemize}
Figure~\ref{fig:part_orig_pred_res_image} (b) provides an example of an intra
predicted picture using the partitioning from
figure~\ref{fig:part_orig_pred_res_image} (a).
Figure~\ref{fig:part_orig_pred_res_image} (c) displays the residual picture (the
difference between the original and predicted pictures).
This image evidences the parts of the picture that could not be predicted
properly and will have to be encoded and transmitted.
\begin{figure}[tb]
\centering
\subfloat[Partitioning]
{\includegraphics[width=0.5\linewidth]{./figures/partitioning-orig-all-001.png}}
\\
\subfloat[Predicted Picture]
{\includegraphics[width=0.5\linewidth]{./figures/pred_image-all-001.png}}
\\
\subfloat[Residual Picture]
{\includegraphics[width=0.5\linewidth]{./figures/res_image-all-001.png}}
\caption{Example of a picture at different encoding points for \acs{AI}
main configuration at QP 32}
\label{fig:part_orig_pred_res_image}
\end{figure}
\subsubsection{Intra prediction}
\label{ssub:intra_prediction}
\index{intra prediction}
Intra prediction, sometimes referred to as spatial prediction, is used to
eliminate spatial redundancies by removing the correlation within local
regions of a picture.
The basic principle of intra prediction is based on the fact that the texture
of a picture region is similar to the texture of its neighbourhood and can be
predicted from there.
Pictures coded using this technique exclusively are named I-slices.
Different models of predictions can be used through projections of adjacent
decoded blocks.
These models include directional projections, gradient projections (called
Planar) and the projection of the mean value (called DC).
The \acp{IPM} are used to derive the predictions of the current block from its
available boundaries, formed by reconstructed blocks.
\subsubsection{Inter prediction}
\label{ssub:inter_prediction}
\index{inter prediction}
Inter prediction, or temporal prediction, takes advantage of the fact that
temporally close pictures share many similarities, and that some of their
component regions will move as a whole.
Since the encoding order is not necessarily the same as the viewing one, inter
predictions can have their origins in either past or future frames, and also
combine both origins.
This feature facilitates movement tracking across frames.
Pictures that are predicted using only one picture either from the past or
from the future are called predicted pictures or P-slices.
Bi-predicted pictures or B-slices have prediction origins in two different
pictures.
At the encoder side, an extra operation, called motion estimation, is
carried out.
This stage searches the best matching area in the reference picture for the
current prediction block.
It is one of the most complex parts of video coders in terms of computational
requirements.
Once a good prediction has been found, a motion vector is created, indicating
the offset that has to be applied in the block from the reference picture.
\subsection{Transform}
\label{sub:transform}
\index{transform}
\index{energy compaction}
The transform stage reduces the remaining correlations from the residual
block, computed as the difference between the original and the predicted
blocks.
The goal of the transform is to concentrate the residual signal into as few
coefficients as possible in the transform domain.
In the spatial domain, the residual signal is spread among the samples of the
blocks, while the objective of transform domains is to concentrate as much as
possible the residual signal into a few transform domain coefficients
exhibiting a large amplitude and the rest might be considered negligible.
This idea of the energy compaction is the main property of the transform
stage.
Most of the transforms used in standardised video coding schemes belong to the
\ac{DTT} family.
Amongst those, the \ac{DCT} of type II has received a considerable amount of
attention in the past and is the \emph{de facto} standard transform used in
\acs{ITU} and \acs{MPEG} codecs since \acs{MPEG}-1/H.261.
Additional choices were introduced recently, especially in \ac{HEVC}, where
the \ac{DST} of type VII was adopted.
Provided that the subject of this thesis is centred around the
transforms for video coding, the transform stage will be explained
thoroughly in Chapter~\ref{cha:transform_coding}.
\subsection{Quantisation}
\label{sub:quantisation}
\index{QP}
\index{quantisation}
\index{lossy}
\index{lossless}
The quantisation, applied in the transform domain, is used as an approximation
operator, reducing the amount of possible output values.
In standards like \ac{HEVC}, the quantisation is scalar:
each coefficient is approximated independently from its neighbouring values.
In these coding schemes, the quantisation step is controlled by a \ac{QP} that
discards any coefficient whose energy level is below a certain threshold.
High energy coefficients are also affected by the quantisation.
It is worth noticing that the quantisation is the only non-reversible step in
the whole hybrid video coding scheme, which induces lossy video coding.
Lossless (or near-lossless) video coding can be attained by not using
quantisation in the process.
\subsection{Loop filters}
\label{sub:loop_filters}
\index{Loop filters}
Loop filters have the goal of improving the quality of the reconstructed
picture for display purposes.
Being located within the loop, these filters have a strong impact on the
overall performance of the video coding scheme.
These filters can be classified into two classes, depending on their target
region.
The fist class is applied to a region of a picture or even a complete picture,
whereas filters from the second class operate on a local spatial domain of the
picture.
A detailed explanation on how these filters improve the picture quality is
explained in \S 2.4.6 and \S 9 of~\cite{wien-15-hevc}.
\subsection{Entropy coding}
\label{sub:entropy_coding}
\index{entropy coding}
\index{CABAC}
\index{scanning}
The last operation consists in reducing the amount of bits transmitted through
the use of an entropy code.
This is a lossless operation, as such the bit allocation performed during this
stage is reversible: no approximation operation is performed at this stage.
Once the transform coefficients have been quantised, they are scanned to make
sure they are sorted in a way that will make the entropy coder work
efficiently.
The scanning operation is a conversion from a 2D array, containing the
quantised transformed coefficients, towards a 1D vector containing the same
values sorted in a way that facilitates a compact transmission.
An appropriate scanning is crucial for efficient entropy
coding~\cite{ye-08-intra-directional-scanning-mddt}.
Signalling is also conveyed into the bitstream at this point, and the
entropy coder ensures a correct binarisation while using the adequate
number of bits.
In \acs{HEVC}, the entropy coding is named \ac{CABAC}.
\subsection{Intra coding in \acs{HEVC}}
\label{sub:intra_coding_in_hevc}
\index{PU}
\index{TU}
\index{scanning}
This subsection explains some particularities of the intra coding inside
\ac{HEVC}, as the work carried out in this thesis focuses on the intra
coding and take advantage of them.
With regards to previous standards, such as H.264/\acs{MPEG}-4 \ac{AVC}, which
has 9 different prediction modes for intra coding, \ac{HEVC} has an improved
prediction system, with 35 different prediction modes.
The upper-left part of figure~\ref{fig:mdcs} illustrates them.
A detailed explanation on how predictions are derived from the block
boundaries using those prediction modes can be found in Chapter 6
of~\cite{wien-15-hevc}.
Depending on the selected \ac{IPM}, residuals present different patterns, and
so do their transformed coefficients.
The top-right part of figure~\ref{fig:mdcs} presents the average \ac{HEVC}
$4\times4$ residuals by scanning mode, together with their average
representation in transform domain through the \acs{DST}-VII.
The average residual profiles have lower (dark) values near the available
borders, which increase with the distance from the boundaries: residuals
issued from horizontal and vertical \acp{IPM} only have the left and upper
borders available, respectively, whereas the remaining \acp{IPM} tend to have
both borders available.
It can also be observed that the scanning patterns match reasonably well the
transformed coefficients in each case, sorting them in an increasing order.
These patterns in the transform domain determine different scanning orders (in
different colours), as presented in the lower part of figure~\ref{fig:mdcs}.
An adapted mode to each pattern ensures, in average, a correct order of the
coefficients that will group all the null values together.
The patterns are described for $4\times4$ blocks, and the same pattern is used
on higher block sizes, which are recursively split into 4 sub-blocks until
size $4\times4$ is reached~\cite{sole-12-transform-coefficient-coding}.
\begin{figure}[tb]
\centering
\begin{minipage}{0.48\textwidth}
\ifthenelse{\usepdfs = 0}
{\input{./figures/pred-directions.tex}}
{\includegraphics{./figures/pred-directions.pdf}}
\end{minipage}
\begin{minipage}{0.48\textwidth}