-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmain.tex
2952 lines (2389 loc) · 247 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\title{Deep learning on biomedical data with Boltzmann machines}
\author{Stefan Lenz}
\date{\today}
\documentclass[12pt]{article}
\usepackage[ngerman,english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{natbib}
\usepackage[toc,page,title]{appendix}
\usepackage[times,inconsolata]{Rd}
\usepackage[parfill]{parskip}
\bibliographystyle{agsm}
\usepackage{amsmath,amssymb}
\usepackage{listings}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage[nobiblatex]{xurl}
\usepackage{epstopdf}
\usepackage[table]{xcolor}
\usepackage{tabularx}
\usepackage{makecell}
\usepackage{placeins}
\usepackage{setspace}
\usepackage{lscape}
\usepackage[shortlabels]{enumitem}
\usepackage[a4paper,%bindingoffset=0.2in,
left=2.5cm,right=2.5cm,top=2.1cm,bottom=2.5cm]{geometry}
\usepackage{helvet}
\usepackage{pdfpages}
\renewcommand{\familydefault}{\sfdefault}
\usepackage{titlesec}
\usepackage{rotating}
\setcounter{secnumdepth}{4}
\usepackage{afterpage}
\usepackage{caption}
\captionsetup[table]{skip=5pt}
\usepackage[bottom]{footmisc}
\usepackage{scalerel}
\titleformat{\paragraph}
{\normalfont\normalsize\bfseries}{\theparagraph}{1em}{}
\titlespacing*{\paragraph}{0pt}{3.25ex plus 1ex minus .2ex}{1.5ex plus .2ex}
\newcommand{\inlinecode}[1]{\texttt{#1}}
\newcommand{\sigm}{\mathrm{sigm}}
\newcommand{\ELBO}{\mathrm{ELBO}}
\newcommand{\apkg}[1]{\emph{#1}}
\newcommand{\proglang}[1]{#1} % egal, kommt in kopierten referenzen vor
\newcommand{\rightpageref}[1]{\hfill $\rightsquigarrow$ p.\ \pageref{#1}}
\newcommand{\rightpagerefs}[2]{\hfill $\rightsquigarrow$ pp.\ \pageref{#1}, \pageref{#2}}
\interfootnotelinepenalty=10000
\DeclareRobustCommand{\bbone}{\text{\usefont{U}{bbold}{m}{n}1}}
\DeclareMathOperator{\EX}{\mathbb{E}}
\newcommand{\circlenum}[1]{\raisebox{.5pt}{\textcircled{\raisebox{-.9pt} {#1}}}}
\newcommand{\intrnv}{\int_{\mathbb{R}^{n_{\scaleto{V}{2.5pt}}}}}
\lstdefinelanguage{Julia}
{morekeywords={begin,abstract,break,catch,const,continue,do,else,elseif,
end,export,false,finally,for,function,global,if,import,let,local,
macro,module,quote,return,struct,true,try,type,mutable,primitive,type
using,while},
sensitive=true,
alsoother={$},%$
morecomment=[l]{\#},
morestring=[s]{"}{"},
morestring=[m]{'}{'},
}[keywords,comments,strings]
\lstdefinelanguage{R}
{morekeywords={},
sensitive=true,
alsoother={.},
morecomment=[l]{\#},
morestring=[s]{"}{"},
morestring=[m]{'}{'},
}[keywords,comments,strings]
\lstset{
language = Julia,
basicstyle = \ttfamily,
keywordstyle = \bfseries,
%commentstyle = \itshape,% funzt nicht
showstringspaces = false,
frame = single
}
\lstset{
language = R,
basicstyle = \ttfamily,
keywordstyle = \bfseries,
%commentstyle = \itshape,% funzt nicht
showstringspaces = false,
frame = single,
escapeinside={\%*}{*)}
}
\def\backtick{\`{}}
\lstdefinestyle{rlststyle}{literate={`}{\backtick}1, escapechar=@}
\begin{document}
\includepdf[pages=-]{includes/Deckblatt}
%\maketitle
\tableofcontents
\newpage
\onehalfspacing
\setlength{\emergencystretch}{3em}
\selectlanguage{english}
\section*{Abstract}
%But since these data sets can be very heterogeneous compared to the widely used image data, the choice of hyperparameters for training is especially challenging.
%Therefore the package puts a strong focus on monitoring and evaluating the learning process.
%Primary evaluation criterion is the likelihood of the model, which can be estimated using annealed importance sampling (AIS). We present our approach for adapting the AIS algorithm on multimodal deep Boltzmann machines in detail in the article.
%Additionally to likelihood estimation, this package offers convenience methods to monitor a number of other statistics and it allows to easily extend the monitoring of the training.
Deep Boltzmann machines (DBMs) are considered as a promising approach for deep learning on data of small sample size, which is common in biomedical data sets.
To make DBMs available for experimentation, an easily extensible implementation of DBMs was created in form of the \apkg{BoltzmannMachines} Julia package, which has been registered in the official Julia package repository.
The implementation offers several unique features, in particular for evaluation and monitoring of the training of DBMs.
This is especially important for biomedical data sets, where finding good hyperparameters is hard due to the diversity of the data.
In addition to small sample sizes, data protection constraints may impose an additional challenge for analyzing biomedical data, in particular, if the data is distributed across different sites and cannot be pooled for the analysis.
One approach for conducting privacy-preserving analyses in such a setting is to generate synthetic data, which captures the important structure of the original data of the individuals and, at the same time, can be used with less data protection constraints as the original data.
For this, DBMs are compared with other popular generative approaches, namely generative adversarial networks (GANs) and variational autoencoders (VAEs) with respect to generative performance and disclosure risk.
In addition to these neural network approaches, simpler methods such as multiple imputation by chained equations (MICE) and independent marginals (IM) are also considered as references.
The experiments show the feasibility of DBMs as a generative approach in distributed settings, even with small sample sizes at different sites.
VAEs showed also a comparable performance to DBMs in this setting, while the other approaches performed considerably worse.
Finally, a ready-to-use implementation for DBMs as distributed approach is presented.
This software is built in form of R packages that integrate into the DataSHIELD infrastructure, which provides a framework for conducting privacy-preserving analyses on distributed cohorts.
Since DataSHIELD is based on the R programming language, and there is no DBM implementation in R, a language interface was necessary to bring the functionality of the \apkg{BoltzmannMachines} package to R.
There are two existing packages, \apkg{JuliaCall} and \apkg{XRJulia} that also make it possible to connect R with Julia, but they did not offer all necessary features.
Therefore, a new solution was created, which provides convenient interactive handling and, at the same time, puts a special focus on the stability that is needed for a production environment.
The resulting \apkg{JuliaConnectoR} R package has been officially registered in the Comprehensive R Archive Network (CRAN).
More generally, this approach of wrapping external deep learning functionality proves also on the technical side that advanced deep learning algorithms can be used for distributed privacy-preserving analyses in DataSHIELD.
\clearpage
\section{Introduction}
Deep learning algorithms have revolutionized image and speech recognition and are the state-of-the-art in many fields for prediction models \citep{goodfellow_deep_2016}.
Also advances in bioinformatics have been made using deep learning \citep{min_deep_2017}.
Yet, there are some challenges that prevent deep learning from being used in biomedical research.
One particularly big challenge is limited data \citep{min_deep_2017}.
While the amount of data in fields such as image and speech recognition has been increased dramatically with the content available in the internet, the amount of usable genetic or medical data remains comparatively small.
The problem of small sample sizes is particularly prevalent when dealing with genetic variant data
because there is a high number of genetic variants, and, at the same time, the sequencing of genomes is still relatively expensive.
Moreover, sharing genomic data can threaten the privacy of individuals because it can potentially be used for identifying individuals and predicting their risks of diseases or for discovering other personal traits \citep{bonomi_privacy_2020}.
Therefore, genomic data of humans is subject to data protection in many cases, and cannot be made available in a simple way.
This results in data sets with relatively few samples compared to the number of dimensions in data sets of genetic variants.
Also for clinical data, the availability of high-quality data is one of the key challenges \citep{machine_learning_review_nejm}.
Although there is an abundance of routine care data, sample sizes are often small when it comes to curated data or data from clinical studies.
Additionally, general privacy restrictions make it difficult to use health information of humans for research.
This further limits the amount of accessible data.
For example, the European General Data Protection Regulation (GDPR) requires an explicit consent for analyzing the health care data of individuals, which makes it hard to collect large amounts of data that can be used for various research purposes \citep{rumbold_effect_2017, shabani_re-identifiability_2019}.
Thus, approaches that can yield good results with only a limited amount of data are needed in many cases when applying deep learning on biomedical data.
Unsupervised learning is particularly interesting if there is not enough or no labeled data.
In contrast to prediction models, which require labeled data as input, unsupervised learning techniques aim at capturing structure of the input data without a specific prediction task.
One class of unsupervised deep learning approaches are {\em generative models}, which are able to generate new data according to a distribution that is learned from input data \citep{sejnowski_unsupervised_1999}.
Generative models have made their appearance in mainstream media when an artist trained a generative model on images of paintings, used it to create new paintings, and sold one of these paintings for \$432,000 at the auction house Christie's \citep{cohn_ai_2018}.
The generative approach used there is called generative adversarial network (GAN) \citep{goodfellow_generative_2014}.
Other generative approaches in the field of deep learning are variational autoencoders (VAEs) \citep{Kingma2013} and deep Boltzmann machines (DBMs) \citep{salakhutdinov2009DBMs}.
In a comparison of GANs, VAEs, and DBMs on genetic variant data of different rather small sample sizes, DBMs showed a good generative performance \citep{nussberger_synthetic_2020}.
Besides generating data, DBMs and VAEs can be used for dimensionality reduction by analysing the activation in higher-level layers of the network, which encode features of increasing complexity in the data.
This has, e.g., been applied to find patterns in gene expression data \citep{ding_interpretable_2018, hess_exploring_2020}.
Unsupervised learning is also called self-supervised learning if the model is designed to predict parts of the input from other parts \citep{lecun_selfsupervised_2019}.
One advantage of DBMs is that it is particularly easy to perform such predictions of parts of data based on other parts.
This can be achieved via conditional sampling, i.e., drawing from the learned distribution conditioned on certain values of input variables \citep{salakhutdinov2015generativemodels}.
The flexibility in conditional sampling separates DBMs from VAEs and GANs, where this is not possible.
Variants of VAEs and GANs have been proposed for this, called {\em conditional VAEs} \citep{cvae} and {\em conditional GANs} \citep{cgan}.
The latter has also been applied to gene expression data \citep{wang_conditional_2018}.
There, the prediction of gene activation from so-called landmark genes, which are suspected to contain most information of the genome, has been improved.
For this, and in conditional VAEs and conditional GANs in general, the variables that are used as conditions need to be specified in the training.
DBMs have the advantage that the designation of certain variables as conditions is not necessary and conditions can be set on arbitrary variables.
Conditional sampling in DBMs could therefore be used to simulate the regulation of arbitrary gene expression patterns based on a DBM model that has been trained on genetic expression data.
Such a model that has learned the overall structure of a data set is then also able to utilize the information about the whole distribution when making inferences about smaller subgroups.
Another possible application of generative models is to create synthetic data for privacy-preserving analyses.
Synthetic data is data that has properties similar to original data but that is not linked to individual samples of the original data.
In the same way as the previously mentioned generated paintings are not simply modified versions of original paintings but entirely new fabrications,
synthetic patient data is not created from single patients but from the distribution of patient characteristics that have been learned by a generative model.
The advantage of synthetic data is that it can be used with less data protection restrictions than the original data as it does not contain individual-level data of real individuals.
This can be useful in settings where data is distributed and sharing the data of individuals is prohibited due to privacy concerns.
For example, in the MIRACUM consortium \citep{prokosch_miracum_2018}, which includes ten German University hospitals,
one of the goals is to jointly analyze data that is distributed across multiple sites.
However, pooling the data of all patients for joint analysis is not possible because of data protection laws.
Therefore, alternative ways of analyzing the data must be sought.
One way to conduct analyses across distributed data is provided by the DataSHIELD software \citep{gaye_datashield,budin-ljosne_datashield}.
This has been used in several multicenter projects conducting epidemiological studies \citep{doiron_datashield_2013, pastorino_datashield_2019, oluwagbemigun_datashield_2020}.
With DataSHIELD, many common statistical analyses can be conducted.
Generating synthetic data in DataSHIELD has also already been explored for enabling analyses that were previously not possible \citep{bonofiglio2020}.
The question is here, whether deep learning techniques can also be brought to DataSHIELD for creating synthetic data, and whether they pose an increased risk with respect to violations of privacy.
%Multimodal DBMs have been used by Srivastava et al. for image captions \citep{srivastava2012multimodal}.
%Here we want to broaden the scope a little bit and present an extensible approach for creating deep Boltzmann architectures with different types of visible units and flexible partitioned hidden layers. Training of such architectures ....
%Those architectures are not only useful for putting data of different types in one model but also in cases where a natural partitioning of the data can be derived from domain knowledge. With partitioning one can greatly reduce the number of parameters in the model. This allows training models on data sets where the sample size would otherwise be too small to fit a full model.
%Idee: aus motivation von Bioinf App not fail: VAE and GAN backpropagation, DBMs harder to train and the algorithm is complex.
%Idee: DBMs multimodal , e. g. clinical, gene expression, and single nucleotide polymorphism (SNP) data
Apparently deep Boltzmann machines exhibit promising qualities for deep learning on biomedical data, where limited sample size and privacy restrictions are common challenges.
An additional big challenge for deep learning on biomedical data is finding good hyperparameters for the optimization algorithms \citep{min_deep_2017}.
Since biomedical data sets can be very diverse, default values for hyperparameters may often not be the best choice and the ability to determine good hyperparameters is especially important there.
As the success of the learning depends critically on good hyperparameters, a special emphasis needs to be put on evaluating the learning.
\subsection{Aims of the thesis}
Small sample sizes and privacy restrictions are major obstacles for performing deep learning on biomedical data \citep{min_deep_2017}.
Using synthetic data is a possibility for analyzing distributed data sets that cannot be pooled due to privacy restrictions \citep{manriquevallier_bayesian_2018, quick_generating_2018, goncalves}.
DBMs are a promising approach for modeling the distribution of high-dimensional data of small sample size \citep{hess2017partitioned, nussberger_synthetic_2020}.
It is to be shown that DBMs can be used for creating synthetic data that capture the structure of the original data while also ensuring that using DBMs for this purpose does not pose an intolerable privacy risk.
More generally, this work aims to investigate how DBMs can be used as generative models for deep learning on biomedical data and how they can become an accessible and practically usable tool for this purpose.
In a first step, the algorithms for training and evaluating DBMs need to be implemented in a way that they are easy to use and suitable for experimentation (Section \ref{bmpart}).
This implementation can then be used to examine the hypothesis that DBMs can produce useful synthetic data in scenarios with small sample size and also with distributed data (Section \ref{simuexp} and Section \ref{realexp}).
For employing DBMs in settings with sensitive health information, the disclosure risk of synthetic data created by DBMs must be investigated as well.
For this purpose, DBMs are compared with VAEs, GANs, and simpler approaches, such as multiple imputation by chained logistic regression models \citep{mice}, with respect their to generative performance and with respect to the disclosure risk of synthetic data that has been generated by these approaches (Section \ref{realexp}).
After this, DBMs can be considered for generating synthetic data in the DataSHIELD infrastructure for distributed privacy-preserving analyses.
Since DataSHIELD is based on the R programming language, the algorithms for training and evaluating DBMs must also be made available in R (Section \ref{juliaconnectorpart}).
Finally, a solution for applying DBMs via DataSHIELD can be provided (Section \ref{dsBoltzmannMachinesImpl}).
\clearpage
\section{Deep Boltzmann machines}\label{bmpart}
\subsection{Theoretical background}\label{bmtheory}
Boltzmann machines \citep{ackley_boltzmann_1985} are a special kind of neural networks.
Artificial neural networks have been developed for solving complex problems by imitating structures of the nervous system of humans and animals \citep{mcculloch_logical_1943}.
Boltzmann machines have been derived from models in statistical physics, which have some interesting parallels with networks of neurons.
In physics, Ising models aim at modeling the magnetic spin in lattices of atoms or molecules in ferromagnetic crystals \citep{isingmodel}.
In Ising models the probability of the magnetic spin of molecules in a lattice of molecules is determined by an energy function that assigns an energy to each possible configuration of magnetic spins in the lattice.
Configurations with a lower energy have a higher probability of occurring.
This connection between energy and and probability also holds for Boltzmann machines.
It makes Boltzmann machines a type of so-called ``energy-based'' models \citep{ranzato_ebm}.
The analogy of lattices of molecules with magnetic spin are networks of neurons in the nervous system, where each cell has a neural activation.
The analogy goes further.
Neural networks receive input from data in special nodes.
The information about the data can then be learned by the network.
In the physics analogue, this corresponds to parts of the lattice that are influenced by an external magnetic field.
In both cases, the information about the input is encapsulated in the parameters of the model, which influence the energy function and thereby determine the probability of the activations in the network.
With the information contained in the model parameters, a trained Boltzmann machine can be used for generating new samples according to the distribution that the model has learned from the original data.
With this ability, Boltzmann machines can be employed as {\em generative models}.
The following parts of this section give an overview of the mathematical definitions surrounding Boltzmann machines that will be necessary to understand how Boltzmann machines can be trained and used.
This also includes the definition of some terminology that is commonly used in the literature about Boltzmann machines.
\subsubsection{Definition of Boltzmann machines}\label{basicbmproperties}
A Boltzmann machine model with parameters $\theta$ defines a probability distribution $p(x)$ for a random variable $x = (v, h)$ on a probability space $\Omega$ via the energy function $E(v, h; \theta)$:
\begin{equation}
p(x) = p(v, h) = \frac{e^{-E(v,h)}}{Z}.
\label{eqn:probbm}
\end{equation}
The normalizing factor $Z$, the so called \emph{partition function}, is defined as $Z = \int_{\Omega} e^{-E(x)} dx$.
In case of a discrete probability distribution, this can be written as $Z = \sum_{v,h}e^{-E(v,h)}$ where the sum goes over all possible realizations of $v$ and $h$.
The term in the denominator $p^*(v,h) = e^{-E(v,h)}$ is called \emph{unnormalized probability}.
The probability space can be divided into dimensions of observed variables (subsumed in vector $v$) and hidden/latent variables (in $h$), corresponding to visible and hidden nodes in the graphical representation, see Figure \ref{fig:bmsoverview}.
The so called \emph{free energy}, a notation also inspired by physics, is defined as
\[
F(v) = - \log \sum_h e^{-E(v, h)}.
\]
With this definition, we can rewrite the formula of the partition function as $Z = \sum_v e^{-F(v)}$.
That is useful because the formula for the free energy of restricted Boltzmann machines can be simplified by using the layerwise structure (see e.g.~\cite{martens_representational_2013}, Appendix A.1).
Thus the complexity of calculating the free energy becomes linear regarding the number of hidden nodes, as can be seen in the formulas for the free energy in the different types of models that are described below.
If the partition function $Z$ is given, the log-likelihood
\begin{equation}
\log p(v) = - F(v) - \log Z
\label{eqn:pRBMfreeenergy}
\end{equation}
can therefore be calculated efficiently using the free energy. The free energy is also used for calculating unnormalized probabilities $p^*(v) = e^{-F(v)}$, which is, e.g., used in the annealed importance sampling algorithm (see \ref{methodAIS}) for estimating the partition function via a stochastic algorithm.
With these basic properties defined, we can go further to take a look at the details of specific types of Boltzmann machines.
Here only the special cases of restricted Boltzmann machines and (multimodal) deep Boltzmann machines are considered.
These models have restrictions on their parameterization compared to a general Boltzmann machine.
The restrictions correspond to a layered design of their graphs.
For an intuitive overview of the architectures of the different types that are considered here, see Figure \ref{fig:bmsoverview}.
\begin{figure}[h]
\centering
\includegraphics[scale=3.]{images/BMsOverview.eps}
\caption{Graph view on different types of Boltzmann machines. Visible units (i.e., input units) are depicted as nodes with doubled circle lines. Hidden units are simple circles.
{\bf A}: General Boltzmann machine, with all nodes connected to each other. {\bf B}: Restricted Boltzmann machine (RBM). Each of the lines corresponds to a weight, i.e., an entry in the weight matrix $W$, like in the formulas in (\ref{eqn:energyformularbm}) and (\ref{eqn:energyformulagbrbm}).
{\bf C}: Deep Boltzmann machine. From a graph perspective, this is simply a stack of RBMs.
{\bf D} and {\bf E}: Multimodal/partitioned deep Boltzmann machines. In a multimodal DBM, the different partitions of the visible nodes may also have different distributions.}
\label{fig:bmsoverview}
\end{figure}
\subsubsection{Basic properties of restricted Boltzmann machines}\label{rbmtypes}
\emph{Restricted Boltzmann machines} \citep{smolensky1986foundations}, or RBMs, consist of two layers, a visible layer $v$, which receives the input data, and a hidden layer $h$, which encodes latent features of the data. For a graphical depiction, see Figure \ref{fig:bmsoverview}B.
The nodes, also called ``units'' in neural network terminology, play the role of neurons.
The values of the nodes correspond to the activation of neurons.
The nodes are divided into several layers.
The nodes inside the same layer are not connected to each other, and therefore, the network has the form of a bipartite graph \citep{diestelgraph}.
Although other distributions are also possible for the hidden nodes \citep{hinton_practical_2012}, only RBMs with binary hidden nodes following a Bernoulli distribution are considered here.
The types of RBMs presented in the following may, however, differ in the distribution of their visible nodes, which makes them suitable for modeling different types of input data.
\paragraph{Bernoulli distributed visible nodes}
The most basic model in this class of models are restricted Boltzmann machines with Bernoulli distributed nodes, most of the time simply called restricted Boltzmann machines. Their energy function is of the form
\begin{equation}
E(v,h) = - a^T v - b^T h - v^T W h. \label{eqn:energyformularbm}
\end{equation}
The parameters of the model are $\theta = (W, a, b)$ with \emph{weight matrix} $W$, vector $a$ as the \emph{visible bias}, and $b$ as the \emph{hidden bias}.
The weights correspond to the connections between the visible and hidden units in a weighted graph, as shown in Figure \ref{fig:bmsoverview}.
The bias variables are usually not depicted in graph views of neural networks, but they are equivalent to adding nodes to the model that always have the value one. The visible bias corresponds to the weights of the connections to an additional node that is connected to all nodes of the visible layer, and the hidden bias corresponds to the weights of the connections to an additional node that is connected to all nodes of the hidden layer. The bias variables serve to set a basic level of activation of the nodes, which is then modified by the input from the connected units.
To see the mapping from the parameters in the formula to the network view, see Figure \ref{fig:rbmweights}.
\begin{figure}[h]
\centering
\includegraphics[scale=0.5]{images/rbmweights.pdf}
\caption{The nodes and parameters of a restricted Boltzmann machine in the graph view. The second visible node and the first node in the hidden layer are connected with weight $w_{21}$ from the weight matrix $W$. The base-level activity of the nodes $v_2$ and $h_1$ is determined by the biases $a_2$ and $b_1$, respectively.}
\label{fig:rbmweights}
\end{figure}
The formulas for the conditional distributions $p(h | v)$ and $p(v | h)$ can be derived from (\ref{eqn:energyformularbm}). (For a detailed derivation see, e.g., \cite{krizhevsky2009tinyimagesthesis}, Section 1.4).
Employing the sigmoid function $\sigm(x) = \frac{1}{1+ e^{-x}}$, the resulting conditional distributions can be written as
\begin{equation}
p(v_i | h) = \sigm ((a + W h)_i)
\quad \text{and}\quad
p(h_i | v) = \sigm ((b + W^T v)_i).
\label{eqn:condprobrbm}
\end{equation}
The free energy is
\begin{equation}
F(v) = - a^T v - \sum_{j=1}^{n_H} \log \left (1 + e^{(W^T v + b)_j}\right).
\label{eqn:freenergy_rbm}
\end{equation}
The trick for deriving the formula of the free energy is to ``analytically sum out'' the hidden units \citep{sala2012anefficient}.
The master thesis of \cite{krizhevsky2009tinyimagesthesis} is a very good resource for looking up the complete derivations of these formulas. This holds not only for the theory of RBMs with Bernoulli distributed nodes but also in particular for RBMs with Gaussian distributed visible nodes, which are presented next.
\paragraph{Gaussian distributed visible nodes}\label{gaussianrbm}
One approach for modeling continuous values for $v$ are restricted Boltzmann machines with Gaussian distributions of the visible variables and Bernoulli distributed hidden variables. \cite{krizhevsky2009tinyimagesthesis} defines the energy of the model as:
\begin{equation}
E(v,h) = \sum_{i=1}^{n_V}\frac{(v_i - a_i)^2}{2\sigma_i^2} - b^T h - \sum_{i=1}^{n_V} \sum_{j=1}^{n_H} \frac{v_i w_{ij} h_j}{\sigma_i}
\label{eqn:energyformulagbrbm}
\end{equation}
The number of visible and hidden nodes is denoted as $n_V$ and $n_H$, respectively. The parameters of this model are $\theta = (W, a, b, \sigma)$ with weight matrix $W$, visible bias $a$ hidden bias $b$ and standard deviation $\sigma$. The role of $\sigma$ as standard deviation becomes clearer by looking at the conditional distributions $p(v_i | h)$, which are equal to the normal distributions $\mathcal{N}(a_i + \sigma_i(Wh)_i, \sigma_i^2)$. For the hidden nodes, one can show that $p(h_j | v) = \sigm \left (b_j + \sum_{i=1}^{n_V} w_{ij} \frac{v_i}{\sigma_i} \right )$.
The free energy is
\begin{equation}
F(v) = \sum_{i=1}^{n_V}\frac{(v_i - a_i)^2}{2\sigma_i^2} + \sum_{j=1}^{n_H} \log \left (1 + e^{b_j + \sum_{i=1}^{n_V} \frac{v_i}{\sigma_i} w_{ij}} \right)
\label{eqn:freenergy_gbrbm}
\end{equation}
\citep{krizhevsky2009tinyimagesthesis}.
\cite{cho2011improved} proposed a different parameterization for the energy function to improve the training of RBMs with Gaussian visible nodes:
\begin{equation}
E(v,h) = \sum_{i=1}^{n_V}\frac{(v_i - a_i)^2}{2\sigma_i^2} - b^T h - \sum_{i=1}^{n_V} \sum_{j=1}^{n_H} \frac{v_i w_{ij} h_j}{\sigma_i^2}
\label{eqn:energyformulagbrbm2}
\end{equation}
In this model the distribution of the visible nodes $v$ conditioned on the hidden nodes $h$ is the multimodal normal distribution $\mathcal{N}(a + Wh, \sigma^2)$.
The conditional distribution of the hidden nodes is $p(h_j | v) = \sigm \left (b_j + \sum_{i=1}^{n_V} w_{ij} \frac{v_i}{\sigma_i^2} \right )$, and the free energy is
\begin{equation*}
F(v) = \sum_{i=1}^{n_V}\frac{(v_i - a_i)^2}{2\sigma_i^2} + \sum_{j=1}^{n_H} \log \left (1 + e^{b_j + \sum_{i=1}^{n_V} \frac{v_i}{\sigma_i^2} w_{ij}} \right).
\end{equation*}
% _maybe cite:
%\citep{melchior_gbrbm}, \citep{cho_gaussiandbm}
% _maybe mention alternative: use continuous as Gaussian hard., verweis auf entsprechenden results teil
\subsubsection{Gibbs sampling in restricted Boltzmann machines}\label{gibbssamplingrbm}
In the previous sections concerning RBMs with different input distributions, one could see that it was possible to specify a formula for the conditional probability that has the same algorithmic complexity of the matrix multiplication in all cases.
However, the formula for the simple probability $p(v)$ of a sample $v$ requires calculating the partition function $Z$ (see Equation (\ref{eqn:probbm})), and calculating $Z$ is not feasible for most practical scenarios, as the number of summands in $Z$ grows exponentially with the number of nodes in the model (see also later in \ref{methodExactloglik}).
This means that using RBMs as generative models, i.e., drawing samples according to the distribution captured in its parameters, is not as straightforward as calculating $p(v)$ and sampling according to the distribution.
Due to the layered structure of RBMs and the closed form (\ref{eqn:condprobrbm}) of the conditional distributions $p(v \mid h)$ and $p(h \mid v)$, which is fast to evaluate, it becomes possible to efficiently apply a Markov chain Monte Carlo technique \citep{mcmc_handbook_2011} called \emph{Gibbs sampling} \citep{gibbssamplingorig} for approximating the distribution of the RBM.
The Gibbs sampling algorithm for RBMs can be formulated as follows\footnote{The tilde over the variable symbol is used here and in the following to denote that the variable stands for a concrete realization and is not a random variable.}:
\begin{enumerate}
\item Start with an arbitrary $\tilde{v}^{(1)}$.
\item Draw $\tilde{h}^{(1)}$ according to $p(h \mid \tilde{v}^{(1)})$.
\item Draw $\tilde{v}^{(2)}$ according to $p(v \mid \tilde{h}^{(1)})$.
\item Repeat steps 2 and 3 until convergence.
\item Result: After $n$ steps, $\tilde{v}^{(n)}$ and $\tilde{h}^{(n)}$ have been drawn according to the joint distribution $p(v,h)$ of the model.
\end{enumerate}
This works because the iteration forms a Markov chain that has the distribution $p(v,h)$ of the model as its equilibrium distribution.
\paragraph{Conditional sampling}\label{condsamplingrbm}
The Gibbs sampling algorithm can also be easily modified for sampling conditionally on the activations of input nodes by clamping the activations of the nodes that are to be conditioned on and running the Gibbs sampling algorithm in the rest of the network \citep{srivastava2012multimodal}.
More formally, to draw from the probability $p(v, h \mid \tilde{v}_C)$ that is conditioned on the activations $\tilde{v}_C$ of a set $C$ of visible nodes, the Gibbs sampling algorithm above can be run with $v_C$ set to $\tilde{v}_C$ in step 1 and after each sampling step 3.
This, of course, works analogously for hidden nodes as conditions.
\subsubsection{Training of restricted Boltzmann machines}\label{rbmtraining}
The goal of the training procedure of restricted Boltzmann machines is to maximize the likelihood $\prod_{k=1}^n p(\tilde{v}^{(k)})$ for a given data set $(\tilde{v}^{(1)}, \dots, \tilde{v}^{(n)})$.
Maximizing the likelihood with respect to the parameters $\theta$ is equal to maximizing the log-likelihood
\[
\sum_{k=1}^{n} \log p_\theta(\tilde{v}^{(i)}) = \sum_{k=1}^{n} \left( \log \sum_h e^{-E_\theta(\tilde{v}^{(k)}, h)} - \log \sum_v \sum_h e^{-E_\theta(v, h)} \right)
\]
with respect to $\theta$.
In the next steps, it is examined how this function of $\theta$ can be optimized.
For simplicity of notation, the $\theta$ is left out from $p_\theta$ and $E_\theta$ in the following derivations.
\paragraph{Deriving the gradient of the log-likelihood}
For finding an optimum of $\sum_{k=1}^{n} \log p_\theta(\tilde{v}^{(k)})$, the gradient $\nabla_{\!\theta} \sum_{k=1}^{n} \log p_\theta(\tilde{v}^{(k)})$ needs to be determined.
The gradient of the log-likelihood for a single observation
$\tilde{v} $ with respect to the parameter set $\theta$ is given by
\begin{align}
\nabla_{\!\theta} \log p(\tilde{v}) &= \nabla_{\!\theta} \log \sum_h e^{-E(\tilde{v}, h)} - \log \sum_v \sum_h e^{-E(v, h)} \nonumber \\
&= \frac{\nabla_{\!\theta} \sum_h e^{-E(\tilde{v}, h)}}{\sum_h e ^{-E(\tilde{v}, h)}} - \frac{\nabla_{\!\theta} \sum_v \sum_h e^{-E(v,h)}}{\sum_v \sum_h e^{-E(v,h)}} \label{eqn:derivedlog}\\
&= \frac{\sum_h e^{-E(\tilde{v}, h)} \nabla_{\!\theta} (-E(\tilde{v}, h))} {\sum_h e^{-E(\tilde{v}, h)}} - \frac{\sum_v \sum_h e^{-E(v, h)} \nabla_{\!\theta} (-E(v, h))}{\sum_v \sum_h e^{-E(v, h)}} \label{eqn:derivedexp}\\
&= \EX_{P_\text{data}} \nabla_{\!\theta} (- E(\tilde{v},h)) - \EX_{P_\text{model}} \nabla_{\!\theta} (-E(v,h)) \label{eqn:nablaresult}
\end{align}
In (\ref{eqn:derivedlog}) and (\ref{eqn:derivedexp}), the chain rule is used, together with the derivative of the logarithm and the exponential function, respectively. The resulting terms can be rewritten in (\ref{eqn:nablaresult}) as the expectation of the gradient given the distribution of the data and the expectation of the gradient given the distribution of the model.
As can be seen in Equation (\ref{eqn:derivedexp}), the calculation of the gradient involves sums over lots of possible combinations of activations of nodes.
The first term can be calculated in the types of RBMs that have been introducted here.
For this, one can use the expected value of the conditional probability $p(h\mid\tilde{v})$ and plug it in:
\begin{equation}
\EX_{P_\text{data}} \nabla_{\!\theta} E(\tilde{v},h) = \nabla_{\!\theta} E(\tilde{v}, \EX(p(h \mid \tilde{v}))
\label{eqn:plugexpectedh}
\end{equation}
In the second term of Equation (\ref{eqn:nablaresult}), the sum goes over all nodes in the network.
Calculating this term is technically not feasible in normal cases, as the number of summands grows exponentially with the number of nodes.
This means that it is not possible to simply equate the gradient of the log-likelihood with zero and solve the equation to get an optimum. Also finding an optimum with gradient descent \citep{gradientdescent}, walking with small steps in the direction of the gradient to find an optimum, is not possible directly, because this would as well require the calculation of the whole gradient in each iteration step.
Via Gibbs sampling (see Section \ref{gibbssamplingrbm}), however, the term can be estimated.
With this, it becomes possible to use the estimated gradients in gradient descent, or here rather ``gradient ascent'' as we would like to find a (local) maximum.
\paragraph{Batch gradient optimization and the role of mini-batches}
For the practical implementation, another stochastic approximation of the gradient for the full data set $\sum_{k=1}^n \log p(\tilde{v}^{(k)})$ is used.
Usually a form of {\em batch gradient optimization} \citep{bottou_optimization_2018} is performed.
For this, the samples are split into batches $B_l$, usually of approximately equal sizes $b_l$.
In each optimization step $t$, the parameters are updated as follows:
\[
\theta^{(t+1)} = \theta^{(t)} + \frac{\epsilon}{b_l} \sum_{\tilde{v} \in B_l} \nabla_{\!\theta} \log p ( \tilde{v}; \theta^{(t)})
\]
The step size is determined by the {\em learning rate} $\epsilon > 0$.
Each iteration $t$ calculates the gradient of a different batch $B_l$ until all samples have been used.
The mini-batches are usually reused multiple times for the learning.
The process of using all samples/mini-batches once for updating the parameters is called a {\em (training) epoch}.
(In the formula above, one training epoch is over if $t = \sum_l b_l$.)
Training usually consists of many epochs and the step size is kept small.
For example, a hundred epochs and a learning rate of 0.001 could be a viable combination.
The number of epochs and the learning rate are the most important hyperparameters for the training.
Mini-batches were at first introduced for performance reasons.
It is not necessary to have the exact gradient for walking only small steps into the direction of it.
Thus it is sufficient to calculate the gradient for small subsets of the data, which is noisy, but can still lead into the right direction.
Using mini-batches can also be advantageous because of another reason: The variance of the steps becomes higher with a with a smaller batch size. This leads to exploring more of the surface of the likelihood function in the high-dimensional space and therefore can lead to finding better local optima in some scenarios \citep{bengio2012practical}.
\paragraph{Contrastive divergence}
For performing batch gradient optimization, the Gibbs sampling procedure is also modified for RBM training to speed up the learning process.
In other applications, thousands of iterations are used for Gibbs sampling \citep{gibbssamplingorig}. For training RBMs, a shortcut is used, which is called {\em contrastive divergence} (CD) \citep{cdorig, perpinan_contrastive_2005}.
In CD, the number of Gibbs sampling steps is reduced drastically. In most cases, even only one step is used, which is denoted with $\text{CD}_1$ \citep{hinton_practical_2012}.
Instead of using a random starting point, CD uses the original sample $\tilde{v}$, for which the gradient shall be computed, as starting point $\tilde{v}^{(1)}$ of the sampling procedure. This should ensure that the starting point is already close to the desired distribution.
Another modification of the Gibbs sampling procedure used for training of RBMs is {\em persistent contrastive divergence} (PCD).
Similar to CD, this procedure also uses only very few steps. Here, the state of the Gibbs sampling chain for calculating the last update is reused and used as starting point. \cite{hinton_practical_2012} recommends PCD over $\text{CD}_1$ or even $\text{CD}_{10}$.
\paragraph{Gradients for different types of restricted Boltzmann machines}\label{rbmgradients}
In an RBM with Bernoulli distributed nodes, Equation (\ref{eqn:nablaresult}) leads to the following formulas for the gradients with respect to the weights and biases:
\begin{align*}
\frac{\partial}{\partial W} \log p(\tilde{v}) &= \EX_{P_\text{data}} \tilde{v} h^T - \EX_{P_\text{model}} v h^T \\
\frac{\partial}{\partial a} \log p(\tilde{v}) &= \EX_{P_\text{data}}
\tilde{v} - \EX_{P_\text{model}} v = \tilde{v} - \EX_{P_\text{model}} v\\
\frac{\partial}{\partial b} \log p(\tilde{v}) &= \EX_{P_\text{data}} h - \EX_{P_\text{model}} h
\end{align*}
In an RBM with Gaussian visible nodes \citep{krizhevsky2009tinyimagesthesis}, the gradients are:
\begin{align*}
\frac{\partial}{\partial W} \log p(\tilde{v}) &= \frac{1}{\sigma_i} \left ( \EX_{P_\text{data}} \tilde{v} h^T - \EX_{P_\text{model}} v h^T \right) \\
\frac{\partial}{\partial a} \log p(\tilde{v}) &= \frac{1}{\sigma_i^2} \left(\tilde{v} - \EX_{P_\text{model}} v \right)\\
\frac{\partial}{\partial b} \log p(\tilde{v}) &= \EX_{P_\text{data}} h - \EX_{P_\text{model}} h \\
\frac{\partial}{\partial \sigma_{i}} \log p(\tilde{v}) &= \EX_{P_\text{data}} \left( \frac{(\tilde{v}_i - a_i)^2}{\sigma_i^3} - \sum_{j=1}^{n_H} h_j \frac{w_{ij} \tilde{v}_i}{\sigma_i^2}\right) \\ & \quad \quad - \EX_{P_\text{model}} \left(\frac{(v_i - a_i)^2}{\sigma_i^3} -\sum_{j=1}^{n_H} h_j \frac{w_{ij} v_i}{\sigma_i^2} \right)
\end{align*}
In the alternative formulation of \cite{cho2011improved}, the gradients are:
\begin{align*}
\frac{\partial}{\partial W} \log p(\tilde{v}) &= \frac{1}{\sigma_i^2} \left ( \EX_{P_\text{data}} \tilde{v} h^T - \EX_{P_\text{model}} v h^T \right) \\
\frac{\partial}{\partial a} \log p(\tilde{v}) &= \frac{1}{\sigma_i^2} \left(\tilde{v} - \EX_{P_\text{model}} v \right)\\
\frac{\partial}{\partial b} \log p(\tilde{v}) &= \EX_{P_\text{data}} h - \EX_{P_\text{model}} h \\
\frac{\partial}{\partial \sigma_{i}} \log p(\tilde{v}) &= \frac{1}{\sigma_i^3} \Bigg( \EX_{P_\text{data}} \left( (\tilde{v}_i - a_i)^2 - 2\sum_{j=1}^{n_H} h_j w_{ij} \tilde{v}_i \right) \\ & \quad \quad - \EX_{P_\text{model}} \left((v_i - a_i)^2 -2\sum_{j=1}^{n_H} h_j w_{ij} v_i \right) \Bigg)
\end{align*}
In the RBM with the dummy variables for encoding categorical variables, the gradients are the same as in the one with Bernoulli distributed variables.
Using the gradients specified above, it is possible to run the gradient optimization for training the different types of RBMs.
\subsubsection{Deep belief networks}\label{dbns}
The next step in the evolution of deep Boltzmann machines are deep belief networks (DBNs) \citep{hinton_reducing_2006}, which have been invented to make better use of RBMs for dimensionality reduction.
Detecting higher level features in data is hard using shallow networks such as RBMs.
The idea of DBNs is to stack RBMs on top of each other (see Figure \ref{fig:dbn}) to be able to model features of increasing complexity with an increased depth of the network.
\begin{figure}[h]
\centering
\includegraphics[scale=0.7]{images/dbn.pdf}
\caption{Training a deep belief network (DBN) and using it as generative model.
{\bf A}: The DBN is trained as a stack of RBMs. {\bf B}: In a DBN as generative model, only the top RBM is used for Gibbs sampling, then the activation is passed in a deterministic way downwards to the visible layer.}
\label{fig:dbn}
\end{figure}
For training a DBN, the first RBM is trained with the original data data vectors $\tilde{v}^{(i)}$ with $i = 1, \dots, n$.
For training the higher layers, a deterministic activation $f_k(v) := p_k(h|v)$, defined via the conditional probability in the $k$-th RBM model, is passed through the lower layers after they have been trained:
The second RBM is trained with the data set of all $f_1(\tilde{v}^{(i)})$, the third RBM can then be trained with the data set $f_2(f_1(\tilde{v}^{(i)}))$, and so on.
As shown in figure (\ref{fig:dbn}B), only the last RBM in the network can be used for sampling, which limits the generative capabilities of a DBN.
The ability to generate data from complex distributions is furthermore reduced if the DBN is used for dimension reduction because in this case the top RBM is also the smallest network.
\subsubsection{Deep Boltzmann machines}\label{dbmprobs}
If the network architecture of a DBN is treated as a Boltzmann machine, we get a \emph{deep Boltzmann machine}, where the complete network can be utilized for generating samples.
A deep Boltzmann machine, as defined by \cite{salakhutdinov2009DBMs}, with $n_L$ hidden layers has parameters
\[
\theta = \left (W^{(1)}, \dots, W^{(n_L)}, a, b^{(1)}, \dots, b^{(n_L)} \right).
\]
The energy is defined as
\[
E(v, h^{(1)}, \dots, h^{(n_L)}) = - a^T v - v^T W^{(1)} h^{(1)} - \sum_{k=1}^{n_L} (b^{(k)})^T h^{(k)} - \sum_{k=2}^{n_L} (h^{(k-1)})^T W^{(k)}h^{(k)}.
\]
For the connection between the formula and the resulting network architecture see Figure \ref{fig:dbmweights}.
\begin{figure}[h]
\centering
\includegraphics[scale=0.6]{images/dbmweights.pdf}
\caption{The nodes and parameters of a deep Boltzmann machine in the graph view}
\label{fig:dbmweights}
\end{figure}
A deep Boltzmann machine has Bernoulli distributed nodes.
Due to the layer-wise structure, the conditional probability of the visible nodes $p(v \mid h) = p(v \mid h^{(1)})$ depends only on the first hidden layer and can be calculated like in an RBM with Bernoulli distributed nodes.
The conditional probability of the final hidden layer $p \left( h^{n_L} \mid v, h^{(1)}, \dots, h^{(n_L -1)} \right) = p \left( h^{n_L} \mid h^{(n_L -1)} \right)$ can similarly be calculated by knowing only $h^{(n_L-1)}$.
For the intermediate layers, the conditional probabilities depending on the neighboring layers can be calculated as
\begin{equation}
p\left(h^{(1)} \mid v, h^{(2)}\right) = \sigm \bigg( b^{(1)} + (W^{(1)})^T v + W^{(2)} h^{(2)} \bigg)
\label{dbmcondprobfirsthidden}
\end{equation}
for the first hidden layer and
\begin{equation}
p\left(h^{(k)} \mid h^{(k-1)}, h^{(k+1)} \right) = \sigm \bigg( b^{(k)} + (W^{(k)})^T h^{(k-1)} + W^{(k+1)} h^{(k+1)} \bigg)
\label{dbmcondprobintermediate}
\end{equation}
for all other intermediate hidden layers.
These conditional probabilities can be used to perform Gibbs sampling in deep Boltzmann machines (see Figure \ref{fig:dbmsampling}) similar to the Gibbs sampling algorithm in restricted Boltzmann machines (as described before in \ref{condsamplingrbm}).
\begin{figure}[h]
\centering
\includegraphics[scale=1.0]{images/dbmsampling.pdf}
\caption{Gibbs sampling step in a deep Boltzmann machine. The visible layer is sampled using only the activation of the first hidden layer from the previous Gibbs sampling step, and the last hidden layer is sampled using only the activation of the penultimate hidden layer. The intermediate layers are sampled by using the activations of the two neighboring layers from the previous Gibbs sampling step. }
\label{fig:dbmsampling}
\end{figure}
\subsubsection{Training of deep Boltzmann machines}\label{dbmtraining}
In context of DBMs, the DBN learning algorithm (see \ref{dbns}) is called greedy layerwise pre-training.
This pre-training helps the algorithm for training a general Boltzmann machine to find a better optimum.
An intermediate layer in a DBM receives input from its neighboring layers (see Equation (\ref{dbmcondprobintermediate}) and Figure \ref{fig:dbmsampling}), but during pre-training it only gets input from the layer below. To account for this, the DBN training is modified for pre-training a DBM and the total input that the intermediate layers receive (i.e., the argument of the $\sigm$ function in the formulas for the conditional probabilities) is doubled \citep{salakhutdinov2009DBMs}.
Using the pre-trained DBM as starting point, the fine-tuning of the weights can then be performed by the algorithm for training a general Boltzmann machine
\citep{salakhutdinov2009DBMs, salakhutdinov2015generativemodels}.
This algorithm is also based on gradient descent.
In RBMs it is possible to calculate $p(h | v)$ because all the hidden nodes are only depending on the visible nodes.
This allowed to calculate $\EX_{P_{\text{data}}} \nabla_{\!\theta} E(\tilde{v},h)$ from $\EX p(h | \tilde{v})$ (see Equation (\ref{eqn:plugexpectedh})).
In DBMs this is not possible anymore as the hidden layers are connected with each other and therefore no simple formula for $p(h | v)$ can be derived.
This means that calculating the derivative of the log-likelihood in deep Boltzmann machines needs an additional technique.
\cite{salakhutdinov2009DBMs} proposed a variational approach for this.
The true distribution $p(h|v)$ is replaced with an approximation $q(h|v)$.
Instead of optimizing the log-likelihood, a lower bound is optimized, for which the gradient can be computed.
This \emph{variational lower bound} or \emph{evidence lower bound} (ELBO) \citep{blei_variational_2017} for the likelihood is
\[
\log p(v) \geq \sum_h q(h|v) \log p(v,h) + \mathcal{H}(q)
\]
$\mathcal{H}(q)$ denotes the entropy of the distribution $q$.
Here a ``mean-field approximation'' $q$ is used, where all hidden nodes are assumed to be independent.
The distribution $q$ for this approach is given by $q(h) = \prod_{k=1}^{n_L} \prod_j^{n_{H,k}} q(h_j^{(k)})$ with $q(h_j^{(k)} = 1) = \mu_j^{(k)}$ for some fixed values $\mu_j^{(k)}$ for each node $j$ in a hidden layer $k$.
($n_L$ denotes the number of hidden layers and $n_{H,k}$ the number of nodes in hidden layer $k$.)
This leads to the lower bound of the log-likelihood, which will be optimized in the fine-tuning algorithm:
\[
\log p(v) \geq - E(v, \mu) - \log Z + \mathcal{H}(q) =: \ELBO(v, \mu)
\]
$E$ is the energy in the DBM and $\mathcal{H}(q) = \sum_j \left( \mu_j \log \mu_j + (1- \mu_j) \log ( 1- \mu_j) \right)$ is the entropy of $q$ \citep{sala2012anefficient, salakhutdinov2015generativemodels}.
To calculate $\mu$ for a given data vector $\tilde{v}$, the following equations for a fix-point iteration can be used, which are analogous to (\ref{dbmcondprobfirsthidden}) and (\ref{dbmcondprobintermediate}).
\begin{align*}
\mu^{(1)}(t+1)&= \sigm \bigg( b^{(1)} + (W^{(1)})^T \tilde{v} + W^{(2)} \mu ^{(2)}(t) \bigg) \\
\mu^{(k)}(t+1) &= \sigm \bigg( b^{(k)} + (W^{(k)})^T \mu^{(k-1)}(t) + W^{(k+1)} \mu^{(k+1)}(t) \bigg) \\
&\quad\quad (\text{for} \; k=2,\dots, n_L-1) \\
\mu^{(n_L)}(t+1)&= \sigm \bigg( b^{(n_L)} + (W^{(n_L)})^T \mu^{(n_L-1)}(t) \bigg)
\end{align*}
For $t \rightarrow \infty$, the series $\mu(t)$ converges to the value of $\mu$ that can be used for calculating the gradient for the lower bound.
As a starting value for $\mu$, activation in the network is induced by the input using a single forward pass, treating the DBM as a DBN (see \ref{dbns}).
Analogously to Equation (\ref{eqn:nablaresult}) for calculating the gradient of the log-likelihood in RBMs, the gradient of the variational lower bound of the log-likelihood can be calculated as
\begin{align*}
\nabla_{\theta} \ELBO(v, \mu) &= \nabla_{\theta} ( - E(\tilde{v}, \mu) - \log Z ) \\
&= \nabla_{\theta} (- E(\tilde{v}, \mu)) - \EX_{P_\text{model}} \nabla_{\theta} (-E(v, h)).
\end{align*}
This results in the following gradients for the different types of model parameters:
\begin{align*}
\frac{\partial}{\partial W^{(1)}} \ELBO(\tilde{v}, \mu) &= \tilde{v} (\mu^{(1)})^T - \EX_{P_\text{model}} v (h^{(1)})^T \\
\frac{\partial}{\partial W^{(k)}} \ELBO(\tilde{v}, \mu) &= \mu^{(k-1)} (\mu^{(k)})^T - \EX_{P_\text{model}} h^{(k-1)} (h^{(k)})^T &\quad(k = 2, \dots, n_L)\\
\frac{\partial}{\partial a} \ELBO(\tilde{v}, \mu) &= \tilde{v} - \EX_{P_\text{model}} v \\
\frac{\partial}{\partial b^{(k)}} \ELBO(\tilde{v}, \mu) &= \mu^{(k)} - \EX_{P_\text{model}} h^{(k)} &\quad ( k = 1, \dots, n_L)
\end{align*}
The expected values under the distribution of the model ($\EX_{P_\text{model}}$) are estimated like in the RBM by running a Gibbs sampler (see Figure \ref{fig:dbmsampling}).
This Gibbs sampler is run with a number of parallel persistent chains, which are called ``fantasy particles'' \citep{salakhutdinov2009DBMs}. The results are averaged over the number of fantasy particles to get better estimations for the expected values in the distribution of the model.
For the convergence of the gradient optimization, it is necessary to decrease the learning rate $\epsilon_t$ over time such that the series $\sum_{k=1}^\infty \epsilon_t^2$ converges, e.g., $\epsilon_t = \frac{c}{d+t}$ with constants $c, d > 0$ \citep{sala2012anefficient}.
Since the training procedure is very complex and depends on many hyperparameters, it is necessary to test whether the training works well by examining the training objective.
Although the likelihood is not the best criterion for all kinds of applications \citep{theis_note_2015}, it is essential to have a way to get a value for the (lower bound of the) likelihood as primary optimization criterion.
Being able to inspect the learning process is important for finding the best choice of hyperparameters and also for ensuring the quality of the software implementation of the learning algorithm.
This leads us to the next section, where we want to take a closer look at methods for calculating and estimating the likelihood, and the lower bound of the likelihood, in case of DBMs.
\subsubsection{Evaluating restricted and deep Boltzmann machines}\label{evaluatingbms}
A special challenge for unsupervised learning in general is the difficulty of evaluating the performance.
In supervised training, the classification accuracy is the natural evaluation criterion, which is also easy to implement.
In unsupervised training with a well investigated class of data such as images, there is already much experience available for choosing the model architecture and the hyperparameters. If models are to be trained on very diverse data, the problem of finding good hyperparameters is exacerbated as parameter tuning can pose a different challenge for each data set.
For finding good hyperparameters, an objective evaluation criterion is needed.
In case of images or natural language, the generative abilities of a model can be tested by simply looking at the generated images or sentences to see whether these are proper samples. In the case of data from a patient record or genetic data, this approach is not feasible.
If there are no other evaluation criteria, one indicator for successful learning in Boltzmann machines remains the model likelihood, which is an inherent property of the model and is therefore applicable in all cases of data. The difficulty of calculating the likelihood lies in its dependency on the partition function (see Equation (\ref{eqn:probbm})).
In most cases, the likelihood cannot be calculated exactly but it can only be estimated by stochastic algorithms like annealed importance sampling (AIS).
% So we also want to detail the extension of AIS on multimodal deep Boltzmann machines in this article.
\paragraph{Exact calculation of the partition function}
\label{methodExactloglik}
As mentioned in Section \ref{rbmtraining}, the exact calculation of partition functions is only computationally feasible for very small models as its complexity grows exponentially. Exploiting the layerwise structure allows a faster exact calculation of $Z$ such that the computation time does not grow exponentially with the number of all nodes but only grows exponentially with the number of elements in a subset of the nodes. It is possible to utilize the formula for the free energy in restricted Boltzmann machines (see (\ref{eqn:freenergy_rbm}) and (\ref{eqn:freenergy_gbrbm})), where the hidden layer is summed out analytically.
With this it is possible to reduce the number of summands.
The complexity for calculating the partition function for all the different types of models described here is then still $\mathcal{O}(2^n)$, but with an $n$ that is smaller than the number of nodes:
By using the formulas for the free energy and the symmetry of restricted Boltzmann with binary nodes, $n = \min(n_V, n_H)$ with $n_V$ and $n_H$ being the number of visible/hidden nodes, respectively.
In RBMs with one of layer Gaussian nodes and one layer of binary nodes, $n$ is the number of binary nodes, since the contribution of the Gaussian nodes can be integrated analytically.
In case of a deep Boltzmann machine, it is possible to sum out each second layer, similar to the calculation of the free energy in restricted Boltzmann machines. This way, $n$ can be reduced to the number of nodes in each second layer for DBMs, see Figure \ref{aissummingout}.
\begin{figure}[h!]
\centering
\includegraphics[scale=2.5]{images/AISsummingout.pdf}
\caption{Possibilities for summing out layers for calculating the likelihood. For calculating the likelihood in a more efficient way, the layers that are striked through can be summed out analytically. {\bf A:} Summing out the odd layers in this DBM leaves $2^{(4 + 4)} = 256$ summands that still have to be summed in order to calculate the likelihood. {\bf B:} Summing out the even layers in this DBM leaves in $2^{(4 + 3)} = 128$ summands. {\bf C:} Summing out the even layers in this multimodal DBM leaves $2^{(6 + 3)} = 512$ summands.}
\label{aissummingout}
\end{figure}
\paragraph{Estimating partition functions with annealed importance sampling (AIS)}\label{methodAIS}
For annealed importance sampling we need a sequence of intermediate distributions
$p_0, \dots p_K$ with
$p_0 = p_A$ and $p_K = p_B$. The ratio $\frac{Z_B}{Z_A}$ is then estimated by the mean of a number of so called {\em importance weights}.
Each importance weight is determined via sampling a new chain of values $x^{(0)}, \dots, x^{(K)}$ and then calculating the product of the ratios of unnormalized probabilities
\[
\prod_{k=1}^K \frac{p^*_k(x^{(k)})}{p^*_{k-1}(x^{(k)})}.
\]
Each of the importance weights is an estimator for $\frac{Z_B}{Z_A}$.
The model for the distribution $p_B$ can be chosen in a way that the partition function $Z_B$ can be calculated exactly.
This way, AIS can be used to estimate $Z_A$ by estimating the ratio $\frac{Z_B}{Z_A}$.
To get an estimation of this fraction with less variance, the mean of all the importance weights is used to estimate $\frac{Z_B}{Z_A}$.
The $x^{(k)}$ are produced by iteratively performing Gibbs sampling. Starting with $x^{(0)}$ sampled from $p_0$, one obtains $x^{(k)}$ by sampling a Gibbs chain that is initialized with $x^{(k-1)}$ in an intermediate model with distribution $p_k$. An appropriate choice for the intermediate models are Boltzmann machines with the energy functions $E_k$ chosen such that
\[
E_k(x) = (1 - \beta_k) E_A(x) + \beta_k E_B(x)
\]
and therefore
\[
p_k^*(x) = p_A^*(x)^{1-\beta_k} p_B^*(x)^{\beta_k}.
\]
The factors $\beta_k$ with $0 = \beta_0 < \beta_1 < ... < \beta_K = 1$ are called temperatures \citep{salakhutdinov2008learning}.
The choice of the temperatures, the number of importance weights and also the number of Gibbs sampling steps for the transition are hyperparameters for the AIS algorithm.
With AIS it is possible to get a direct estimate of the partition function $Z_A$ of a model $A$ by annealing from the model $A$ to a null model with all weights being zero as model $B$.
$Z_A$ can be computed from the estimation of $\frac{Z_B}{Z_A}$ because the partition function $Z_B$ of the null model can be calculated exactly. This approach is shown in Figure \ref{figTwotypesais}A.
It is also possible to compare the likelihood of two models instead of calculating it directly.
For this, only the ratio $\frac{Z_B}{Z_A}$ between two full models needs to be estimated.
There are two practical approaches for constructing intermediate models for directly annealing from one full model to another full model (see also Figure \ref{figTwotypesais}, B and C):
\begin{enumerate}
\item Combining (adding) corresponding weights of two models of the same size to get an intermediate model of the same size.
\item Constructing a larger model by putting the two models next to each other and connecting their nodes, increasing the energy in one part of the combined model while reducing it in the other part \citep{theis2011deepbelief}.
\end{enumerate}
\begin{figure}[h!]
\centering
\includegraphics[scale=3.5]{images/twotypesais.pdf}
\caption{Different approaches for performing AIS.
The ``tempered'' weights, i.e., the weights of the intermediate models, are depicted in grey. Black lines correspond to the original weights of a model. No lines between nodes are equal to the weights being zero.
{\bf A:} Annealing from a null model, where all weights are zero, to the full model.
{\bf B:} Annealing from one RBM model with three visible nodes and two hidden nodes to another model of the same size.
The weights of the first model are depicted as continuous lines, the weights of the second as dashed lines. Here, model one is gradually morphed into model two. {\bf C:}
The same two RBMs are compared again by annealing from a model that is equivalent to the first model to a model that is equivalent to a second model via a combined, bigger model. For this approach, both models do not necessarily have to be of the same size.}
\label{figTwotypesais}
\end{figure}
The approach with a combined model (see Figure \ref{figTwotypesais}C) is more flexible and allows to estimate the ratios of two arbitrary Boltzmann machines of the same type and with the same number of hidden nodes.
But it requires sampling in a Boltzmann machine that has as many hidden nodes as the two models together.
In a DBM, it is possible to use only the states of each second layer from samples generated by running a Gibbs chain \citep{salakhutdinov2008learning}. The unnormalized probabilities of these states can be calculated by analytically summing out the other layers.
For this, we can use the fact that the states in one layer are independent from the states of each non-adjacent layer to get the unnormalized probability $p_k^*(\tilde{h})$ for the subset $\tilde{h} = \left(h^{(1)}, h^{(3)}, \dots\right)$ of a generated sample $x = \left( v, h^{(1)}, h^{(2)}, \dots \right)$ in an intermediate model.
The layers that can be analytically summed out are the same as the ones shown in Figure \ref{aissummingout}.
To calculate the ratio $\frac{p_k^*(x^{(k)})}{p_{k-1}^*(x^{(k)})}$, one can derive the formula for $p^*(\tilde{h})$ in a DBM similarly to the free energy $F(v) = - \log p^*(v)$ in RBMs \citep{sala2012anefficient}. For a DBM with three hidden layers, the formula is
\begin{align}
p^*( \tilde{h} ) =& e^{(b^{(1)})^T h^{(1)}} \prod_j (1+ \exp((a + W^{(1)} h^{(1)})_j)) \; \cdot \nonumber \\
&e^{(b^{(3)})^T h^{(3)}} \prod_{j} \left( 1 + \exp \left( (b^{(2)} + (W^{(2)})^T h^{(1)} + W^{(3)} h^{(3)})_j \right) \right).
\label{eqn:aisunnormalizedprob}
\end{align}
Here the visible layer $v$ and the second hidden layer $h^{(2)}$ have been summed out analytically, like shown in Figure \ref{aissummingout}B.
%Note that the model for Gibbs sampling, used as the Markov transition operator, can differ from the model that is used to evaluate the unnormalized probability $p_k^*(h)$ as long as the probability distribution $p_k(h)$ of the hidden nodes is the same.
%This trick can be used to fit AIS for GBRBMs in the same schema as RBMs with Bernoulli distributed nodes.
It can be noted that the term in the first line of Equation (\ref{eqn:aisunnormalizedprob}) is equal to the unnormalized probability of the hidden nodes in a RBM with Bernoulli distributed nodes.
This procedure can be generalized for AIS on multimodal DBMs.
If we have the unnormalized probability $p^*(h)$ for each type of RBM that receives the data input, it becomes possible to calculate the unnormalized probability of sampled hidden values in a multimodal DBM in the same way as for a standard DBM with only Bernoulli distributed nodes.
The formula for the unnormalized probability for the respective RBM type (see Section \ref{unnormalizedprobsrbm}) can then be used for summing out the visible units in Equation (\ref{eqn:aisunnormalizedprob}) by substituting the term in the first line with the product of the unnormalized probabilities for all RBMs in the visible layer.
\paragraph{Calculating or estimating likelihoods in deep Boltzmann machines}
For a restricted Boltzmann machine, the likelihood can be calculated using Equation (\ref{eqn:pRBMfreeenergy}) if the partition function is known. This is not so easily possible in a DBM, for which calculating the distribution of the hidden nodes is of exponential complexity.
Estimating the likelihood of DBMs is possible using AIS by constructing a smaller DBM for each sample and estimating its partition function.
The smaller DBM is constructed by removing the visible layer, and incorporating the contribution of the sample to the energy of the first RBM - consisting only of visible and first hidden layer - into the bias of the new visible layer which was the first hidden layer of the original model.
The partition function of this smaller model is then the unnormalized probability of the sample in the original model \citep{salakhutdinov2009DBMs}.
In a setting with very large sample size, the cost of estimating the actual likelihood with this procedure may be too expensive. But if the sample size is small enough, can be affordable to estimate the likelihood and not fall back on the lower bound.
\paragraph{Alternative criteria for monitoring the training of RBMs}\label{alternativebmevaluations}
The AIS algorithm is very complex to implement.
Although it becomes possible to estimate the true optimization criterion with AIS in reasonable time, it is still compute-intensive, depending on the required exactness of the estimation.
Thus, often there are other statistics used for quickly checking the training progress.
The free energy cannot be used for comparing different models because it does not include the normalization by $Z$.
It can, however, be used to compare how well the same model fits different data sets.
One application for this is monitoring the overfitting by comparing the training data set and a test data set \citep{hinton_practical_2012}.
\label{reconstructionerror}
Another popular statistics, which behaves similar to the likelihood in RBMs in most cases, is the {\em reconstruction error} \citep{hinton_practical_2012}.
For defining this, one first needs to define the term {\em reconstruction} in an RBM.
The reconstruction \citep{hinton_practical_2012} for a sample $\tilde{v}$ in an RBM is calculated by using the conditional probabilities as deterministic ``activation potential''.
With $f_{hidden}(\tilde{v}) := p(h|\tilde{v})$ and $f_{visible}(\tilde{h}) := p(v|\tilde{h})$, the reconstruction $r(v)$ can be defined as $r(\tilde{v}) := f_{visible}(f_{hidden}(\tilde{v}))$.
(For continuously valued variables, the expected value can be used instead of the probability.)
The reconstruction error is then the distance between the sample $\tilde{v}$ and its reconstruction $r(\tilde{v})$, e.g., measured as absolute distance or quadratic distance.
Calculating the reconstruction error is a very fast and simple technique for monitoring the training progress for RBMs, DBNs and for greedy layer-wise pre-training of DBMs.
Although the reconstruction error can serve as a rough proxy statistic for the likelihood, it does not replace the likelihood entirely, as it is not the actual optimization target, and it is strongly influenced by the {\em mixing rate} of the Markov chain for Gibbs sampling, i.e., how fast the Markov chain reaches its equilibrium distribution. If the distribution in the Markov chain changes very slowly, the reconstruction error will be low, even if the distributions of the samples and the model differ much \citep{hinton_practical_2012}.
\subsubsection{Multimodal deep Boltzmann machines}
Figure \ref{fig:bmsoverview} shows different possible architectures for multimodal Boltzmann machines \citep{srivastava2012multimodal}.
Multimodal DBMs can be considered a generalization of DBMs.
They have been invented to make it possible to combine different input data types in a single model.
The training procedure of multimodal DBMs can be generalized as follows:
For the pre-training of a multimodal DBM, the different layers are trained by stacking RBMs in the same way as for pre-training a DBM.
The only differences are that the RBMs at the lowest layer may have different input distributions, and that RBMs in higher layers may be partitioned, i.e., some connections are always zero.
For the fine-tuning, the gradients in the different layers are calculated in the same way as the gradients in the respective RBM types, using the mean-field approximations and the fantasy particles of the DBM fine-tuning algorithm as activations of the respective visible and hidden nodes.
Gibbs sampling can also be performed in the same way, using the conditional probabilities in the RBMs.
\subsection{Theoretical developments}\label{bmaddtheory}
\subsubsection{A new approach for modeling categorical data}\label{methodsoftmax0}
% _maybe auch beschreiben, dass man bernoulli und categoriale variablen in einem haben kann
Categorical values are common in biomedical data.
For most applications in machine learning, categorical data is usually encoded in dummy variables \citep{hastie_elements}.
It would be possible to use the binary dummy variables as input to a restricted or deep Boltzmann machine with Bernoulli distributed visible units as well.
But when sampling from such a Boltzmann machine model, all combinations of visible nodes have a positive probability. This can be seen from the formula of the conditional probability (\ref{eqn:condprobrbm}) and the fact that the values of the sigmoid function are strictly positive.
Therefore, the resulting data is not properly encoded in general because illegal combinations of the values of dummy variables can occur.
This means that sampled values cannot be mapped to the original categories any more.
Using dummy variables as input to Boltzmann machines with Bernoulli distributed visible nodes makes it also more difficult to learn higher level patterns, as the Boltzmann machine has at first to learn the pattern that results from the dummy encoding by itself. Hence it is advised to use a Boltzmann machine that has the knowledge about the encoding built into its energy function and probability distribution like described in the following.
For encoding categorical variables, the most popular encoding used by the machine learning frameworks \apkg{TensorFlow} \citep{tensorflow}, \apkg{scikit-learn} \citep{scikit-learn} or \apkg{Flux} \citep{flux} is the so-called ``one-hot encoding'', which encodes a variable with $k$ categories in a binary vector of $k$ components, where exactly one component is one and all others are zero.
An advantage of this is that all categories are treated equally.
Here, I tried a slightly different variant.
A categorical variable with $k$ categories is encoded in $k-1$ binary dummy variables.
For example, this is the encoding of the values for a variable with four categories:
\begin{table}[h!]
\centering
\begin{tabular}{cc}
Categorical value & Dummy encoding \\
\hline
1 & 0 0 0 \\
2 & 1 0 0 \\
3 & 0 1 0 \\
4 & 0 0 1 \\
\end{tabular}
\caption{Dummy encoding with reference category for a categorical variable with four categories}\label{dummenc}
\end{table}
This variant is the technique for creating dummy variables for categorical variables in regression models \citep{faraway_regression}.
One category is used as reference category for the interpretation of the regression coefficients.
This allows to be parsimonious with parameters.
As already pointed out in the introduction, this may therefore help with deep learning on genetic data in particular.
Consider genetic variant data that contains values 0/1/2 for each of defined location on the human genome to encode that there is no deviation from the reference genome (0), a deviation on one chromosome (1) or a deviation on both chromosomes (2).
An RBM receiving input with dummy encoding using the natural reference category 0 needs only two thirds of the parameters that an RBM needs which receives the same data in one-hot encoding.
The energy function $E$ for RBMs with categorical data is the same as for Bernoulli distributed visible nodes (see Equation (\ref{eqn:energyformularbm})).
The difference is that not all combinations of visible nodes are allowed. Thus the partition function $Z = \sum_{v} \sum_{h} e^{-E(v,h)}$ and the probability
$p(v) = \frac{\sum_h e^{-E(v,h)}}{Z}$ change because the sum over all possible states of $v$ in the formula for $Z$ contains less summands than in the case with a Bernoulli distribution, where all combinations of activations are possible.
For deriving the formulas for an RBM that handles the dummy encoding as exemplified in Table \ref{dummenc}, let us denote with $C_k$ the set of dummy variables that the dummy variable $k$ belongs to.
The visible nodes of the RBM can cover multiple categorical variables and multiple sets of dummy variables, which may differ in the number of categories.
If the number of categories equals two for all categorical variables, this model is the same as an RBM with Bernoulli distributed nodes.
Following the notation in \cite{krizhevsky2009tinyimagesthesis}, let us write $E(v_k = 1, v_{i \neq k}, h)$ for the energy of the combination of a visible vector $v$ with $v_k = 1$ and a hidden vector $h$. Similarly, let us define $E(v_{i \in C_k} = 0, v_{i \notin C_k}, h)$ for the energy of a hidden vector $h$ and a visible vector $v$ where all dummy variables that belong to $C_k$ have the value zero.
From the previously defined dummy encoding results that if one dummy variable in the set is one, all others in the set are zero.
The sum over all possible combinations of $v$ can therefore be split into those parts where the value of the dummy variable $k$ is zero and one part where the value is one, which means that all others in the corresponding set of dummy variables are zero.
This allows to split the formula for the unnormalized probability of the hidden nodes into the following sums:
\begin{align}
p^*(h) &= \sum_v \exp (-E(v,h)) \nonumber \\
&= \sum_{v_{i \notin C_k}} \exp (-E(v_{i \in C_k} = 0, v_{i \notin C_k}, h)) + \sum_{c \in C_k} \sum_{v_{i \notin C_k}} \exp ( - E(v_c = 1, v_{i \notin C_k}, h))
%E(v_k=1, v_{i\neq k}, h) = E(v_k = 1, v_{i \notin C_k}, h).
\end{align}
The notation $\sum_{v_{i \notin C_k}}$ here indicates that the sum goes over all possible combinations of values for the visible nodes that do not belong to $C_k$.
The input from the hidden nodes can further be extracted from $ \sum_{v_{i \notin C_k}} e^{-E(v_{c=1}, v_{i \notin C_k}, h)}$ by regrouping the summands:
\begin{align}
&\sum_{v_{i \notin C_k}} \exp ( - E(v_c = 1, v_{i \notin C_k}, h)) = \nonumber \\
&\quad = \sum_{v_{i \notin C_k}} \exp \left( \sum_{i \notin C_k} v_i h_j w_{ij} + \sum_{i \notin C_k} a_i v_i + a_k + \sum_j h_j w_{kj} +\sum_j h_j b_j \right) \nonumber \\
&\quad = \exp \left( (Wh)_c + a_c \right) \sum_{v_{i \notin C_k}} \exp \left(-E(v_{i \in C_k} = 0, v_{i \notin C_k}, h) \right)
\label{eqn:sumvksoftmax}
\end{align}
The unnormalized probability $p^*(h)$ of the hidden nodes can now be rewritten as:
\begin{align}
p^*(h) =& \sum_{v_{i \notin C_k}} \exp (-E(v_{i \in C_k} = 0, v_{i \notin C_k}, h)) \; + \nonumber \\
&\quad \sum_{c \in C_k} \left( (Wh)_c + a_c \right) \sum_{v_{i \notin C_k}} \exp \left(-E(v_{i \in C_k} = 0, v_{i \notin C_k}, h) \right)
\label{eqn:unnormalizedhiddensplit}
\end{align}
With this, the conditional probability of a dummy variable being equal to one can finally be derived as follows:
\begin{align*}
p(v_k = 1 \mid h) &= \frac{p(v_k = 1, h)}{p(h)} \\
&= \frac{p^*(v_k = 1, h)}{p^*(h)}\\
&= \frac{\sum_{v_{i \notin C_k}} e^{-E(v_{k=1}, v_{i \notin C_k}, h)}}{p^*(h)} \\
%&= \frac{\sum_{v_{i \notin C_k}} \exp \left( \sum_{i \notin C_k} v_i h_j w_{ij} + \sum_{i \notin C_k} a_i v_i + a_k + \sum h_j w_{kj} +\sum_j h_j b_j \right)}{p^*(h)} \\
&\stackrel{(\ref{eqn:sumvksoftmax})}{=} \frac{\exp \left( (Wh)_k + a_k \right) \sum_{v_{i \notin C_k}} \exp \left(-E(v_{i \in C_k} = 0, v_{i \notin C_k}, h) \right)}{p^*(h)}\\
&\stackrel{(\ref{eqn:unnormalizedhiddensplit})}{=} \frac{\exp((Wh)_k + a_k)}{1 + \sum_{c \in C_k} \exp ((Wh)_c + a_c)}
%&= \frac{\exp \left( (Wh)_k + a_k \right) \sum_{v_{i \notin C_k}} \exp \left(-E(v_{i \in C_k} = 0, v_{i \notin C_k}, h) \right)}{\sum_{u_{i\notin C_k)}
\end{align*}
\subsubsection{Generalizing annealed importance sampling for multimodal DBMs} \label{unnormalizedprobsrbm}
To generalize the evaluation of the lower bound of the likelihood in multimodal DBMs, the AIS algorithm for estimating the partition function via AIS can be generalized.
This can be done by ``summing out'' or ``integrating out'' the different types of visible nodes (see Figure \ref{aissummingout} C).
In Section \ref{methodAIS}, AIS for binary DBMs is described.
In Equation (\ref{eqn:aisunnormalizedprob}) the binary visible layer is summed out by using the formula
\[
p^*(h) = e^{b^T h} \prod_j (1+ \exp((a + W h)_j))
\]
for the unnormalized probability of the hidden units in the RBM at the input layer.
Due to the symmetry of hidden and visible nodes in a RBM with only Bernoulli distributed nodes, this formula for $p^*(h)$ can be derived analogously to the free energy $F(v) = - \log p^*(v)$.
To adapt Equation (\ref{eqn:aisunnormalizedprob}) to a multimodal DBM, the formulas for the unnormalized probabilities $p^*(h)$ of the different RBMs at the input layer of the multimodal DBM need to be derived.
To change the type of the input layer, these formulas can then be used in place of the first line in Equation (\ref{eqn:aisunnormalizedprob}).
Since I have not found formulas for the unnormalized probability of the hidden nodes in RBMs with Gaussian visible nodes in the literature, I derive them here in detail.
The derivations for $p^*(h)$ for RBMs with Gaussian visible nodes use the fact that the integral over the density function of a normal distribution $\mathcal{N}(\mu, \sigma^2)$ is equal to one:
\begin{equation} \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{ -\frac{(x - \mu)^2}{2 \sigma^2}} dx = 1
\label{eqn:densitynormal}
\end{equation}
With that, the unnormalized probability of the hidden nodes in a {\bf Gaussian RBM with original parameterization} (see Equation (\ref{eqn:energyformulagbrbm})) can be calculated as
\begingroup
\allowdisplaybreaks
\begin{align*}
p^*(h) &= \intrnv e^{-E \left(v,h \right)} dv \\
&= \intrnv \exp \left( -\sum_{i=1}^{n_V}\frac{(v_i - a_i)^2}{2\sigma_i^2} + b^T h + \sum_{i=1}^{n_V} \sum_{j=1}^{n_H} \frac{v_i}{\sigma_i}h_j w_{ij} \right) dv\\
&= e^{b^T h} \intrnv \exp \left( - \sum_{i=1}^{n_V} \frac{v_i^2 -2 a_i v_i + a_i^2 - 2 v_i (Wh)_i \sigma_i}{2 \sigma_i^2} \right) dv \\
&= e^{b^T h} \intrnv \exp \left(
- \sum_{i=1}^{n_V} \frac{{\left( v_i - \left( (Wh)_i \sigma_i + a_i \right) \right)}^2}{2\sigma_i^2} + \sum_{i=1}^{n_V} \frac{1}{2}(Wh)_i^2 + (Wh)_i \frac{a_i}{\sigma_i} \right ) dv \\
\begin{split}
&= \exp \left(b^T h + \sum_{i=1}^{n_V} \frac{1}{2}(Wh)_i^2 + (Wh)_i \frac{a_i}{\sigma_i} \right ) \cdot \\
& \quad \quad \intrnv \exp \left ( - \sum_{i=1}^{n_V} \frac{{\left( v_i - ((Wh)_i \sigma_i + a_i) \right)}^2}{2\sigma_i^2} \right) dv
\end{split} \\
& \stackrel{(\ref{eqn:densitynormal})}{=} \exp \left( b^T h + \sum_{i=1}^{n_V} \frac{1}{2}(Wh)_i^2 + (Wh)_i \frac{a_i}{\sigma_i} \right ) \prod_{i=1}^{n_V}\left(\sqrt{2\pi} \sigma_i\right). \\
\end{align*}
\endgroup
For {\bf Cho's alternative parameterization} (see Equation (\ref{eqn:energyformulagbrbm2})), the unnormalized probability calculates analogously as
\begingroup
\allowdisplaybreaks
\begin{align*}
p^*(h) &= \intrnv e^{-E \left(v,h \right)} dv \\
&= \intrnv \exp \left( - \sum_{i=1}^{n_V} \frac{(v_i - a_i)^2}{2\sigma_i^2} + b^T h + \sum_{i=1}^{n_V} \sum_{j=1}^{n_H} h_j w_{ij} \frac{v_i}{\sigma_i^2} \right) dv \\
&= e^{b^T h} \intrnv \exp\left( - \sum_{i=1}^{n_V} \frac{(v_i - a_i)^2 - 2 v_i (Wh)_i}{2 \sigma_i^2} \right) dv \\
&= e^{b^T h} \intrnv \exp \left( - \sum_{i=1}^{n_V} \frac{\left(v_i - ((Wh)_i + a_i) \right)^2}{2 \sigma_i^2} + \sum_{i=1}^{n_V} \frac{(Wh)_i^2 + 2 a_i (Wh)_i}{2\sigma_i^2} \right) dv\\
&= \exp \left( b^T h + \sum_{i=1}^{n_V} \frac{(Wh)_i^2 + 2 a_i (Wh)_i}{2\sigma_i^2} \right) \cdot \\
& \quad \quad \intrnv \exp \left(- \sum_{i=1}^{n_V} \frac{\left(v_i - ((Wh)_i + a_i) \right)^2}{2 \sigma_i^2} \right) dv\\
&\stackrel{(\ref{eqn:densitynormal})}{=} \exp \left( b^T h + \sum_{i=1}^{n_V} \frac{\frac{1}{2}(Wh)_i^2 + (Wh)_i a_i}{\sigma_i^2} \right ) \prod_{i=1}^{n_V}\left(\sqrt{2\pi} \sigma_i \right).
\end{align*}
\endgroup
For the RBM for {\bf categorical variables with the dummy encoding with reference level} from Section \ref{methodsoftmax0}, the unnormalized probability can be derived similar to the RBM with Bernoulli distributed nodes. At first, the energy function can be rewritten in the same way as with the RBM with Bernoulli distributed nodes:
\begin{align}
e^{-E(v,h)} &= \exp \left(\sum_j b_j h_j + \sum_{i,j} w_{ij} v_i h_j + \sum_i a_i v_i \right) \nonumber \\
&= \exp \left( \sum_i b_j h_j + \sum_i v_i \left( a_i + \sum_j w_{ij} h_j \right) \right) \nonumber \\
&= e^{\sum_j b_h h_j} \prod_i \underbrace{e^{v_i (a_i + \sum_j w_{ij} h_j)}}_{(*)}
\label{eqn:freeenergytrick}
\end{align}
\begin{equation*}
(*) = \left\{
\begin{array}{l}
=1 \text{ for } v_i = 0 \\
= e^{a_i +\sum_j w_{ij} h_j} \text{ for } v_i = 1
\end{array} \right.
\end{equation*}
When multiplying out the product below using Equation (\ref{eqn:freeenergytrick}), it can be seen that $p^*(h)$ can be written as
\begin{align*}
p^*(h) &= \sum_v e^{-E(v,h)} \\
&= e^{\sum_j b_j h_j} \prod_{C} \left( 1 + \sum_{i \in C} e^{\sum_j w_{ij} h_j + a_i} \right).
\end{align*}
Here $C$ denotes the set of all index sets of the dummy variables, where each index set belongs to a categorical variable.
\clearpage
\subsection{The \apkg{BoltzmannMachines} Julia package}\label{bmsjl}
For being able to experiment with the algorithms for training and evaluating deep Boltzmann machines, an accessible and easily customizable implementation was needed.
The Julia package \apkg{BoltzmannMachines}, which is one of the main contributions of this thesis, serves this purpose.
The package offers a thorough and well-tested implementation of the algorithms described in Section \ref{bmtheory} and Section \ref{bmaddtheory}, with several unique features that are not covered by other published software.
It is the only known software that allows a flexible and easy composition of partitioned and multimodal DBMs, including different types of visible nodes, and an extensible design that makes the integration of different types of RBMs possible.
It is also the only known published implementation of AIS with DBMs that works with diverse architectures of DBMs.
Much emphasis is also put on evaluating RBMs and DBMs and allowing an inspection of the learning process.
For this purpose, there are many possibilities for a convenient and flexible monitoring of different criteria integrated in the functionality of the package.
Furthermore, the monitoring is completely customizable and supports user-defined evaluations as well.
The package is registered in the official package repository of the Julia language and it is also available as open source software from GitHub\footnote{\url{https://github.com/stefan-m-lenz/BoltzmannMachines.jl}}.
A detailed documentation of the functions and data types can be found in Appendix \ref{BMDoku}.
\subsubsection{Feature overview and comparison with existing implementations}\label{BMfeatures}
At the time of the decision whether to start a new implementation, several software implementations for RBMs and DBNs already existed.
Yet, there was no user-friendly software solution for training and evaluating DBMs.
The original implementation of DBM training\footnote{\url{http://www.cs.toronto.edu/~rsalakhu/DBM.html}} was written in MATLAB.
It consists of a collection of scripts and it is hard to re-use because it does not encapsulate the code in functions.
Therefore, the decision for creating a new software implementation of the algorithm was made.
The new implementation was designed to be very flexible and extensible for exploring new ideas, such as partitioning of DBMs and the generalization on multimodal DBMs.
This lead to the implementation of a range of features for experimenting with restricted and deep Boltzmann machines.
This section describes the features and characteristics of the existing software implementations for training and evaluating RBMs, DBNs, and DBMs.
Table \ref{tab:FeatureOverviewBoltzmann} summarizes the information about the existing solutions and compares them with the features of the \apkg{BoltzmannMachines} package.
\begin{table}[h!]
\centering
\caption{\label{tab:FeatureOverviewBoltzmann} Comparison of features and basic charateristics of different software implementations of restricted and deep Boltzmann machines.}
\begin{tabular}{p{4.4cm} p{1.1cm} p{1.1cm} p{1.1cm} p{1.1cm} p{1.25cm} p{1.25cm}}
\hline \\ [3.9ex]\\
& \begin{rotate}{25}\apkg{BoltzmannMachines} \end{rotate} &
\begin{rotate}{25} \apkg{lucastheis/deepbelief} \end{rotate} &
\begin{rotate}{25} \apkg{Boltzmann} \end{rotate} &
\begin{rotate}{25} \apkg{yell/boltzmann-machines} \end{rotate} &
\begin{rotate}{25} \apkg{scikit-learn} \end{rotate} &
\begin{rotate}{25} \apkg{darch} \end{rotate} \\
\hline
\\\\[-4\medskipamount]
Compared version & 1.2 & - & 0.7 & - & 0.23 & 0.12 \\
Language & Julia & Python & Julia & Python & Python & R \\
Registered package & Yes & No & Yes & No & Yes & Yes \\
License & MIT & MIT & MIT & MIT & 3-BSD & GPL-3 \\
Bernoulli RBM & Yes & Yes & Yes & Yes & Yes & Yes \\
Categorical input & Yes & No & No & No & No & No \\
Gaussian RBM (Hinton) & Yes & Yes & No & Yes & No & No \\
Gaussian RBM (Cho) & Yes & No & No & No & No & No \\
DBN training & Yes & Yes & Yes & Yes & No & Yes \\
DBM training & Yes & No & No & Yes & No & No \\
Multimodal DBM & Yes & No & No & No & No & No \\
Exact likelihood & Yes & No & No & No & No & No \\
AIS for RBMs & Yes & Yes & No & Yes & No & No \\
AIS for DBMs & Yes & No & No & Yes & No & No \\
\\[-2\medskipamount]
\hline