Skip to content

Commit b3bd8c2

Browse files
committed
revising Bernadette comments
1 parent 541e4dc commit b3bd8c2

File tree

4 files changed

+77
-50
lines changed

4 files changed

+77
-50
lines changed

ADictML_English.pdf

1.09 KB
Binary file not shown.

ADictML_Glossary_English.tex

Lines changed: 49 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -405,7 +405,7 @@
405405
description={An attack\index{attack} on an \gls{ml} system refers to an intentional action—either
406406
active or passive—that compromises the system's integrity, availability, or confidentiality.
407407
Active attacks involve perturbing components such as \glspl{dataset} (via \gls{datapoisoning})
408-
or communication links between \glspl{device} in a \gls{fl} setting. Passive attacks,
408+
or communication links between \glspl{device} within an \gls{ml} application. Passive attacks,
409409
such as \glspl{privattack}, aim to infer \glspl{sensattr} without modifying the system.
410410
Depending on their goal, we distinguish between \glspl{dosattack}, \gls{backdoor} attacks, and \glspl{privattack}.
411411
\\
@@ -1748,7 +1748,7 @@
17481748
is \gls{diffpriv}. The relations between different measures of privacy leakage have been
17491749
studied in the literature (see \cite{InfThDiffPriv}).
17501750
\\
1751-
See also: \gls{ml}, \gls{dataset}, \gls{prediction}, \gls{datapoint}, \gls{feature}, \gls{probmodel}, \gls{data}, \gls{mutualinformation}, \gls{diffpriv}. },
1751+
See also: \gls{privattack}, \gls{gdpr}, \gls{mutualinformation}, \gls{diffpriv}. },
17521752
first={privacy leakage},
17531753
text={privacy leakage}
17541754
}
@@ -2714,7 +2714,7 @@
27142714
description={Degree of belonging\index{degree of belonging} is a number that indicates the extent to which a \gls{datapoint}
27152715
belongs to a \gls{cluster} \cite[Ch. 8]{MLBasics}. The degree of belonging can be
27162716
interpreted as a soft \gls{cluster} assignment. \Gls{softclustering} methods can
2717-
encode the degree of belonging by a real number in the interval $[0,1]$.
2717+
encode the degree of belonging with a real number in the interval $[0,1]$.
27182718
\Gls{hardclustering} is obtained as the extreme case when the degree of belonging
27192719
only takes on values $0$ or $1$.
27202720
\\
@@ -2906,11 +2906,12 @@
29062906

29072907
\newglossaryentry{vcdim}
29082908
{name={Vapnik–Chervonenkis dimension (VC dimension)},
2909-
description={The\index{Vapnik–Chervonenkis dimension (VC dimension)} VC dimension of an infinite \gls{hypospace} is a widely-used measure
2910-
for its size. We refer to the literature (see \cite{ShalevMLBook}) for a precise definition of VC dimension
2911-
as well as a discussion of its basic properties and use in \gls{ml}.
2909+
description={The\index{Vapnik–Chervonenkis dimension (VC dimension)} VC dimension
2910+
is a widely-used measure for the size of an infinite \gls{hypospace}. We refer to
2911+
the literature (see \cite{ShalevMLBook}) for a precise definition of VC dimension
2912+
as well as a discussion of its basic properties and use in \gls{ml}.
29122913
\\
2913-
See also: \gls{hypospace}, \gls{ml}.},
2914+
See also: \gls{effdim}, \gls{hypospace}, \gls{ml}.},
29142915
first={Vapnik–Chervonenkis dimension (VC dimension)},
29152916
text={VC dimension}
29162917
}
@@ -3305,16 +3306,28 @@
33053306
}
33063307

33073308
\newglossaryentry{datapoint}
3308-
{name={data point}, plural={data points},
3309-
description={A\index{data point} \gls{data} point is any object that conveys information \cite{coverthomas}. \Gls{data} points might be
3310-
students, radio signals, trees, forests, images, \glspl{rv}, real numbers, or proteins. We characterize \gls{data} points
3311-
using two types of properties. One type of property is referred to as a \gls{feature}. \Glspl{feature} are properties of a
3312-
\gls{data} point that can be measured or computed in an automated fashion.
3313-
A different kind of property is referred to as a \gls{label}. The \gls{label} of
3314-
a \gls{data} point represents some higher-level fact (or quantity of interest). In
3315-
contrast to \glspl{feature}, determining the \gls{label} of a \gls{data} point typically
3316-
requires human experts (or domain experts). Roughly speaking, \gls{ml} aims to predict
3317-
the \gls{label} of a \gls{data} point based solely on its \glspl{feature}.
3309+
{name={data point},
3310+
plural={data points},
3311+
description={
3312+
A\index{data point} \gls{data} point is any object that conveys information~\cite{coverthomas}.
3313+
Examples include students, radio signals, trees, images, \glspl{rv}, real numbers,
3314+
or proteins. \Gls{data} points are typically described by two types of properties (or attributes):
3315+
\begin{itemize}
3316+
\item \Glspl{feature} are measurable or computable properties of a \gls{data} point. These
3317+
attributes can be automatically extracted or computed using sensors, computers, or other
3318+
\gls{data} collection systems. For a \gls{data} point being a patient, one \gls{feature}
3319+
could be the body weight.
3320+
\item \Glspl{label} are higher-level facts (or quantities of interest)
3321+
associated with the \gls{data} point. Determining the \glspl{label} of a \gls{data} point
3322+
usually requires human expertise or domain knowledge. For a \gls{data} point being a patient,
3323+
a cancer diagnosis provided by a physician would serve as the \gls{label}.
3324+
\end{itemize}
3325+
The distinction between \glspl{feature} and \glspl{label} is not always clear-cut.
3326+
A property that is considered a \gls{label} in one setting (e.g., a cancer diagnosis)
3327+
may be treated as a \gls{feature} in another—particularly if reliable automation (e.g.,
3328+
via image analysis) allows it to be computed without human intervention.
3329+
\Gls{ml} broadly aims to predict the \gls{label} of a \gls{data} point based
3330+
on its \glspl{feature}.
33183331
\\
33193332
See also: \gls{data}, \gls{rv}, \gls{feature}, \gls{label}, \gls{ml}.},
33203333
first={data point},
@@ -4056,7 +4069,7 @@
40564069
matrix $\mQ \in \mathbb{R}^{\nrfeatures \times \nrfeatures}$ with
40574070
\gls{evd} (or spectral decomposition),
40584071
$$ \mQ = \sum_{\featureidx=1}^{\nrfeatures} \eigval{\featureidx} \vu^{(\featureidx)} \big( \vu^{(\featureidx)} \big)^{T}.$$
4059-
Here, we use the ordered (in increasing fashion) \glspl{eigenvalue}
4072+
Here, we use the ordered (in ascending order) \glspl{eigenvalue}
40604073
\begin{equation}
40614074
\nonumber
40624075
\eigval{1} \leq \ldots \leq \eigval{\nrnodes}.
@@ -4135,14 +4148,20 @@
41354148

41364149
\newglossaryentry{cm}
41374150
{name={confusion matrix},
4138-
description={Consider\index{confusion matrix} \glspl{datapoint}, which are characterized
4139-
by \glspl{feature} $\featurevec$ and \gls{label} $\truelabel$, having values from the finite
4140-
\gls{labelspace} $\labelspace = \{1, \ldots, \nrcluster\}$. For a given \gls{hypothesis} $\hypothesis$,
4141-
the confusion matrix is a $\nrcluster \times \nrcluster$ matrix with rows representing the elements of
4142-
$\labelspace$. The columns of a confusion matrix correspond to the \gls{prediction} $\hypothesis(\featurevec)$.
4143-
The $(\clusteridx,\clusteridx')$-th entry of the confusion matrix is the fraction of
4144-
\glspl{datapoint} with \gls{label} $\truelabel\!=\! \clusteridx$ and resulting in a \gls{prediction} $\hypothesis(\featurevec)\!=\!\clusteridx'$.
4145-
\\
4151+
description={Consider\index{confusion matrix} \glspl{datapoint} characterized
4152+
by \glspl{feature} $\featurevec$ and corresponding \glspl{label} $\truelabel$.
4153+
The labels take values in a finite \gls{labelspace} $\labelspace = \{1, \ldots, \nrcluster\}$.
4154+
For a given \gls{hypothesis} $\hypothesis$, the confusion matrix is a
4155+
$\nrcluster \times \nrcluster$ matrix where each row corresponds to a different
4156+
value of the true \gls{label} $\truelabel \in \labelspace$ and each column to a
4157+
different value of the \gls{prediction} $\hypothesis(\featurevec) \in \labelspace$.
4158+
The $(\clusteridx,\clusteridx')$-th entry of the confusion matrix represents the fraction of
4159+
\glspl{datapoint} with true \gls{label} $\truelabel = \clusteridx$ that are predicted as
4160+
$\hypothesis(\featurevec) = \clusteridx'$. The main diagonal of the confusion matrix
4161+
contains the fractions of correctly classified \glspl{datapoint} (i.e, those for which
4162+
$\truelabel = \hypothesis(\featurevec)$). The off-diagonal entries contain the fractions of
4163+
\glspl{datapoint} that are misclassified by $\hypothesis$.
4164+
\\
41464165
See also: \gls{label}, \gls{labelspace}, \gls{hypothesis}, \gls{classification}.},
41474166
first={confusion matrix},text={confusion matrix} }
41484167

@@ -4165,11 +4184,11 @@
41654184
description={DBSCAN\index{density-based spatial clustering of applications with
41664185
noise (DBSCAN)} refers to a \gls{clustering} \gls{algorithm} for \glspl{datapoint}
41674186
that are characterized by numeric \glspl{featurevec}.
4168-
Like \gls{kmeans} and \gls{softclustering} via \gls{gmm}, also DBSCAN uses the Euclidean
4187+
Like \gls{kmeans} and \gls{softclustering} via \gls{gmm}, DBSCAN also uses the Euclidean
41694188
distances between \glspl{featurevec} to determine the \glspl{cluster}. However, in contrast to \gls{kmeans}
41704189
and \gls{gmm}, DBSCAN uses a different notion of similarity between \glspl{datapoint}.
41714190
DBSCAN considers two \glspl{datapoint} as similar if they are connected
4172-
via a sequence (i.e., path) of close-by intermediate \glspl{datapoint}.
4191+
via a sequence (i.e., path) of nearby intermediate \glspl{datapoint}.
41734192
Thus, DBSCAN might consider two \glspl{datapoint} as similar (and therefore belonging
41744193
to the same \gls{cluster}) even if their \glspl{featurevec} have a large Euclidean distance.
41754194
\\
@@ -5252,7 +5271,7 @@
52525271
More formally, a decision tree is a directed \gls{graph} containing a root node that reads
52535272
in the \gls{featurevec} $\featurevec$ of a \gls{datapoint}. The root node then forwards
52545273
the \gls{datapoint} to one of its child nodes based on some elementary test on the \glspl{feature} $\featurevec$.
5255-
If the receiving child node is not a leaf node, i.e., it has itself child nodes,
5274+
If the receiving child node is not a leaf node, i.e., it has child nodes itself,
52565275
it represents another test. Based on the test result, the \gls{datapoint} is forwarded
52575276
to one of its descendants. This testing and forwarding of the \gls{datapoint} is continued
52585277
until the \gls{datapoint} ends up in a leaf node without any children.
@@ -5996,7 +6015,7 @@
59966015
{name={local dataset}, plural={local datasets},
59976016
description={The\index{local dataset} concept of a local \gls{dataset} is
59986017
in between the concept of a \gls{datapoint} and a \gls{dataset}. A local \gls{dataset} consists of several
5999-
individual \glspl{datapoint}, which are characterized by \glspl{feature} and \glspl{label}.
6018+
individual \glspl{datapoint}, characterized by \glspl{feature} and \glspl{label}.
60006019
In contrast to a single \gls{dataset} used in basic \gls{ml} methods, a local \gls{dataset} is also
60016020
related to other local \glspl{dataset} via different notions of similarity. These similarities
60026021
might arise from \glspl{probmodel} or communication infrastructure and

0 commit comments

Comments
 (0)