|
405 | 405 | description={An attack\index{attack} on an \gls{ml} system refers to an intentional action—either
|
406 | 406 | active or passive—that compromises the system's integrity, availability, or confidentiality.
|
407 | 407 | Active attacks involve perturbing components such as \glspl{dataset} (via \gls{datapoisoning})
|
408 |
| - or communication links between \glspl{device} in a \gls{fl} setting. Passive attacks, |
| 408 | + or communication links between \glspl{device} within an \gls{ml} application. Passive attacks, |
409 | 409 | such as \glspl{privattack}, aim to infer \glspl{sensattr} without modifying the system.
|
410 | 410 | Depending on their goal, we distinguish between \glspl{dosattack}, \gls{backdoor} attacks, and \glspl{privattack}.
|
411 | 411 | \\
|
|
1748 | 1748 | is \gls{diffpriv}. The relations between different measures of privacy leakage have been
|
1749 | 1749 | studied in the literature (see \cite{InfThDiffPriv}).
|
1750 | 1750 | \\
|
1751 |
| - See also: \gls{ml}, \gls{dataset}, \gls{prediction}, \gls{datapoint}, \gls{feature}, \gls{probmodel}, \gls{data}, \gls{mutualinformation}, \gls{diffpriv}. }, |
| 1751 | + See also: \gls{privattack}, \gls{gdpr}, \gls{mutualinformation}, \gls{diffpriv}. }, |
1752 | 1752 | first={privacy leakage},
|
1753 | 1753 | text={privacy leakage}
|
1754 | 1754 | }
|
|
2714 | 2714 | description={Degree of belonging\index{degree of belonging} is a number that indicates the extent to which a \gls{datapoint}
|
2715 | 2715 | belongs to a \gls{cluster} \cite[Ch. 8]{MLBasics}. The degree of belonging can be
|
2716 | 2716 | interpreted as a soft \gls{cluster} assignment. \Gls{softclustering} methods can
|
2717 |
| - encode the degree of belonging by a real number in the interval $[0,1]$. |
| 2717 | + encode the degree of belonging with a real number in the interval $[0,1]$. |
2718 | 2718 | \Gls{hardclustering} is obtained as the extreme case when the degree of belonging
|
2719 | 2719 | only takes on values $0$ or $1$.
|
2720 | 2720 | \\
|
|
2906 | 2906 |
|
2907 | 2907 | \newglossaryentry{vcdim}
|
2908 | 2908 | {name={Vapnik–Chervonenkis dimension (VC dimension)},
|
2909 |
| - description={The\index{Vapnik–Chervonenkis dimension (VC dimension)} VC dimension of an infinite \gls{hypospace} is a widely-used measure |
2910 |
| - for its size. We refer to the literature (see \cite{ShalevMLBook}) for a precise definition of VC dimension |
2911 |
| - as well as a discussion of its basic properties and use in \gls{ml}. |
| 2909 | + description={The\index{Vapnik–Chervonenkis dimension (VC dimension)} VC dimension |
| 2910 | + is a widely-used measure for the size of an infinite \gls{hypospace}. We refer to |
| 2911 | + the literature (see \cite{ShalevMLBook}) for a precise definition of VC dimension |
| 2912 | + as well as a discussion of its basic properties and use in \gls{ml}. |
2912 | 2913 | \\
|
2913 |
| - See also: \gls{hypospace}, \gls{ml}.}, |
| 2914 | + See also: \gls{effdim}, \gls{hypospace}, \gls{ml}.}, |
2914 | 2915 | first={Vapnik–Chervonenkis dimension (VC dimension)},
|
2915 | 2916 | text={VC dimension}
|
2916 | 2917 | }
|
|
3305 | 3306 | }
|
3306 | 3307 |
|
3307 | 3308 | \newglossaryentry{datapoint}
|
3308 |
| -{name={data point}, plural={data points}, |
3309 |
| - description={A\index{data point} \gls{data} point is any object that conveys information \cite{coverthomas}. \Gls{data} points might be |
3310 |
| - students, radio signals, trees, forests, images, \glspl{rv}, real numbers, or proteins. We characterize \gls{data} points |
3311 |
| - using two types of properties. One type of property is referred to as a \gls{feature}. \Glspl{feature} are properties of a |
3312 |
| - \gls{data} point that can be measured or computed in an automated fashion. |
3313 |
| - A different kind of property is referred to as a \gls{label}. The \gls{label} of |
3314 |
| - a \gls{data} point represents some higher-level fact (or quantity of interest). In |
3315 |
| - contrast to \glspl{feature}, determining the \gls{label} of a \gls{data} point typically |
3316 |
| - requires human experts (or domain experts). Roughly speaking, \gls{ml} aims to predict |
3317 |
| - the \gls{label} of a \gls{data} point based solely on its \glspl{feature}. |
| 3309 | +{name={data point}, |
| 3310 | + plural={data points}, |
| 3311 | + description={ |
| 3312 | + A\index{data point} \gls{data} point is any object that conveys information~\cite{coverthomas}. |
| 3313 | + Examples include students, radio signals, trees, images, \glspl{rv}, real numbers, |
| 3314 | + or proteins. \Gls{data} points are typically described by two types of properties (or attributes): |
| 3315 | +\begin{itemize} |
| 3316 | + \item \Glspl{feature} are measurable or computable properties of a \gls{data} point. These |
| 3317 | + attributes can be automatically extracted or computed using sensors, computers, or other |
| 3318 | + \gls{data} collection systems. For a \gls{data} point being a patient, one \gls{feature} |
| 3319 | + could be the body weight. |
| 3320 | + \item \Glspl{label} are higher-level facts (or quantities of interest) |
| 3321 | + associated with the \gls{data} point. Determining the \glspl{label} of a \gls{data} point |
| 3322 | + usually requires human expertise or domain knowledge. For a \gls{data} point being a patient, |
| 3323 | + a cancer diagnosis provided by a physician would serve as the \gls{label}. |
| 3324 | +\end{itemize} |
| 3325 | + The distinction between \glspl{feature} and \glspl{label} is not always clear-cut. |
| 3326 | + A property that is considered a \gls{label} in one setting (e.g., a cancer diagnosis) |
| 3327 | + may be treated as a \gls{feature} in another—particularly if reliable automation (e.g., |
| 3328 | + via image analysis) allows it to be computed without human intervention. |
| 3329 | + \Gls{ml} broadly aims to predict the \gls{label} of a \gls{data} point based |
| 3330 | + on its \glspl{feature}. |
3318 | 3331 | \\
|
3319 | 3332 | See also: \gls{data}, \gls{rv}, \gls{feature}, \gls{label}, \gls{ml}.},
|
3320 | 3333 | first={data point},
|
|
4056 | 4069 | matrix $\mQ \in \mathbb{R}^{\nrfeatures \times \nrfeatures}$ with
|
4057 | 4070 | \gls{evd} (or spectral decomposition),
|
4058 | 4071 | $$ \mQ = \sum_{\featureidx=1}^{\nrfeatures} \eigval{\featureidx} \vu^{(\featureidx)} \big( \vu^{(\featureidx)} \big)^{T}.$$
|
4059 |
| - Here, we use the ordered (in increasing fashion) \glspl{eigenvalue} |
| 4072 | + Here, we use the ordered (in ascending order) \glspl{eigenvalue} |
4060 | 4073 | \begin{equation}
|
4061 | 4074 | \nonumber
|
4062 | 4075 | \eigval{1} \leq \ldots \leq \eigval{\nrnodes}.
|
|
4135 | 4148 |
|
4136 | 4149 | \newglossaryentry{cm}
|
4137 | 4150 | {name={confusion matrix},
|
4138 |
| - description={Consider\index{confusion matrix} \glspl{datapoint}, which are characterized |
4139 |
| - by \glspl{feature} $\featurevec$ and \gls{label} $\truelabel$, having values from the finite |
4140 |
| - \gls{labelspace} $\labelspace = \{1, \ldots, \nrcluster\}$. For a given \gls{hypothesis} $\hypothesis$, |
4141 |
| - the confusion matrix is a $\nrcluster \times \nrcluster$ matrix with rows representing the elements of |
4142 |
| - $\labelspace$. The columns of a confusion matrix correspond to the \gls{prediction} $\hypothesis(\featurevec)$. |
4143 |
| - The $(\clusteridx,\clusteridx')$-th entry of the confusion matrix is the fraction of |
4144 |
| - \glspl{datapoint} with \gls{label} $\truelabel\!=\! \clusteridx$ and resulting in a \gls{prediction} $\hypothesis(\featurevec)\!=\!\clusteridx'$. |
4145 |
| - \\ |
| 4151 | + description={Consider\index{confusion matrix} \glspl{datapoint} characterized |
| 4152 | + by \glspl{feature} $\featurevec$ and corresponding \glspl{label} $\truelabel$. |
| 4153 | + The labels take values in a finite \gls{labelspace} $\labelspace = \{1, \ldots, \nrcluster\}$. |
| 4154 | + For a given \gls{hypothesis} $\hypothesis$, the confusion matrix is a |
| 4155 | + $\nrcluster \times \nrcluster$ matrix where each row corresponds to a different |
| 4156 | + value of the true \gls{label} $\truelabel \in \labelspace$ and each column to a |
| 4157 | + different value of the \gls{prediction} $\hypothesis(\featurevec) \in \labelspace$. |
| 4158 | + The $(\clusteridx,\clusteridx')$-th entry of the confusion matrix represents the fraction of |
| 4159 | + \glspl{datapoint} with true \gls{label} $\truelabel = \clusteridx$ that are predicted as |
| 4160 | + $\hypothesis(\featurevec) = \clusteridx'$. The main diagonal of the confusion matrix |
| 4161 | + contains the fractions of correctly classified \glspl{datapoint} (i.e, those for which |
| 4162 | + $\truelabel = \hypothesis(\featurevec)$). The off-diagonal entries contain the fractions of |
| 4163 | + \glspl{datapoint} that are misclassified by $\hypothesis$. |
| 4164 | + \\ |
4146 | 4165 | See also: \gls{label}, \gls{labelspace}, \gls{hypothesis}, \gls{classification}.},
|
4147 | 4166 | first={confusion matrix},text={confusion matrix} }
|
4148 | 4167 |
|
|
4165 | 4184 | description={DBSCAN\index{density-based spatial clustering of applications with
|
4166 | 4185 | noise (DBSCAN)} refers to a \gls{clustering} \gls{algorithm} for \glspl{datapoint}
|
4167 | 4186 | that are characterized by numeric \glspl{featurevec}.
|
4168 |
| - Like \gls{kmeans} and \gls{softclustering} via \gls{gmm}, also DBSCAN uses the Euclidean |
| 4187 | + Like \gls{kmeans} and \gls{softclustering} via \gls{gmm}, DBSCAN also uses the Euclidean |
4169 | 4188 | distances between \glspl{featurevec} to determine the \glspl{cluster}. However, in contrast to \gls{kmeans}
|
4170 | 4189 | and \gls{gmm}, DBSCAN uses a different notion of similarity between \glspl{datapoint}.
|
4171 | 4190 | DBSCAN considers two \glspl{datapoint} as similar if they are connected
|
4172 |
| - via a sequence (i.e., path) of close-by intermediate \glspl{datapoint}. |
| 4191 | + via a sequence (i.e., path) of nearby intermediate \glspl{datapoint}. |
4173 | 4192 | Thus, DBSCAN might consider two \glspl{datapoint} as similar (and therefore belonging
|
4174 | 4193 | to the same \gls{cluster}) even if their \glspl{featurevec} have a large Euclidean distance.
|
4175 | 4194 | \\
|
|
5252 | 5271 | More formally, a decision tree is a directed \gls{graph} containing a root node that reads
|
5253 | 5272 | in the \gls{featurevec} $\featurevec$ of a \gls{datapoint}. The root node then forwards
|
5254 | 5273 | the \gls{datapoint} to one of its child nodes based on some elementary test on the \glspl{feature} $\featurevec$.
|
5255 |
| - If the receiving child node is not a leaf node, i.e., it has itself child nodes, |
| 5274 | + If the receiving child node is not a leaf node, i.e., it has child nodes itself, |
5256 | 5275 | it represents another test. Based on the test result, the \gls{datapoint} is forwarded
|
5257 | 5276 | to one of its descendants. This testing and forwarding of the \gls{datapoint} is continued
|
5258 | 5277 | until the \gls{datapoint} ends up in a leaf node without any children.
|
|
5996 | 6015 | {name={local dataset}, plural={local datasets},
|
5997 | 6016 | description={The\index{local dataset} concept of a local \gls{dataset} is
|
5998 | 6017 | in between the concept of a \gls{datapoint} and a \gls{dataset}. A local \gls{dataset} consists of several
|
5999 |
| - individual \glspl{datapoint}, which are characterized by \glspl{feature} and \glspl{label}. |
| 6018 | + individual \glspl{datapoint}, characterized by \glspl{feature} and \glspl{label}. |
6000 | 6019 | In contrast to a single \gls{dataset} used in basic \gls{ml} methods, a local \gls{dataset} is also
|
6001 | 6020 | related to other local \glspl{dataset} via different notions of similarity. These similarities
|
6002 | 6021 | might arise from \glspl{probmodel} or communication infrastructure and
|
|
0 commit comments