Skip to content

Commit

Permalink
Update 01-Intro-High-Dimensional-Data.md
Browse files Browse the repository at this point in the history
  • Loading branch information
qualiaMachine authored Aug 15, 2024
1 parent 7fa2c35 commit 0d77aea
Showing 1 changed file with 12 additions and 5 deletions.
17 changes: 12 additions & 5 deletions _episodes/01-Intro-High-Dimensional-Data.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,11 +263,18 @@ human gene expression e.g.
{:.challenge}

## High-Dimensional Data
What is high-dimensional data? Unfortunately, there isn't a precise definition. Oftentimes, when people use the term, they are referring to data that has so many features that it is difficult to determine which features are relevant to the research question (dozens or more). In a modeling context, however, high-dimensional data is usually defined as a dataset where the number of features approaches or exceeds the number of observations.

The "curse of dimensionality" generally refers to the issues that arise when dealing with data in high-dimensional spaces, where distances between data points become less meaningful and the data becomes more sparse. This can lead to challenges in terms of computational complexity, overfitting in machine learning models, difficulties in visualization, and the need for specialized techniques to handle such data effectively.
What is high-dimensional data? Unfortunately, there isn’t a precise definition. Often, when people use the term, they are referring to specific problems and headaches that arise when working with data that has many (typically dozens or more) features (a.k.a. dimensions). These problems are generally referred to as the "curse of dimensionality".

So, whether the term "high-dimensional data" is used to describe datasets with a large number of features or datasets with a very high number of features, the underlying challenges related to dimensionality are usually concerned with the same issues of increased complexity and difficulties in analysis and modeling.
### Curse of Dimensionality

The “curse of dimensionality” refers to the challenges that arise when dealing with data in high-dimensional spaces. These challenges include:

- **Overfitting in Models:** Machine learning models are prone to overfitting when the number of features approaches or exceeds the number of observations in the data. In this context, what is considered "high-dimensional is relative to the number of observations in your data.
- **Increased Computational Complexity:** As dimensions increase, so do compute needs both in terms of memory and processing power. This can make the analysis more difficult.
- **Visualization Challenges:** Visualizing data with many features becomes challenging, as humans can easily comprehend only up to three dimensions.
- **Increased Sparsity:** As dimensions increase, the volume of the space grows exponentially, making data points more spread out and less dense.
- **Reduced Meaningfulness of Distance:** As dimensions increase, the concept of distance between data points becomes less intuitive and less useful for distinguishing between different points.

Throughout this workshop, we'll see how these challenges, or "curses," apply to our research goals and explore strategies to address them.

# End of part 1
in part two we'll start exploring a new dataset

0 comments on commit 0d77aea

Please sign in to comment.