Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,9 @@ Data
widgets/data/applydomain
widgets/data/purgedomain
widgets/data/rank
widgets/data/correlations
widgets/data/color
widgets/data/featurestatistics
widgets/data/melt
widgets/data/neighbors
widgets/data/unique
widgets/data/groupby

Expand Down Expand Up @@ -150,6 +148,8 @@ Unsupervised
:maxdepth: 1

widgets/unsupervised/PCA
widgets/unsupervised/neighbors
widgets/unsupervised/correlations
widgets/unsupervised/correspondenceanalysis
widgets/unsupervised/distancemap
widgets/unsupervised/distances
Expand Down
Binary file removed source/widgets/data/images/Correlations-Example.png
Binary file not shown.
Binary file removed source/widgets/data/images/Correlations-links.png
Binary file not shown.
Binary file removed source/widgets/data/images/Correlations-stamped.png
Binary file not shown.
Binary file not shown.
Binary file removed source/widgets/data/images/neighbours-example1.png
Binary file not shown.
Binary file removed source/widgets/data/images/neighbours-example2.png
Binary file not shown.
Binary file removed source/widgets/data/images/neighbours-stamped.png
Binary file not shown.
42 changes: 0 additions & 42 deletions source/widgets/data/neighbors.md

This file was deleted.

21 changes: 10 additions & 11 deletions source/widgets/unsupervised/PCA.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,19 @@ PCA linear transformation of input data.
**Outputs**

- Transformed Data: PCA transformed data
- Components: [Eigenvectors](https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors).
- Data: original data with PCA components as meta variables
- Components: [Eigenvectors](https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors)
- PCA: PCA to use as Scorer in [Rank](../data/rank.md)

[Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA) computes the PCA linear transformation of the input data. It outputs either a transformed dataset with weights of individual instances or weights of principal components.

![](images/PCA-stamped.png)
![](images/PCA-stamped.png){width=500px}

1. Select how many principal components you wish in your output. It is best to choose as few as possible with variance covered as high as possible. You can also set how much variance you wish to cover with your principal components.
2. You can normalize data to adjust the values to common scale. If checked, columns are divided by their standard deviations.
1. Select how many principal components you wish in your output. It is best to choose as few as possible with variance (parameter *Explained variance*) covered as high as possible. You can also set how much variance you wish to cover with your principal components.
2. You can normalize data to adjust the values to common scale. If checked, columns are divided by their standard deviations. One can also set how many components to display in the graph.
3. When *Apply Automatically* is ticked, the widget will automatically communicate all changes. Alternatively, click *Apply*.
4. Press *Save Image* if you want to save the created image to your computer.
5. Produce a report.
6. Principal components graph, where the red (lower) line is the variance covered per component and the green (upper) line is cumulative variance covered by components.

The number of components of the transformation can be selected either in the *Components Selection* input box or by dragging the vertical cutoff line in the graph.
The principal components graph, called a scree plot, show the red (lower) line, representing the variance covered per component, and the green (upper) line, representing the cumulative variance covered by components. The number of components of the transformation can be selected either in the *Components* input box or by dragging the vertical cutoff line in the graph.

Preprocessing
-------------
Expand All @@ -39,8 +38,8 @@ Examples

**PCA** can be used to simplify visualizations of large datasets. Below, we used the *Iris* dataset to show how we can improve the visualization of the dataset with PCA. The transformed data in the [Scatter Plot](../visualize/scatterplot.md) show a much clearer distinction between classes than the default settings.

![](images/PCAExample.png)
![](images/PCA-Example1.png)

The widget provides two outputs: transformed data and principal components. Transformed data are weights for individual instances in the new coordinate system, while components are the system descriptors (weights for principal components). When fed into the [Data Table](../data/datatable.md), we can see both outputs in numerical form. We used two data tables in order to provide a more clean visualization of the workflow, but you can also choose to edit the links in such a way that you display the data in just one data table. You only need to create two links and connect the *Transformed data* and *Components* inputs to the *Data* output.
PCA can also be used as a scorer for the [Rank](../data/rank.md) widget. We used the *iris* data for this example. The data is passed both to Rank and to PCA. PCA passes the Scorer output to the Rank widget. Rank now shows feature scores for the first two principal components.

![](images/PCAExample2.png)
![](images/PCA-Example2.png)
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,23 @@ Compute all pairwise attribute correlations.

**Correlations** computes Pearson or Spearman correlation scores for all pairs of features in a dataset. These methods can only detect monotonic relationship.

![](images/Correlations-stamped.png)
![](images/Correlations-stamped.png){width=400px}

1. Correlation measure:
- Pairwise [Pearson](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) correlation.
- Pairwise [Spearman](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) correlation.
2. Select the attribute for computing correlations. Useful for large datasets.
2. Filter for finding attribute pairs.
3. A list of attribute pairs with correlation coefficient. Press *Finished* to stop computation for large datasets.
4. Access widget help and produce report.

Press *Finished* to stop computation for large datasets.

Example
-------

Correlations can be computed only for numeric (continuous) features, so we will use *housing* as an example data set. Load it in the [File](file.md) widget and connect it to **Correlations**. Positively correlated feature pairs will be at the top of the list and negatively correlated will be at the bottom.
Correlations can be computed only for numeric (continuous) features, so we will use *brown-selected* as an example data set. Load it in the [File](file.md) widget and connect it to **Correlations**. Positively correlated feature pairs will be at the top of the list and negatively correlated will be at the bottom.

![](images/Correlations-links.png)

Go to the most negatively correlated pair, DIS-NOX. Now connect [Scatter Plot](../visualize/scatterplot.md) to **Correlations** and set two outputs, Data to Data and Features to Features. Observe how the feature pair is immediately set in the scatter plot. Looks like the two features are indeed negatively correlated.
Select the most correlated feature pair. Now connect [Scatter Plot](../visualize/scatterplot.md) to **Correlations** and set two outputs, Data to Data and Features to Features. Observe how the feature pair is immediately set in the scatter plot. Looks like the two features are indeed positively correlated.

![](images/Correlations-Example.png)
18 changes: 10 additions & 8 deletions source/widgets/unsupervised/distancefile.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,23 @@ Loads an existing distance file.

- Distance File: distance matrix

![](images/DistanceFile-stamped.png)
![](images/DistanceFile-stamped.png){width=400px}

1. Choose from a list of previously saved distance files.
2. Browse for saved distance files.
3. Reload the selected distance file.
4. Information about the distance file (number of points,
labelled/unlabelled).
5. Browse documentation datasets.
6. Produce a report.
Browse for saved distance files.
Reload the selected distance file.
2. If *Treat triangular matrices as symmetric* is checked, triangular matrices will be mirrored over diagonal.
3. Browse documentation datasets.

The simplest way to prepare a distance file is to use Excel. The widget currently processes only single-sheet workbooks. The matrix can be either rectangular, or upper- or lower-triangular, with labels given for columns (immediately above) or rows (immediately to the left) or both. Empty cells are treated as zeros. If the matrix is triangular and only one set of labels is given or both sets are equal, the other half can be filled automatically, making the matrix symmetric.

![](images/DistanceFile-Excel.png){width=400px}

Above is an example of am upper-triangular matrix.

Example
-------

When you want to use a custom-set distance file that you've saved before, open the **Distance File** widget and select the desired file with the *Browse* icon. This widget loads the existing distance file. In the snapshot below, we loaded the transformed *Iris* distance matrix from the [Save Distance Matrix](../unsupervised/savedistancematrix.md) example. We displayed the transformed data matrix in the [Distance Map](../unsupervised/distancemap.md) widget. We also decided to display a distance map of the original *Iris* dataset for comparison.
When you want to use a custom-set distance file that you've saved before, open the **Distance File** widget and select the desired file with the *Browse* icon. This widget loads the existing distance file. In the snapshot below, we loaded the test square matrix. We displayed the matrix in the [Distance Matrix](../unsupervised/distancematrix.md) widget.

![](images/DistanceFile-Example.png)
28 changes: 11 additions & 17 deletions source/widgets/unsupervised/distancemap.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,41 +12,35 @@ Visualizes distances between items.
- Data: instances selected from the matrix
- Features: attributes selected from the matrix

The **Distance Map** visualizes distances between objects. The visualization is the same as if we printed out a table of numbers, except that the numbers are replaced by colored spots.
The **Distance Map** visualizes distances between objects. The visualization is the same as if we printed out a table of numbers, except that the numbers are replaced by colored spots. Conceptually, it is similar to the [Heat Map](../visualize/heatmap.md) widget.

Distances are most often those between instances ("*rows*" in the [Distances](../unsupervised/distances.md) widget) or attributes ("*columns*" in Distances widget). The only suitable input for **Distance Map** is the [Distances](../unsupervised/distances.md) widget. For the output, the user can select a region of the map and the widget will output the corresponding instances or attributes. Also note that the **Distances** widget ignores discrete values and calculates distances only for continuous data, thus it can only display distance map for discrete data if you [Continuize](../data/continuize.md) them first.
Distances are most often those between instances ("*rows*" in the [Distances](distances.md) widget) or attributes ("*columns*" in Distances widget). The two suitable inputs for **Distance Map** are the [Distances](distances.md) and the [Distance File](distancefile.md) widget. For the output, the user can select a region of the map and the widget will output the corresponding instances or attributes. Also note that the **Distances** widget ignores discrete values and calculates distances only for continuous data, thus it can only display distance map for discrete data if you [Continuize](../data/continuize.md) them first.

The snapshot shows distances between columns in the *heart disease* data, where smaller distances are represented with light and larger with dark orange. The matrix is symmetric and the diagonal is a light shade of orange - no attribute is different from itself. Symmetricity is always assumed, while the diagonal may also be non-zero.
The snapshot shows distances between columns in the *heart_disease* data, where smaller distances are represented with blue and larger with yellow/white. The matrix is symmetric and the diagonal is blue - no attribute is different from itself. Symmetricity is always assumed, while the diagonal may also be non-zero.

![](images/DistanceMap-stamped.png)
![](images/DistanceMap-stamped.png){width=500px}

1. *Element sorting* arranges elements in the map by
- None (lists instances as found in the dataset)
- **Clustering** (clusters data by similarity)
- **Clustering with ordered leaves** (maximizes the sum of similarities of adjacent elements)
2. *Colors*
- **Colors** (select the color palette for your distance map)
- **Low** and **High** are thresholds for the color palette (low for instances or attributes with low distances and high for instances or attributes with high distances).
- **Range**: Define the low and high thresholds for the color palette (low for instances or attributes with low distances and high for instances or attributes with high distances).
3. Select *Annotations*.
4. If *Send Selected Automatically* is on, the data subset is communicated automatically, otherwise you need to press *Send Selected*.
5. Press *Save Image* if you want to save the created image to your computer.
6. Produce a report.

Normally, a color palette is used to visualize the entire range of distances appearing in the matrix. This can be changed by setting the low and high threshold. In this way we ignore the differences in distances outside this interval and visualize the interesting part of the distribution.
Normally, a color palette is used to visualize the entire range of distances appearing in the matrix. This can be changed by setting the low and high threshold. In this way, we ignore the differences in distances outside this interval and visualize the interesting part of the distribution.

Below, we visualized the most correlated attributes (distances by columns) in the *heart disease* dataset by setting the color threshold for high distances to the minimum. We get a predominantly black square, where attributes with the lowest distance scores are represented by a lighter shade of the selected color schema (in our case: orange). Beside the diagonal line, we see that in our example *ST by exercise* and *major vessels colored* are the two attributes closest together.
Below, we visualized the most correlated attributes (distances by columns) in the *heart_disease* dataset by lowering the color threshold for high distances. We get a predominantly white square, where attributes with the lowest distance scores are represented by blue. We see that, beside the diagonal line, *age* and *major vessels colored* are the two attributes closest together.

![](images/DistanceMap-Highlighted.png)
![](images/DistanceMap-Threshold.png){width=400px}

The user can select a region in the map with the usual click-and-drag of the cursor. When a part of the map is selected, the widget outputs all items from the selected cells.

Examples
--------
Example
-------

The first workflow shows a very standard use of the **Distance Map** widget. We select 70% of the original *Iris* data as our sample and view the distances between rows in **Distance Map**.

![](images/DistanceMap-Example1.png)

In the second example, we use the *heart disease* data again and select a subset of women only from the [Scatter Plot](../visualize/scatterplot.md). Then, we visualize distances between columns in the **Distance Map**. Since the subset also contains some discrete data, the [Distances](../unsupervised/distances.md) widget warns us it will ignore the discrete features, thus we will see only continuous instances/attributes in the map.
The workflow shows a very standard use of the **Distance Map** widget. We select the *Iris* data and view the distances between rows in **Distance Map**.

![](images/DistanceMap-Example.png)
Loading