Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial upload #1

Merged
merged 10 commits into from
Mar 24, 2020
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .github/workflows/npm-publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: npm-publish

on:
push:
branches:
- master

jobs:
npm-publish:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@master
vincerubinetti marked this conversation as resolved.
Show resolved Hide resolved
- name: Set up Node
uses: actions/setup-node@master
with:
node-version: '12.x'
- name: Publish
if: github.ref == 'refs/heads/master'
uses: mikeal/merge-release@master
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
NPM_AUTH_TOKEN: ${{ secrets.NPM_AUTH_TOKEN }}
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2020 Greene Laboratory
cgreene marked this conversation as resolved.
Show resolved Hide resolved

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
240 changes: 238 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,238 @@
# hclust
Agglomerative hierarchical clustering in JavaScript
### hclust
[Agglomerative hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) in JavaScript

Adapted from [hcluster.js](https://github.com/cmpolis/hcluster.js) by [@ChrisPolis](https://twitter.com/chrispolis)
vincerubinetti marked this conversation as resolved.
Show resolved Hide resolved

---

### Usage

#### Browser

```html
<script src="hclust.min.js"></script>
<script>
hclust.clusterData(...);
hclust.euclideanDistance(...);
hclust.avgDistance(...);
hclust.minDistance(...);
hclust.maxDistance(...);
</script>
```

#### Node

`npm install @greenelab/hclust.git`

or

`yarn add @greenelab/hclust.git`

then

```javascript
import { clusterData } from 'hclust';
import { euclideanDistance } from 'hclust';
import { avgDistance } from 'hclust';
import { minDistance } from 'hclust';
import { maxDistance } from 'hclust';
```
vincerubinetti marked this conversation as resolved.
Show resolved Hide resolved

---

### `clusterData({ data, key, distance, linkage, onProgress })`

#### Parameters

**`data`**

The data you want to cluster, in the format:

```javascript
[
...
[ ... 1, 2, 3 ...],
[ ... 1, 2, 3 ...],
[ ... 1, 2, 3 ...],
...
]
```

or if `key` parameter is specified:

```javascript
[
...
{ someKey: [ ... 1, 2, 3 ...] },
{ someKey: [ ... 1, 2, 3 ...] },
{ someKey: [ ... 1, 2, 3 ...] },
...
]
```

The entries in the outer array can be considered the `rows` and the entries within each `row` array can be considered the `cols`.
Each `row` should have the same number of `cols`.

*Default value:* `[]`

**`key`**

A `string` key to specify which values to extract from the `data` array.
If omitted, `data` is assumed to be an array of arrays.
If specified, `data` is assumed to be array of objects, each with a key that contains the values for that `row`.

*Default value:* `''`

**`distance`**

A function to calculate the distance between two equal-dimension vectors, used in calculating the distance matrix, in the format:

```javascript
function (arrayA, arrayB) { return someNumber; }
```

The function receives two equal-length arrays of numbers.
vincerubinetti marked this conversation as resolved.
Show resolved Hide resolved

*Default value:* `euclideanDistance` from this `hclust` package

**`linkage`**

A function to calculate the distance between pairs of clusters, used in determining linkage criterion, in the format:

```javascript
function (arrayA, arrayB, distanceMatrix) { return someNumber; }
```

The function receives two sets of indexes and the distance matrix computed between each datum and every other datum.

*Default value:* `averageDistance` from this `hclust` package
*Other built-in values:* `minDistance` and `maxDistance` from this `hclust` package
vincerubinetti marked this conversation as resolved.
Show resolved Hide resolved

**`onProgress`**

A function that is called several times throughout clustering, and is provided the current progress through the clustering, in the format:

```javascript
function (progress) { }
```

The function receives the percent progress between `0` and `1`.

*Default value:* an internal function that `console.log`'s the progress

**Note:** [`postMessage`](https://developer.mozilla.org/en-US/docs/Web/API/Worker/postMessage) is called in the same places as `onProgress`, if the script is running as a [web worker](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers).

#### Returns

vincerubinetti marked this conversation as resolved.
Show resolved Hide resolved
**`clusters`**

The resulting cluster tree, in the format:

```javascript
{
indexes: [ ... Number, Number, Number ... ],
height: Number,
children: [ ... {}, {}, {} ... ]
}

```

**`distances`**

The computed distance matrix, in the format:

```javascript
[
...
[ ... Number, Number, Number ...],
[ ... Number, Number, Number ...],
[ ... Number, Number, Number ...]
...
]
```

**`order`**

The new order of the data, in terms of original data array indexes, in the format:

```javascript
[ ... Number, Number, Number ... ]
```

Equivalent to `clusters.indexes` and `clustersGivenK[1]`.

**`clustersGivenK`**

A list of tree slices in terms of original data array indexes, where index = K, in the format:

```javascript
[
[], // K = 0
[ [] ], // K = 1
[ [], [] ], // K = 2
[ [], [], [] ], // K = 3
[ [], [], [], [] ], // K = 4
[ [], [], [], [], [] ] // K = 5
...
]
```

---

### `avgDistance(arrayA, arrayB, distanceMatrix)`

Calculates the average distance between pairs of clusters.

---

### `minDistance(arrayA, arrayB, distanceMatrix)`

Calculates the smallest distance between pairs of clusters.

---

### `maxDistance(arrayA, arrayB, distanceMatrix)`

Calculates the largest distance between pairs of clusters.

---

### Comparison with [hcluster.js](https://github.com/cmpolis/hcluster.js)

- This package does not duplicate items from the original dataset in the results.
Results are given in terms of indexes, either with respect to the original dataset or the distance matrix.
- This package uses more modern JavaScript syntaxes and practices to make the code cleaner and simpler.
- This package provides an `onProgress` callback and calls `postMessage` for use in [web workers](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers).
Because clustering can take a long time with large data sets,
you may want to run it as a web worker so the browser doesn't freeze for a long time, and you may need a callback so you can give users visual feedback on its progress.
- This package makes some performance optimizations, such as removing unnecessary loops through big sets.
It has been tested on modern OS's (Windows, Mac, Linux, iOS, Android), devices (desktop, laptop, mobile), browsers (Chrome, Firefox, Safari), contexts (main thread, web worker), and hosting locations (local, online).
The results vary widely, and are likely sensitive to the specifics of hardware, cpu cores, browser implementation, etc.
But in general, this package is significantly more performant than `hcluster.js`, to varying degrees, and is always at least as performant on average.
vincerubinetti marked this conversation as resolved.
Show resolved Hide resolved
Chrome seems to see the most performance gains (up to 10x, when the row number is high), while Firefox seems to see no gains.
- This package does not touch the input data object, whereas the `hcluster.js` package does.
D3 often expects you to mutate data objects directly, which is now typically considered bad practice in JavaScript.
Instead, this package returns the useful data from the clustering algorithm (including the distance matrix), and allows you to mutate or not mutate the data object depending on your needs.
In the future, a simple option could be added to instruct the algorithm to mutate the data object, if users can provide good use cases for what information is needed for constructing various D3 visualizations.

---

### Making changes to the library

1. [Install Node](https://nodejs.org/en/download/)
2. [Install Yarn](https://classic.yarnpkg.com/en/docs/install)
3. Clone this repo and navigate to it in your command terminal
4. Run `yarn install` to install this package's dependencies
5. Make desired changes to `./src/hclust.js`
6. Run `npm run test` to automatically rebuild the library and run test suite
7. Run `npm run build` to just rebuild the library, and output the compiled contents to `./build/hclust.min.js`
8. Commit changes to repo if necessary. *Make sure to run the build command before committing; it won't happen automatically.*

---

### Similar Libraries

[cmpolis/hcluster.js](https://github.com/cmpolis/hcluster.js)
[harthur/clustering](https://github.com/harthur/clustering)
[mljs/hclust](https://github.com/mljs/hclust)
[math-utils/hierarchical-clustering](https://github.com/math-utils/hierarchical-clustering)
17 changes: 17 additions & 0 deletions build/hclust.min.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

34 changes: 34 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"name": "@greenelab/hclust",
"version": "0.0.0-dev",
"description": "Agglomerative hierarchical clustering in JavaScript",
"keywords": [
"hierarchy",
"hierarchical",
"cluster",
"clustering",
"agglomerative",
"data",
"tree"
],
"author": "Vincent Rubinetti",
"license": "MIT",
"repository": "git+https://github.com/greenelab/hclust.git",
"main": "./build/hclust.min.js",
"module": "./build/hclust.min.js",
"scripts": {
"test": "bash ./scripts/build.sh && bash ./scripts/test.sh",
"build": "bash ./scripts/build.sh"
},
"devDependencies": {
"@babel/cli": "^7.8.4",
"@babel/core": "^7.8.6",
"@babel/preset-env": "^7.8.6",
"babel-preset-minify": "^0.5.1",
"eslint": "^6.8.0",
"eslint-config-google": "^0.14.0",
"eslint-plugin-html": "^6.0.0",
"jest": "^25.1.0"
},
"browserslist": "> 0.1%, not dead"
}
3 changes: 3 additions & 0 deletions scripts/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
rm -rf build
mkdir build
npx babel ./src/hclust.js --out-file ./build/hclust.min.js
1 change: 1 addition & 0 deletions scripts/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
jest ./test/test.js --notify
Loading