Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Members of a K-means clsuter #3

Open
Jonathan-Abrahams opened this issue Nov 27, 2017 · 9 comments
Open

Members of a K-means clsuter #3

Jonathan-Abrahams opened this issue Nov 27, 2017 · 9 comments

Comments

@Jonathan-Abrahams
Copy link

Hi,

I am struggling to find the right documentation that details which rows or columns have been clustered into a specific K-means cluster.

Is this feature available? Or how would you suggest is the best way to go about doing this?

I have been looking in the notebooks of the examples but cannot find it. The cytof notebook does detail a similar process, but it is more complicated.

@cornhundred
Copy link
Contributor

Hi,

This feature is available but the documentation does not currently discuss it. This information is returned as a NumPy array by the downsample method. See below for an example

ds_data = net.downsample(axis='row', ds_type='kmeans', num_samples=5)

This array (referred to as ds_data above) is the same length as the original rows/columns and the integer in each element refer to the cluster each row/column has been assigned to. Please see this quick example notebook that goes into more detail. We will also update the documentation to address this and thank you for bringing this to our attention. Let us know if this works or if you have any other questions.

@Jonathan-Abrahams
Copy link
Author

Jonathan-Abrahams commented Nov 28, 2017

Thank you for your fast response!

Thats very heplful and does solve my original query.I am making good progress on applying this to my own data.

Now I am wondering what the best way to display this on the heatmap would be?

@cornhundred
Copy link
Contributor

Great, I'm glad that helped. I updated the example notebook to show how the K-mean cluster ids can be overlayed on the original data by adding an additional row category (see below).

screen shot 2017-11-28 at 10 33 23 am

Let us know if that answered your question and if you have any other questions.

@Jonathan-Abrahams
Copy link
Author

My dataset consists of 1000 bacterial strains and data relating to their ~3000 genes. My primary motivation for downsampling is to simplify the heatmap to a manageable size.

The solution you have proposed does not solve this specific motivation of mine as it is the same size as the original dataset. I can see useful tips in your update on modifying labels and adding columns.I am sure I will be able to incorporate these at a later date.

My Ideal solution would almost be the reverse of the solution you proposed. the K-means clustered heatmap with details as to which gene is represented by which K-means.I can see many problems with what I am proposing. I am trying it out myself. This may also go against my aims of simplfying the data. What do you think?

I have been able to plot heatmaps of individual K-means clusters but this is not nearly as elegant as is possible, im sure!

It seems as though having gene names in one column beside the K-means would be a messy way(and probably impossible) to show such information.

@cornhundred
Copy link
Contributor

cornhundred commented Nov 28, 2017

I see, it sounds like your matrix is ~1,000 columns/strains by ~3,000 rows/genes and you are looking to reduce the size of your dataset to something more manageable. It will probably be difficult to show the gene list of the downsampled clusters (and this is not currently supported by Clustergrammer).

I would recommend a couple of things based on our experience with similar datasets.

We used Clustergrammer to visualize the Cancer Cell Line Encyclopedia which is ~1,000 columns/cancer-cell-lines by ~20,000 rows/genes (see CCLE Notebook). We first filtered for the top 1,000 most variable genes and then downsampled our cell-lines to obtain 100 cell line clusters (downsampling also keeps track of the most common category in each cluster). So if you can filter your genes down (based on variance or sum) then something like this might be useful. The MNIST notebook also does something similar. If you can add some category to your genes, then this would be tracked with the downsampling, but it is not exactly what you are asking for.

Finally, the next version of Clustergrammer will be built in WebGL, which can handle much larger datasets like these. Here's a very simple visualization of a random matrix of 1,000 rows by 1,000 columns built in WebGL to demonstrate how much data can be handled. We can keep you up to date on this progress.

@Jonathan-Abrahams
Copy link
Author

Jonathan-Abrahams commented Dec 15, 2017

The link to the notebook you made is now dead unfortunately!

@cornhundred
Copy link
Contributor

Which notebook? Can you provide the link because the links I checked on this thread still appear to work.

@Jonathan-Abrahams
Copy link
Author

You are right!

I should have checked today.

Yesterday I am certain they were down.

@cornhundred
Copy link
Contributor

No problem, did the approaches we recommended work out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants