-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement an initial prototype of the system #1
Comments
How is the value 'r' in the formula for the sense-feature matrix calculated? For example given
and
then Java#NP#4 and Windows#NP#1 have Linux#NP in common, so the value should be > 0 (if i got it right). |
In the original work of Chris (and as implemented in jobimtext.org) similarity of two objects (in our case of two senses) is simply the number of common features. As far as I remember, here https://github.com/tudarmstadt-lt/noun-sense-induction-scala number of common features is normalized by number of features per word (sense in you case i.e. 200): Implementation is here: Note that normalization is not really needed because most of the words (senses) have the same number of features. |
Found some problematic lines while testing: The computation of the sense-feature-matrix for senses-n30-1600k takes around 40 Minutes and most of my memory at the moment. I am going to check, if i can adapt it for spark. |
I have a slightly more cleaned versions of the files here (some null bytes and other wierd characters were removed): http://panchenko.me/data/joint/senses-wiki-n30-1600k.csv.gz Feel free to use these and apply any extra filter on the top of these. As to these filtering it is better to make a standalone class/utility for the cleanup (takes senses file as input and outputs a cleaned version of the file). In this way other programs dealing with the senses will benefit the cleaned version. |
I looked though your current code. Looks good. Here a couple of comments:
|
It's far from being clean code but works for the first start :) At the moment it loads all data in memory (~4.6GB) then performs the calculation: For each cluster and each word in the cluster: lookup all senses for the word and calculate the similarity. Taking n as number of clusters, m as the max. number of words/cluster and s the maximal number of senses for one word, it should run approx in: For senses senses-n30-1600k we have I think the similarty matrix output file had something like 2.9GB |
So this is something like an inverted index of senes given a word, right? |
Kind of. The idea was to use the same datatype for the sense-word and the words in the cluster. I am currently trying to implement the same algorithm in spark which seems to be a problem of finding the right data-structure and operations on the rdd. |
OK. As to Spark, I recommend here rather to use the current implementation extending it if needed. We will have to implement some other stuff which is not implemented yet with Spark e.g. graph clustering which is local now. |
useful commands
|
Frequency dictionary:
MapReduce for similarity computation |
Another tip on your project: Use Gzip to store all big files in the project. Use readers/writers that can read Gzip. So instead sim.txt use sim.txt.gz and so on. This saves a great deal of space and comes at very little CPU cost. Editors like VIM, but also other cli tools can open Gzip files without any problem. |
|
|
vim replace
|
|
top 100000, sorted
number of lines in a file
|
Some output from the clustering:
How do i have to interpret the cluster-lables (third column)?
agate does not seem to be an appropriate label. |
but as we discovered your implementation seems to limit choice of candidates to the related words, isn't it? what about the Spark implementation of the similarity. Did you calculate similarity of senses with this one? It would be nice to compare the results. as to the label, for the moment we are not interested in it. |
please create a bash script with commands you used to perform the clustering for reproducibility in future and for easier automation (e.g. for meta-parameter optimization). something like:
|
These results looks reasonable. We will look into more of them tomorrow. |
|
I started some pipelines with different settings on frink.
The ddt-news-*-closure input files are not exactly the same format as the previous ones: I will take some more time to inspect the data and add Markov Chain Clustering the next days. |
OK, great. By the way, we need to postpone our meeting on Friday -- I will be travelling. Can we meet next Tuesday instead at 14:00? Please upload new clustering results, so I can still inspect them. |
Motivation
We need to develop a prototype of the system that builds structured topics (creates a model) and is able to label new texts according to these topics. This prototype is supposed to have all minimal functionality of the system (input/output) and implemented with the most straightforward set of algorithms. The goal is to make initial validation of the idea and then improve the prototype gradually. In this step we do no evaluation which will be done later as well (during the "official" 6 month period reserved for writing the thesis).
Implementation
The prototype will build structured topics out of sense clusters i.e. it will cluster word senses which are provided as input. The system will rely on existing developments of the LT team in order to efficiently implement computation of a similarity graph of senses.
The overall pipeline of the prototype (to be implemented in Java/Scala or a mix of both):
Download the data -- a Disambiguated Distributional Thesaurus (DDT) build from the JoBimText sense clusters:
Create a sense-feature matrix out of these files in the following format:
for instance
here 0.75 is 1/(r+1)^0.33, where r is the rank of cluster word from 1 to n
To be more precise, the system should generate the same output as this package: https://github.com/tudarmstadt-lt/noun-sense-induction
This includes two additional files:
sense-id<TAB>freq
andcluster-word<TAB>freq
.For each sense retrieve top 200 most similar senses i.e. calculate a similarity graph of senses. You will need to adapt this project to do so efficiently using Apache Spark framework: https://github.com/tudarmstadt-lt/noun-sense-induction-scala. To understand how it works read
The framework outputs a file in the format:
for each sense, it would generate 200 nearest neighbours.
For start with Spark just look at examples: http://spark.apache.org/
4. Cluster the graph of sense similarities using Chinese Whispers or/and Markov Chain Clustering algorithm. Use this implementation: https://github.com/johannessimon/chinese-whispers. Alternatively you can use this implementation: http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/sense-clustering/
In your case, instead of words you will cluster word senses.
http://wortschatz.uni-leipzig.de/~cbiemann/pub/2006/BiemannTextGraph06.pdf
Write a classification module that would use the structured topics, being clusters of senses, to annotate text documents. The module should
To implement this module you should use ElasticSearch index. One topic would be one document, and then use an input document as search query. The retrieval system will return a list of documents (topics) according to their TF-IDF score.
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
The text was updated successfully, but these errors were encountered: