-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the libclustER wiki!
At the moment the documentation is a bit lacking, but the comments in the code do a fairly good job on informing you what functions do. But, you might only care for the clustering algorithms or accuracy measures (or just getting started). Lets do a quick walk through of the relevant functions.
Clustering. To be more specific, duplicate detection / entity resolution via graph clustering.
Data loading isn't necessary per se, since all the algorithms take in an igraph objects. You only need to load the igraph object in one of their formats. But, if you are looking at csv-form data of graphs and their true clusters (like those found here, or in this repo's dataset folder) then you'll find the following load functions to be useful
-
load.true.cluster <- function(file.path="./../data/5k.csv", index.correction=FALSE)
- Purpose: for loading the 'true' or 'ground truth' clusters.
-
file.path=...
is a character string path to a file. The file should contain clusters in the form of a minimal edgelist. All edges don't need to be present, but enough need to be present to recreate the cluster. -
index.correction=...
is a Boolean. This parameter is for backwards compatibility with older versions of the igraph library (vertex indexing use to start at 0, but now starts at 1) - Returns: a list of vectors, representing a clustering.
-
load.graph.csv <- function(edge.path="./../data/scores_5k_weightedjaccardbm25.csv", node.path="./../data/5k.csv", sparse=FALSE, knnk=100, index.correction=FALSE)
- Purpose: load graph data from a csv format file.
-
node.path=...
is a file path as a character string. Same as the above function'sfile.path
. The only reason it's used is so that the igraph vertices can be initialized before hand, making loading faster. -
edge.path=...
is a file path as a character string. The file is expected to be storing a weighted graph is csv form (i.e., vertex_id1, vertex_id2, edge_weight). -
sparse=FALSE, knnk=100
are used to sparse the graph edges so that every node retains its top k nearest neighbors. This was originally meant to for use in the MCL algorithm, but ended up not being used. - Returns: an igraph object.
The usage should be pretty straight forward. Just in case you're very new to the whole R programming language, follow the default settings of the function headers. They will actually work if nothing is passed to the functions. (Assuming you followed the instructions and are running them from the src directory)
Now that you've played with loading igraph objects into your R session, you probably want to try out some clustering algorithms. Here are the headers of the clustering algorithms. Any parameter called g
or graph
is an igraph object. They all return a list of vectors (the returned clustering).
articulation.point.clustering <- function(graph)
-
partition.clustering <- function(g, index.correction=FALSE)
-
index.correction=...
is used for index correction. Same reason as stated earlier regarding igraph vertex indexing changes
-
-
affinity.propagation.cluster <- function(g, maxits=500, convits=10, pref=NA, q=NA, normalize=TRUE, median.default.case=TRUE)
-
maxits=500
decides the maximum number of iterations -
convits=10
decide the number of iterations used to decide convergence -
pref=NA, q=NA
where pref is for deciding types of preference, and q is for manually setting preferences. If"uniform"
is passed topref
then all preferences are set to the values of q, otherwisepref
takes the same values as centrality measures decided on in the library (see the vertex ordering functions)
-
merge.center.cluster <- function(g, index.correction=FALSE)
-
markov.cluster <- function(g,r=2, k=100, prune=FALSE, limit=0.01)
-
r=2
is the granularity constant used in the MCL clustering algorithm -
k=100
is the top k edges retained for the graph approximation used in the MCL algorithm -
prune=FALSE
is used to decide whether the approximation is used or not -
limit=0.01
is used to decide when convergence has been reached (used to avoid unstable convergence where circular hopping of states occur as result of numerical errors)
-
-
center.cluster <- function(g, mode = "arbitrary", order = 1, index.correction=FALSE)
-
mode = "arbitrary"
is used for deciding whether a centrality measure will be used for deciding centers or whether arbitrary assignment will occur. -
order = 1
is used for higher-order centrality measure calculations.
-
-
star.cluster <- function(g, method='mean', overlap=TRUE, no.subsets=TRUE)
-
method='mean'
decides centrality measure used- mean - average weight of a vertex's neighborhood of edges
- degree - number of edges in a vertex's neighborhood
- markov - Sum of Markov Steady-State probabilities
- evcent - Principal eigen vector component values
- kleinberg - Authority score from HITS algorithm
- sum - the strength of a vertex (sum of weights for neighborhood)
- etc... (there are more)
-
overlap=TRUE
decides whether returned clustering algorithms are allowed to overlap or whether they are strict partitions -
no.subsets=TRUE
decides whether clusters are allowed to be subsets of one another. (in the case of Star, this never happens, but it can happen in the case of Min Cut unless specified otherwise.)
-
-
min.cut.cluster <- function(graph, method='mean', overlap=TRUE, order=1, alpha=0.1, no.subsets=TRUE)
- see star clustering algorithm
-
alpha=0.1
the parameter used to setup the weights of the edges connecting the pseudo node that sinks network flow.
-
ricochet.cluster <- function(g, method='mean', algo="sr")
-
method='mean'
centrality measure setting (if you feel like using it) -
algo="sr"
for deciding the specific algorithm.- sr - sequential ricochet
- bsr - balanced sequential ricochet
- cr - concurrent ricochet
- ocr - ordered concurrent ricochet
-
All measures assume that you are giving the produced clustering first, then the ground truth or 'true' clustering second.
precision <- function(a.clusters, ground.truth.clusters)
recall <- function(a.clusters, ground.truth.clusters)
f1.measure <- function(a.clusters, ground.truth.clusters)
clustering.precision<- function(a.clusters, ground.truth.clusters)
penalized.clustering.precision<- function(a.clusters, ground.truth.clusters)
-
variation.of.information <- function(a.clusters, ground.truth.clusters)
normalized.variation.of.information <- function(a.clusters, ground.truth.clusters)
-
k.measure <- function(a.clusters, ground.truth.clusters, total.num.records=...
- the third parameter is being computed automatically, and doesn't need to given. But, it's faster if it can be given.
Here we load up the default data from within the src directory. We look at the igraph object, cluster the graph, check the number of clusters, and check the precision and recall.
Lets first load the library from the src directory.
> source("libcluster.R")
Then load a graph.
> g = load.graph.csv()
Lets take a look at the graph
> g
So, we now know that this igraph object is undirected and weighted (U-W-
) and contains 4674 vertices and 650459 edges. Cool.
IGRAPH U-W- 4674 650459 --
This graph has one attr
ibute called weight
. The weight attribute applies to the e
dges of the graph and is of type n
umerical.
+ attr: weight (e/n)
Lets load up the ground truth clustering.
> gtc = load.true.cluster()
Now that we have the ground truth clustering and the igraph object, we can produce some clusters using one of the clustering algorithms.
> pc = partition.clustering(g=g)
How many clusters are in this clustering?
> length(pc)
Not a lot. Make sense, after having a quick look at the number of edges and nodes in our graph.
[1] 1
How many should there be?
> length(gtc)
A lot more.
[1] 500
So how good is our produced clustering?
> precision(pc, gtc)
[1] 0.004065041
> recall(pc, gtc)
[1] 1
It could be better.
Had enough tinkering for now? Quit and save your session on the prompt.
> q()
You: I notice a lot of code in test.R
. What does it do and why do I care about it?
The file test.R
is for testing the accuracy (measures.R
) of the different clustering algorithms. You can essentially learn how every function works by reading through test.R
. You can also use it as a template for writing tests for your own clustering algorithms.
Load up libclustER
> source("libcluster.R")
Run one of the high level functions (ones with data.sources.run.
prefix) to begin running tests, such as
> data.sources.run.center()
or
> data.sources.run.partition()
For an overview of the files used in conjunction with test.R
and reproducing results in output.csv
, see here.