GraphZip is a scalable method for mining interesting patterns in graph streams, based on the Lempel-Ziv class of compression algorithms.
This repository contains a reference implementation of the GraphZip algorithm, written in Python.
For more detailed information regarding the algorithm, see our paper.
Python 3 and python-igraph are required.
Once you have a working version of Python 3, install python-igraph with:
$ pip install python-igraph
Note: python-igraph
is not to be confused with the igraph
package (renamed to jgraph
).
Run GraphZip directly from the command line with:
$ python graphzip.py graph_file [-n NUM_FILES] [-d] [-a ALPHA] [-t THETA] [-o OUTFILE]
# Run GraphZip on `test.graph` with a batch size of `5` and a dictionary size of `10`:
python graphzip.py test.graph -a 5 -t 10
# Run GraphZip on files `1.graph` through `100.graph` located in directory `test_graphs/`, using a batch size of 5 and the default dictionary size:
python graphzip.py test_graphs -n 100 -a 5
Using -n
turns on multi-file mode, and GraphZip will treat graph_file
as a directory holding NUM_FILES
sequential graph stream files, labelled 1.graph
to [NUM_FILES].graph
.
Using -d
will make GraphZip read the graph file(s) as directed graphs.
Batch size and dictionary size, respectively (hyperparameters of the GraphZip model).
By default, the pattern dictionary is dumped to stdout; use -o
to save the dictionary to a specified file.
The correct format for .graph
files is:
% '%' to start comments
v [vertex id (int)] [vertex label (int)]
e [source vertex id (int)] [target vertex id (int)] [edge label (int)]
For example:
% example 3-clique with vertex labels "100", "999" and edge labels "1", "2", "3"
v 1 100
v 2 999
v 3 100
e 1 2 1
e 1 3 2
e 2 3 3
Vertices must be defined prior to being referenced by an edge (or the vertex label would be unknown).
In the case of processing a graph stream over sequential .graph
files, having Compressor
's label_history_per_file
property set to False
and add_implicit_vertices
set to True
(the default) allows edges to reference vertices declared in previous files. For further details see the inline comments.
Several example experiments are located in the unit tests directory under tests/test_expr.py
.
Run the algorithm on several example graphs (located in data/
) by navigating to the root project directory then running:
$ python -m unittest
To run a specific test use the format:
$ python -m unittest test.[test_file].[TestSuite].[test_function]
For example, to run the 4-clique test with 20% coverage located in the TestGraphZipSubgen
suite:
$ python -m unittest test.test_expr.TestGraphZipSubgen.test_4PATH_20
For details on specific examples, see the files under the test/
directory.
If you find GraphZip useful in your research, please consider citing:
C. Packer and L. Holder, GraphZip: Mining Graph Streams using Dictionary-based Compression. SIGKDD Workshop on Mining and Learning in Graphs (MLG), August 2017.
@inproceedings{PacHol:mlg17,
title = {{GraphZip}: Mining Graph Streams using Dictionary-based Compression},
author = {Charles Packer and Larry Holder},
booktitle = {Proceedings of the 13th International Workshop on Mining and Learning with Graphs (MLG)},
year = {2017}
}
MIT License - see LICENSE.txt for details.