Skip to content

Commit

Permalink
Update doc (#40)
Browse files Browse the repository at this point in the history
* Update README.md

* Update

* update img

* update

* update

* update

* update

* update

* change transE-l1 to l2

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update doc
  • Loading branch information
aksnzhy authored Apr 2, 2020
1 parent 287fbd4 commit 0f7805a
Show file tree
Hide file tree
Showing 5 changed files with 101 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Knowledge graphs (KGs) are data structures that store information about differen
To install the latest version of DGL-KE run:

```
pip install dgl
pip install dglke
```

Expand Down
Binary file added docs/images/dist_train.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/metis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
101 changes: 100 additions & 1 deletion docs/source/dist_train.rst
Original file line number Diff line number Diff line change
@@ -1,2 +1,101 @@
Distributed Training on Large Data
-----------------------------------------
------------------------------------

In the previous sections, we saw that using multiple GPUs within a machine can accelerate training. The speedup, however, is limited by the number of GPUs installed in that machine. The ever-growing size of knowledge graphs requires computation capable of scaling to graphs with millions of nodes and billions of edges.

DGL-KE supports distributed training that allows users to train their tasks in the cluster. In this tutorial, we will demonstrate how to set up a 4-machines cluster and how to use DGL-KE to train KG embeddings on this cluster.

Architecture of Distributed Training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

DGL-KE adopts the *parameter-server* architecture for distributed training.

.. image:: ../images/dist_train.png
:width: 400

In this architecture, the entity embeddings and relation embeddings are stored in DGL-KVStore, and the trainer processes can pull the latest model from KVStore and push the calculated gradient to the KVStore. All the processes will train the KG embeddings in an *async* way.

METIS Partition Algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^^

In distributed training, all the training processes will communicate with KVStore through the network. This could bring a large amount of networking traffic and overhead. DGL-KE uses the `METIS graph partition`__ algorithm to solve this problem. For a cluster of ``P`` machines, we split the graph into ``P`` partitions using the METIS partition to a machine (all entities and triplets incident to the entities) to a machine as shown in the following Figure.

.. __: http://glaros.dtc.umn.edu/gkhome/metis/metis/overview


.. image:: ../images/metis.png
:width: 400

The majority of the triplets are in the diagonal blocks. We co-locate the embeddings of the entities with the triplets in the diagonal block by specifying a proper data partitioning in the distributed KVStore. When a trainer process samples triplets in the local partition, most of the entity embeddings accessed by the batch fall in the local partition and, thus, there is little network communication to access entity embeddings from other machines.


Distributed Training by DGL-KE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Distributed training on DGL-KE usually involves three steps:

1. Partition a knowledge graph.
2. Copy partitioned data to remote machines.
3. Invoke the distributed training job by ``dglke_dist_train``.

Here we demonstrate how to training KG embedding on ``FB15k`` dataset using 4 machines. Note that, the ``FB15k`` is just a small dataset as out toy demo. An interested user can try it on ``Freebase``, which contains *86M* nodes and *338M* edges.

**Step 1: Prepare your machines**

Assume that we have four machines with the following IP addresses::

machine_0: 172.31.24.245
machine_1: 172.31.24.246
machine_2: 172.31.24.247
machine_3: 172.32.24.248

Make sure that *machine_0* has the permission to *ssh* to all the other machines.

**Step 2: Prepare your data**

Create a new directory called ``my_task`` on machine_0::

mkdir my_task

We use built-in ``FB15k`` as demo and paritition it into ``4`` parts::

dglke_partition --dataset FB15k -k 4 --data_path ~/my_task

Note that, in this demo, we have 4 machines so we set ``-k`` to 4. After this step, we can see 4 new directories called ``partition_0``, ``partition_1``, ``partition_2``, and ``partition_3`` in your ``FB15k`` dataset folder.

Create a new file called ``ip_config.txt`` in ``my_task``, and write the following contents into it::

172.31.24.245 30050 8
172.31.24.246 30050 8
172.31.24.247 30050 8
172.32.24.248 30050 8

Each line in ``ip_config.txt`` is the KVStore configuration on each machine. For example, ``172.31.24.245 30050 8`` represents that, on ``machine_0``, the IP is ``172.31.24.245``, the base port is ``30050``, and we start ``8`` servers on this machine. Note that, you can change the number of servers on each machine based on your machine capabilities. In our environment, the instance has ``48`` cores, and we set ``8`` cores to KVStore and ``40`` cores for worker processes.

After that, we can copy the ``my_task`` directory to all the remote machines::

scp -r ~/my_task 172.31.24.246:~
scp -r ~/my_task 172.31.24.247:~
scp -r ~/my_task 172.31.24.248:~


**Step 3: Launch distributed jobs**

Run the following command on ``machine_0`` to start a distributed task::

dglke_dist_train --path ~/my_task --ip_config ~/my_task/ip_config.txt \
--num_client_proc 16 --model_name TransE_l2 --dataset FB15k --data_path ~/my_task --hidden_dim 400 \
--gamma 19.9 --lr 0.25 --batch_size 1000 --neg_sample_size 200 --max_step 500 --log_interval 100 \
--batch_size_eval 16 --test -adv --regularization_coef 1.00E-09 --num_thread 1

Most of the options we have already seen in previous sections. Here are some new options we need to know.

``--path`` indicates the absolute path of our workspace. All the logs and trained embedding will be stored in this path.

``--ip_config`` is the absolute path of ``ip_config.txt``.

``--num_client_proc`` has the same behaviors to ``--num_proc`` in single-machine training.

All the other options are the same as single-machine training. For some EC2 users, you can also set ``--ssh_key`` for right *ssh* permission.

If you don't set ``--no_save_embed`` option. The trained KG embeddings will be stored in ``machine_0`` by default.
Binary file added metis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 0f7805a

Please sign in to comment.