Skip to content

Commit 96d143a

Browse files
authored
Merge pull request #690 from usc-isi-i2/dev
for release 1.5.0
2 parents 9e43506 + 0601293 commit 96d143a

File tree

137 files changed

+49418
-10205
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

137 files changed

+49418
-10205
lines changed

.github/workflows/run-tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,12 @@ jobs:
1111
- name: Install Python
1212
uses: actions/setup-python@v2
1313
with:
14-
python-version: '3.8'
14+
python-version: '3.9'
1515
- name: Setup conda
1616
uses: s-weigand/setup-conda@v1
1717
with:
1818
update-conda: true
19-
python-version: '3.8'
19+
python-version: '3.9'
2020
conda-channels: anaconda, conda-forge
2121
- name: Setup env
2222
run: |

Makefile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,9 @@ coverage:
3232

3333
download-spacy-model:
3434
python3 -m spacy download en_core_web_sm
35+
36+
update-documents:
37+
for name in docs/*/*.md; do \
38+
echo python kgtk/utils/update_documentation.py --md $${name} --summary; \
39+
python kgtk/utils/update_documentation.py --md $${name} --summary; \
40+
done

README.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,33 @@ you encounter problems with your installation, or are interested in a detailed
5050
explanation of these commands, [read more about the installation procedure
5151
here](docs/KGTK-Installation-Procedure-Details.md).
5252

53+
### Installation issues on Macbooks with M1 chip
54+
Running `pip install -e .` (development mode) throws an error about 3 libraries,
55+
1. thinc
56+
2. blis
57+
3. tokenizers
58+
59+
Fixed the `thinc` issue by ,
60+
61+
a. commenting out [this line in requirements.txt](https://github.com/usc-isi-i2/kgtk/blob/dev/requirements.txt#L11)
62+
63+
b. running `pip install thinc-apple-ops`
64+
65+
Fixed the tokenizers issue by running the following commands in the conda environment
66+
```
67+
# download and install Rust. Follow the on screen instructions
68+
69+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
70+
source "$HOME/.cargo/env"
71+
72+
git clone https://github.com/huggingface/tokenizers
73+
cd tokenizers/bindings/python/
74+
pip install setuptools_rust
75+
python setup.py install
76+
77+
```
78+
continue installing `kgtk`, `pip install -e .`
79+
5380
### Installing KGTK with Docker
5481

5582
Please refer to [this document](docs/install-with-docker.md) for installing KGTK with Docker

docs/analysis/community-detection.md

Lines changed: 237 additions & 296 deletions
Large diffs are not rendered by default.

docs/analysis/connected_components.md

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,15 @@ This command will find the connected components in a KGTK edge file. The output
66

77
## Usage
88
```
9-
Usage: kgtk connected-components [-h] [-i INPUT_FILE] [-o OUTPUT_FILE] [--properties PROPERTIES] [--undirected] [--strong]
9+
usage: kgtk connected-components [-h] [-i INPUT_FILE] [-o OUTPUT_FILE]
10+
[--properties PROPERTIES] [--undirected]
11+
[--strong]
1012
[--cluster-name-method {cat,hash,first,last,shortest,longest,numbered,prefixed,lowest,highest}]
11-
[--cluster-name-separator CLUSTER_NAME_SEPARATOR] [--cluster-name-prefix CLUSTER_NAME_PREFIX]
12-
[--cluster-name-zfill CLUSTER_NAME_ZFILL] [--minimum-cluster-size MINIMUM_CLUSTER_SIZE] [-v]
13+
[--cluster-name-separator CLUSTER_NAME_SEPARATOR]
14+
[--cluster-name-prefix CLUSTER_NAME_PREFIX]
15+
[--cluster-name-zfill CLUSTER_NAME_ZFILL]
16+
[--minimum-cluster-size MINIMUM_CLUSTER_SIZE]
17+
[-v [optional True|False]]
1318
1419
Find all the connected components in an undirected or directed Graph.
1520
@@ -19,25 +24,36 @@ kgtk --expert connected-components --help
1924
optional arguments:
2025
-h, --help show this help message and exit
2126
-i INPUT_FILE, --input-file INPUT_FILE
22-
The KGTK file to find connected components in. (May be omitted or '-' for stdin.)
27+
The KGTK file to find connected components in. (May be
28+
omitted or '-' for stdin.)
2329
-o OUTPUT_FILE, --output-file OUTPUT_FILE
24-
The KGTK output file. (May be omitted or '-' for stdout.)
30+
The KGTK output file. (May be omitted or '-' for
31+
stdout.)
2532
--properties PROPERTIES
26-
A comma separated list of properties to traverse while finding connected components, by default all properties will be considered
27-
--undirected Specify if the input graph is undirected, default FALSE
28-
--strong Treat graph as directed or not, independent of its actual directionality.
33+
A comma separated list of properties to traverse while
34+
finding connected components, by default all
35+
properties will be considered
36+
--undirected Specify if the input graph is undirected, default
37+
FALSE
38+
--strong Treat graph as directed or not, independent of its
39+
actual directionality.
2940
--cluster-name-method {cat,hash,first,last,shortest,longest,numbered,prefixed,lowest,highest}
30-
Determine the naming method for clusters. (default=Method.HASH)
41+
Determine the naming method for clusters.
42+
(default=Method.HASH)
3143
--cluster-name-separator CLUSTER_NAME_SEPARATOR
32-
Specify the separator to be used in cat and hash cluster name methods. (default=+)
44+
Specify the separator to be used in cat and hash
45+
cluster name methods. (default=+)
3346
--cluster-name-prefix CLUSTER_NAME_PREFIX
34-
Specify the prefix to be used in the prefixed and hash cluster name methods. (default=CLUS)
47+
Specify the prefix to be used in the prefixed and hash
48+
cluster name methods. (default=CLUS)
3549
--cluster-name-zfill CLUSTER_NAME_ZFILL
36-
Specify the zfill to be used in the numbered and prefixed cluster name methods. (default=4)
50+
Specify the zfill to be used in the numbered and
51+
prefixed cluster name methods. (default=4)
3752
--minimum-cluster-size MINIMUM_CLUSTER_SIZE
3853
Specify the minimum cluster size. (default=2)
3954
40-
-v, --verbose Print additional progress messages (default=False).
55+
-v [optional True|False], --verbose [optional True|False]
56+
Print additional progress messages (default=False).
4157
```
4258
***OPTIONS***:
4359

docs/analysis/graph_embeddings.md

Lines changed: 25 additions & 139 deletions
Original file line numberDiff line numberDiff line change
@@ -53,57 +53,33 @@ The algorithm is defined with the `operator` (`-op`) parameter. By default, it i
5353
## Usage
5454
You can call the functions directly with given args as
5555
```
56-
usage: kgtk graph-embeddings [-h] [-i INPUT_FILE_PATH] [-o OUTPUT_FILE_PATH]
57-
[-l] [-T] [-ot] [-r True|False] [-d] [-s]
56+
usage: kgtk graph-embeddings [-h] [-i INPUT_FILE] [-o OUTPUT_FILE] [-l] [-T]
57+
[-ot] [-r True|False] [-d] [-s]
5858
[-c dot|cos|l2|squared_l2]
59-
[-op linear|diagonal|complex_diagonal|translation]
60-
[-e] [-b True|False] [-w] [-bs]
59+
[-op RESCAL|DistMult|ComplEx|TransE] [-e]
60+
[-b True|False] [-w] [-bs]
6161
[-lf ranking|logistic|softmax] [-lr] [-ef]
6262
[-dr True|False] [-ge True|False]
63+
[--no-output-header [True|False]]
6364
[-v [optional True|False]]
64-
[--column-separator COLUMN_SEPARATOR]
65-
[--input-format INPUT_FORMAT]
66-
[--compression-type COMPRESSION_TYPE]
67-
[--error-limit ERROR_LIMIT]
68-
[--use-mgzip [optional True|False]]
69-
[--mgzip-threads MGZIP_THREADS]
70-
[--gzip-in-parallel [optional True|False]]
71-
[--gzip-queue-size GZIP_QUEUE_SIZE]
72-
[--mode {NONE,EDGE,NODE,AUTO}]
73-
[--force-column-names FORCE_COLUMN_NAMES [FORCE_COLUMN_NAMES ...]]
74-
[--header-error-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
75-
[--skip-header-record [optional True|False]]
76-
[--unsafe-column-name-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
77-
[--initial-skip-count INITIAL_SKIP_COUNT]
78-
[--every-nth-record EVERY_NTH_RECORD]
79-
[--record-limit RECORD_LIMIT]
80-
[--tail-count TAIL_COUNT]
81-
[--repair-and-validate-lines [optional True|False]]
82-
[--repair-and-validate-values [optional True|False]]
83-
[--blank-required-field-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
84-
[--comment-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
85-
[--empty-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
86-
[--fill-short-lines [optional True|False]]
87-
[--invalid-value-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
88-
[--long-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
89-
[--prohibited-list-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
90-
[--short-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
91-
[--truncate-long-lines [TRUNCATE_LONG_LINES]]
92-
[--whitespace-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}]
65+
66+
Generate graph embedding in kgtk tsv format, here we use PytorchBigGraph as low-level implementation
9367
9468
optional arguments:
9569
-h, --help show this help message and exit
96-
-i INPUT_FILE_PATH, --input-file INPUT_FILE_PATH
97-
The KGTK input file. (default=-)
98-
-o OUTPUT_FILE_PATH, --output-file OUTPUT_FILE_PATH
99-
The KGTK output file. (default=-).
70+
-i INPUT_FILE, --input-file INPUT_FILE
71+
The KGTK input file. (May be omitted or '-' for
72+
stdin.)
73+
-o OUTPUT_FILE, --output-file OUTPUT_FILE
74+
The KGTK output file. (May be omitted or '-' for
75+
stdout.)
10076
-l , --log Setting the log path [Default: None]
10177
-T , --temporary_directory
10278
Sepecify the directory location to store temporary
10379
file
10480
-ot , --output_format
105-
Outputformat for embeddings [Default: w2v] Choice: kgtk
106-
| w2v | glove
81+
Outputformat for embeddings [Default: w2v] Choice:
82+
kgtk | w2v | glove
10783
-r True|False, --retain_temporary_data True|False
10884
When opearte graph, some tempory files will be
10985
generated, set True to retain these files
@@ -122,8 +98,9 @@ optional arguments:
12298
-op RESCAL|DistMult|ComplEx|TransE, --operator RESCAL|DistMult|ComplEx|TransE
12399
The transformation to apply to the embedding of one of
124100
the sides of the edge (typically the right-hand one)
125-
before comparing it with the other one. It reflects
126-
which model that embedding uses. [Default:ComplEx]
101+
before comparing it with the other one. It
102+
reflectswhich model that embedding uses.
103+
[Default:ComplEx]
127104
-e , --num_epochs The number of times the training loop iterates over
128105
all the edges.[Default:100]
129106
-b True|False, --bias True|False
@@ -145,111 +122,20 @@ optional arguments:
145122
The fraction of edges withheld from training and used
146123
to track evaluation metrics during training.
147124
[Defalut:0.0 training all edges ]
148-
-dr True|False, --dynamic_relaitons True|False
125+
-dr True|False, --dynamic_relations True|False
149126
Whether use dynamic relations (when graphs with a
150127
large number of relations) [Default: True]
151128
-ge True|False, --global_emb True|False
152129
Whether use global embedding, if enabled, add to each
153130
embedding a vector that is common to all the entities
154131
of a certain type. This vector is learned during
155132
training.[Default: False]
133+
--no-output-header [True|False]
134+
When true, do not write a header to the output file
135+
(default=False).
156136
157137
-v [optional True|False], --verbose [optional True|False]
158138
Print additional progress messages (default=False).
159-
160-
File options:
161-
Options affecting processing.
162-
163-
--column-separator COLUMN_SEPARATOR
164-
Column separator (default=<TAB>).
165-
--input-format INPUT_FORMAT
166-
Specify the input format (default=None).
167-
--compression-type COMPRESSION_TYPE
168-
Specify the compression type (default=None).
169-
--error-limit ERROR_LIMIT
170-
The maximum number of errors to report before failing
171-
(default=1000)
172-
--use-mgzip [optional True|False]
173-
Execute multithreaded gzip. (default=False).
174-
--mgzip-threads MGZIP_THREADS
175-
Multithreaded gzip thread count. (default=3).
176-
--gzip-in-parallel [optional True|False]
177-
Execute gzip in parallel. (default=False).
178-
--gzip-queue-size GZIP_QUEUE_SIZE
179-
Queue size for parallel gzip. (default=1000).
180-
--mode {NONE,EDGE,NODE,AUTO}
181-
Determine the KGTK file mode
182-
(default=KgtkReaderMode.AUTO).
183-
184-
Header parsing:
185-
Options affecting header parsing.
186-
187-
--force-column-names FORCE_COLUMN_NAMES [FORCE_COLUMN_NAMES ...]
188-
Force the column names (default=None).
189-
--header-error-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
190-
The action to take when a header error is detected.
191-
Only ERROR or EXIT are supported
192-
(default=ValidationAction.EXIT).
193-
--skip-header-record [optional True|False]
194-
Skip the first record when forcing column names
195-
(default=False).
196-
--unsafe-column-name-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
197-
The action to take when a column name is unsafe
198-
(default=ValidationAction.REPORT).
199-
200-
Pre-validation sampling:
201-
Options affecting pre-validation data line sampling.
202-
203-
--initial-skip-count INITIAL_SKIP_COUNT
204-
The number of data records to skip initially
205-
(default=do not skip).
206-
--every-nth-record EVERY_NTH_RECORD
207-
Pass every nth record (default=pass all records).
208-
--record-limit RECORD_LIMIT
209-
Limit the number of records read (default=no limit).
210-
--tail-count TAIL_COUNT
211-
Pass this number of records (default=no tail
212-
processing).
213-
214-
Line parsing:
215-
Options affecting data line parsing.
216-
217-
--repair-and-validate-lines [optional True|False]
218-
Repair and validate lines (default=False).
219-
--repair-and-validate-values [optional True|False]
220-
Repair and validate values (default=False).
221-
--blank-required-field-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
222-
The action to take when a line with a blank node1,
223-
node2, or id field (per mode) is detected
224-
(default=ValidationAction.EXCLUDE).
225-
--comment-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
226-
The action to take when a comment line is detected
227-
(default=ValidationAction.EXCLUDE).
228-
--empty-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
229-
The action to take when an empty line is detected
230-
(default=ValidationAction.EXCLUDE).
231-
--fill-short-lines [optional True|False]
232-
Fill missing trailing columns in short lines with
233-
empty values (default=False).
234-
--invalid-value-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
235-
The action to take when a data cell value is invalid
236-
(default=ValidationAction.COMPLAIN).
237-
--long-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
238-
The action to take when a long line is detected
239-
(default=ValidationAction.COMPLAIN).
240-
--prohibited-list-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
241-
The action to take when a data cell contains a
242-
prohibited list (default=ValidationAction.COMPLAIN).
243-
--short-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
244-
The action to take when a short line is detected
245-
(default=ValidationAction.COMPLAIN).
246-
--truncate-long-lines [TRUNCATE_LONG_LINES]
247-
Remove excess trailing columns in long lines
248-
(default=False).
249-
--whitespace-line-action {PASS,REPORT,EXCLUDE,COMPLAIN,ERROR,EXIT}
250-
The action to take when a whitespace line is detected
251-
(default=ValidationAction.EXCLUDE).
252-
253139
```
254140
## Examples
255141

@@ -274,7 +160,7 @@ The output_file.tsv may look like:
274160
### Example 2
275161
Running with more specific parameters (TransE algorithm and 200-dimensional vectors):
276162
```
277-
kgtk graph-embeddings
163+
kgtk graph-embeddings \
278164
--input-file input_file.tsv \
279165
--output-file output_file.tsv \
280166
--dimension 200 \
@@ -296,7 +182,7 @@ The `output_file.tsv` may look like:
296182
### Example 3
297183
Using glove format to generate graph embeddings
298184
```
299-
kgtk graph-embeddings
185+
kgtk graph-embeddings \
300186
--input-file input_file.tsv \
301187
--output-file output_file.tsv \
302188
--output_format glove
@@ -313,7 +199,7 @@ The `output_file.tsv` may look like:
313199
### Example 4
314200
Using kgtk format to generate graph embeddings
315201
```
316-
kgtk graph-embeddings
202+
kgtk graph-embeddings \
317203
--input-file input_file.tsv \
318204
--output-file output_file.tsv \
319205
--output_format kgtk --no-output-headers

0 commit comments

Comments
 (0)