-
Notifications
You must be signed in to change notification settings - Fork 60
build_faiss command #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
build_faiss command #601
Conversation
|
Hello @nicklein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2022-05-19 22:08:31 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nicklein I added some comments. Please make the changes.
I'll take another closer look at the code logic wise after you make the required changes
kgtk/cli/build_faiss.py
Outdated
|
|
||
| # REQUIRED # | ||
| # Related to input file | ||
| parser.add_argument('-i', '--input_file', '--embeddings_file', action='store', dest='embeddings_file', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a add_input_file function, please use it
https://github.com/usc-isi-i2/kgtk/blob/master/kgtk/cli/cat.py#L45
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
kgtk/cli/build_faiss.py
Outdated
| help='Input file containing the embeddings for which a Faiss index will be created.') | ||
|
|
||
| # Related to output | ||
| parser.add_argument('-o', '--output_file', '--index_file_out', action='store', dest='index_file_out', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a add_output_file() function
https://github.com/usc-isi-i2/kgtk/blob/master/kgtk/cli/cat.py#L50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
kgtk/cli/build_faiss.py
Outdated
| required=True, help="Output .idx file where the index fill be saved.", | ||
| metavar="INDEX_FILE_OUT") | ||
|
|
||
| parser.add_argument('-id2n', '--index_to_node_file_out', action='store', dest='index_to_node_file_out', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the secondary options should only have -, instead of underscore.
--index_to_node_file_out should be --index-to-node-file-out.
Also as a general rule, do not use stop words like to in the parameter name, This review comment applies to all the subsequent parameters.
The dest parameter is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason to avoid stop words simply for brevity? I am struggling to think of a name for this that omits 'to' and isn't ambiguous. This parameter specifies the path where a kgtk file will be saved. The kgtk file contains a mapping of index to corresponding node. Do you have a suggestion for this parameter name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain a bit more as to what exactly this file contains? mapping of what index ?how is this different from the output file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you use a faiss index to search for nearest neighbors, it returns distances and corresponding indexes/IDs of the nearest neighbors, rather than the names or embeddings of the nearest neighbors. This file would allow you to look up the entity name that corresponds to the index/ID. Here's an example of what the file would look like:
Input embedding file:
q1 <embedding>
q2 <embedding>
q3 <embedding>
output index_to_node file:
node1 label node2
0 index_to_node q1
1 index_to_node q2
2 index_to_node q3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I could use the word ID rather than index here to avoid confusion. Then I could call this 'node_id_file_out' to avoid 'to'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, using the stop word to makes sense here. I would call this --faiss-id-to-node-mapping-file As you can, brevity is not the issue, meaningful name is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
kgtk/cli/build_faiss.py
Outdated
| except SystemExit as e: | ||
| raise KGTKException("Exit requested") | ||
| except KGTKException as e: | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise the caught exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| """ | ||
| Train and populate a faiss index that can compute nearest neighbors of given embeddings. | ||
| """ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename the command and the file in cli to build-faiss-index`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| def build_faiss(embeddings_file, embeddings_format, no_input_header, index_file_out, index_to_node_file_out, | ||
| max_train_examples, workers, index_string, metric_type, p=None, verbose=False): | ||
|
|
||
| # validate input file path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use KGTKReader to read the input file as it is an edge file. It handles a lot of exceptions and cases. Take a look at the cat command for example.
kgtk/graph_embeddings/build_faiss.py
Outdated
|
|
||
| # validate metric type and translate to a faiss metric | ||
| metrics = { | ||
| "Inner_product": faiss.METRIC_INNER_PRODUCT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure this is case insensitive
kgtk/graph_embeddings/build_faiss.py
Outdated
| else: | ||
| raise KGTKException("Unrecognized value for metric_type parameter: {}.".format(metric_type) + | ||
| "Please choose one of {}.".format(" | ".join(list(metrics.keys())))) | ||
| if metric_type == "Lp" and p is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is p? Use a descriptive name please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Lp distance metric is specified, then the user also needs to specify the value of p that they want to use, e.g. L1, L2, L...
I'll change it to metric_arg though since that is what Faiss calls it.
| print("Writing index-to-node file...") | ||
| with open(embeddings_file, 'r') as f_in: | ||
| with open(index_to_node_file_out, 'w+') as f_out: | ||
| f_out.write("node1\tlabel\tnode2\n") # header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use KGTKWriter instead
| # Load training examples for index training | ||
| train_vecs = [] | ||
| if verbose: | ||
| print("Loading training vectors...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print to an error_file, see example in cat. No printing to standard out
|
@nicklein ETA on the requested changes? |
New kgtk command for building a faiss index. Intended for use with graph embeddings.