[KGE] Add embedding export / unenumeration utils #341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

nshah-sc wants to merge 13 commits into main from nshah/add-infer-export-code

Collaborator

nshah-sc commented Sep 27, 2025

Add some utils which are used to write embeddings to BQ and unenumerate the resulting tables for KGE.

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: YES

nshah-sc added 13 commits

September 11, 2025 17:39


          add edge_dataset creation utils

dfd9676


          bump

9385ea2


          refactoring

e0f2079


          protocols

251641f


          comment

4945d9e


          stash

e44b55e


          Merge branch 'main' into nshah/kge-edge-dataset

c96baf7


          drop unneeded port

4d83c6f


          comments

0a51b9b


          typing

31b2ebc


          Merge branch 'main' into nshah/kge-edge-dataset

7ec6d97


          embedding export and unenumeration utils

ec23ab5


          Merge branch 'main' into nshah/add-infer-export-code

c592824

nshah-sc added the experimental label

svij-sc reviewed

View reviewed changes

python/gigl/experimental/knowledge_graph_embedding/lib/infer.py

+                  exporter = export.EmbeddingExporter(
+                      export_dir=embedding_dir,
+                      file_prefix=f"{rank}_of_{world_size}_embeddings_",
+                      min_shard_size_threshold_bytes=1_000_000_000,  # 1GB threshold for sharding

Collaborator

svij-sc Oct 1, 2025

this seems like a lot, any reason we are using 1GB here?
IIRC, we found best results to be 200-500mb fordownstream distributed reads.

svij-sc reviewed

View reviewed changes

python/gigl/experimental/knowledge_graph_embedding/lib/infer.py

+                  embeddings_table_node_id_field: str,
+                  unenumerated_embeddings_table: str,
+                  enumerator_mapping_table: str,
+              ):

Collaborator

svij-sc Oct 1, 2025 •

edited

Loading

Can the following function be re-used here instead? gigl.src.post_process.utils.unenumeration._unenumerate_single_inferred_asset()
It could also be moved to some shared place for better visibility than in pos processor.

svij-sc reviewed

View reviewed changes

python/gigl/experimental/knowledge_graph_embedding/lib/infer.py

+                  logger.info(f"{rank_prefix_str} Initialized TrainPipelineSparseDist for inference.")
+                  # Run inference in no_grad context to save memory and improve performance
+                  with torch.no_grad():

Collaborator

svij-sc Oct 1, 2025

suggestion / non-blocking: there is also with torch.inference_mode(): which is faster than no_grad. It wasnt working w/ GNN inference due to some reason (@mkolodner-sc) - but maybe it works here.

svij-sc reviewed

View reviewed changes

python/gigl/experimental/knowledge_graph_embedding/lib/infer.py

+              def infer_and_export_embeddings(
+                  applied_task_identifier: AppliedTaskIdentifier,
+                  rank_prefix_str: str,

Collaborator

svij-sc Oct 1, 2025

this feels like a weird argument.
Any reason it cannot be inferred automatically?

svij-sc reviewed

View reviewed changes

python/gigl/experimental/knowledge_graph_embedding/lib/infer.py

+                          logger.info(
+                              f"""{rank_prefix_str} Running inference for edge type {edge_type} on
+                              src node type {edge_type.src_node_type} and dst node type {edge_type.dst_node_type}."""
+                          )

Collaborator

svij-sc Oct 1, 2025 •

edited

Loading

use textwrap.dedent ?

svij-sc reviewed

View reviewed changes

python/gigl/experimental/knowledge_graph_embedding/lib/infer.py

+                      )
+                      unenumerate_embeddings_table(
+                          enumerated_embeddings_table=enum_src_node_embedding_table,
+                          embeddings_table_node_id_field=export._NODE_ID_KEY,

Collaborator

svij-sc Oct 1, 2025

use of cross modules private constants.
_NODE_ID_KEY

svij-sc reviewed

View reviewed changes

python/gigl/experimental/knowledge_graph_embedding/lib/infer.py

+                      QUALIFY RANK() OVER (PARTITION BY mapping.{original_node_id_field} ORDER BY RAND()) = 1
+                  """
+                  bq_utils = BqUtils(project=get_resource_config().project)

Collaborator

svij-sc Oct 1, 2025

We have been trying to prevent coupling of get_resource_config() in low level / leaf apis/function calls.
It is better to dereference resource config as much as we can higher in the stack and pass those values through. Not doing this leads to usability and extensibility challanges.

Collaborator

svij-sc Oct 1, 2025

reco: inject project as a param.

svij-sc reviewed

View reviewed changes

python/gigl/experimental/knowledge_graph_embedding/lib/infer.py

+                          )
+                          # Process destination nodes for this edge type
+                          infer_and_export_node_embeddings(

Collaborator

svij-sc Oct 1, 2025

opportunity to reduce complexity here by infer_and_export_node_embeddings automatically exporting both src and dst node embeddings.
We are essentially calling infer_and_export_node_embeddings w/ same arguments but different values of is_src.

svij-sc reviewed

View reviewed changes

python/gigl/experimental/knowledge_graph_embedding/lib/infer.py

+                  world_size: int,
+                  device: torch.device,
+                  kge_config: HeterogeneousGraphSparseEmbeddingConfig,
+                  model_and_loss: Union[

Collaborator

svij-sc Oct 1, 2025

why do we need loss if we are just doing inference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

svij-sc svij-sc left review comments

kmontemayor2-sc Awaiting requested review from kmontemayor2-sc kmontemayor2-sc will be requested when the pull request is marked ready for review kmontemayor2-sc is a code owner

yliu2-sc Awaiting requested review from yliu2-sc yliu2-sc will be requested when the pull request is marked ready for review yliu2-sc is a code owner

mkolodner-sc Awaiting requested review from mkolodner-sc mkolodner-sc will be requested when the pull request is marked ready for review mkolodner-sc is a code owner

xgao4-sc Awaiting requested review from xgao4-sc xgao4-sc will be requested when the pull request is marked ready for review xgao4-sc is a code owner

zfan3-sc Awaiting requested review from zfan3-sc zfan3-sc will be requested when the pull request is marked ready for review zfan3-sc is a code owner

At least 2 approving reviews are required to merge this pull request.

Labels