-
Notifications
You must be signed in to change notification settings - Fork 125
feat: Add NCBI Datasets API Integration (56 Tools) #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Integrates high-coverage NCBI Datasets API tools with auto-generated tool classes, wrappers, and JSON configs, supporting gene, genome, taxonomy, and virus queries. Introduces OpenAPI-driven discovery and code generation scripts, enabling maintenance automation and parameter synchronization. Ensures all tool schemas and parameters remain up to date with the evolving NCBI Datasets OpenAPI spec, minimizing manual drift. Provides an extensive, parametrized test suite for functionality, error handling, rate limits, and OpenAPI compliance, supporting robust, future-proof integration. Lays groundwork for continuous tool API maintenance and easy coverage extension as NCBI adds endpoints.
Introduces support for retrieving taxonomy dataset reports using NCBI taxon identifiers, including function, tool class, JSON schema, and integration into the tool universe. Enhances automation and code generation logic to handle flexible path parameters for endpoints that accept both single values and arrays. Improves coverage of NCBI Datasets API tools, enabling users to access richer taxonomic metadata across various taxonomic ranks.
Updates the parameter-building logic to use string concatenation that properly separates conditional parameter blocks with newlines. Prevents formatting issues in generated query code, ensuring parameters are correctly added when present.
Adds support for additional flexible path parameters such as locus tags, assembly names, bioprojects, biosample IDs, proteins, tax IDs, and WGS accessions, enabling single values or lists for these inputs. Improves parameter description logic by extracting the first word from descriptions or falling back to parameter names, enhancing auto-generated documentation clarity. Updates response construction to include path parameters for better context. These changes improve tool flexibility and generated API documentation quality.
Introduces new auto-generated tools for NCBI Datasets API endpoints that provide dataset reports by gene ID, accession, taxon, locus tag, and for viruses and genomes by various identifiers. Updates initialization, lazy loading, and exports to support these tools and registers their schemas and Python client functions. Enables broader and more granular access to NCBI Datasets metadata, allowing easier integration and improved flexibility for downstream consumers.
Adds comprehensive integration with the NCBI Datasets API, introducing 56 new tools for accessing gene data, genome assemblies, taxonomy information, virus genomes, organelle data, and biosample records. This update includes auto-generated tool classes, detailed documentation, and a maintenance guide, enhancing the API's usability and flexibility for researchers. Additionally, known test failures are documented to improve testing transparency.
Combined NCBI Datasets tools with upstream's new tools (OLS, ClinVar, literature search tools). Updated type annotations, imports, lazy proxies, and __all__ list to include both sets of tools.
|
Tried to do my best here, but let me know if I missed anything that I can fix! |
|
Looks good to me, thank you! I will test these tools on my side and merge them ASAP! |
|
Hi @benjibromberg, sorry for the delay in merging. I am wondering can you please remove the tools that fail in test for now. We can have another pull request to figure out why some tools are not working and we can figure out ways to fix them. Thank you! |
Summary
This PR adds comprehensive integration with the NCBI Datasets API v2, providing
56 tools for accessing gene data, genome assemblies, taxonomy information,
virus genomes, organelle data, and biosample records. The integration uses an
OpenAPI-driven approach where the OpenAPI specification serves as the single
source of truth for all parameters, endpoints, and validation.
Features
56 Tool Classes: Complete coverage of NCBI Datasets API endpoints
100% OpenAPI Parameter Coverage: All parameters from the OpenAPI
specification are implemented in each tool
Automated Generation System: Configuration files and test definitions
are auto-generated from the OpenAPI specification, ensuring easy updates
when NCBI releases new API versions
Comprehensive Test Suite: 447 tests total (408 passing, 91.3% pass rate)
KNOWN_TEST_FAILURES.md)for NCBI tests) due to rate limiting (0.25s delay between tests) to
respect NCBI API limits
Complete Documentation:
docs/tools/ncbi_datasets_tools.rst(774 lines)src/tooluniverse/data/specs/ncbi/README.mdexamples/ncbi_datasets_tool_example.pyTechnical Implementation
OpenAPI-Driven Architecture
The integration follows a specification-driven approach:
OpenAPI Specification:
src/tooluniverse/data/specs/ncbi/openapi3.docs.yamlAuto-Generation Scripts:
scripts/discover_and_generate.py: Discovers endpoints and generatestool classes
scripts/update_ncbi_json_from_openapi.py: Updates JSON configurationsfrom spec
Tool Classes: All 56 tools in
src/tooluniverse/ncbi_datasets_tool.pyBaseToolFunction Wrappers: 56 wrapper functions in
src/tooluniverse/tools/Test Results
Test Runtime Impact: This PR adds 447 tests to the test suite, which
extends the overall test runtime by approximately 4 minutes (~228 seconds).
Each test includes a 0.25s delay to respect NCBI API rate limits (5-10
requests/second), ensuring reliable test execution without hitting API
throttling.
Known Failures: Documented in
src/tooluniverse/data/specs/ncbi/KNOWN_TEST_FAILURES.md. These are upstreamNCBI API issues affecting:
Tests are kept active to detect when NCBI fixes these issues.
Upstream Compatibility
Merge Tested: Successfully merged with
upstream/mainsrc/tooluniverse/__init__.py(resolved)Files Changed
Core Implementation
src/tooluniverse/ncbi_datasets_tool.py: 56 tool classessrc/tooluniverse/data/ncbi_datasets_tools.json: Tool configurationssrc/tooluniverse/tools/ncbi_datasets_*.py: 56 wrapper functionssrc/tooluniverse/__init__.py: Updated imports and exports (4 locations)Specifications and Maintenance
src/tooluniverse/data/specs/ncbi/: Complete directoryopenapi3.docs.yaml: Official OpenAPI specificationREADME.md: Maintenance guide for contributorsKNOWN_TEST_FAILURES.md: Documentation of known API issuesscripts/discover_and_generate.py: Auto-generation scriptscripts/update_ncbi_json_from_openapi.py: JSON config updaterTests
tests/tools/test_ncbi_datasets_tool.py: Comprehensive test suiteDocumentation
docs/tools/ncbi_datasets_tools.rst: Complete user documentation (774 lines)examples/ncbi_datasets_tool_example.py: 13 working examplesAPI Key Support
Tools support optional API key authentication via
NCBI_API_KEYenvironmentvariable for enhanced rate limits (10 rps vs 5 rps default). See
docs/tools/ncbi_datasets_tools.rstfor setup instructions.Usage Example
Maintenance
Future updates to the NCBI Datasets API can be easily integrated by:
openapi3.docs.yamlwith new specificationpython src/tooluniverse/data/specs/ncbi/scripts/discover_and_generate.pySee
src/tooluniverse/data/specs/ncbi/README.mdfor detailed maintenanceinstructions.
Related Issues
This PR adds a new API integration following the OpenAPI-driven approach
documented in the maintenance guide. The integration is complete and ready
for review.
Checklist
__init__.pyupdated in all 4 required locations