Skip to content

Conversation

@pankajbaid567
Copy link

Add three new CLI subcommands under 'openml datasets':

  • openml datasets list: List datasets with optional filtering
  • openml datasets info: Display detailed dataset information
  • openml datasets search: Search datasets by name (case-insensitive)

Features:

  • Support for multiple filter options (tag, status, size, instances, features, classes)
  • Output formatting (table/json) with verbose mode
  • Pagination support (offset, size)
  • Comprehensive test suite with mocked API calls
  • Proper error handling

Addresses ESoC 2025 goal of improving user experience of the dataset catalogue.

Related to issue #1503 Add CLI Commands for browsing and searching OpenML datasets

Metadata

Details

What does this PR implement/fix?

This PR adds three new CLI subcommands under openml datasets to improve the user experience of the dataset catalogue:

  • openml datasets list - List datasets with optional filtering (tag, status, data_name, number_instances, number_features, number_classes, pagination, output format)
  • openml datasets info <dataset_id> - Display detailed information about a specific dataset including qualities, features, and metadata
  • openml datasets search <query> - Search datasets by name with case-insensitive matching

Why is this change necessary? What is the problem it solves?

Currently, users must write Python code to browse or search OpenML datasets, even for simple tasks like listing available datasets or finding a specific dataset. This creates a barrier to entry and makes the dataset catalogue less accessible. Adding CLI commands allows users to interact with the dataset catalogue directly from the command line without writing code.

This directly addresses the ESoC 2025 goal of "Improving user experience of the dataset catalogue in AIoD and OpenML".

How can I reproduce the issue this PR is solving and its solution?

Before (requires Python code):

import openml
datasets = openml.datasets.list_datasets(size=10)
for did, dataset in datasets.items():
    print(f"{did}: {dataset['name']}")

After (CLI commands):

# List first 10 datasets
openml datasets list --size 10

# Search for iris datasets
openml datasets search iris

# Get detailed info about a dataset
openml datasets info 61

# List datasets with a specific tag, formatted as table
openml datasets list --tag study_14 --format table --verbose

# Filter by number of instances
openml datasets list --number-instances "100..1000"

Implementation Details:

  • Added three new functions in openml/cli.py: datasets_list(), datasets_info(), datasets_search()
  • Added helper function _format_output() for consistent output formatting (table/JSON)
  • Integrated into main CLI parser with proper argument handling
  • Added comprehensive test suite in tests/test_openml/test_cli.py (11 test cases)
  • Uses existing openml.datasets.list_datasets() and openml.datasets.get_dataset() functions - no changes to core API
  • Follows existing CLI patterns (similar to configure command)
  • All tests use mocked API calls to avoid requiring server connections

Add three new CLI subcommands under 'openml datasets':
- openml datasets list: List datasets with optional filtering
- openml datasets info: Display detailed dataset information
- openml datasets search: Search datasets by name (case-insensitive)

Features:
- Support for multiple filter options (tag, status, size, instances, features, classes)
- Output formatting (table/json) with verbose mode
- Pagination support (offset, size)
- Comprehensive test suite with mocked API calls
- Proper error handling

Addresses ESoC 2025 goal of improving user experience of the dataset catalogue.

Related to issue openml#1486
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants