You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of #245, basic auth support was added for the fetching of raw data files. During a review of using mlcroissant for data loading functionality, we realized that there's a data exfiltration vulnerability. The exploit could look something like this:
Bad actor hosts a properly formatted Croissant JSON-LD. Maybe from a reputable-seeming location like a public GCS/S3 URL.
The distribution property in the JSON-LD contains contentUrl values that point to a server under their control.
Bad actor publishes an example notebook of how to use Croissant to load data. It would:
Describe how to set the CROISSANT_BASIC_AUTH_* values to access a private dataset on kaggle.com, where victims would provide their own username and API key.
Use some means to load the private Kaggle data--either via the load script or by doing something like mlc.Dataset(private_kaggle_dataset_url).records(record_set=name_of_tabular_file)--which would use the Kaggle credentials in the requests for data. Note that this step isn't necessary for exfiltration, but establishes a reason for why the victim's credentials would be needed in the first place.
Without clearing environment variables, then load data from the from the bad actor's Croissant. Because of the presence of the same values in the environment variables, the victim's credentials would be sent to the server under the attacker's control. The attacker could store those credentials and use them at a later time. If the attacker's server returns data that allows the notebook to complete as expected, this could go undetected for a while, allowing the attacker to amass credentials for several victims.
Kaggle is the example here, but this could be the case for other data providers that support basic auth as a means for retrieving data behind ACLs.
Solutions:
We could introduce a 3rd new environment variable to serve as a safeguard. Something like CROISSANT_BASIC_AUTH_HOST could be used to ensure that requests for data are only permitted to use the CROISSANT_BASIC_AUTH_(USERNAME|PASSWORD) for URLs that match on that domain.
This is maybe preferred since we already have environment-based authentication and the implementation is simple
Authentication values could be accepted at the point of loading data so that each instantiation/invocation supplies the values rather than reading from the environment.
Aside from being more complicated, this approach is also slightly riskier: it would require changing more of the code (higher chance of regression) and because of the top-down approach of passing in values, there's a higher risk of missing a path/branch.
The text was updated successfully, but these errors were encountered:
As part of #245, basic auth support was added for the fetching of raw data files. During a review of using
mlcroissant
for data loading functionality, we realized that there's a data exfiltration vulnerability. The exploit could look something like this:distribution
property in the JSON-LD containscontentUrl
values that point to a server under their control.CROISSANT_BASIC_AUTH_*
values to access a private dataset onkaggle.com
, where victims would provide their own username and API key.mlc.Dataset(private_kaggle_dataset_url).records(record_set=name_of_tabular_file)
--which would use the Kaggle credentials in the requests for data. Note that this step isn't necessary for exfiltration, but establishes a reason for why the victim's credentials would be needed in the first place.Solutions:
CROISSANT_BASIC_AUTH_HOST
could be used to ensure that requests for data are only permitted to use theCROISSANT_BASIC_AUTH_(USERNAME|PASSWORD)
for URLs that match on that domain.The text was updated successfully, but these errors were encountered: