[/datasetio] drop columns not specified by dataset schema for huggingface provider (#611)

# What does this PR do?

**Why**
- huggingface datasets could have extra unused columns, some of these
columns (e.g. images) is unable to be casted as JSON over http requests
for datasetio.
- it is also inefficient to create a new dataset that's a subset of
columns

**Solution**
- drop columns not specified by dataset schema

## Test Plan

Tested with script:
https://gist.github.com/yanxi0830/23be5725e0d82d79e24cc5dd1d21b571


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
This commit is contained in:
Xi Yan 2024-12-12 10:23:09 -08:00 committed by GitHub
parent b7cb06f004
commit 8b45d147df
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -21,14 +21,19 @@ DATASETS_PREFIX = "datasets:"
def load_hf_dataset(dataset_def: Dataset): def load_hf_dataset(dataset_def: Dataset):
if dataset_def.metadata.get("path", None): if dataset_def.metadata.get("path", None):
return hf_datasets.load_dataset(**dataset_def.metadata) dataset = hf_datasets.load_dataset(**dataset_def.metadata)
else:
df = get_dataframe_from_url(dataset_def.url) df = get_dataframe_from_url(dataset_def.url)
if df is None: if df is None:
raise ValueError(f"Failed to load dataset from {dataset_def.url}") raise ValueError(f"Failed to load dataset from {dataset_def.url}")
dataset = hf_datasets.Dataset.from_pandas(df) dataset = hf_datasets.Dataset.from_pandas(df)
# drop columns not specified by schema
if dataset_def.dataset_schema:
dataset = dataset.select_columns(list(dataset_def.dataset_schema.keys()))
return dataset return dataset