Huggingface Integration¶

Loadax provides integration with Huggingface Datasets. This integration allows you to load datasets from the Huggingface hub and use them with loadax.

Loading a Dataset¶

To load a dataset from the Huggingface hub, you can use the from_hub method of the HuggingFaceDataset class. This method takes in the path to the dataset on the Huggingface hub, the name of the dataset, and the split of the dataset. The dataset is lazily loaded from the Huggingface cache.

from loadax.dataset.huggingface import HuggingFaceDataset

dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb", split="train")

Alternatively, you can construct a HuggingFaceDataset object directly from a Huggingface Dataset object. This is useful if you want to do some preprocessing and store the results in the Huggingface dataset cache.

from loadax.dataset.huggingface import HuggingFaceDataset
import datasets as hf_datasets

train_data = hf_datasets.load_dataset("stanfordnlp/imdb", split="train")
train_dataset = HuggingFaceDataset(train_data)

Sharding a Dataset¶

Huggingface datasets natively support sharding, no need to wrap them in a ShardedDataset. Instead you can use the split_dataset_by_node method to get a shard of the dataset for a given node. This method takes in the world size and the rank of the node and returns a shard of the dataset. The shards are contiguous and consistent for a given world_size.

from loadax.dataset.huggingface import HuggingFaceDataset

dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb", split="train")
shard = dataset.split_dataset_by_node(world_size=2, rank=0)

Bases: Shardable[Example], Dataset[Example]

A dataset that integrates with Hugging Face's datasets library.

Any huggingface compatible dataset can be loaded with loadax to leverage the rich ecosystem of datasets, tooling, and efficient arrow-backed tables.

If you are loading large datasets in a multi-host environment it is important to think about the order you load data for sharding. If you intend to shard your data such that each host is not fully replicated you will need to identify how to split the dataset.

A huggingface dataset is sharded when loaded if using HuggingFaceDataset.from_hub(..., num_shards=n, shard_id=i). Otherwise you will want to pre-shard it yourself.

Examples:

from loadax.experimental.dataset.huggingface import HuggingFaceDataset
import datasets as hf_datasets

train_data = hf_datasets.load_dataset("stanfordnlp/imdb", split="train")
train_data.shard(num_shards=2, shard_id=0)

train_dataset = HuggingFaceDataset(train_data)

data = train_dataset.get(0)
print(data)

Alternatively you can use ShardedDataset to wrap a HuggingFaceDataset. This will perform the same sharding algorithm as the datasets library however based on your usage it may not prevent the extra network overhead of loading all shards.

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	HuggingFace Dataset	required

Source code in src/loadax/dataset/huggingface.py

def __init__(
    self,
    dataset: HFDataset,
):
    """Initialize a huggingface dataset that has already been loaded.

    Any huggingface compatible dataset can be loaded with loadax to leverage
    the rich ecosystem of datasets, tooling, and efficient arrow-backed tables.

    If you are loading large datasets in a multi-host environment it is important
    to think about the order you load data for sharding. If you intend to shard
    your data such that each host is not fully replicated you will need to
    identify how to split the dataset.

    A huggingface dataset is sharded when loaded if using
    `HuggingFaceDataset.from_hub(..., num_shards=n, shard_id=i)`. Otherwise
    you will want to pre-shard it yourself.

    Examples:
        ```python
        from loadax.experimental.dataset.huggingface import HuggingFaceDataset
        import datasets as hf_datasets

        train_data = hf_datasets.load_dataset("stanfordnlp/imdb", split="train")
        train_data.shard(num_shards=2, shard_id=0)

        train_dataset = HuggingFaceDataset(train_data)

        data = train_dataset.get(0)
        print(data)
        ```

    Alternatively you can use ShardedDataset to wrap a HuggingFaceDataset. This
    will perform the same sharding algorithm as the datasets library however based
    on your usage it may not prevent the extra network overhead of loading all
    shards.

    Args:
        dataset: HuggingFace Dataset
    """
    self._dataset = dataset

dataset `property` ¶

dataset: Dataset

The underlying HuggingFace dataset.

from_hub `staticmethod` ¶

from_hub(path: str, name: str | None = None, split: str | None = None) -> HuggingFaceDataset[Example]

Load a HuggingFace dataset from the HuggingFace hub.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to the dataset on the HuggingFace hub.	required
`name`	`str \| None`	The name of the dataset on the HuggingFace hub.	`None`
`split`	`str \| None`	The split of the dataset on the HuggingFace hub.	`None`

Returns:

Type	Description
`HuggingFaceDataset[Example]`	HuggingFaceDataset[Example]: The HuggingFace dataset.

Examples:

from loadax.experimental.dataset.huggingface import HuggingFaceDataset

dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb")

Source code in src/loadax/dataset/huggingface.py

@staticmethod
def from_hub(
    path: str,
    name: str | None = None,
    split: str | None = None,
) -> "HuggingFaceDataset[Example]":
    """Load a HuggingFace dataset from the HuggingFace hub.

    Args:
        path: The path to the dataset on the HuggingFace hub.
        name: The name of the dataset on the HuggingFace hub.
        split: The split of the dataset on the HuggingFace hub.

    Returns:
        HuggingFaceDataset[Example]: The HuggingFace dataset.

    Examples:
        ```python
        from loadax.experimental.dataset.huggingface import HuggingFaceDataset

        dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb")
        ```
    """
    dataset = load_dataset(
        path=path, name=name, split=split, trust_remote_code=True
    )
    dataset.set_format(type="numpy")

    assert isinstance(
        dataset, HFDataset
    ), f"loaded dataset must be a Dataset, got {type(dataset)}"

    logger.info(f"Loaded HF dataset with length: {len(dataset)}")
    return HuggingFaceDataset[Example](dataset)

split_dataset_by_node ¶

split_dataset_by_node(world_size: int, rank: int) -> Dataset[Example]

Split the dataset into shards.

Parameters:

Name	Type	Description	Default
`world_size`	`int`	The number of nodes.	required
`rank`	`int`	The rank of the current node.	required

Returns:

Type	Description
`Dataset[Example]`	Dataset[Example]: The shard of the dataset for the current node.

Source code in src/loadax/dataset/huggingface.py

def split_dataset_by_node(self, world_size: int, rank: int) -> Dataset[Example]:
    """Split the dataset into shards.

    Args:
        world_size (int): The number of nodes.
        rank (int): The rank of the current node.

    Returns:
        Dataset[Example]: The shard of the dataset for the current node.
    """
    from datasets.distributed import (
        split_dataset_by_node as hf_split_dataset_by_node,
    )

    dataset = hf_split_dataset_by_node(self._dataset, rank, world_size)
    assert isinstance(dataset, HFDataset)
    return HuggingFaceDataset[Example](dataset)

Huggingface Integration¶

Loading a Dataset¶

Sharding a Dataset¶

dataset property ¶

from_hub staticmethod ¶

split_dataset_by_node ¶

dataset `property` ¶

from_hub `staticmethod` ¶