Skip to content

Huggingface Integration

Loadax provides integration with Huggingface Datasets. This integration allows you to load datasets from the Huggingface hub and use them with loadax.

Loading a Dataset

To load a dataset from the Huggingface hub, you can use the from_hub method of the HuggingFaceDataset class. This method takes in the path to the dataset on the Huggingface hub, the name of the dataset, and the split of the dataset. The dataset is lazily loaded from the Huggingface cache.

from loadax.dataset.huggingface import HuggingFaceDataset

dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb", split="train")

Alternatively, you can construct a HuggingFaceDataset object directly from a Huggingface Dataset object. This is useful if you want to do some preprocessing and store the results in the Huggingface dataset cache.

from loadax.dataset.huggingface import HuggingFaceDataset
import datasets as hf_datasets

train_data = hf_datasets.load_dataset("stanfordnlp/imdb", split="train")
train_dataset = HuggingFaceDataset(train_data)

Sharding a Dataset

Huggingface datasets natively support sharding, no need to wrap them in a ShardedDataset. Instead you can use the split_dataset_by_node method to get a shard of the dataset for a given node. This method takes in the world size and the rank of the node and returns a shard of the dataset. The shards are contiguous and consistent for a given world_size.

from loadax.dataset.huggingface import HuggingFaceDataset

dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb", split="train")
shard = dataset.split_dataset_by_node(world_size=2, rank=0)

Bases: Shardable[Example], Dataset[Example]

A dataset that integrates with Hugging Face's datasets library.

Any huggingface compatible dataset can be loaded with loadax to leverage the rich ecosystem of datasets, tooling, and efficient arrow-backed tables.

If you are loading large datasets in a multi-host environment it is important to think about the order you load data for sharding. If you intend to shard your data such that each host is not fully replicated you will need to identify how to split the dataset.

A huggingface dataset is sharded when loaded if using HuggingFaceDataset.from_hub(..., num_shards=n, shard_id=i). Otherwise you will want to pre-shard it yourself.

Examples:

from loadax.experimental.dataset.huggingface import HuggingFaceDataset
import datasets as hf_datasets

train_data = hf_datasets.load_dataset("stanfordnlp/imdb", split="train")
train_data.shard(num_shards=2, shard_id=0)

train_dataset = HuggingFaceDataset(train_data)

data = train_dataset.get(0)
print(data)

Alternatively you can use ShardedDataset to wrap a HuggingFaceDataset. This will perform the same sharding algorithm as the datasets library however based on your usage it may not prevent the extra network overhead of loading all shards.

Parameters:

Name Type Description Default
dataset Dataset

HuggingFace Dataset

required
Source code in src/loadax/dataset/huggingface.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def __init__(
    self,
    dataset: HFDataset,
):
    """Initialize a huggingface dataset that has already been loaded.

    Any huggingface compatible dataset can be loaded with loadax to leverage
    the rich ecosystem of datasets, tooling, and efficient arrow-backed tables.

    If you are loading large datasets in a multi-host environment it is important
    to think about the order you load data for sharding. If you intend to shard
    your data such that each host is not fully replicated you will need to
    identify how to split the dataset.

    A huggingface dataset is sharded when loaded if using
    `HuggingFaceDataset.from_hub(..., num_shards=n, shard_id=i)`. Otherwise
    you will want to pre-shard it yourself.

    Examples:
        ```python
        from loadax.experimental.dataset.huggingface import HuggingFaceDataset
        import datasets as hf_datasets

        train_data = hf_datasets.load_dataset("stanfordnlp/imdb", split="train")
        train_data.shard(num_shards=2, shard_id=0)

        train_dataset = HuggingFaceDataset(train_data)

        data = train_dataset.get(0)
        print(data)
        ```

    Alternatively you can use ShardedDataset to wrap a HuggingFaceDataset. This
    will perform the same sharding algorithm as the datasets library however based
    on your usage it may not prevent the extra network overhead of loading all
    shards.

    Args:
        dataset: HuggingFace Dataset
    """
    self._dataset = dataset

dataset property

dataset: Dataset

The underlying HuggingFace dataset.

from_hub staticmethod

from_hub(path: str, name: str | None = None, split: str | None = None) -> HuggingFaceDataset[Example]

Load a HuggingFace dataset from the HuggingFace hub.

Parameters:

Name Type Description Default
path str

The path to the dataset on the HuggingFace hub.

required
name str | None

The name of the dataset on the HuggingFace hub.

None
split str | None

The split of the dataset on the HuggingFace hub.

None

Returns:

Type Description
HuggingFaceDataset[Example]

HuggingFaceDataset[Example]: The HuggingFace dataset.

Examples:

from loadax.experimental.dataset.huggingface import HuggingFaceDataset

dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb")
Source code in src/loadax/dataset/huggingface.py
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
@staticmethod
def from_hub(
    path: str,
    name: str | None = None,
    split: str | None = None,
) -> "HuggingFaceDataset[Example]":
    """Load a HuggingFace dataset from the HuggingFace hub.

    Args:
        path: The path to the dataset on the HuggingFace hub.
        name: The name of the dataset on the HuggingFace hub.
        split: The split of the dataset on the HuggingFace hub.

    Returns:
        HuggingFaceDataset[Example]: The HuggingFace dataset.

    Examples:
        ```python
        from loadax.experimental.dataset.huggingface import HuggingFaceDataset

        dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb")
        ```
    """
    dataset = load_dataset(
        path=path, name=name, split=split, trust_remote_code=True
    )
    dataset.set_format(type="numpy")

    assert isinstance(
        dataset, HFDataset
    ), f"loaded dataset must be a Dataset, got {type(dataset)}"

    logger.info(f"Loaded HF dataset with length: {len(dataset)}")
    return HuggingFaceDataset[Example](dataset)

split_dataset_by_node

split_dataset_by_node(world_size: int, rank: int) -> Dataset[Example]

Split the dataset into shards.

Parameters:

Name Type Description Default
world_size int

The number of nodes.

required
rank int

The rank of the current node.

required

Returns:

Type Description
Dataset[Example]

Dataset[Example]: The shard of the dataset for the current node.

Source code in src/loadax/dataset/huggingface.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
def split_dataset_by_node(self, world_size: int, rank: int) -> Dataset[Example]:
    """Split the dataset into shards.

    Args:
        world_size (int): The number of nodes.
        rank (int): The rank of the current node.

    Returns:
        Dataset[Example]: The shard of the dataset for the current node.
    """
    from datasets.distributed import (
        split_dataset_by_node as hf_split_dataset_by_node,
    )

    dataset = hf_split_dataset_by_node(self._dataset, rank, world_size)
    assert isinstance(dataset, HFDataset)
    return HuggingFaceDataset[Example](dataset)