Huggingface Integration¶
Loadax provides integration with Huggingface Datasets. This integration allows you to load datasets from the Huggingface hub and use them with loadax.
Loading a Dataset¶
To load a dataset from the Huggingface hub, you can use the from_hub
method of the HuggingFaceDataset
class. This method takes in the path to the dataset on the Huggingface hub, the name of the dataset, and the split of the dataset. The dataset is lazily loaded from the Huggingface cache.
from loadax.dataset.huggingface import HuggingFaceDataset
dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb", split="train")
Alternatively, you can construct a HuggingFaceDataset
object directly from a Huggingface Dataset object. This is useful if you want to do some preprocessing and store the results in the Huggingface dataset cache.
from loadax.dataset.huggingface import HuggingFaceDataset
import datasets as hf_datasets
train_data = hf_datasets.load_dataset("stanfordnlp/imdb", split="train")
train_dataset = HuggingFaceDataset(train_data)
Sharding a Dataset¶
Huggingface datasets natively support sharding, no need to wrap them in a ShardedDataset
. Instead you can use the split_dataset_by_node
method to get a shard of the dataset for a given node. This method takes in the world size and the rank of the node and returns a shard of the dataset. The shards are contiguous and consistent for a given world_size
.
from loadax.dataset.huggingface import HuggingFaceDataset
dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb", split="train")
shard = dataset.split_dataset_by_node(world_size=2, rank=0)
Bases: Shardable[Example]
, Dataset[Example]
A dataset that integrates with Hugging Face's datasets
library.
Any huggingface compatible dataset can be loaded with loadax to leverage the rich ecosystem of datasets, tooling, and efficient arrow-backed tables.
If you are loading large datasets in a multi-host environment it is important to think about the order you load data for sharding. If you intend to shard your data such that each host is not fully replicated you will need to identify how to split the dataset.
A huggingface dataset is sharded when loaded if using
HuggingFaceDataset.from_hub(..., num_shards=n, shard_id=i)
. Otherwise
you will want to pre-shard it yourself.
Examples:
from loadax.experimental.dataset.huggingface import HuggingFaceDataset
import datasets as hf_datasets
train_data = hf_datasets.load_dataset("stanfordnlp/imdb", split="train")
train_data.shard(num_shards=2, shard_id=0)
train_dataset = HuggingFaceDataset(train_data)
data = train_dataset.get(0)
print(data)
Alternatively you can use ShardedDataset to wrap a HuggingFaceDataset. This will perform the same sharding algorithm as the datasets library however based on your usage it may not prevent the extra network overhead of loading all shards.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
HuggingFace Dataset |
required |
Source code in src/loadax/dataset/huggingface.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
from_hub
staticmethod
¶
from_hub(path: str, name: str | None = None, split: str | None = None) -> HuggingFaceDataset[Example]
Load a HuggingFace dataset from the HuggingFace hub.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The path to the dataset on the HuggingFace hub. |
required |
name |
str | None
|
The name of the dataset on the HuggingFace hub. |
None
|
split |
str | None
|
The split of the dataset on the HuggingFace hub. |
None
|
Returns:
Type | Description |
---|---|
HuggingFaceDataset[Example]
|
HuggingFaceDataset[Example]: The HuggingFace dataset. |
Examples:
from loadax.experimental.dataset.huggingface import HuggingFaceDataset
dataset = HuggingFaceDataset.from_hub("stanfordnlp/imdb")
Source code in src/loadax/dataset/huggingface.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
|
split_dataset_by_node ¶
Split the dataset into shards.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
world_size |
int
|
The number of nodes. |
required |
rank |
int
|
The rank of the current node. |
required |
Returns:
Type | Description |
---|---|
Dataset[Example]
|
Dataset[Example]: The shard of the dataset for the current node. |
Source code in src/loadax/dataset/huggingface.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|