Skip to content

Dataloader

The Dataloader is the main interface for loading data into your training loop. The Dataloader is responsible for defining how to efficiently load data from a dataset and allocate it to the appropriate devices, batches, and all of the other features that make up proper data loading.

The Dataloader works by spawning background workers to prefetch data into a cache, and then filling batches from the cache as they become available. The use of background workers allows the dataloader to be highly efficient and not block the main thread, which is important for training loops. Loadax takes care of the parllelization details for you, so your dataloading is fast, reliable, and simple. The background cache will load out of order, as utilizes mutlithreading to load data in parallel, however the actual batches will be in order. This is because loadax ensures deterministic ordering of batches, and the background workers will load batches in the order that they are requested.

Creating a dataloader
from loadax import Dataloader, SimpleDataset

dataset = SimpleDataset([1, 2, 3, 4, 5])
dataloader = Dataloader(
    dataset=dataset,
    batch_size=2,
    num_workers=2,
    prefetch_factor=2,
)
for batch in dataloader:
    print(batch)

#> [1, 2]
#> [3, 4]
#> [5]

Bases: Generic[Example]

Dataloader that loads batches in the background or synchronously.

Example
from loadax.experimental.dataset.simple import SimpleDataset
from loadax.experimental.loader import Dataloader

dataset = SimpleDataset([1, 2, 3, 4, 5])
dataloader = Dataloader(
    dataset=dataset,
    batch_size=2,
    num_workers=2,
    prefetch_factor=2,
    drop_last=False,
)
for batch in dataloader:
    print(batch)

#> [1, 2]
#> [3, 4]
#> [5]

Parameters:

Name Type Description Default
dataset Dataset

The dataset to load data from.

required
batch_size int

The size of each batch.

required
num_workers int

The number of workers to use for parallel data loading. If 0, data will be loaded synchronously.

0
prefetch_factor int

The prefetch factor to use for prefetching. If 0, no prefetching will occur.

0
drop_last bool

Whether to drop the last incomplete batch.

False
Source code in src/loadax/dataloader/loader.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
def __init__(
    self,
    dataset: Dataset[Example],
    batch_size: int,
    num_workers: int = 0,
    prefetch_factor: int = 0,
    *,
    drop_last: bool = False,
):
    """A dataloader that can load data in the background or synchronously.

    Example:
        ```python
        from loadax.experimental.dataset.simple import SimpleDataset
        from loadax.experimental.loader import Dataloader

        dataset = SimpleDataset([1, 2, 3, 4, 5])
        dataloader = Dataloader(
            dataset=dataset,
            batch_size=2,
            num_workers=2,
            prefetch_factor=2,
            drop_last=False,
        )
        for batch in dataloader:
            print(batch)

        #> [1, 2]
        #> [3, 4]
        #> [5]
        ```

    Args:
        dataset (Dataset): The dataset to load data from.
        batch_size (int): The size of each batch.
        num_workers (int): The number of workers to use for parallel data loading.
            If 0, data will be loaded synchronously.
        prefetch_factor (int): The prefetch factor to use for prefetching.
            If 0, no prefetching will occur.
        drop_last (bool): Whether to drop the last incomplete batch.
    """
    self.dataset = dataset
    self.batch_size = batch_size
    self.num_workers = num_workers
    self.prefetch_factor = prefetch_factor
    self.drop_last = drop_last

Bases: Generic[Example]

Iterator for the dataloader.

Parameters:

Name Type Description Default
dataloader Dataloader

The dataloader to iterate over.

required
Source code in src/loadax/dataloader/loader.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def __init__(self, dataloader: "Dataloader[Example]"):
    """Iterator for the dataloader.

    Args:
        dataloader (Dataloader): The dataloader to iterate over.
    """
    self.dataloader = dataloader
    self.current_index = 0
    self.buffer = Queue(maxsize=max(1, self.dataloader.prefetch_factor))
    self.exception = None

    if self.dataloader.num_workers > 0:
        self.executor = ThreadPoolExecutor(max_workers=self.dataloader.num_workers)
        self.stop_event = threading.Event()
        self.prefetch_thread = threading.Thread(target=self._prefetch_worker)
        self.prefetch_thread.start()
    else:
        self.executor = None

progress

progress() -> Progress

Get the progress of the dataloader.

Source code in src/loadax/dataloader/loader.py
105
106
107
108
109
def progress(self) -> Progress:
    """Get the progress of the dataloader."""
    total_items = len(self.dataloader.dataset)
    processed_items = min(self.current_index, total_items)
    return Progress(processed_items, total_items)