Dataloader¶

The Dataloader is the main interface for loading data into your training loop. The Dataloader is responsible for defining how to efficiently load data from a dataset and allocate it to the appropriate devices, batches, and all of the other features that make up proper data loading.

The Dataloader works by spawning background workers to prefetch data into a cache, and then filling batches from the cache as they become available. The use of background workers allows the dataloader to be highly efficient and not block the main thread, which is important for training loops. Loadax takes care of the parllelization details for you, so your dataloading is fast, reliable, and simple. The background cache will load out of order, as utilizes mutlithreading to load data in parallel, however the actual batches will be in order. This is because loadax ensures deterministic ordering of batches, and the background workers will load batches in the order that they are requested.

Creating a dataloader

from loadax import Dataloader, SimpleDataset

dataset = SimpleDataset([1, 2, 3, 4, 5])
dataloader = Dataloader(
    dataset=dataset,
    batch_size=2,
    num_workers=2,
    prefetch_factor=2,
)
for batch in dataloader:
    print(batch)

#> [1, 2]
#> [3, 4]
#> [5]

Bases: Generic[Example]

Dataloader that loads batches in the background or synchronously.

Example

from loadax.experimental.dataset.simple import SimpleDataset
from loadax.experimental.loader import Dataloader

dataset = SimpleDataset([1, 2, 3, 4, 5])
dataloader = Dataloader(
    dataset=dataset,
    batch_size=2,
    num_workers=2,
    prefetch_factor=2,
    drop_last=False,
)
for batch in dataloader:
    print(batch)

#> [1, 2]
#> [3, 4]
#> [5]

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	The dataset to load data from.	required
`batch_size`	`int`	The size of each batch.	required
`num_workers`	`int`	The number of workers to use for parallel data loading. If 0, data will be loaded synchronously.	`0`
`prefetch_factor`	`int`	The prefetch factor to use for prefetching. If 0, no prefetching will occur.	`0`
`drop_last`	`bool`	Whether to drop the last incomplete batch.	`False`

Source code in src/loadax/dataloader/loader.py

def __init__(
    self,
    dataset: Dataset[Example],
    batch_size: int,
    num_workers: int = 0,
    prefetch_factor: int = 0,
    *,
    drop_last: bool = False,
):
    """A dataloader that can load data in the background or synchronously.

    Example:
        ```python
        from loadax.experimental.dataset.simple import SimpleDataset
        from loadax.experimental.loader import Dataloader

        dataset = SimpleDataset([1, 2, 3, 4, 5])
        dataloader = Dataloader(
            dataset=dataset,
            batch_size=2,
            num_workers=2,
            prefetch_factor=2,
            drop_last=False,
        )
        for batch in dataloader:
            print(batch)

        #> [1, 2]
        #> [3, 4]
        #> [5]
        ```

    Args:
        dataset (Dataset): The dataset to load data from.
        batch_size (int): The size of each batch.
        num_workers (int): The number of workers to use for parallel data loading.
            If 0, data will be loaded synchronously.
        prefetch_factor (int): The prefetch factor to use for prefetching.
            If 0, no prefetching will occur.
        drop_last (bool): Whether to drop the last incomplete batch.
    """
    self.dataset = dataset
    self.batch_size = batch_size
    self.num_workers = num_workers
    self.prefetch_factor = prefetch_factor
    self.drop_last = drop_last

Bases: Generic[Example]

Iterator for the dataloader.

Parameters:

Name	Type	Description	Default
`dataloader`	`Dataloader`	The dataloader to iterate over.	required

Source code in src/loadax/dataloader/loader.py

def __init__(self, dataloader: "Dataloader[Example]"):
    """Iterator for the dataloader.

    Args:
        dataloader (Dataloader): The dataloader to iterate over.
    """
    self.dataloader = dataloader
    self.current_index = 0
    self.buffer = Queue(maxsize=max(1, self.dataloader.prefetch_factor))
    self.exception = None

    if self.dataloader.num_workers > 0:
        self.executor = ThreadPoolExecutor(max_workers=self.dataloader.num_workers)
        self.stop_event = threading.Event()
        self.prefetch_thread = threading.Thread(target=self._prefetch_worker)
        self.prefetch_thread.start()
    else:
        self.executor = None

progress ¶

progress() -> Progress

Get the progress of the dataloader.

Source code in src/loadax/dataloader/loader.py

def progress(self) -> Progress:
    """Get the progress of the dataloader."""
    total_items = len(self.dataloader.dataset)
    processed_items = min(self.current_index, total_items)
    return Progress(processed_items, total_items)