grain.sources module#

APIs for reading data from various file formats.

List of Members#

class grain.sources.RandomAccessDataSource(*args, **kwargs)[source]#

Interface for datasets where storage supports efficient random access.

This Protocol defines the contract for any custom data source injected into the PyGrain pipeline. Implementations do not need to inherit from this class directly; they only need to implement the required structural methods (__len__ and __getitem__).

Notes: Checkpointing: If used with DataLoader, __repr__ has to be additionally implemented to support checkpointing.

Multiprocessing: If used with multiprocessing, the instance must be fully picklable.

Example

Implementing a minimal, checkpoint-safe custom data source:

from grain.sources import RandomAccessDataSource

class MyInMemorySource:
  def __init__(self, data: list):
    self._data = data
  def __len__(self) -> int:
    return len(self._data)
  def __getitem__(self, index: int):
    return self._data[index]
  def __repr__(self) -> str:
    # Required for PyGrain checkpointing with DataLoader
    return f"MyInMemorySource(size={len(self)})"

source = MyInMemorySource(["a", "b", "c"])
# source satisfies the RandomAccessDataSource protocol.
assert isinstance(source, RandomAccessDataSource)
__getitem__(index)[source]#

Returns the value for the given index.

This method must be thread-safe and deterministic.

Note that a number of sources take SupportsIndex instead of int for index. Such sources will still support int index and pass the isinstance check with this protocol, but all new source implementations should use int directly.

Parameters:

index (int) – An integer in [0, len(self)-1].

Returns:

The corresponding record. File data sources often return the raw bytes but records can be any Python object.

Return type:

T

__len__()[source]#

Returns the total number of records in the data source.

Returns:

The total count of accessible records.

Return type:

int

class grain.sources.ArrayRecordDataSource(*args, **kwargs)[source]#

Data source for ArrayRecord files.

Parameters:
  • paths (array_record.python.array_record_data_source.PathLikeOrFileInstruction | Sequence[array_record.python.array_record_data_source.PathLikeOrFileInstruction])

  • reader_options (dict[str, str] | None)

__init__(paths, reader_options=None)[source]#

Creates a new ArrayRecordDataSource object.

See array_record.ArrayRecordDataSource for more details.

Parameters:
  • paths (array_record.python.array_record_data_source.PathLikeOrFileInstruction | Sequence[array_record.python.array_record_data_source.PathLikeOrFileInstruction]) – A single path/FileInstruction or list of paths/FileInstructions.

  • reader_options (dict[str, str] | None) – a dict[str, str] to be passed when creating a reader. For example, {index_storage_option:”in_memory”} stores the reader indices in memory versus {index_storage_option:”offloaded”} stores the indices on disk to save memory usage.

class grain.sources.SharedMemoryDataSource(elements=None, *, name=None)[source]#

Simple in-memory data source for sequences that is sharable among multiple processes.

Note

This constrains storable values to only the int, float, bool, str (less than 10M bytes each), bytes (less than 10M bytes each), and None built-in data types. It also notably differs from the built-in list type in that these lists can not change their overall length (i.e. no append, insert, etc.)

Parameters:
  • elements (Sequence[Any] | None)

  • name (str | None)

__init__(elements=None, *, name=None)[source]#

Creates a new InMemoryDataSource object.

Parameters:
  • elements (Sequence[Any] | None) – The elements for the sharable list.

  • name (str | None) – The name of the datasource.

class grain.sources.RangeDataSource(start, stop, step)[source]#

Range data source, similar to python range() function.

Parameters:
  • start (int)

  • stop (int)

  • step (int)

__init__(start, stop, step)[source]#
Parameters:
  • start (int)

  • stop (int)

  • step (int)