Releases: pytorch/data
TorchData 0.4.0 Beta Release
TorchData 0.4.0 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Performance
- Documentation
- Future Plans
- Beta Usage Note
Highlights
We are excited to announce the release of TorchData 0.4.0. This release is composed of about 120 commits since 0.3.0, made by 23 contributors. We want to sincerely thank our community for continuously improving TorchData.
TorchData 0.4.0 updates are focused on consolidating the DataPipe
APIs and supporting more remote file systems. Highlights include:
- DataPipe graph is now backward compatible with
DataLoader
regarding dynamic sharding and shuffle determinism in single-process, multiprocessing, and distributed environments. Please check the tutorial here. AWSSDK
is integrated to support listing/loading files from AWS S3.- Adding support to read from
TFRecord
and Hugging Face Hub. DataLoader2
became available in prototype mode. For more details, please check our future plans.
Backwards Incompatible Change
DataPipe
Updated Multiplexer
(functional API mux
) to stop merging multiple DataPipes
whenever the shortest one is exhausted (pytorch/pytorch#77145)
Please use MultiplexerLongest
(functional API mux_longgest
) to achieve the previous functionality.
0.3.0 | 0.4.0 |
---|---|
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22, 3, 13, 23, 4, 14, 24]
>>> len(output_dp)
13
|
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> len(output_dp)
9
|
Enforcing single valid iterator for IterDataPipes
w/wo multiple outputs pytorch/pytorch#70479, (pytorch/pytorch#75995)
If you need to reference the same IterDataPipe
multiple times, please apply .fork()
on the IterDataPipe
instance.
IterDataPipe with a single output | |
---|---|
0.3.0 | 0.4.0 |
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp)
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2)
0
>>> next(it1)
1
# Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
[(0, 0), ..., (9, 9)]
|
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp) # This doesn't raise any warning or error
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2) # Invalidates `it1`
0
>>> next(it1)
RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
# Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
RuntimeError: This iterator has been invalidated because another iterator has been createdfrom the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
|
IterDataPipe with multiple outputs | |
---|---|
0.3.0 | 0.4.0 |
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1)
# Basically share the same reference as `it1`
# doesn't reset because `cdp1` hasn't been read since reset
>>> next(it1)
0
>>> next(it2)
0
>>> next(it3)
1
# The next line resets all ChildDataPipe
# because `cdp2` has started reading
>>> it4 = iter(cdp2)
>>> next(it3)
0
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
|
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1) # This invalidates `it1` and `it2`
>>> next(it1)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it2)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it3)
0
# The next line should not invalidate anything, as there was no new iterator created
# for `cdp2` after `it2` was invalidated
>>> it4 = iter(cdp2)
>>> next(it3)
1
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
|
Deprecations
DataPipe
Deprecated functional APIs of open_file_by_fsspec
and open_file_by_iopath
for IterDataPipe
(pytorch/pytorch#78970, pytorch/pytorch#79302)
Please use open_files_by_fsspec
and open_files_by_iopath
0.3.0 | 0.4.0 |
---|---|
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec() # No Warning
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath() # No Warning
|
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec()
FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_fsspec()` instead.
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath()
FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_iopath()` instead.
|
Argument drop_empty_batches
of Filter
(functional API filter
) is deprecated and going to be removed in the future release (pytorch/pytorch#76060)
0.3.0 | 0.4.0 |
---|---|
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
|
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
FutureWarning: The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14.
See https://github.com/pytorch/data/issues/163 for details.
|
New Features
DataPipe
- Added utility to visualize
DataPipe
graphs (#330)
IterDataPipe
- Added
Bz2FileLoader
with functional API ofload_from_bz2
(#312) - Added
BatchMapper
(functional API:map_batches
) andFlatMapper
(functional API:flat_map
) (#359) - Added support for WebDataset-style archives (#367)
- Added
MultiplexerLongest
with functional API ofmux_longest
(#372) - Add
ZipperLongest
with functional API ofzip_longest
(#373) - Added
MaxTokenBucketizer
with functional API ofmax_token_bucketize
(#283) - Added
S3FileLister
(functional API:list_files_by_s3
) and `S...
TorchData 0.3.0 Beta Release
0.3.0 Release Notes
We are delighted to present the Beta release of TorchData. This is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines. Based on community feedback, we have found that the existing DataLoader bundled too many features together and can be difficult to extend. Moreover, different use cases often have to rewrite the same data loading utilities over and over again. The goal here is to enable composable data loading through Iterable-style and Map-style building blocks called “DataPipes” that work well out of the box with the PyTorch’s DataLoader
.
- Highlights
- What are DataPipes?
- Usage Example
- New Features
- Documentation
- Usage in Domain Libraries
- Future Plans
- Beta Usage Note
Highlights
We are releasing DataPipes - there are Iterable-style DataPipe (IterDataPipe
) and Map-style DataPipe (MapDataPipe
).
What are DataPipes?
Early on, we observed widespread confusion between the PyTorch DataSets
which represented reusable loading tooling (e.g. TorchVision's ImageFolder
), and those that represented pre-built iterators/accessors over actual data corpora (e.g. TorchVision's ImageNet
). This led to an unfortunate pattern of siloed inheritance of data tooling rather than composition.
DataPipe
is simply a renaming and repurposing of the PyTorch DataSet
for composed usage. A DataPipe
takes in some access function over Python data structures, __iter__
for IterDataPipes
and __getitem__
for MapDataPipes
, and returns a new access function with a slight transformation applied. For example, take a look at this JsonParser
, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:
import json
class JsonParserIterDataPipe(IterDataPipe):
def __init__(self, source_datapipe, **kwargs) -> None:
self.source_datapipe = source_datapipe
self.kwargs = kwargs
def __iter__(self):
for file_name, stream in self.source_datapipe:
data = stream.read()
yield file_name, json.loads(data)
def __len__(self):
return len(self.source_datapipe)
You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sophisticated data pipelines, with streamed operation as a first-class citizen.
Under this naming convention, DataSet
simply refers to a graph of DataPipes
, and a dataset module like ImageNet
can be rebuilt as a factory function returning the requisite composed DataPipes
.
Usage Example
In this example, we have a compressed TAR archive file stored in Google Drive and accessible via an URL. We demonstrate how you can use DataPipes to download the archive, cache the result, decompress the archive, filter for specific files, parse and return the CSV content. The full example with detailed explanation is included in the example folder.
url_dp = IterableWrapper([URL])
cache_compressed_dp = GDriveReader(cache_compressed_dp)
# cache_decompressed_dp = ... # See source file for full code example
# Opens and loads the content of the TAR archive file.
cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").load_from_tar()
# Filters for specific files based on the file name.
cache_decompressed_dp = cache_decompressed_dp.filter(
lambda fname_and_stream: _EXTRACTED_FILES[split] in fname_and_stream[0]
)
# Saves the decompressed file onto disk.
cache_decompressed_dp = cache_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
data_dp = FileOpener(cache_decompressed_dp, mode="b")
# Parses content of the decompressed CSV file and returns the result line by line. return
return data_dp.parse_csv().map(fn=lambda t: (int(t[0]), " ".join(t[1:])))
New Features
[Beta] IterDataPipe
We have implemented over 50 Iterable-style DataPipes across 10 different categories. They cover different functionalities, such as opening files, parsing texts, transforming samples, caching, shuffling, and batching. For users who are interested in connecting to cloud providers (such as Google Drive or AWS S3), the fsspec and iopath DataPipes will allow you to do so. The documentation provides detailed explanations and usage examples of each IterDataPipe
.
[Beta] MapDataPipe
Similar to IterDataPipe
, we have various, but a more limited number of MapDataPipe
available for different functionalities. More MapDataPipes
support will come later. If the existing ones do not meet your needs, you can write a custom DataPipe.
Documentation
The documentation for TorchData is now live. It contains a tutorial that covers how to use DataPipes, use them with DataLoader, and implement custom ones.
Usage in Domain Libraries
In this release, some of the PyTorch domain libraries have migrated their datasets to use DataPipes. In TorchText, the popular datasets provided by the library are implemented using DataPipes and a section of its SST-2 binary text classification tutorial demonstrates how you can use DataPipes to preprocess data for your model. There also are other prototype implementations of datasets with DataPipes in TorchVision (available in nightly releases) and in TorchRec. You can find more specific examples here.
Future Plans
There will be a new version of DataLoader in the next release. At the high level, the plan is that DataLoader V2 will only be responsible for multiprocessing, distributed, and similar functionalities, not data processing logic. All data processing features, such as the shuffling and batching, will be moved out of DataLoader to DataPipe. At the same time, the current/old version of DataLoader should still be available and you can use DataPipes with that as well.
Beta Usage Note
This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.