Version 1.0.0 Release Candidate 1
Pre-releaseBreaking changes:
- WordSplitter has been replaced by Tokenizer. See #3361. (this also requires a config file change; see below)
- Models are now available at https://github.com/allenai/allennlp-models.
Model.decode()
was renamed toModel.make_output_human_readable()
trainer.cuda_device
no longer takes a list, usedistributed.cuda_devices
. See #3529.- TODO: As of #3529 dataset readers used in the distributed setting need to shard instances to separate processes internally. We should fix this before releasing a final version.
- Dataset caching is now handled entirely with a parameter passed to the dataset reader, not with command-line arguments. If you used the caching arguments to
allennlp train
, instead just add a"cache_directory"
key to your dataset reader parameters. - Sorting keys in the bucket iterator have changed; now you just give a field name that you want to sort by, and that's it. We also implemented a heuristic to figure out what the right sorting key is, so in almost all cases you can just remove the
sorting_keys
parameter from your config file and our code will do the right thing. Some cases where our heuristic might get it wrong are if you have aListField[TextField]
and you want the size of the list to be the padding key, or if you have anArrayField
with a long but constant dimension that shouldn't be considered in sorting. - The argument order to
Embedding()
changed (because we made an argument optional and had to move it; pretty unfortunate, sorry). It used to beEmbedding(num_embeddings, embedding_dim)
. Now it'sEmbedding(embedding_dim, num_embeddings)
.
Config file changes:
In 1.0, we made some significant simplifications to how FromParams
works, so you basically never see it in your code and don't ever have to think about adding custom from_params
methods. We just follow argument annotations and have the config file always exactly match the argument names in the constructors that are called.
This meant that we needed to remove cases where we previously allowed mismatches between argument names and keys that show up in config files. There are a number of places that were affected by this:
- The way Vocabulary options are specified in config files has changed. See #3550. If you want to load a vocabulary from files, you should specify
"type": "from_files"
, and use the key"directory"
instead of"directory_path"
. - When instantiating a
BasicTextFieldEmbedder
from_params
, you used to be able to have embedder names be top-level keys in the config file (e.g.,"embedder": {"elmo": ELMO_PARAMS, "tokens": TOKEN_PARAMS}
). We changed this a long time ago to prefer wrapping them in a"token_embedders"
key, and this is now required (e.g.,"embedder": {"token_embedders": {"elmo": ELMO_PARAMS, "tokens": TOKEN_PARAMS}}
). - The
TokenCharactersEncoder
now requires you to specify thevocab_namespace
for the underlying embedder. It used to default to"token_characters"
, matching theTokenCharactersIndexer
default, but making that work required some custom magic that wasn't worth the complexity. So instead of"token_characters": {"type": "character_encoding", "embedding": {"embedding_dim": 25}, "encoder": {...}}
, you need to change this to:"token_characters": {"type": "character_encoding", "embedding": {"embedding_dim": 25, "vocab_namespace": "token_characters"}, "encoder": {...}}
- Regularization now needs another key in a config file. Instead of specifying regularization as
"regularizer": [[regex1, regularizer_params], [regex2, regularizer_params]]
, it now must be specified as"regularizer": {"regexes": [[regex1, regularizer_params], [regex2, regularizer_params]]}
. - Initialization was changed in a similar way to regularization. Instead of specifying initialization as
"initializer": [[regex1, initializer_params], [regex2, initializer_params]]
, it now must be specified as"initializer": {"regexes": [[regex1, initializer_params], [regex2, initializer_params]]}
. Also, you used to be able to haveinitializer_params
be"prevent"
, to prevent initialization of matching parameters. This is now done with a separate key passed to the initializer: `"initializer": {"regexes": [..], "prevent_regexes": [regex1, regex2]}. num_serialized_models_to_keep
andkeep_serialized_model_every_num_seconds
used to be able to be passed as top-level parameters to thetrainer
, but now they must always be passed to thecheckpointer
instead. For example, if you had"trainer": {"num_serialized_models_to_keep": 1}
, it now needs to be"trainer": {"checkpointer": {"num_serialized_models_to_keep": 1}}
.- Tokenizer specification changed because of #3361. Instead of something like
"tokenizer": {"word_splitter": {"type": "spacy"}}
, you now just do"tokenizer": {"type": "spacy"}
(more technically: theWordTokenizer
has now been removed, with the things we used to callWordSplitters
now just moved up to be top-levelTokenizers
themselves). - The
namespace_to_cache
argument toElmoTokenEmbedder
has been removed as a config file option. You can still passvocab_to_cache
to the constructor of this class, but this functionality is no longer available from a config file. If you used this and are really put out by this change, let us know, and we'll see what we can do.
Changes:
allennlp make-vocab
andallennlp dry-run
are deprecated, replaced with a--dry-run
flag which can be passed toallennlp train
. We did this so that people might find it easier to actually use the dry run features, as they don't require any config changes etc.- When constructing objects using our
FromParams
pipeline, we now inspect superclass constructors when the concrete class has a**kwargs
argument, adding arguments from the superclass. This means, e.g., that we can add parameters to the baseDatasetReader
class that are immediately accessible to all subclasses, without code changes. All that this requires is that you take a**kwargs
argument in your constructor, and you callsuper().__init__(**kwargs)
. See #3633. - The order of the
num_embeddings
andembedding_dim
constructor arguments forEmbedding
has changed. Additionally, passing apretrained_file
now initializes the embedding regardless of whether it was constructed from a config file, or directly instantiated in code. see #3763 - Our default logging behavior is now much less verbose.
Notable bug fixes
There are far too many small fixes to be listed here, but these are some notable fixes:
- NER interpretations have never actually reproduced the result in the original AllenNLP Interpret paper. This version finds and fixes that problem, which was that the loss masking code to make the model compute correct gradients for a single span prediction was not checked in. See #3971.
Upgrading Guide
Iterators -> DataLoaders
Allennlp now uses Pytorch's API for data iteration, rather than our own custom one. This means that train_data
, validation_data
, iterator
and validation_iterator
arguments to the Trainer
are now deprecated, having been replaced with data_loader
and validation_dataloader
.
Previous config files which looked like:
{
"iterator": {
"type": "bucket",
"sorting_keys": [["tokens"], ["num_tokens"]],
"padding_noise": 0.1
...
}
}
Now become:
{
"data_loader": {
"batch_sampler" {
"type": "bucket",
// sorting keys are no longer required! They can be inferred automatically.
"padding_noise": 0.1
}
}
Multi-GPU
Allennlp now uses DistributedDataParallel
for parallel training, rather than DataParallel
. With DistributedDataParallel
, each worker (GPU) runs in it's own process. As such, each process also has its own Trainer
, which now takes a single GPU ID only.
Previous config files which looked like:
{
"trainer": {
"cuda_device": [0, 1, 2, 3],
"num_epochs": 20,
...
}
}
Now become:
{
"distributed": {
"cuda_devices": [0, 1, 2, 3],
},
"trainer": {
"num_epochs": 20,
...
}
}
In addition, if it is important that your dataset is correctly sharded such that one epoch strictly corresponds to one pass over the data, your dataset reader should contain the following logic to read instances on a per-worker basis:
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
for idx, inputs in enumerate(data_file):
if idx % world_size == rank:
yield self.text_to_instance(inputs)