Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Version 1.0.0 Release Candidate 1

Pre-release
Pre-release
Compare
Choose a tag to compare
@dirkgr dirkgr released this 16 Apr 23:15
· 381 commits to master since this release

Breaking changes:

  • WordSplitter has been replaced by Tokenizer. See #3361. (this also requires a config file change; see below)
  • Models are now available at https://github.com/allenai/allennlp-models.
  • Model.decode() was renamed to Model.make_output_human_readable()
  • trainer.cuda_device no longer takes a list, use distributed.cuda_devices. See #3529.
  • TODO: As of #3529 dataset readers used in the distributed setting need to shard instances to separate processes internally. We should fix this before releasing a final version.
  • Dataset caching is now handled entirely with a parameter passed to the dataset reader, not with command-line arguments. If you used the caching arguments to allennlp train, instead just add a "cache_directory" key to your dataset reader parameters.
  • Sorting keys in the bucket iterator have changed; now you just give a field name that you want to sort by, and that's it. We also implemented a heuristic to figure out what the right sorting key is, so in almost all cases you can just remove the sorting_keys parameter from your config file and our code will do the right thing. Some cases where our heuristic might get it wrong are if you have a ListField[TextField] and you want the size of the list to be the padding key, or if you have an ArrayField with a long but constant dimension that shouldn't be considered in sorting.
  • The argument order to Embedding() changed (because we made an argument optional and had to move it; pretty unfortunate, sorry). It used to be Embedding(num_embeddings, embedding_dim). Now it's Embedding(embedding_dim, num_embeddings).

Config file changes:

In 1.0, we made some significant simplifications to how FromParams works, so you basically never see it in your code and don't ever have to think about adding custom from_params methods. We just follow argument annotations and have the config file always exactly match the argument names in the constructors that are called.

This meant that we needed to remove cases where we previously allowed mismatches between argument names and keys that show up in config files. There are a number of places that were affected by this:

  • The way Vocabulary options are specified in config files has changed. See #3550. If you want to load a vocabulary from files, you should specify "type": "from_files", and use the key "directory" instead of "directory_path".
  • When instantiating a BasicTextFieldEmbedder from_params, you used to be able to have embedder names be top-level keys in the config file (e.g., "embedder": {"elmo": ELMO_PARAMS, "tokens": TOKEN_PARAMS}). We changed this a long time ago to prefer wrapping them in a "token_embedders" key, and this is now required (e.g., "embedder": {"token_embedders": {"elmo": ELMO_PARAMS, "tokens": TOKEN_PARAMS}}).
  • The TokenCharactersEncoder now requires you to specify the vocab_namespace for the underlying embedder. It used to default to "token_characters", matching the TokenCharactersIndexer default, but making that work required some custom magic that wasn't worth the complexity. So instead of "token_characters": {"type": "character_encoding", "embedding": {"embedding_dim": 25}, "encoder": {...}}, you need to change this to: "token_characters": {"type": "character_encoding", "embedding": {"embedding_dim": 25, "vocab_namespace": "token_characters"}, "encoder": {...}}
  • Regularization now needs another key in a config file. Instead of specifying regularization as "regularizer": [[regex1, regularizer_params], [regex2, regularizer_params]], it now must be specified as "regularizer": {"regexes": [[regex1, regularizer_params], [regex2, regularizer_params]]}.
  • Initialization was changed in a similar way to regularization. Instead of specifying initialization as "initializer": [[regex1, initializer_params], [regex2, initializer_params]], it now must be specified as "initializer": {"regexes": [[regex1, initializer_params], [regex2, initializer_params]]}. Also, you used to be able to have initializer_params be "prevent", to prevent initialization of matching parameters. This is now done with a separate key passed to the initializer: `"initializer": {"regexes": [..], "prevent_regexes": [regex1, regex2]}.
  • num_serialized_models_to_keep and keep_serialized_model_every_num_seconds used to be able to be passed as top-level parameters to the trainer, but now they must always be passed to the checkpointer instead. For example, if you had "trainer": {"num_serialized_models_to_keep": 1}, it now needs to be "trainer": {"checkpointer": {"num_serialized_models_to_keep": 1}}.
  • Tokenizer specification changed because of #3361. Instead of something like "tokenizer": {"word_splitter": {"type": "spacy"}}, you now just do "tokenizer": {"type": "spacy"} (more technically: the WordTokenizer has now been removed, with the things we used to call WordSplitters now just moved up to be top-level Tokenizers themselves).
  • The namespace_to_cache argument to ElmoTokenEmbedder has been removed as a config file option. You can still pass vocab_to_cache to the constructor of this class, but this functionality is no longer available from a config file. If you used this and are really put out by this change, let us know, and we'll see what we can do.

Changes:

  • allennlp make-vocab and allennlp dry-run are deprecated, replaced with a --dry-run flag which can be passed to allennlp train. We did this so that people might find it easier to actually use the dry run features, as they don't require any config changes etc.
  • When constructing objects using our FromParams pipeline, we now inspect superclass constructors when the concrete class has a **kwargs argument, adding arguments from the superclass. This means, e.g., that we can add parameters to the base DatasetReader class that are immediately accessible to all subclasses, without code changes. All that this requires is that you take a **kwargs argument in your constructor, and you call super().__init__(**kwargs). See #3633.
  • The order of the num_embeddings and embedding_dim constructor arguments for Embedding has changed. Additionally, passing a pretrained_file now initializes the embedding regardless of whether it was constructed from a config file, or directly instantiated in code. see #3763
  • Our default logging behavior is now much less verbose.

Notable bug fixes

There are far too many small fixes to be listed here, but these are some notable fixes:

  • NER interpretations have never actually reproduced the result in the original AllenNLP Interpret paper. This version finds and fixes that problem, which was that the loss masking code to make the model compute correct gradients for a single span prediction was not checked in. See #3971.

Upgrading Guide

Iterators -> DataLoaders

Allennlp now uses Pytorch's API for data iteration, rather than our own custom one. This means that train_data, validation_data, iterator and validation_iterator arguments to the Trainer are now deprecated, having been replaced with data_loader and validation_dataloader.

Previous config files which looked like:

{
  "iterator": {
    "type": "bucket",
    "sorting_keys": [["tokens"], ["num_tokens"]],
	"padding_noise": 0.1
    ...
  }
}

Now become:

{
  "data_loader": {
	"batch_sampler" {
		"type": "bucket",
		// sorting keys are no longer required! They can be inferred automatically.
		"padding_noise": 0.1
	}
}

Multi-GPU

Allennlp now uses DistributedDataParallel for parallel training, rather than DataParallel. With DistributedDataParallel, each worker (GPU) runs in it's own process. As such, each process also has its own Trainer, which now takes a single GPU ID only.

Previous config files which looked like:

{
  "trainer": {
    "cuda_device": [0, 1, 2, 3],
    "num_epochs": 20,
    ...
  }
}

Now become:

{
  "distributed": {
    "cuda_devices": [0, 1, 2, 3],
  },
  "trainer": {
    "num_epochs": 20,
    ...
  }
}

In addition, if it is important that your dataset is correctly sharded such that one epoch strictly corresponds to one pass over the data, your dataset reader should contain the following logic to read instances on a per-worker basis:

rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
for idx, inputs in enumerate(data_file):
	if idx % world_size == rank:
		yield self.text_to_instance(inputs)