HomeBlog

A Simple Deep Learning Model Config

02 January, 2020 - 4 min read

One thing odd thing I'll remember in the world of ML in 2019 was the increase in experimentation frameworks. These were either standalone libraries often targeting a specific deep learning framework or part of a larger framework.

For example, gin, is a tool to facilitate parameter configuration. And then within a larger framework, allennlp implements an internal model registry.

As another good example, the models in HuggingFace's transformer package follow the convention of:

bert_config = transformers.BertConfig()
bert_for_classification = transformers.TFBertForSequenceClassification(bert_config)

I like all of these tools, though during model development there's often a grey area where doing too much configuration will make it harder to adapt the model as you learn new information. At the same time, having an object that holds the model configuration, can provide validation, simplifies a model taking 25 arguments, etc, can be a good thing.

So there's a middle ground I'd like to share that I've found useful for local development, is lightweight, but also extends well to production settings.

pydantic's BaseSettings

Having been turned onto pydantic a few months ago, I instantly found it useful. The BaseModel is very handy for development where you want to have schemas for pieces of data being read or written, job parameters, request bodies, etc.

One related class is BaseSettings. It's very similar to BaseModel, but affords the ability to default values to associated environment variables, otherwise a lot of the "goodness" of base model.

Example Config

As a small example, here's a dummy config for a model where epochs is the only configuration option.

from pydantic import BaseSettings, Field

class FancySettings(BaseSettings):
    """Describing the fancy settings in the docstring."""
    epochs: int = Field(100, env="FANCY_EPOCHS", description="The number of epochs", gt=0)

Even though it's small, there's a lot of good things to point out:

  • By setting the FANCY_EPOCHS environment variables, epochs is set. This is minor initially, but application configuration via envvars is a good practice, so we're on the right track from the start.
  • I haven't shown how the descriptions and docstrings are used, but just within the code having the description is helpful documentation.
  • gt stands for greater than. Epochs must be greater than 0. Sit back and let the validation run over you.

The culmination of these things is the json schema that comes with the settings.

>>> print(FancySettings.schema_json())
{
  "title": "FancySettings",
  "description": "Describing the fancy settings in the docstring.",
  "type": "object",
  "properties": {
    "epochs": {
      "title": "Epochs",
      "description": "The number of epochs",
      "default": 100,
      "exclusiveMinimum": 0,
      "env": "FANCY_EPOCHS",
      "env_names": ["fancy_epochs"],
      "type": "integer"
    }
  },
  "additionalProperties": false
}

Again, this is useful immediately as it gives a way to describe what input should be passed.

Through the use of custom validators the logic can get more complex.

Dealing with Objects

Given the messy internal details of a model implementation and how object choices themselves are often hyperparameters (e.g. tokenizer choice), being able to configure the python objects can be helpful. Though I'm not sure I totally like it...

For example, in allennlp once the model is registered, it's accessible by name.

# Grab the part of the `config` that defines the model
model_params = config.pop("model")

# Find out which model subclass we want
model_name = model_params.pop("type")

# Instantiate that subclass with the remaining model params
model = Model.by_name(model_name).from_params(model_params)

(from allennlp docs)

Something similar can be done with the BaseSettings approach.

from pydantic import BaseSettings, PyObject

class ObjSet(BaseSettings):
    """Settings."""
    tokenizer: PyObject = 'gcgc.tokenizer.KmerTokenizer'
    settings: PyObject = 'gcgc.tokenizer.KmerTokenizerSettings'

if __name__ == "__main__":
    obj = ObjSet()
    tokenizer = obj.tokenizer(settings=obj.settings())
    print(tokenizer.encode("ATCG"))

which will print [1, 2, 3, 4] and demonstrating the objects getting created and used.

Note that gcgc is a package (that I wrote) for tokenizing DNA. It must be importable from python or an error is thrown.

Conclusion

I think ,model configuration is a useful for reproducibility and productionalization, though it can become burdensome if done too early or with too heavy of hand. For these reasons I like pydantic and its BaseSettings to configure models.