Thermite: techniques to move complexity away from business logic in Go applications

Published in

DSC Engineering

15 min readAug 10, 2021

At Dollar Shave Club, we deploy a cloud platform mainly built on Kubernetes running on Amazon Web Services. We use a number of AWS services to run our platform, and last year we moved our Docker image hosting from Quay to Amazon Elastic Container Registry. We needed a reasonable method of cleaning up unused images from our registry in order to stay under ECR’s 10,000 image-per-repo limit, as some of our busiest repos had up to 13,000 images.

ECR’s lifecycle policies can automatically remove least-recently-pushed images. but we needed an additional layer of verification. An image may have been pushed some time ago but still be targeted by a Kubernetes pod running in our cluster; if this image was spuriously removed, it could cause errors should any of the pods referencing it restart and re-attempt to pull the image. This led us to the development of a tool that would remove only the old, expired images that were also not present in any Kubernetes deployment in our platform. We called it Thermite to reflect its potentially destructive reaction based on iron oxide (rust), a material that builds up with age, and recently made it open source. Feedback and contributions are welcomed and appreciated!

Design

The core logic of Thermite is pretty simple:

Check each ECR repository in a given registry for a given tag key, whose value indicates the “prune period” — the number of days that must pass after an image has been pushed before it is eligible for removal
If the tag is missing or invalid, do not remove any images from the repository
Otherwise, remove any images from the repository which were both a) pushed longer ago than the prune period and b) not present in a given Kubernetes cluster at the time

Our infrastructure software engineering projects are built with the Go programming language. Go is a popular language for cloud software, and is the implementation language for both Docker and Kubernetes, allowing us a great degree of control and automation via their respective Go packages. With the addition of the AWS SDK for Go, we had all the tools we needed to implement Thermite.

The remainder of this article will discuss some of the patterns used to accomplish Thermite’s simple logic in an appropriately simple way, following our conventions and best practices as well as meeting common expectations for open-source projects. The connecting thread between the topics discussed is the emphasis on using convention over coding and moving complexity to the environment to write Go types, packages, and projects that follow the UNIX philosophy of “do one thing and do it well”, while still allowing the adaptability, configurability, and observability required in a production cloud platform service and the additional release methodology around a community-oriented open-source project.

This article will assume familiarity with the Go language and its idioms, but I will do my best to explain the core concepts of each section.

Use packages for business logic

Go’s boundary of visibility is a package — Go code can only reference code within its own package, or exported code from packages that are explicitly imported and referenced. All Go code must exist within a package, whose name is declared at the beginning of the source file containing the code. There is a reserved package name, “main”, which must contain the entrypoint for the compiled program (a function also called “main”).

package mainfunc main() {
    …
}

This is where most Go projects start from, and for simpler uses, it’s tempting to just wire everything together in the main package and split it out later. However, by identifying the logical boundaries of our code early on, we’ll make it easier to test and add the additional tooling we need down the road.

Determining appropriate package boundaries is never an exact science, and books can and have been written on how to organize code. Package descriptions (a comment added by convention preceding the package declaration) are a good litmus test; you should be able to summarize the purpose of the package clearly in a single line, and should organize your packages until they can each be described in a similarly clear fashion.

In any case, begin by implementing your core logic inside a package. A common convention for Go projects is to keep package directories under pkg/ in the repo, and import them in the main package as needed. This will allow you to separate the specifics of command-line argument parsing, input, output, and configuration from your business logic.

In the case of Thermite, there were two clear packages that made up the core functionality:

A third package is responsible for the minor logic of passing the surveyed data as a whitelist to the prune package, allowing the code for the CLI itself to be completely separate from any business logic:

Package thermite removes old Amazon Elastic Container Registry images that are not currently deployed in a Kubernetes cluster.

Inject dependencies as testable interfaces

When we move a project’s code into subpackages, we enable dependency injection: we can have consumers of the package provide implementations of any dependencies our package might have at runtime, rather than creating the concrete implementation ourselves within the code. This means that the initialization, configuration, and testing of these implementations exists outside of the package, and we only have to concern ourselves with the intersection of our code and the dependency’s API. Without dependency injection, the core functionality would become conflated with initialization and configuration logic for specific dependencies that might change over time.

This is only one half of the equation, however. If we required a specific, concrete implementation of dependencies, we would avoid having to write brittle configuration code, but we would still require an instance of that implementation to be created by any consumer of the package. This might be enough for dependencies like Kubernetes, where one standard, first-party client implementation exists, but requiring a concrete implementation prevents us from using any alternative client (e.g.), and severely limits our ability to use Go’s built-in testing framework to verify our code.

Therefore, when creating subpackages for our projects, we want to inject dependencies as interfaces they can satisfy. Go’s interfaces name a set of method signatures, and any Go type which satisfies all of the method signatures for the interface can be used as an instance of that interface. Some Go packages will declare an interface or a collection of interfaces satisfied by the package’s concrete type(s). If not, declare an interface with the relevant functionality in such a way that the dependency can be faked when unit testing. For simple dependencies, this might just mean declaring a subset of the methods of the concrete dependency; for more complex dependencies, this might mean declaring custom method signatures and a method of wrapping the concrete dependency to satisfy them.

In the case of Thermite, we are lucky: the Kubernetes Go client package includes the kubernetes.Interface interface, which bundles all the various Kubernetes API methods together, and the ECR Go client package includes a subpackage with a similar ecriface.ECRAPI interface. Both of these interfaces are satisfied by their respective concrete client implementations. Go interfaces have a useful property: if we embed a certain interface in a struct, it can be used as a member of that interface, but does not need to implement all the required methods of that interface. Unless one of the missing methods is called on the type embedding the interface, no errors will occur. This means that we can use these interfaces as the type of the dependencies we wish to inject, but in our tests, we are free to embed the interface and only fake or mock the methods that we know we call from our package.

For example, when constructing a prune.Client, we accept an ecriface.ECRAPI as the single required dependency:

func NewClient(client ecriface.ECRAPI, …) (*Client, …) {
    …
}

In our unit tests, we declare a mockedClient type which embeds the interface in question, and is initialized with static test data to derive the output from:

type mockedClient struct {
    ecriface.ECRAPI
    …
}

We then fake the various interface methods we know will be used in our package:

func (m mockedClient) DescribeRepositoriesWithContext(…) … {
    …
}…

and can pass the mockedClient to the constructor in our tests:

client := &mockedClient{
    …
}
gc, err := NewClient(client, …)
…

We are then free to populate the mockedClient with test data to allow us to test the core logic of aggregating and filtering images by their push time. In the thermite command itself, we begin a real AWS session and derive an ECR client from it, and pass it to the constructor in the same way:

sess, err := session.NewSessionWithOptions(…)
…
ecrClient := ecr.New(sess)
pruneClient, … := prune.NewClient(ecrClient, …)
…

Finally, we declare an interface matching the main functionality of the package, which can be used by any downstream package to easily fake a dependency on our package’s client similarly to how we used the interfaces declared by our Kubernetes and ECR dependencies:

type GarbageCollector interface {
    PruneRepo(ctx context.Context, name string, until time.Time, excluded ...string) (pruned []string, err error)
    PruneAllRepos(ctx context.Context, until time.Time, excluded ...string) (pruned []string, err error)
}
…
func (gc *Client) PruneAllRepos(ctx context.Context, until time.Time, excluded ...string) (pruned []string, err error) {
    …
}
…
func (gc *Client) PruneRepo(ctx context.Context, name string, until time.Time, excluded ...string) (pruned []string, err error) {
    …
}

We use this interface, along with a similar interface declared in the census package, in tests for the thermite package to ensure it correctly passes the census data to the garbage collector.

Use functional options to keep initialization simple

Dave Cheney’s functional options pattern describes one method for allowing optional configuration parameters for Go types with constructors, where these parameters are expressed as functions on that type which modify the relevant internal state. Constructors for our types can then take a variadic number of these functional options as their final arguments and apply them transparently — if no optional configuration parameters are needed, they can simply be omitted entirely, rather than using nil pointers or custom constructors for each combination of optional parameters that might be necessary.

In the case of the prune package, we define these functions’ type explicitly:

type Option func(gc *Client)

This definition makes the functional options’ purpose clearer in the signature of the Client constructor:

func NewClient(…, opts ...Option) (*Client, error) {
    …
    gc := &Client{
        …
    }
    …
    for _, opt := range opts {
        opt(gc)
    }
    return gc, nil
}

We then implement various top-level functions that return Options. These functions can take zero parameters for boolean options (e.g. WithRemoveImages, which configures the client to remove images it finds eligible rather than simply logging them), or can take additional parameters to configure more complex options (e.g. WithPeriodTagKey, which configures the key string in an ECR repository’s resource tags whose value should be taken as the repository’s prune period):

func WithRemoveImages() Option {
    return func(gc *Client) {
        gc.removeImages = true
    }
}…func WithPeriodTagKey(key string) Option {
    return func(gc *Client) {
        gc.periodTagKey = key
    }
}

Note here that each of these options has a sane default (in this case, “false” and “thermite:prune-period”) that the option overrides. More complex options may require additional logic to establish defaults or act conditionally. For example, the prune.WithStatsdClient option overrides the default value for the Client’s statsd field, which is a statsd.NoOpClient that throws away any statsd metrics it receives.

The named functional options approach leads to slightly more boilerplate off the bat, but has several nice properties:

Additional configuration options can be added with a single new functional option, without needing to write additional constructors or other initialization code outside the definition of the option
Calls to the constructor that don’t specify any options retain the visual simplicity of a “default” constructor, only passing the required dependencies common to all usages of the package
Calls to the constructor that specify options are clearly labelled and organized, without the ambiguity that can arise with unnamed arguments to the constructor or nil config pointers
Lists of options can be built up programmatically and passed as variadic arguments, useful for translating command-line flags and arguments into appropriately-constructed types as they are parsed

Move deployment complexity to the environment

No matter how clean and well-organized the code in our subpackages is, we need to be able to deploy it into our production infrastructure, as well as develop and use it locally. This can lead to a significant amount of boilerplate devoted to supporting different use cases (for example, running as a local binary vs. a Docker image vs. a Kubernetes pod, or passing authentication secrets via environment variables vs. Kubernetes Secrets vs. Vault secrets). Fortunately, each of Thermite’s command-line flags (which we define using Cobra) has an easy default for us to use.

The Datadog tracer initialization code checks for standard environment variables that specify the address of the Datadog Agent to send APM spans to (DD_AGENT_HOST and DD_TRACE_AGENT_PORT). If these environment variables exist, the tracer and profiler are started. A top-level span is created using tracer.StartSpanWithContext, which will return the same context.Context it is passed if the Agent address has not been specified. By using default environment variables and taking advantage of the Datadog package’s API, we avoiding adding conditional complexity around Datadog configuration in the rest of our codebase.

This approach becomes especially useful for the AWS and Kubernetes Go clients, which both offer a method of picking up authentication details from local dotfiles and environment variables, which is the method by which their official CLIs (aws and kubectl) authenticate. Previous Dollar Shave Club projects have often used bespoke configuration code and sets of CLI flags to pull authentication secrets for these clients from Vault and initialize the clients with them, often with extra logic for passing secrets via environment variables or other simpler alternatives during development. However, enabling shared environment configuration when initializing the clients means that most of this code is already written for us in the client package, and that the remaining complexity around deployment can be moved into the Helm chart for the application.

Kubernetes clients in Go can easily be created when running inside the cluster they are meant to connect to via rest.InClusterConfig. However, supporting out-of-cluster configs (i.e. when running the application locally and targeting a remote cluster) is more complex. Luckily, clientcmd.NewNonInteractiveDeferredLoadingClientConfig, when passed clientcmd.NewDefaultClientConfigLoadingRules, will automatically load and use the shared configuration present on the machine (controlled by the kubecfg files specified by the KUBECONFIG environment variable). By initializing the Kubernetes client in this way, we transparently support both simple in-cluster deployment and local connections to a remote cluster, using the existing configuration that might be present on a user’s machine.

loadingRules := clientcmd.NewDefaultClientConfigLoadingRules()
kubeConfig := clientcmd.NewNonInteractiveDeferredLoadingClientConfig(loadingRules, nil)
config, … := kubeConfig.ClientConfig()
…
clientset, … := kubernetes.NewForConfig(config)
censusClient, … := census.NewDefaultClient(clientset, …)

The ECR client can be initialized existing configurations in a similar way. When we create an AWS session in Thermite (from which the ECR client is derived), we simply enable the SharedConfigState option:

sess, … := session.NewSessionWithOptions(session.Options{
    SharedConfigState: session.SharedConfigEnable,
})
…
ecrClient := ecr.New(sess)
pruneClient, … := prune.NewClient(ecrClient, …)

Using these options allows us to run our application locally with our preexisting environments, with a minimum of extra Go code dedicated to the specifics of authentication, but we must now find a way to deploy our application to production securely using the same codebase.

At Dollar Shave Club, we use Helm charts to deploy our applications. Helm charts describe templated, configurable Kubernetes resources that allow us to deploy applications locally and remotely using simple configuration options. We will use several features of Helm to automatically generate Kubernetes resources with the authentication and configuration options required for our chosen method of deployment.

The Kubernetes authentication for our deployment is less involved than AWS, as the application will be deployed inside our cluster and thus automatically provisioned with credentials. We still need to add a ClusterRole to the deployment describing the permissions required for our application (in this case, the ability to list CronJobs, Jobs, DaemonSets, Deployments, and StatefulSets). A ClusterRole is similar to a regular Kubernetes Role, but applies to the entire cluster instead of a specific namespace. We also need to add a ServiceAccount for the deployment and a ClusterRoleBinding to associate it with the ClusterRole. This ServiceAccount is then used in the PodSpec of our deployment (which runs as a CronJob, regularly pruning old images on a configurable schedule). Specific code snippets here will probably add confusion as they are intertwined with other specifics of the Helm chart, but feel free to peruse the chart code or deploy into a testing cluster to understand how the Kubernetes RBAC resources are generated.

For our AWS configuration, we need to query secrets from our production Vault instance. We typically generate a set of locked-down AWS robot users and access keys for each of our production services, and store these access keys in Vault. It might seem self-evident that interacting with Vault’s API over the network will require some sort of specific client code in our Go codebase, but thanks to the Vault Agent injector, we can automatically mount Vault secrets as temporary files inside our Kubernetes containers. This article won’t go over the steps for enabling the injector in your cluster, but in brief, the injector adds a sidecar container to pods in the cluster, which is responsible for formatting secrets according to a template in the deployment’s metadata and mounting them in the primary container. In our case, we store the AWS access key ID and secret access key as separate key-value pairs under a single secret, and our template renders these into the format of an AWS credentials file. Our PodSpec then configures the AWS_SHARED_CREDENTIALS_FILE environment variable to point to the location where the Vault injector will mount the templated secret. This is all that is necessary for Thermite to successfully authenticate.

As a side note, Thermite also supports using Kubernetes Secrets to manually manage AWS authentication, which is useful for local development and testing where an injector-enabled test cluster might be difficult and brittle to spin up. The Helm chart can be configured to use a pre-existing named Secret, which should contain AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables that will be added to the environment of the deployed container.

These examples may be specific to the tools we use and the platform we run, but the general principle stands: move deployment complexity to the project scaffolding, away from the application code. Tools like Helm can aid with this, but all of these approaches are possible without Helm if you’re willing to accept less configurability and some copy-pasting.

The UNIX philosophy

Since its inception over 50 years ago, UNIX (or more accurately, the open POSIX standard based on UNIX’s design) has become the lingua franca of computer infrastructure. Servers (virtual and physical), cloud computing instances, containers, and more are almost invariably based on POSIX-compliant operating systems like Linux or BSD. UNIX’s dominance can easily be partially attributed to the growth of Linux as a free, fully-featured, and performant alternative to costly proprietary server operating systems, but I believe this misses the full picture. UNIX has managed to remain the dominant paradigm for as long as it has due to its carefully-considered design, at the core of which is something referred to as the “UNIX philosophy”:

Do one thing and do it well.

Discussions of the UNIX philosophy usually go on to discuss examples of specific, simplistic UNIX commands that can be combined together using pipes to achieve a wide range of emergent goals. By providing a set of simple, broadly-applicable tools and a method to easily connect and combine them with other tools, the UNIX paradigm has remained flexible and relevant over its 50 years of existence. The UNIX philosophy is not just limited to simple command-line tools and shell scripting, however: it gives us a general guideline on how to engineer software systems in a maintainable, approachable, and flexible way.

The Go language was designed by three programmers, two of which were members of the UNIX team from Bell Labs and one of whom designed and implemented the original UNIX operating system. Go is known as a language that emphasizes, and Go projects written with the UNIX philosophy in mind are best equipped to take advantage of this simplicity to write quality software. Each of the techniques, tips, and tricks discussed in this article allows us to make the various layers of our application and its environment do one thing well.

Our project’s subpackages each have a single purpose with inputs, outputs, and behavior that we can easily describe, test and verify. We don’t combine unrelated functionality in a single type or package, and we make sure that our dependencies are injected and fakeable and that our packages can be used in a similar fashion downstream.
Constructors in our package’s types do default initialization well, leaving any necessary knobs and options to be implemented in their own units of code. We don’t try to make a single, monolithic constructor with a combinatorial explosion of possible permutations of options to apply, nor a bloated package with every possible permutation as its own constructor.
Our command-line code does option parsing well, with the bare minimum of logic to initialize and configure each of our required dependencies before passing control to our business logic. We avoid mirroring the specific details of possible deployment approaches, and instead search for solutions that are broadly applicable and can be configured during deployment rather than polluting our codebase. For Kubernetes and AWS, we let their respective packages do environment configuration loading well, rather than creating a half-baked reimplementation based on our specific requirements.
Our command itself does one thing well, accomplishing a single purpose with a well-defined, easily parsed textual output (a list of Docker image identifiers removed from ECR) configured wherever possible with standard environment variables and files.

While cloud infrastructure projects are a far cry from the sort of cute text-processing one-liners usually used to showcase the UNIX philosophy, we can apply the principle of limiting and externalizing the complexity of each layer of our projects to write better, cleaner Go code with a larger focus on business logic and less need for repetitive scaffolding code.

Some of the examples given in this article are quite specific to the Thermite project, while others are specific to our needs and platform at Dollar Shave Club, so I have avoided going into minute technical detail. However, I am happy to go into more depth on any of these topics, give more concrete examples, or discuss anything about Go software design with anyone who reaches out, so please do!