The Philosophy of Infrastructure & SRE

Benjamen Keroack
DSC Engineering
Published in
9 min readDec 19, 2018

--

This is an internal document used to explain the purpose and philosophy behind the Infrastructure Engineering and SRE Team at Dollar Shave Club. We use this when on-boarding new engineers into DSC Engineering.

We believe that publishing this provides a window into how we have built and run the technical infrastructure behind DSC and is a (perhaps) unique take on the subject.

Welcome to DSC Engineering!

You’re reading this because you want an overview of what our team — Infrastructure Engineering and SRE — does, and how it relates to you and your job. You might be thinking that we are DSC’s version of DevOps, or Ops or something else along those lines. It’s certainly understandable why you might have that idea coming from other companies, but we want to be very clear at the outset…

We are not DevOps. We are not Ops.

We are software engineers who have built a software delivery pipeline and related tools that you get to use. We hope you like them! If not, or if we could do anything better, we’re open to feedback so please share your thoughts with us.

Of course we all share responsibility for the uptime of DSC systems, but rather than our team running your applications for you, instead we act more as consultants and partners to help you design your systems in such a way that they run well and reliably on the platform we’ve created.

But why aren’t we DevOps or Ops?

Why not DevOps?

“DevOps” is one of those terms that has as many definitions as there are people who use it, but our understanding of DevOps is that it refers to a team of people who understand software engineering but do mostly ops work¹. We do everything we can to limit our ops work because we don’t think anyone should have to do ops work. We build systems that make the need for traditional ops work largely disappear, and distributes out the remaining ops work in a way that’s more scalable.

Why not Ops?

When we say “traditional ops work” we mean provisioning servers, configuring operating systems, manually scaling applications up or down, writing service startup scripts and the like. We don’t like spending time on that work because computers can do it, and with software systems like Kubernetes, we can put a lot of power and flexibility to run your application how you want in your hands directly.

A simplified explanation of Kubernetes (K8s) that I like is that it allows you, the application developer, to write a specification of how you want your application to run in production, in a way that both humans and machines can understand.

This is information that, in the old ops world, you would have to explain in a ticket that you submit to the ops team:

“My application uses Java and it requires JRE 1.7. It’s a web service that listens on port 3000. It needs at least 4GB of RAM and it will write logs to /var/log/myapp.log. Please run 8 instances horizontally behind a load balancer.”

Instead of all that, in K8s you encode this information in YAML as K8s API objects like Deployments and Services (or more specifically, as a Helm Chart containing templates that produces those API objects).

Interfaces

In both the old ops world and in DevOps, the (dev)ops team needs to know a great deal about the applications they run. They need to understand how it runs, the dependencies it needs, how it might fail, the load it’s expected to handle and many more details. This sometimes motivates ops teams to restrict the technologies that app teams are allowed to use² to make the problem more tractable. We don’t want to do that, yet we also can’t possibly know that level of detail about every application that runs on our platform.

Therefore we use several interfaces that allow us to abstract away application detail. Just like interfaces in software design and programming languages, these interfaces encapsulate detail we don’t need to care about.

Container Image

The first interface is also the most fundamental: the Docker image.

To run your software on our platform, it must be contained (no pun intended) within a Docker container image. Full stop. No exceptions.

We ask that you also follow some conventions when constructing your container:

  • Configure your application with environment variables or CLI flags.
  • Always log to stdout (avoid writing to the local filesystem at all, if possible).
  • If your application is a network service, make the port on which it listens configurable.

We also ask that you make your image build stateless, so that the GitHub repository is all we need to build your image³. This lets us use our Furan service to build your software artifact.

A big advantage to using container images is that you can run the same artifact on your machine — or in your environment of choice — as that which runs in production. This reduces the likelihood of “works locally but fails in production” issues.

Helm Chart

The next interface is Helm Charts. Charts describe how to run your application container, the resources it needs and any supporting containers (sidecars). Charts provide a powerful templating system that allows substantial flexibility in configuring applications for different environments.

By convention, we want you to put your Helm chart in your application git repository at .helm/charts/<chart name>. This lets it work well with our deployment machinery as well as DQAs.

If your application is in a private repo, put the chart variables in .helm/releases/<env name>.yml. If it’s a public repo and you don’t want to expose sensitive configuration values, please put the chart variables in <link to private git repository>.

Kubernetes: The Real Deal

So let’s say you go through all the above, and create a container image and Helm chart for your shiny new application. What does this get you?

It gets you (or at least your team manager, depending upon you and your team) full operational control over your app in production, and total visibility into what it’s doing. You won’t have to put in tickets to us to reboot the server, or to scale it up or down, or to have SSH access for a few minutes. By using kubectl, the Swiss Army knife of K8s tools, or by modifying your chart, you can do all of these things and more. You can tail logs (kubectl logs), open a shell and run a Rails console (kubectl exec), access a private network service (kubectl port-forward) or whatever else you might need to do in your day.

You’re also fully empowered to scale your application up and down as you see fit. But how do you know the scale at which it should run? Even more importantly, how do you know your application is functioning properly at all?

Site Reliability Engineering (SRE)

Site Reliability Engineers are experts in SRE methodology pioneered by Google for running production systems. We can use SRE concepts to help answer those questions.

The depth and scope of SRE is too big for this brief overview, but some key points that are relevant to this discussion are:

  • You should instrument your application and capture metrics that tell you how your application is running and what errors are occurring. Error rate and request latency are two critical metrics that must be captured for all network services.
  • All applications will have some errors sometimes. 100% reliability is neither feasible nor worth pursuing in most cases.
  • Reliability is not a subjective, hand-wavey side effect but rather is an empirical property of your software that is tracked with metrics that serve as indicators and is made actionable with transparent objectives and agreements.
  • Your team should come up with an SLA (Service Level Agreement) with your business stakeholders, and use that to determine an acceptable error budget for your application. This is essentially a measure of how reliable your service needs to be from both a business and technical standpoint. As long as you are meeting SLA and staying under your error budget, you are good to develop features and deploy. If you’re not meeting SLA, you have to stop feature work and focus on stability until you’re back under your error budget.
  • Performance and scaling is another aspect of your application SLA. If you’ve instrumented the work your application does (for example, requests per second and latency), you should be able to easily see when your application needs to scale up or down. This is simply a matter of adjusting the replica count in your chart, or (in the more advanced case) utilizing a K8s Horizontal Pod Autoscaler.

We expect all application authors to instrument their applications and utilize monitoring and alerting for appropriate metrics. For tips and guidance on how to best follow SRE principles, please reach out to us.

On-Call

At DSC, all engineers share the burden of being on-call for production systems. The engineering organization as a whole determines which team owns which systems, and each team manager decides who should be on the their rotation schedule. For every on-call schedule there as at least one person on primary duty and another person on secondary duty as a backup.

See <internal link to On-Call Expectations document>.

Infrastructure Engineering and SRE are on-call for the systems we run (see below), but we are not primary on call for most applications. We wouldn’t be of much use for them even if we were because the app teams have the expertise needed to fix critical outages, not us. That said, if an issue arises where we could be of use to an application team debugging a critical issue, we always have at least one team member on-call and ready to help.

Systems We Run

Just like application teams, we run a number of systems and are the first point of contact about using them, or for operational issues.

Kubernetes

Our primary area of focus and expertise is in running a large, production-grade Kubernetes cluster. Our goal is to provide a flexible, reliable and performant computing platform to all of DSC Engineering, such that nobody has to worry about “servers” or any other system-level concerns.

CDN

We manage the CDN (Content Delivery Network) layer that regulates public Internet traffic entering our Kubernetes platform. This involves several repositories of configuration that are treated as independent application codebases:

<links to private repositories related to CDN configuration>

In addition, we run a related service called Guardian which provides rate-limiting, IP blacklisting and other security features for traffic entering our systems from the public Internet.

Databases & Data stores

We manage application SQL databases (via AWS RDS): MySQL and PostgreSQL. We prefer the Amazon Aurora versions of each due to some nice features like storage autoscaling, automatic cluster failover and low-latency read replicas.

In addition we have a production-grade Cassandra (ScyllaDB) cluster available for use, as well as other non-SQL data stores like ElasticSearch and Kafka.

Vault

To help you follow security best practices, we run a production-grade Hashicorp Vault cluster where you can securely store and access sensitive application configuration values. All application secrets (like database passwords, encryption keys or third party API keys) must be stored in Vault.

To help your application interact with Vault, we’ve written several tools and libraries (all of which are open source):

Dynamic QA Environments (Acyl)

This is very important system that you will interact with on a daily basis. Acyl is a service that creates full testing environments (a “DQA”, Dynamic QA Environment) on demand and tears them down when you are done with them. With DQAv2 (aka Nitro), application developers have full control over environment configuration. See <internal link to DQAv2 User Guide> for details.

We don’t manage your individual environment configurations or the lifecycle of any particular testing environment, but we do run the service itself and we provide SLAs around uptime. If you have any questions about what caused an error in one of these environments, please reach out.

Furan

Furan is our open source Docker image building service. This is used by all systems that require dynamic image builds, such as Acyl and the deployment systems.

Deployment Systems

We manage the legacy Jenkins system that is the primary UI for running application deployments to K8s. At this point Jenkins is little more than a script runner, and the scripts it uses are managed by us through Terraform. We are actively working on an automated deployment service that, when complete, will allow us to finally retire Jenkins for good. Stay tuned for news about this!

Conclusion

We hope this overview of our team and what we do has been helpful. Please don’t hesitate to reach out if you have any questions, or if you want to help. We always welcome contributions from outside teams, and we’re happy to discuss anything infrastructure or SRE-related.

Again, welcome to DSC!

Footnotes

  1. There’s a secondary definition of DevOps that emphasizes the goal of DevOps Engineers working closely with application developers to make sure they have effective processes and automation that works for the whole team. This applies to our team as well as part of our consulting/advisory role.
  2. “We can run anything as long as it’s a .jar”
  3. In some communities (C++, Go and sometimes others) occasionally people want to build their software artifact outside of the container, and then COPY it in the Dockerfile. The problem is that this requires a build server (sometimes just the engineer’s local machine) with implicit state (build tools, libraries, etc) and therefore is a non-reproducible image build.
  4. We’ve chosen not to use Helm Chart repositories deliberately as they don’t provide enough benefit for the added complexity.
  5. By “sensitive values”, we mean values that are not quite secrets (like encryption keys, which must be stored in Vault) but also shouldn’t be exposed in a public repository. If in doubt, ask around before making something public.

--

--