The Crucible, Duke’s data science accelerator, faces a challenge common to many organizations: Figuring out how to easily deploy and manage cloud native stateless applications in a secure environment. In early 2020, we embarked on solving this challenge with what we’ve dubbed the Crucible App Platform.
The requirements for the Crucible are similar to those encountered by many small, agile organizations:
- Cloud first: For many use cases, cloud vendors provide cost savings while providing superior security and agility to on premises. The cloud also provides access to a wide range of managed services such as machine learning as a service (e.g., AzureML for Microsoft, Sagemaker for Amazon Web Services (AWS), and CloudML for Google Cloud Platform).
- Agility: Ability to continuously deploy and manage stateless cloud-native apps (see also: 12-factor apps). Examples of these apps are REST or GraphQL APIs, web apps, or background workers.
- Secure by default: Defines “guardrails” for apps, making it easier to adhere to certain security and regulatory requirements, such as HIPAA.
- Automation: Minimize the number of manual operational tasks for provisioning common infrastructure.
- Code first configuration management: Both infrastructure and app configuration should be represented in code. Change management happens via source control.
Build vs. Buy
Given the common set of requirements, one would assume that there’s a cost-effective, off-the-shelf commercial offering. Here’s a summary of the various offerings that we considered:
Heroku – By far the best offering from a simplicity perspective. However, we would have needed Heroku Enterprise in order to meet our security and regulatory requirements. We also desired access to our cloud vendor’s services for data-intensive machine learning (ML). This would have reduced performance while increasing network costs between compute resources in Heroku and ML services in our cloud.
AWS Fargate – Amazon has a containerized app platform as a service called Fargate. It’s closer to a lightweight “Kubernetes as a service” than it is a “containers as a service”, although it does not offer nearly the flexibility that Kubernetes offers. Deployment time was quite slow even for small apps (on the order of minutes) as opposed to other solutions (on the order of seconds). It also takes a surprising amount of effort to configure a single app. Finally, there was no out-of-the-box way to manage best practices (note: recently this may have changed with the preview release of AWS Proton). AWS Fargate is also about twice the cost for the same CPU and memory as an AWS VM. Additionally, we were leaning toward using Azure as our cloud of choice for its healthcare-related services and ease of management. Because we did not want to be locked into AWS, and since we did not love our experience with Fargate, we opted against it.
Pivotal Cloud Foundry – Cloud Foundry is effectively an “on-premises private cloud” with a managed cloud service that more closely mirrors Heroku. We did not take a close look at Pivotal Cloud Foundry, as prior experience shows that you need a large install base to justify its high costs. In fact, there are anecdotes of chief technology officers eventually moving away from Cloud Foundry due to growing costs over time. Additionally, we believe that it would have presented networking challenges similar to those of Heroku Enterprise.
Red Hat OpenShift – OpenShift is an Enterprise Kubernetes solution. We did not closely consider OpenShift for a couple of reasons: 1) although Duke already has a relationship with Red Hat OpenShift, the overall sentiment was that it was not serving our needs in a way that justified its high costs. 2) While OpenShift provides some nice functionality, it has many limitations and also adds a lot of complexity. If you work in the Kubernetes space, you’ll be familiar with the many footnotes in documentation such as “*does not work on OpenShift” or “*perform the following workaround for OpenShift.”
Azure App Services – Azure App Services is essentially an application-centric version of AWS Fargate. While it is much easier to get up and running, its main downside is high cost: in order to meet security and compliance requirements, one must spend a staggering $277 per month per core of compute, with no support for fractional cores. The compute is not very elastic, and similar to AWS Fargate, it’s also not vendor-agnostic. Vendor lock-in for Azure was not a deal-breaker, but benefits must be strong in order to bear the risk of lock-in.
Google CloudRun – Google’s CloudRun is the closest thing to a Heroku-like user experience for cloud-native apps. CloudRun has many advantages relative to many of the other offerings; however, it was clear early on that Duke was not planning to sign a business service agreement with Google Cloud.
Build and “Buy”
Because no solution was a clear “buy” for us, we instead adopted a hybrid “build and buy” option. The “buy” in this case was to leverage existing open-source components such as Kubernetes as much as possible. While open source is free of licensing fees, it still costs money in the sense that it would take investment on our end to create an end-to-end solution. Riser, an experimental platform built on top of Kubernetes, while not implemented at Duke, was referenced throughout the Crucible App Platform design & development to reduced our costs and accelerated our time to market. Riser was designed with the Crucible’s exact problem space in mind.
The Crucible App Platform is built on a set of open-source components. While we have configured these components to work specifically with Azure, these components are vendor-neutral and can run on any cloud.
- Kubernetes provides the foundation
- Knative Serving builds on top of Kubernetes by providing higher-level abstractions for common app use cases
- Istio is a service mesh that provides zero-trust network security
- Sealed Secrets enables secret management in a GitOps environment
- Flux enables GitOps on Kubernetes (Note: only used for Git syncing. We do not use other Flux features such as Docker image management)
The Crucible App Platform offers the following key features:
- Easier deployment and management of stateless cloud-native apps to Kubernetes
- Reduces the YAML that a developer needs to write for app configuration
- Routing and managing traffic with blue/green deployment between one or more point-in-time deployment snapshots (revisions)
- Scaling automatically and based on traffic instead of resource-based (the Kubernetes resource based autoscaler is still available if desired)
- Secure by default
- Zero-trust networking
- End-to-end mutual Transport Layer Security (mTLS)
- Restricted Kubernetes Pod spec
- Infrastructure as code
- Kubernetes and app configuration is stored in source control and applied via GitOps
- Azure-specific infrastructure is stored in source control and applied via Terraform
- Developer Workflow
- Easier deployment and management of stateless cloud-native apps to Kubernetes
Deploying an app is straightforward:
- Create a Dockerfile for your app
- Create a Knative configuration
- Configure your GitLab pipeline (provides automatic deploys on commit)
- Commit to source control
Here’s an example of our “testdummy” demo app:
apiVersion: serving.knative.dev/v1 kind: Service metadata: name: testdummy namespace: demo spec: template: spec: containers: - image: crucibleplatform.azurecr.io/demo/testdummy:0.1.1 ports: - containerPort: 8000 protocol: TCP
This is all the Kubernetes configuration that we need in order to get secure ingress, end-to-end mTLS, autoscaling, and much more. Of course, real-world apps require additional configuration, such as secrets and environment variables, but this represents a huge reduction in the amount of boilerplate. You can’t run your container in privileged mode (i.e., access host devices), expose arbitrary ports, expose unencrypted endpoints, etc.
After this is committed to the app’s repo, our CI/CD runner picks up changes, builds the Docker image, and publishes to the Azure Container Registry. The Crucible App Platform is not opinionated about what CI/CD runner you use. At the Crucible, we developed a simple shared GitLab pipeline to build our Docker containers and publish them to the Azure Container Registry.
Once the Docker image is pushed, the Knative configuration is pushed to the Kubernetes State Repository. This repository contains all Kubernetes configuration states. We do not use kubectl apply in any part of the process. Instead, we use Flux, our GitOps controller, which watches the Git repo and applies any changes. This has multiple benefits:
- Easy to audit what changed, when, and by whom
- Security: CI/CD runner needs no access to the Kubernetes control plane
- Ability to easily roll back configuration simply by reverting a commit
Once Flux applies the changes, the app is deployed. Deployments in Knative are similar to a Kubernetes deployment. After pulling the Docker image, it attempts to start it and performs an initial health check (readiness probe). If no health check is specified, Knative still tests to ensure that your app is listening properly on the defined port. Knative continues to route traffic to the existing deployment until your new deployment is healthy. For more sophisticated scenarios, developers can manually define traffic routing for A/B or canary-style rollouts.
Once our app is deployed, accessing it is easy. Depending on which environment we deployed to (e.g. dev or prod), the platform assigns us a URL using the following format: <appName>.<namespace>.<env>.<internalDomain>. For example, if your internal domain was “myplatform.duke.net”, the testdummy app in the dev environment would be reachable at https://testdummy.demo.dev.myplatform.duke.net. One of the platform’s strong opinions is that all apps get their own CNAME and are exposed via HTTPS on port 443. This makes it both easy for developers to discover where their app is hosted, as well as secure by default. Developers do not have to worry about configuring TLS correctly or concern themselves with other ingress settings. HTTP on port 80 redirects to 443 by default. What about HTTP/2 or GRPC? No problem—a simple declarative configuration change to the port enables this.
The Crucible App Platform provides a good foundation for easily and securely delivering stateless cloud-native apps, with a lot of opportunity for improvement and growth. Security in particular is not a destination, but a journey. We would love to see broader adoption at Duke to encourage continued investment in this platform. If you’re on the Duke network, feel free to take a look at our GitLab repo, which contains both developer and operator documentation.