Creating Azure Kubernetes Service (AKS) the Right Way

IntelliJ IDEA with Terraform Plugin.

Introduction

About 7 months ago I was asked to create a short Azure AKS poc for one project using ARM with a preview version of AKS (of that time), you can read more about that in my previous blog post “Running Azure Kubernetes Service (AKS)” . ARM was a requirement so I couldn’t use my choice of Infrastructure as Code tool Terraform.

Azure Infrastructure Using ARM vs. Terraform

I don’t have that much experience in ARM but quite extensive experience how to use Terraform in the AWS side. When using ARM for creating Azure resources I always felt like this would be easier if I only could use Terraform. Now I had a chance to see how Terraform handles Azure resources and I must say that the experiences were pretty good. Terraform is really nice to work with also in the Azure side. With a good editor that understands Hashicorp Configuration Language (hcl) and also understands the semantics of hcl entities and their relationships (like IntelliJ IDEA with Terraform plugin) writing Azure resources was a breeze. Just like in the AWS side you can make a modular configuration of your Azure cloud infra. Terraform provides a plan phase in which you can check what configuration changes (new resources to be created, some resources to be updated etc.) are to be made before you apply the changes.

ARM / json.
Terraform / hcl.

The Solution

The overall solution is pretty simple: create Azure storage account to store Terraform state, create Azure AKS configuration in a modular manner using Terraform, and deploy the infra incrementally to Azure when you write new resource configurations.

1. Create Azure Storage Account for Terraform Backend

You don’t need this step for small pocs but I like to do things with best practices. So, the best practice is to store the Terraform state in a place where many developers can access the state (i.e. in the cloud, naturally). Terraform takes care of locking of the state so that many developers are not running cloud infra updates concurrently breaking the coherence of the infra. In the AWS side the natural place for the Terraform backend is S3, in the Azure side it is Blob Storage. I created a simple script to automate this part: create-azure-storage-account.sh.

2. Create Azure Infra Code Using Terraform

The next step is to create the Azure cloud infra code using Terraform (see directory terraform). I usually create a “main” configuration part which defines what modules comprise the main configuration: file env-def.tf. If you look at that file you see that we create here the resource group for the infra, Azure Container Registry (ACR), a couple of public ips and the Azure Kubernetes Service (AKS). The actual resources are provided as Terraform modules in modules directory. When you have the main definition (env-def.tf) and the modules ready you can create the environments (e.g. development, integration-testing, performance-testing, production) separately just by injecting the environment specific values to the main configuration module, see example dev.tf for development environment. If I later wanted to create e.g. a production environment I could then easily refactor certain parameters out of the env-def.tf to the environment files (e.g. vm_sizes etc.).

3. Deploy Cloud Infra!

You don’t have to create the whole cloud infra in one shot before applying it to the cloud provider. And actually you shouldn’t. And actually I never do so. I create the cloud infra incrementally, resource by resource. I try “terraform plan”, and if everything looks good I apply the new resources using command “terraform apply”. This way you can create the cloud infra incrementally which is pretty nice. And once you have one environment ready (e.g. development environment) it is pretty easy to apply the same code for other environments as well (which are going to be as exact copies of your development environment as you wanted to be — probably you just want to use more inexpensive resource types (e.g. vm sizes) in your development environment for saving some project money).

Experiences

Service Principal Hassle

I spent quite a lot of time trying to create a Service principal for using with Terraform commands so that this Service principal would have authorization to create other Service principals (AKS needs a service principal to be able to create virtual machines for the Kubernetes cluster infra). This turned out not to be an easy task. I could have created the right kind of Service principal outside Terraform and inject the Service principal id and secret to the Terraform configuration but this would have been a bit of an ugly solution — best practice is not to divide your cloud infra to various scripts but try to make the whole cloud infra using one configuration which you can apply with one command.

Terraform Hassle

Not everything is dancing on a bed of roses with Terraform. Terraform state can corrupt which is a real hassle if you have a big environment (not to speak if that environment is production). Sometimes Terraform fails and gives cryptic error messages. Usually a standard procedure is to wait a couple of minutes and try to give “terraform apply” command again — if the second apply run is successful probably some previous cloud resource was not finalized before Terraform tried to create some other resource which had dependency to the previous one. This happened to me when creating AKS and Service principal. So, if you try the example in my Github account don’t get frightened if the first “terraform apply” command fails.

Update 2019–01–09: Major Hassle with Authenticating AKS to Pull Images from ACR

I must admit now that I created the original version of this blog post a bit too early :-) . After the original version of this blog post I tried to use the terraform version of that time to deploy the Simple Server single-node Kubernetes deployment to that AKS. Didn’t work. AKS didn’t have authorization to pull images from ACR. When examining the failed pod (using kubectl describe command…) I saw that the image pull was failed: “Failed to pull image … unauthorized: authentication required…”. It took me quite some time to figure out how to do this. In this kind of situation it is a good idea to create a working reference infra e.g. using Portal or command line tool. So, I created everything (AKS, ACR, Public IPs, Service principal, Role assignment etc.) manually step by step using az cli and deployed the Simple Server single-node version there and tested the deployment by curling the Simple Server API — the reference infra worked ok. Now I had a working reference infra for examining what was wrong with my Terraform configuration. I examined the created azure resources side by side (reference infra created by az cli and my terraform code). Finally I figured out the issues and was able to fix them. While spending some 8h with the terraform code I also quite extensively refactored it (e.g. put Service principal to aks module where it belongs, put role assignment to acr module where it belongs, introduced locals etc.).

Conclusions

Using Terraform it was pretty easy to create Azure AKS infra and related Azure resources. If I can choose I will use Terraform rather than ARM in my future Azure projects.

I’m a Software architect and developer. Currently implementing systems on AWS / GCP / Azure / Docker / Kubernetes using Java, Python, Go and Clojure.