Azure Kubernetes Service (AKS) is a managed Kubernetes cluster on Microsoft Azure cloud that can be used to quickly deploy Kubernetes clusters. Combined with other Azure services and features, it simplifies daily operations and easily implements business application elasticity. This article is a hands-on experiment demonstrating the basic steps of elastic deployment. The applicable scenario is when virtual machines behind AKS experience unexpected shutdowns, automatic failover is achieved through Kubernetes configuration.
Reading this article requires mastering basic Kubernetes knowledge and operations, as well as AKS basic concepts and deployment.
Basic Deployment
Application
The demo application source code is here:
https://github.com/xfsnow/pipelines-javascript-docker/
Please fork it to your own GitHub account first.
This application is a very simple Node.js application that outputs time and hostname to demonstrate which backend it's running on after load balancing. There's only one main file:
The output content is the following line:
res.send('Hello world! Now is '+now+'.\nRunning on '+os.hostname()+'.');
The main Kubernetes deployment file is manifests/hello-deployment.yml, deploying a deployment with 4 replicas. The declarations and functions of this configuration file are written in the relevant comments. The core principle is using:
# Use toleration to detect node status, stop placing pods once nodes become unavailable
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 1
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 1
And:
# Health checks, check application running status through httpGet
livenessProbe:
httpGet:
port: 80
scheme: HTTP
path: /
# Initial delay of 3 seconds for waiting time during first startup to avoid slow startup being misjudged as failure
initialDelaySeconds: 10
# Check interval after that
periodSeconds: 1
# How long timeout is considered failure
timeoutSeconds: 3
There's also a hello-service.yml that deploys a load balancer, exposing a public IP for external service. This load balancer also has health checks - when it detects backend pods are unavailable, it removes them and then mounts newly generated pods.
Azure Resources
An ACR for storing container images. Note that currently only the East 2 region supports the az acr build command to build and push directly to ACR locally. So we create the ACR resource in East 2, while the AKS cluster can be in any domestic region.
# Define environment variables
REGION_NAME=ChinaEast2
RESOURCE_GROUP=aksResilience
AKS_CLUSTER_NAME=aksResilience
ACR_NAME=acr$RANDOM
# Create resource group
az group create --location $REGION_NAME --name $RESOURCE_GROUP
# Create ACR
az acr create --resource-group $RESOURCE_GROUP --name $ACR_NAME --location $REGION_NAME --sku Standard
List the results:
az acr list -o table
NAME RESOURCE GROUP LOCATION SKU LOGIN SERVER CREATION DATE ADMIN ENABLED
-------- --------------- ----------- ------- ------------------- -------------------- ---------------
acr11044 aksResilience chinaeast2 Standard acr11044.azurecr.cn 2021-02-05T03:12:36Z False
Create AKS cluster:
az aks create \
--resource-group $RESOURCE_GROUP \
--name $AKS_CLUSTER_NAME \
--location $REGION_NAME \
--node-count 2 \
--enable-addons monitoring \
--generate-ssh-keys
This step is slow, please wait patiently until there are return results.
List AKS clusters:
az aks list -o table
Name Location ResourceGroup KubernetesVersion ProvisioningState Fqdn
------------- ----------- -------------- ------------------ ------------------ ----
aksResilience chinanorth2 aksResilience 1.18.14 Succeeded
Bind ACR to AKS cluster so images can be automatically authenticated when pulling from ACR to AKS cluster:
az aks update \
--name $AKS_CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--attach-acr $ACR_NAME
This step is slow, please wait patiently until there are return results.
At this point, Azure resources have been created: 1 AKS cluster total, including 1 virtual machine scale set with 2 virtual machines.
List the virtual machine scale sets behind the AKS cluster. Note that the virtual machine scale set is automatically created by Azure in another resource group:
az vmss list -o table
Name ResourceGroup Location Zones Capacity Overprovision UpgradePolicy
--------------------------- ------------------------------------------ ----------- ----- -------- ------------- ---------------
aks-nodepool1-40474697-vmss MC_AKSRESILIENCE_AKSRESILIENCE_CHINANORTH2 chinanorth2 2 False Manual
Finally list the instances in the virtual machine scale set:
az vmss list-instances \
-g MC_AKSRESILIENCE_AKSRESILIENCE_CHINANORTH2 \
-n aks-nodepool1-40474697-vmss \
-o table \
--query "[].{instanceId:instanceId, Name:name, State:provisioningState}"
InstanceId Name State
------------ ------------------------------- ---------
0 aks-nodepool1-40474697-vmss_0 Succeeded
1 aks-nodepool1-40474697-vmss_1 Succeeded
Deploy Kubernetes
Build image and push to image registry
git clone https://github.com/xfsnow/pipelines-javascript-docker/
Pull the source code.
First build and push the image:
cd app
az acr build \
--resource-group $RESOURCE_GROUP \
--registry $ACR_NAME \
--image helloworld:1.0 .
Then get the specific URL of the image registry:
az acr repository show -n $ACR_NAME --repository helloworld --query "registry"
"acr11044.azurecr.cn"
Note down this URL like acr11044.azurecr.cn. Adding helloworld:1.0 after it gives the complete image pull address, like:
acr11044.azurecr.cn/helloworld:1.0
cd manifests
Find this line in hello-deployment.yml:
# Container image uses ACR service on Azure
image: snowpeak.azurecr.cn/helloworld:1.3
Replace it with your own ACR image URL "acr11044.azurecr.cn/helloworld:1.0".
Connect to AKS cluster and deploy image
Connect to AKS cluster:
az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME
First view the nodes in the AKS cluster:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-nodepool1-40474697-vmss000000 Ready agent 142m v1.18.14 10.240.0.4 <none> Ubuntu 18.04.5 LTS 5.4.0-1035-azure docker://19.3.14
aks-nodepool1-40474697-vmss000001 Ready agent 142m v1.18.14 10.240.0.5 <none> Ubuntu 18.04.5 LTS 5.4.0-1035-azure docker://19.3.14
Then view the pods:
kubectl get pods -o wide
No resources found in default namespace.
No images have been deployed yet, so there are no resources.
Then use cd ../manifests/
to go to the Kubernetes configuration file directory.
Deploy to AKS cluster with kubectl apply -f hello-deployment.yml
and kubectl apply -f hello-service.yml
. The initial deployed cluster looks like this:
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
helloworld-869d58f588-5tmpf 1/1 Running 0 71s 10.244.1.6 aks-nodepool1-40474697-vmss000001 <none> <none>
helloworld-869d58f588-bt54r 1/1 Running 0 71s 10.244.1.7 aks-nodepool1-40474697-vmss000001 <none> <none>
helloworld-869d58f588-qh6hn 1/1 Running 0 71s 10.244.0.6 aks-nodepool1-40474697-vmss000000 <none> <none>
helloworld-869d58f588-w8xbx 1/1 Running 0 71s 10.244.0.7 aks-nodepool1-40474697-vmss000000 <none> <none>
These are the 4 running pods, 2 pods on each node.
kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
helloworld LoadBalancer 10.0.116.133 139.217.112.135 80:30754/TCP 7d17h
This is the deployed LoadBalancer. The external IP 139.217.112.135 can be accessed directly, returning results like:
curl http://139.217.112.135/;
Hello world! Now is 2021-02-04 08:26:52.924.
Running on helloworld-745c979464-q54vd.
Here "Running on helloworld-745c979464-q54vd." is exactly the name of one of the pods above.
If you make continuous requests multiple times, you'll see requests distributed roughly evenly across the 4 backend pods:
while(true); do curl --connect-timeout 2 http://139.217.112.135/; echo -e '\n'; sleep 0.5; done
Hello world! Now is 2021-02-04 08:34:19.523.
Running on helloworld-745c979464-q54vd.
Hello world! Now is 2021-02-04 08:34:20.160.
Running on helloworld-745c979464-ljrcv.
Hello world! Now is 2021-02-04 08:34:20.798.
Running on helloworld-745c979464-97v26.
Hello world! Now is 2021-02-04 08:34:21.427.
Running on helloworld-745c979464-q54vd.
Hello world! Now is 2021-02-04 08:34:22.56.
Running on helloworld-745c979464-l2jpm.
Hello world! Now is 2021-02-04 08:34:22.697.
Running on helloworld-745c979464-ljrcv.
Hello world! Now is 2021-02-04 08:34:23.368.
Running on helloworld-745c979464-l2jpm.
Hello world! Now is 2021-02-04 08:34:24.6.
Running on helloworld-745c979464-ljrcv.
Hello world! Now is 2021-02-04 08:34:24.659.
Running on helloworld-745c979464-l2jpm.
Simulate Failure
We manually stop a virtual machine to simulate a virtual machine failure scenario.
- First, open real-time observation of pod running status:
kubectl get po -o wide -w
- Open a new window and run:
while(true); do curl --connect-timeout 2 http://139.217.112.135/; echo -e '\n'; sleep 0.5; done
To observe application access.
- Open another window and execute the command to stop a virtual machine:
az vmss stop -g MC_DEMO_HELLOWORLD_CHINANORTH2 -n aks-agentpool-11798053-vmss --instance-id 0
- Return to the first window to view pod running status, you'll see output similar to:
kubectl get po -o wide -w
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
helloworld-9bbdbf45b-2hnz4 1/1 Running 0 119s 10.240.0.152 aks-agentpool-11798053-vmss000001 <none> <none>
helloworld-9bbdbf45b-64qt4 1/1 Terminating 0 6m41s 10.240.0.106 aks-agentpool-11798053-vmss000000 <none> <none>
helloworld-9bbdbf45b-gczpn 1/1 Running 0 6m37s 10.240.0.135 aks-agentpool-11798053-vmss000001 <none> <none>
helloworld-9bbdbf45b-l6dxm 1/1 Terminating 0 5m11s 10.240.0.68 aks-agentpool-11798053-vmss000000 <none> <none>
helloworld-9bbdbf45b-rhn2m 1/1 Terminating 0 4m45s 10.240.0.42 aks-agentpool-11798053-vmss000000 <none> <none>
helloworld-9bbdbf45b-smtft 1/1 Running 0 118s 10.240.0.185 aks-agentpool-11798053-vmss000001 <none> <none>
helloworld-9bbdbf45b-tmdkr 1/1 Running 0 119s 10.240.0.149 aks-agentpool-11798053-vmss000001 <none> <none>
Pods on the first node show they're stopping, while new pods are running on the second node.
- Go to the curl running window to check:
while(true); do curl --connect-timeout 2 http://139.217.112.135/; echo -e '\n'; sleep 0.5; done
curl: (28) Connection timed out after 2013 milliseconds
...
curl: (28) Connection timed out after 2009 milliseconds
...
Hello world! Now is 2021-02-04 01:40:37.752.
Running on helloworld-9bbdbf45b-2vjxt.
Hello world! Now is 2021-02-04 01:40:38.409.
Running on helloworld-9bbdbf45b-vpvn6.
Hello world! Now is 2021-02-04 01:40:39.57.
Running on helloworld-9bbdbf45b-2vjxt.
Hello world! Now is 2021-02-04 01:40:39.736.
After several timeouts, the return results become normal again, and new pod names appear in the hostname.
Looking carefully at the time in the output content, you can see the timeout duration doesn't exceed 1 minute.
- Finally, we restore node operation:
You can see that pods on the originally stopped node show normal Terminating and eventually disappear. All 4 existing pods are now running on the second node. You can rebalance pod placement by setting replicas.
Resource Cleanup
az group delete --name $RESOURCE_GROUP --yes
Delete the resource group and all its resources. The experiment is complete.