Intro
This article focuses on infrastructure deployment using Terraform from a local machine. Before putting a script into a pipline, I usually test its performance from the local environment. Initially I didn’t plan to write this post, because I wanted to outline all the material in another post. But in the process of working on my Terraform script and explaining some aspects, I realized that it would be better to logically divide the whole narrative so as not to mix everything in a pile.
Preparing a Terraform script
The script is a basic infrastructure with EKS cluster, VPC, two IAM roles (developer and manager), etc. At the next stage the script will be added to the CI-pipeline and all variables will be formatted as Gitlab CI/CD variables. Only the main components are shown in the diagram, and the script itself is added at the very end of this paragraph (s. TF-script).
I usually create TF-scripts with everything inside. This may be seen as anti-pattern and to some extent against best practises 1, but I’ll give you the main reason why I do this because I experience it every time – traceability. With many disparate files, each containing one or more related or somehow interconnected resources, and as the infrastructure expands, traceability can be lost very quickly. Especially a lot of time is lost during debugging when an error occurs when adding a new resource, or when local variables are changed and several resources need to be changed accordingly.
Personally, it’s easier for me to have one big script with all the infrastructure inside. Even if it’s the size of a tablecloth, it will still be easier for me to know that everything is here, and that any defects should be found here, rather than jumping through dozens of files looking for bugs.
Naturally, with a competent IDE you could argue that this is easily remedied, for me it’s not – for me it is easier with ctrl+F to find everything I want when needed. I use comments to divide my huge script into logical blocks where I group closely related resources.
Control flow
In order to make the script flexible enough I use conditionals with the count parameter, for example, for the Metrics Server helm release:
variable "deploy_metrics_server" {
description = "Flag to control metrics server deployment"
type = bool
default = true
}
# Metrics server
resource "helm_release" "metrics_server" {
count = var.deploy_metrics_server ? 1 : 0
name = "metrics-server"
repository = "https://charts.bitnami.com/bitnami"
chart = "metrics-server"
namespace = "metrics-server"
version = "7.2.16"
create_namespace = true
set {
name = "apiService.create"
value = "true"
}
depends_on = [
aws_eks_node_group.general
]
}
If count
is set to true, then an instance of the Metrics Server is deployed in the cluster, otherwise the resource is ignored.
Terraform modules
I decided not to use TF-modules, because several times I faced a lot of problems when it was time to update the script, and since the module is actually a custom assembly of many disparate resources, which is aimed at ease of work and quick deployment, all the functionality is hidden inside. It would seem that you can take it and use it, but it happens that when changing the version of the main sequence, many components turn out to be incompatible due to contradictory parameters that either need to be added for necessity or removed for unnecessity.
For example, I once had a hell of a battle switching EKS module from version 17 to version 20. Originally it was legacy code that needed to be used somehow and initially had no problems, however one day I needed to update it because version 17 is already too old and there is no such thing as a nodegroup, only worker groups, which I desperately needed. Having failed with such a drastic upgrade, I decided to make upgrades sequentially – from 17 to 18, from 18 to 19, from 19 to 20 – but this was also problematic, because the module stubbornly refused to work – there were fundamental differences in authentication methods.
Kubernetes Provider
One of the stumbling blocks was getting credentials for the providers, particularly for Kubernetes:
provider "kubernetes" {
host = data.aws_eks_cluster.eks.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
exec {
api_version = "client.authentication.k8s.io/v1beta1"
args = ["eks", "get-token", "--cluster-name", aws_eks_cluster.eks.name]
command = "aws"
}
}
After a couple of hours of trying and failing, the reason was found:
Based on the provided configuration, it seems that you are affected by a bug in Terraform Cloud where in some circumstances when using that authentication method the
awscli
executable which should be installed on Terraform Cloud agent node gets installed slower making it unavailable at the time ofawscli
command execution. This is the reason why sometimes the run is successful, but sometimes it fails.
And the provider for Kubernetes should use token instead:
provider "kubernetes" {
host = data.aws_eks_cluster.eks.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
token = data.aws_eks_cluster_auth.eks.token
}
The TF-script
# Local variables
locals {
env = "staging"
region = "eu-central-1"
zoneA = "eu-central-1a"
zoneB = "eu-central-1b"
zoneC = "eu-central-1c"
eks_version = "1.31"
eks_name = "test-nest"
}
# Variables
variable "deploy_metrics_server" {
description = "Flag to control metrics server deployment"
type = bool
default = true
}
variable "create_developer_user" {
description = "Flag to control developer user creation"
type = bool
default = true
}
variable "create_manager_user" {
description = "Flag to control manager user creation"
type = bool
default = true
}
# Data
data "aws_eks_cluster" "eks" {
name = aws_eks_cluster.eks.name
}
data "aws_eks_cluster_auth" "eks" {
name = aws_eks_cluster.eks.name
}
# Providers
provider "aws" {
region = local.region
profile = "sobercounsel"
shared_credentials_files = ["~/.aws/credentials"]
}
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.53"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "2.35.0"
}
helm = {
source = "hashicorp/helm"
version = "2.16.1"
}
}
}
provider "helm" {
kubernetes {
host = data.aws_eks_cluster.eks.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
token = data.aws_eks_cluster_auth.eks.token
}
}
provider "kubernetes" {
host = data.aws_eks_cluster.eks.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
token = data.aws_eks_cluster_auth.eks.token
}
# Networking
resource "aws_vpc" "aws-vpc" {
cidr_block = "10.0.0.0/16"
enable_dns_support = true
enable_dns_hostnames = true
tags = {
Name = "${local.env}-vpc"
}
}
resource "aws_internet_gateway" "aws-igw" {
vpc_id = aws_vpc.aws-vpc.id
tags = {
Name = "${local.env}-igw"
}
}
resource "aws_subnet" "privateA" {
vpc_id = aws_vpc.aws-vpc.id
cidr_block = "10.0.0.0/19"
availability_zone = local.zoneA
tags = {
Name = "${local.env}-private-${local.zoneA}"
"kubernetes.io/role/internal-elb" = "1"
"kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
}
}
resource "aws_subnet" "privateB" {
vpc_id = aws_vpc.aws-vpc.id
cidr_block = "10.0.32.0/19"
availability_zone = local.zoneB
tags = {
Name = "${local.env}-private-${local.zoneB}"
"kubernetes.io/role/internal-elb" = "1"
"kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
}
}
resource "aws_subnet" "privateC" {
vpc_id = aws_vpc.aws-vpc.id
cidr_block = "10.0.64.0/19"
availability_zone = local.zoneC
tags = {
Name = "${local.env}-private-${local.zoneC}"
"kubernetes.io/role/internal-elb" = "1"
"kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
}
}
resource "aws_subnet" "publicA" {
vpc_id = aws_vpc.aws-vpc.id
cidr_block = "10.0.96.0/19"
availability_zone = local.zoneA
map_public_ip_on_launch = true
tags = {
Name = "${local.env}-private-${local.zoneA}"
"kubernetes.io/role/elb" = "1"
"kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
}
}
resource "aws_subnet" "publicB" {
vpc_id = aws_vpc.aws-vpc.id
cidr_block = "10.0.128.0/19"
availability_zone = local.zoneB
map_public_ip_on_launch = true
tags = {
Name = "${local.env}-private-${local.zoneB}"
"kubernetes.io/role/elb" = "1"
"kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
}
}
resource "aws_subnet" "publicC" {
vpc_id = aws_vpc.aws-vpc.id
cidr_block = "10.0.160.0/19"
availability_zone = local.zoneC
map_public_ip_on_launch = true
tags = {
Name = "${local.env}-private-${local.zoneC}"
"kubernetes.io/role/elb" = "1"
"kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
}
}
resource "aws_eip" "aws-eip" {
domain = "vpc"
tags = {
Name = "${local.env}-nat"
}
}
resource "aws_nat_gateway" "aws-nat-gw" {
allocation_id = aws_eip.aws-eip.id
subnet_id = aws_subnet.publicA.id
tags = {
Name = "${local.env}-nat"
}
depends_on = [
aws_internet_gateway.aws-igw
]
}
resource "aws_route_table" "aws-rt-private" {
vpc_id = aws_vpc.aws-vpc.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.aws-nat-gw.id
}
tags = {
Name = "${local.env}-private"
}
}
resource "aws_route_table" "aws-rt-public" {
vpc_id = aws_vpc.aws-vpc.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.aws-igw.id
}
tags = {
Name = "${local.env}-public"
}
}
resource "aws_route_table_association" "privateA" {
subnet_id = aws_subnet.privateA.id
route_table_id = aws_route_table.aws-rt-private.id
}
resource "aws_route_table_association" "privateB" {
subnet_id = aws_subnet.privateB.id
route_table_id = aws_route_table.aws-rt-private.id
}
resource "aws_route_table_association" "privateC" {
subnet_id = aws_subnet.privateC.id
route_table_id = aws_route_table.aws-rt-private.id
}
resource "aws_route_table_association" "publicA" {
subnet_id = aws_subnet.publicA.id
route_table_id = aws_route_table.aws-rt-public.id
}
resource "aws_route_table_association" "publicB" {
subnet_id = aws_subnet.publicB.id
route_table_id = aws_route_table.aws-rt-public.id
}
resource "aws_route_table_association" "publicC" {
subnet_id = aws_subnet.publicC.id
route_table_id = aws_route_table.aws-rt-public.id
}
# EKS
resource "aws_iam_role" "eks" {
name = "${local.env}-${local.eks_name}-eks-cluster"
assume_role_policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Principal": {
"Service": "eks.amazonaws.com"
}
}
]
}
POLICY
}
resource "aws_iam_role_policy_attachment" "eks" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = aws_iam_role.eks.name
}
resource "aws_eks_cluster" "eks" {
name = "${local.env}-${local.eks_name}"
version = local.eks_version
role_arn = aws_iam_role.eks.arn
vpc_config {
endpoint_private_access = false
endpoint_public_access = true
subnet_ids = [
aws_subnet.privateA.id,
aws_subnet.privateB.id,
aws_subnet.privateC.id
]
}
access_config {
authentication_mode = "API"
bootstrap_cluster_creator_admin_permissions = true
}
depends_on = [
aws_iam_role_policy_attachment.eks
]
}
# Nodes
resource "aws_iam_role" "nodes" {
name = "${local.env}-${local.eks_name}-eks-nodes"
assume_role_policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Principal": {
"Service": "ec2.amazonaws.com"
}
}
]
}
POLICY
}
# This policy now includes AssumeRoleForPodIdentity for the Pod Identity Agent
resource "aws_iam_role_policy_attachment" "amazon_eks_worker_node_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
role = aws_iam_role.nodes.name
}
resource "aws_iam_role_policy_attachment" "amazon_eks_cni_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
role = aws_iam_role.nodes.name
}
resource "aws_iam_role_policy_attachment" "amazon_ec2_container_registry_read_only" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = aws_iam_role.nodes.name
}
resource "aws_eks_node_group" "general" {
cluster_name = aws_eks_cluster.eks.name
version = local.eks_version
node_group_name = "general"
node_role_arn = aws_iam_role.nodes.arn
subnet_ids = [
aws_subnet.privateA.id,
aws_subnet.privateB.id,
aws_subnet.privateC.id
]
capacity_type = "SPOT"
instance_types = ["t3.small"]
scaling_config {
desired_size = 2
max_size = 10
min_size = 1
}
update_config {
max_unavailable = 1
}
labels = {
role = "general"
}
depends_on = [
aws_iam_role_policy_attachment.amazon_eks_worker_node_policy,
aws_iam_role_policy_attachment.amazon_eks_cni_policy,
aws_iam_role_policy_attachment.amazon_ec2_container_registry_read_only,
]
# Allow external changes without Terraform plan difference
lifecycle {
ignore_changes = [scaling_config[0].desired_size]
}
}
# K8S Roles & Role Bindings
## Developer
resource "kubernetes_cluster_role" "viewer" {
metadata {
name = "viewer"
}
rule {
api_groups = ["*"]
resources = [
"namespaces",
"pods",
"configmaps",
"secrets",
"services"
]
verbs = [
"get",
"list",
"watch"
]
}
}
resource "kubernetes_cluster_role_binding" "viewer-binding" {
metadata {
name = "viewer-binding"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = "cluster-admin"
}
subject {
kind = "Group"
name = "viewer-group"
api_group = "rbac.authorization.k8s.io"
}
}
## Manager
resource "kubernetes_cluster_role_binding" "admin-binding" {
metadata {
name = "admin-binding"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = "cluster-admin"
}
subject {
kind = "User"
name = "admin"
api_group = "rbac.authorization.k8s.io"
}
subject {
kind = "ServiceAccount"
name = "default"
namespace = "kube-system"
}
subject {
kind = "Group"
name = "manager-group"
api_group = "rbac.authorization.k8s.io"
}
}
# IAM
## Developer
resource "aws_iam_user" "developer" {
count = var.create_developer_user ? 1 : 0
name = "LongView"
}
resource "aws_iam_policy" "developer_eks" {
count = var.create_developer_user ? 1 : 0
name = "AmazonEKSDeveloperPolicy"
policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"eks:DescribeCluster",
"eks:ListClusters"
],
"Resource": "*"
}
]
}
POLICY
}
resource "aws_iam_user_policy_attachment" "developer_eks" {
count = var.create_developer_user ? 1 : 0
user = aws_iam_user.developer[0].name
policy_arn = aws_iam_policy.developer_eks[0].arn
}
resource "aws_eks_access_entry" "developer" {
count = var.create_developer_user ? 1 : 0
cluster_name = aws_eks_cluster.eks.name
principal_arn = aws_iam_user.developer[0].arn
kubernetes_groups = ["viewer-group"]
}
## Manager
data "aws_caller_identity" "current" {}
resource "aws_iam_role" "eks_admin" {
count = var.create_manager_user ? 1 : 0
name = "${local.env}-${local.eks_name}-eks-admin"
assume_role_policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Principal": {
"AWS": "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
}
}
]
}
POLICY
}
resource "aws_iam_policy" "eks_admin" {
count = var.create_manager_user ? 1 : 0
name = "AmazonEKSAdminPolicy"
policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"eks:*"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "*",
"Condition": {
"StringEquals": {
"iam:PassedToService": "eks.amazonaws.com"
}
}
}
]
}
POLICY
}
resource "aws_iam_role_policy_attachment" "eks_admin" {
count = var.create_manager_user ? 1 : 0
role = aws_iam_role.eks_admin[0].name
policy_arn = aws_iam_policy.eks_admin[0].arn
}
resource "aws_iam_user" "manager" {
count = var.create_manager_user ? 1 : 0
name = "WithinReason"
}
resource "aws_iam_policy" "eks_assume_admin" {
count = var.create_manager_user ? 1 : 0
name = "AmazonEKSAssumeAdminPolicy"
policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sts:AssumeRole"
],
"Resource": "${aws_iam_role.eks_admin[0].arn}"
}
]
}
POLICY
}
resource "aws_iam_user_policy_attachment" "manager" {
count = var.create_manager_user ? 1 : 0
user = aws_iam_user.manager[0].name
policy_arn = aws_iam_policy.eks_assume_admin[0].arn
}
# Best practice: use IAM roles due to temporary credentials
resource "aws_eks_access_entry" "manager" {
count = var.create_manager_user ? 1 : 0
cluster_name = aws_eks_cluster.eks.name
principal_arn = aws_iam_role.eks_admin[0].arn
kubernetes_groups = ["manager-group"]
}
# Metrics server
resource "helm_release" "metrics_server" {
count = var.deploy_metrics_server ? 1 : 0
name = "metrics-server"
repository = "https://charts.bitnami.com/bitnami"
chart = "metrics-server"
namespace = "metrics-server"
version = "7.2.16"
create_namespace = true
set {
name = "apiService.create"
value = "true"
}
depends_on = [
aws_eks_node_group.general
]
}
There are tons of videos on Youtube for helping, understanding and putting together a script. Links to one such series can be found in the references.
- here even in the article itself four files are given as an example, where the main.tf file contains the main code with the infrastructure, and the other three – variables.tf, outputs.tf – are only a small part of the whole logic, thus not substantially different from my preferred approach ↩︎
References
- Create AWS VPC using Terraform: AWS EKS Kubernetes Tutorial – Part 1
- Create AWS EKS Cluster using Terraform: AWS EKS Kubernetes Tutorial – Part 2
- Add IAM User & IAM Role to AWS EKS: AWS EKS Kubernetes Tutorial – Part 3
- AWS Load Balancer Controller Tutorial (TLS): AWS EKS Kubernetes Tutorial – Part 6
- Solution: Error getting credentials
- Inconsistent “getting credentials: exec: executable aws failed with exit code 1” errors #2011
- Terraform tips & tricks: loops, if-statements, and gotchas
- AWS EKS cluster + GitLab CI (remote server)
- Terraform Best Practices. Code structure
Leave a Reply