Infrastructure provisioning in AWS with Terraform

Contents

Intro
Preparing a Terraform script
References

Intro

This article focuses on infrastructure deployment using Terraform from a local machine. Before putting a script into a pipline, I usually test its performance from the local environment. Initially I didn’t plan to write this post, because I wanted to outline all the material in another post. But in the process of working on my Terraform script and explaining some aspects, I realized that it would be better to logically divide the whole narrative so as not to mix everything in a pile.

Preparing a Terraform script

The script is a basic infrastructure with EKS cluster, VPC, two IAM roles (developer and manager), etc. At the next stage the script will be added to the CI-pipeline and all variables will be formatted as Gitlab CI/CD variables. Only the main components are shown in the diagram, and the script itself is added at the very end of this paragraph (s. TF-script).

I usually create TF-scripts with everything inside. This may be seen as anti-pattern and to some extent against best practises ¹, but I’ll give you the main reason why I do this because I experience it every time – traceability. With many disparate files, each containing one or more related or somehow interconnected resources, and as the infrastructure expands, traceability can be lost very quickly. Especially a lot of time is lost during debugging when an error occurs when adding a new resource, or when local variables are changed and several resources need to be changed accordingly.

Personally, it’s easier for me to have one big script with all the infrastructure inside. Even if it’s the size of a tablecloth ², it will still be easier for me to know that everything is here, and that any defects should be found here, rather than jumping through dozens of files looking for bugs.

Naturally, with a competent IDE you could argue that this is easily remedied, for me it’s not – for me it is easier with ctrl+F to find everything I want when needed. I use comments to divide my huge script into logical blocks where I group closely related resources.

Control flow

In order to make the script flexible enough I use conditionals with the count parameter, for example, for the Metrics Server helm release:

variable "deploy_metrics_server" {
  description = "Flag to control metrics server deployment"
  type        = bool
  default     = true
}

# Metrics server
resource "helm_release" "metrics_server" {
  count             = var.deploy_metrics_server ? 1 : 0
  name              = "metrics-server"
  repository        = "https://charts.bitnami.com/bitnami"
  chart             = "metrics-server"
  namespace         = "metrics-server"
  version           = "7.2.16"
  create_namespace  = true

  set {
      name  = "apiService.create"
      value = "true"
  }

  depends_on = [
    aws_eks_node_group.general
  ]
}

If count is set to true, then an instance of the Metrics Server is deployed in the cluster, otherwise the resource is ignored.

Terraform modules

I decided not to use TF-modules, because several times I faced a lot of problems when it was time to update the script, and since the module is actually a custom assembly of many disparate resources, which is aimed at ease of work and quick deployment, all the functionality is hidden inside. It would seem that you can take it and use it, but it happens that when changing the version of the main sequence, many components turn out to be incompatible due to contradictory parameters that either need to be added for necessity or removed for unnecessity.

For example, I once had a hell of a battle switching EKS module from version 17 to version 20. Originally it was legacy code that needed to be used somehow and initially had no problems, however one day I needed to update it because version 17 is already too old and there is no such thing as a nodegroup, only worker groups, which I desperately needed. Having failed with such a drastic upgrade, I decided to make upgrades sequentially – from 17 to 18, from 18 to 19, from 19 to 20 – but this was also problematic, because the module stubbornly refused to work – there were fundamental differences in authentication methods.

Kubernetes Provider

One of the stumbling blocks was getting credentials for the providers, particularly for Kubernetes:

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    args        = ["eks", "get-token", "--cluster-name", aws_eks_cluster.eks.name]
    command     = "aws"
  }
}

After a couple of hours of trying and failing, the reason was found:

Based on the provided configuration, it seems that you are affected by a bug in Terraform Cloud where in some circumstances when using that authentication method the awscli executable which should be installed on Terraform Cloud agent node gets installed slower making it unavailable at the time of awscli command execution. This is the reason why sometimes the run is successful, but sometimes it fails.

And the provider for Kubernetes should use token instead:

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.eks.token
}

The TF-script

# Local variables
locals {
  env = "staging"
  region = "eu-central-1"
  zoneA = "eu-central-1a"
  zoneB = "eu-central-1b"
  zoneC = "eu-central-1c"
  eks_version = "1.31"
  eks_name = "test-nest"
}

# Variables
variable "deploy_metrics_server" {
  description = "Flag to control metrics server deployment"
  type        = bool
  default     = true
}

variable "create_developer_user" {
  description = "Flag to control developer user creation"
  type        = bool
  default     = true
}

variable "create_manager_user" {
  description = "Flag to control manager user creation"
  type        = bool
  default     = true
}

# Data
data "aws_eks_cluster" "eks" {
  name = aws_eks_cluster.eks.name
}

data "aws_eks_cluster_auth" "eks" {
  name = aws_eks_cluster.eks.name
}

# Providers
provider "aws" {
  region = local.region
  profile = "sobercounsel"
  shared_credentials_files  = ["~/.aws/credentials"]
}

terraform {
  required_version = ">= 1.0"

  required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "~> 5.53"
    }
    kubernetes = {
      source = "hashicorp/kubernetes"
      version = "2.35.0"
    }
    helm = {
      source = "hashicorp/helm"
      version = "2.16.1"
    }
  }
}

provider "helm" {
  kubernetes {
    host                   = data.aws_eks_cluster.eks.endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
    token                  = data.aws_eks_cluster_auth.eks.token
  }
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.eks.token
}

# Networking
resource "aws_vpc" "aws-vpc" {
  cidr_block = "10.0.0.0/16"

  enable_dns_support = true
  enable_dns_hostnames = true

  tags = {
    Name = "${local.env}-vpc"
  }
}

resource "aws_internet_gateway" "aws-igw" {
  vpc_id = aws_vpc.aws-vpc.id

  tags = {
	  Name = "${local.env}-igw"
  }
}

resource "aws_subnet" "privateA" {
  vpc_id = aws_vpc.aws-vpc.id
  cidr_block = "10.0.0.0/19"
  availability_zone = local.zoneA

  tags = {
    Name = "${local.env}-private-${local.zoneA}"
    "kubernetes.io/role/internal-elb" = "1"
    "kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
  }
}

resource "aws_subnet" "privateB" {
  vpc_id = aws_vpc.aws-vpc.id
  cidr_block = "10.0.32.0/19"
  availability_zone = local.zoneB

  tags = {
    Name = "${local.env}-private-${local.zoneB}"
    "kubernetes.io/role/internal-elb" = "1"
    "kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
  }
}

resource "aws_subnet" "privateC" {
  vpc_id = aws_vpc.aws-vpc.id
  cidr_block = "10.0.64.0/19"
  availability_zone = local.zoneC

  tags = {
    Name = "${local.env}-private-${local.zoneC}"
    "kubernetes.io/role/internal-elb" = "1"
    "kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
  }
}

resource "aws_subnet" "publicA" {
  vpc_id = aws_vpc.aws-vpc.id
  cidr_block = "10.0.96.0/19"
  availability_zone = local.zoneA
  map_public_ip_on_launch = true

  tags = {
    Name = "${local.env}-private-${local.zoneA}"
    "kubernetes.io/role/elb" = "1"
    "kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
  }
}

resource "aws_subnet" "publicB" {
  vpc_id = aws_vpc.aws-vpc.id
  cidr_block = "10.0.128.0/19"
  availability_zone = local.zoneB
  map_public_ip_on_launch = true

  tags = {
    Name = "${local.env}-private-${local.zoneB}"
    "kubernetes.io/role/elb" = "1"
    "kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
  }
}

resource "aws_subnet" "publicC" {
  vpc_id = aws_vpc.aws-vpc.id
  cidr_block = "10.0.160.0/19"
  availability_zone = local.zoneC
  map_public_ip_on_launch = true

  tags = {
    Name = "${local.env}-private-${local.zoneC}"
    "kubernetes.io/role/elb" = "1"
    "kubernetes.io/cluster/${local.env}-${local.eks_name}" = "owned"
  }
}

resource "aws_eip" "aws-eip" {
  domain = "vpc"

  tags = {
    Name = "${local.env}-nat"
  }
}

resource "aws_nat_gateway" "aws-nat-gw" {
  allocation_id = aws_eip.aws-eip.id
  subnet_id = aws_subnet.publicA.id

  tags = {
    Name = "${local.env}-nat"
  }

  depends_on = [
    aws_internet_gateway.aws-igw
  ]
}

resource "aws_route_table" "aws-rt-private" {
  vpc_id = aws_vpc.aws-vpc.id
  route {
    cidr_block = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.aws-nat-gw.id
  }
  tags = {
    Name = "${local.env}-private"
  }
}

resource "aws_route_table" "aws-rt-public" {
  vpc_id = aws_vpc.aws-vpc.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.aws-igw.id
  }
  tags = {
    Name = "${local.env}-public"
  }
}

resource "aws_route_table_association" "privateA" {
  subnet_id = aws_subnet.privateA.id
  route_table_id = aws_route_table.aws-rt-private.id
}

resource "aws_route_table_association" "privateB" {
  subnet_id = aws_subnet.privateB.id
  route_table_id = aws_route_table.aws-rt-private.id
}

resource "aws_route_table_association" "privateC" {
  subnet_id = aws_subnet.privateC.id
  route_table_id = aws_route_table.aws-rt-private.id
}

resource "aws_route_table_association" "publicA" {
  subnet_id = aws_subnet.publicA.id
  route_table_id = aws_route_table.aws-rt-public.id
}

resource "aws_route_table_association" "publicB" {
  subnet_id = aws_subnet.publicB.id
  route_table_id = aws_route_table.aws-rt-public.id
}

resource "aws_route_table_association" "publicC" {
  subnet_id = aws_subnet.publicC.id
  route_table_id = aws_route_table.aws-rt-public.id
}

# EKS
resource "aws_iam_role" "eks" {
  name = "${local.env}-${local.eks_name}-eks-cluster"

  assume_role_policy = <<POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "eks.amazonaws.com"
      }
    }
  ]
}
POLICY
}

resource "aws_iam_role_policy_attachment" "eks" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
  role       = aws_iam_role.eks.name
}

resource "aws_eks_cluster" "eks" {
  name     = "${local.env}-${local.eks_name}"
  version  = local.eks_version
  role_arn = aws_iam_role.eks.arn

  vpc_config {
    endpoint_private_access = false
    endpoint_public_access  = true

    subnet_ids = [
      aws_subnet.privateA.id,
      aws_subnet.privateB.id,
      aws_subnet.privateC.id
    ]
  }

  access_config {
    authentication_mode                         = "API"
    bootstrap_cluster_creator_admin_permissions = true
  }

  depends_on = [
    aws_iam_role_policy_attachment.eks
  ]
}

# Nodes
resource "aws_iam_role" "nodes" {
  name = "${local.env}-${local.eks_name}-eks-nodes"

  assume_role_policy = <<POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      }
    }
  ]
}
POLICY
}

# This policy now includes AssumeRoleForPodIdentity for the Pod Identity Agent
resource "aws_iam_role_policy_attachment" "amazon_eks_worker_node_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
  role       = aws_iam_role.nodes.name
}

resource "aws_iam_role_policy_attachment" "amazon_eks_cni_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
  role       = aws_iam_role.nodes.name
}

resource "aws_iam_role_policy_attachment" "amazon_ec2_container_registry_read_only" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
  role       = aws_iam_role.nodes.name
}

resource "aws_eks_node_group" "general" {
  cluster_name    = aws_eks_cluster.eks.name
  version         = local.eks_version
  node_group_name = "general"
  node_role_arn   = aws_iam_role.nodes.arn

  subnet_ids = [
    aws_subnet.privateA.id,
    aws_subnet.privateB.id,
    aws_subnet.privateC.id
  ]

  capacity_type  = "SPOT"
  instance_types = ["t3.small"]

  scaling_config {
    desired_size = 2
    max_size     = 10
    min_size     = 1
  }

  update_config {
    max_unavailable = 1
  }

  labels = {
    role = "general"
  }

  depends_on = [
    aws_iam_role_policy_attachment.amazon_eks_worker_node_policy,
    aws_iam_role_policy_attachment.amazon_eks_cni_policy,
    aws_iam_role_policy_attachment.amazon_ec2_container_registry_read_only,
  ]

  # Allow external changes without Terraform plan difference
  lifecycle {
    ignore_changes = [scaling_config[0].desired_size]
  }
}

# K8S Roles & Role Bindings
## Developer
resource "kubernetes_cluster_role" "viewer" {
  metadata {
    name = "viewer"
  }

  rule {
    api_groups = ["*"]
    resources = [
      "namespaces", 
      "pods", 
      "configmaps", 
      "secrets", 
      "services"
      ]
    verbs = [
      "get", 
      "list", 
      "watch"
    ]
  }
}

resource "kubernetes_cluster_role_binding" "viewer-binding" {
  metadata {
    name = "viewer-binding"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "ClusterRole"
    name      = "cluster-admin"
  }
  subject {
    kind      = "Group"
    name      = "viewer-group"
    api_group = "rbac.authorization.k8s.io"
  }
}

## Manager
resource "kubernetes_cluster_role_binding" "admin-binding" {
  metadata {
    name = "admin-binding"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "ClusterRole"
    name      = "cluster-admin"
  }
  subject {
    kind      = "User"
    name      = "admin"
    api_group = "rbac.authorization.k8s.io"
  }
  subject {
    kind      = "ServiceAccount"
    name      = "default"
    namespace = "kube-system"
  }
  subject {
    kind      = "Group"
    name      = "manager-group"
    api_group = "rbac.authorization.k8s.io"
  }
}

# IAM
## Developer
resource "aws_iam_user" "developer" {
  count = var.create_developer_user ? 1 : 0
  name  = "LongView"
}

resource "aws_iam_policy" "developer_eks" {
  count = var.create_developer_user ? 1 : 0
  name  = "AmazonEKSDeveloperPolicy"

  policy = <<POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "eks:DescribeCluster",
        "eks:ListClusters"
      ],
      "Resource": "*"
    }
  ]
}
POLICY
}

resource "aws_iam_user_policy_attachment" "developer_eks" {
  count       = var.create_developer_user ? 1 : 0
  user        = aws_iam_user.developer[0].name
  policy_arn  = aws_iam_policy.developer_eks[0].arn
}

resource "aws_eks_access_entry" "developer" {
  count             = var.create_developer_user ? 1 : 0
  cluster_name      = aws_eks_cluster.eks.name
  principal_arn     = aws_iam_user.developer[0].arn
  kubernetes_groups = ["viewer-group"]
}

## Manager
data "aws_caller_identity" "current" {}

resource "aws_iam_role" "eks_admin" {
  count = var.create_manager_user ? 1 : 0
  name  = "${local.env}-${local.eks_name}-eks-admin"

  assume_role_policy = <<POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Principal": {
        "AWS": "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
      }
    }
  ]
}
POLICY
}

resource "aws_iam_policy" "eks_admin" {
  count = var.create_manager_user ? 1 : 0
  name  = "AmazonEKSAdminPolicy"

  policy = <<POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "eks:*"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "eks.amazonaws.com"
        }
      }
    }
  ]
}
POLICY
}

resource "aws_iam_role_policy_attachment" "eks_admin" {
  count       = var.create_manager_user ? 1 : 0
  role        = aws_iam_role.eks_admin[0].name
  policy_arn  = aws_iam_policy.eks_admin[0].arn
}

resource "aws_iam_user" "manager" {
  count = var.create_manager_user ? 1 : 0
  name  = "WithinReason"
}

resource "aws_iam_policy" "eks_assume_admin" {
  count = var.create_manager_user ? 1 : 0
  name  = "AmazonEKSAssumeAdminPolicy"

  policy = <<POLICY
{
  "Version": "2012-10-17",
  "Statement": [
      {
          "Effect": "Allow",
          "Action": [
              "sts:AssumeRole"
          ],
          "Resource": "${aws_iam_role.eks_admin[0].arn}"
      }
  ]
}
POLICY
}

resource "aws_iam_user_policy_attachment" "manager" {
  count       = var.create_manager_user ? 1 : 0
  user        = aws_iam_user.manager[0].name
  policy_arn  = aws_iam_policy.eks_assume_admin[0].arn
}

# Best practice: use IAM roles due to temporary credentials
resource "aws_eks_access_entry" "manager" {
  count             = var.create_manager_user ? 1 : 0
  cluster_name      = aws_eks_cluster.eks.name
  principal_arn     = aws_iam_role.eks_admin[0].arn
  kubernetes_groups = ["manager-group"]
}

# Metrics server
resource "helm_release" "metrics_server" {
  count             = var.deploy_metrics_server ? 1 : 0
  name              = "metrics-server"
  repository        = "https://charts.bitnami.com/bitnami"
  chart             = "metrics-server"
  namespace         = "metrics-server"
  version           = "7.2.16"
  create_namespace  = true

  set {
      name  = "apiService.create"
      value = "true"
  }

  depends_on = [
    aws_eks_node_group.general
  ]
}

There are tons of videos on Youtube for helping, understanding and putting together a script. Links to one such series can be found in the references.

Elenche Zetetique