Terraform: count vs for_each

Both meta-arguments produce N copies of a resource. They look interchangeable in the config. They are not. The difference shows up in the state file — and that's where the production failures live.

What the terraform state representation looks like

With count, Terraform indexes resources positionally:

resource "aws_iam_user" "team" {
  count = length(var.usernames)
  name  = var.usernames[count.index]
}

State addresses:

aws_iam_user.team[0]
aws_iam_user.team[1]
aws_iam_user.team[2]

With for_each, Terraform keys resources by string:

resource "aws_iam_user" "team" {
  for_each = toset(var.usernames)
  name     = each.value
}

State addresses:

aws_iam_user.team["alice"]
aws_iam_user.team["bob"]
aws_iam_user.team["carol"]

Same three users, two very different state shapes. The first is positional; the second is named. Everything below follows from that.

The re-indexing trap

Remove bob from the middle of var.usernames and re-plan.

With count, indices shift under your feet:

aws_iam_user.team[0]   alice  -> alice    (no change)
aws_iam_user.team[1]   bob    -> carol    (replace)
aws_iam_user.team[2]   carol  -> (gone)   (destroy)

Terraform sees team[1] as “the user formerly known as bob, now carol” and plans to destroy and recreate it. team[2] is gone. You wanted to delete one user; Terraform plans to destroy two and recreate one.

With for_each, the keyspace is stable:

aws_iam_user.team["alice"]   no change
aws_iam_user.team["bob"]     destroy
aws_iam_user.team["carol"]   no change

Bob goes. Everyone else stays put. The plan does what your brain expected.

Why this matters

The IAM-user example is harmless — users get recreated and life goes on. The shape of the bug is the same when the resources have real side effects:

  • A list of aws_db_instance indexed by count. Remove one from the middle and Terraform plans to destroy and recreate every RDS that came after it. Each replacement is a multi-hour, multi-snapshot affair.
  • Route-table entries or VPC peering links keyed positionally. Re-indexing breaks connectivity for every entry past the one you removed, often silently until traffic hits it.

The common thread: anything with a meaningful identity, stored at a positional address. The plan output reads “destroy 12, create 12” and the apply does what it says.

How do you workaround this? Enter moved

You can't just rewrite the resource. State still has [0], [1], [2]; the config now expects ["alice"], ["bob"], ["carol"]. Terraform's plan:

Terraform will perform the following actions:

  # aws_iam_user.team[0] will be destroyed
  # aws_iam_user.team[1] will be destroyed
  # aws_iam_user.team[2] will be destroyed
  # aws_iam_user.team["alice"] will be created
  # aws_iam_user.team["bob"] will be created
  # aws_iam_user.team["carol"] will be created

Plan: 3 to add, 0 to change, 3 to destroy.

Exactly the failure mode this post is about.

moved blocks tell Terraform to rename addresses in state instead of destroying and recreating:

moved {
  from = aws_iam_user.team[0]
  to   = aws_iam_user.team["alice"]
}

moved {
  from = aws_iam_user.team[1]
  to   = aws_iam_user.team["bob"]
}

moved {
  from = aws_iam_user.team[2]
  to   = aws_iam_user.team["carol"]
}

Plan output drops to 0 to add, 0 to change, 0 to destroy — pure address rewrites, applied as part of the next normal run. No destroy, no recreate, no downtime.

A few practical notes from doing this on real fleets:

  • The from addresses must match state exactly. Run terraform state list against the workspace before writing the blocks.
  • For lists of more than ~10 items, generate the moved blocks with a quick script over terraform state list output. Hand-writing fifty of them is how typos sneak in and resources end up orphaned.
  • Once the migration has been applied, the moved blocks can be removed from the code. They have no effect on subsequent runs — the addresses already exist at their new keys in state. Removing them costs nothing: terraform plans a no-op and the codebase stays clean. Some teams keep them around as an inline changelog of past refactors, which is fine too. Style call, not a safety call.
  • The migration only covers items that already exist at the old address. New items go in through for_each as normal.
  • For modules being refactored, moved works across module boundaries too — same syntax, fully-qualified addresses on both sides.

Real cases from production

The same shape of bug shows up across totally different parts of a terraform codebase — AWS networking primitives one day, Kubernetes manifests rendered by helm the next. Two cases from a multi-region production fleet.

VPC peering routes

Earlier I said route-table entries keyed positionally are a classic version of this bug. Here is how it actually played out.

Cross-region VPC peerings install routes into every route table on each side. The original module used count:

resource "aws_route" "region_a_to_region_b_peering" {
  provider                  = aws.region_a
  count                     = length(data.aws_route_tables.region_a_vpc.ids)
  route_table_id            = data.aws_route_tables.region_a_vpc.ids[count.index]
  destination_cidr_block    = data.aws_vpc.region_b_vpc.cidr_block
  vpc_peering_connection_id = aws_vpc_peering_connection.region_a_to_region_b_peering.id
}

The trigger was unrelated work elsewhere in the network stack: a project added new private subnets to all production clusters. New subnets meant new route tables, and data.aws_route_tables.region_a_vpc.ids started returning the IDs in a different order. Every aws_route keyed by count re-indexed silently.

The plan came back proposing to destroy and recreate every cross-region peering route across every production cluster. Hundreds of lines of aws_route.<name>[N] will be destroyed and replaced.

Your backend applications reach RDS, ElastiCache, OpenSearch, and other regional endpoints over those peering routes — including cluster-to-cluster communication. A momentary route deletion is not free; even a short drop knocks out cache traffic for everything in flight, across regions. Not a plan you apply.

The fix was to key by the route table ID itself:

-  count                     = length(data.aws_route_tables.region_a_vpc.ids)
-  route_table_id            = data.aws_route_tables.region_a_vpc.ids[count.index]
+  for_each                  = toset(data.aws_route_tables.region_a_vpc.ids)
+  route_table_id            = each.value

The whole resource, after the change:

resource "aws_route" "region_a_to_region_b_peering" {
  provider                  = aws.region_a
  for_each                  = toset(data.aws_route_tables.region_a_vpc.ids)
  route_table_id            = each.value
  destination_cidr_block    = data.aws_vpc.region_b_vpc.cidr_block
  vpc_peering_connection_id = aws_vpc_peering_connection.region_a_to_region_b_peering.id
}

Route table IDs do not move. Adding new subnets — and therefore new route tables — only adds new entries to the map. Nothing existing changes address.

Paired with one moved block per existing route, per peering, per region:

moved {
  from = aws_route.region_a_to_region_b_peering[0]
  to   = aws_route.region_a_to_region_b_peering["rtb-aaaaaaaaaaaaaaaaa"]
}

moved {
  from = aws_route.region_a_to_region_b_peering[1]
  to   = aws_route.region_a_to_region_b_peering["rtb-bbbbbbbbbbbbbbbbb"]
}
# ...one per route table, across every peering and every region.

The plan dropped from “destroy and recreate everything” to 0 to add, 0 to change, 0 to destroy. Pure state rewrites. Inter-region traffic kept flowing.

ingress-nginx upgrades

The task: cut a major version upgrade of ingress-nginx. The new chart adds a handful of new Kubernetes resources alongside the breaking changes called out in the release notes. Routine upgrade work — proactive, of course. Cough. An ingress-nginx CVE knocked on the door.

Side note: ingress-nginx is being retired upstream — in case you missed it. The pattern in this section still applies to whatever you migrate to.

The ingress-nginx manifests are rendered with helm template and applied via terraform through kubectl_manifest:

resource "kubectl_manifest" "ingress_nginx" {
  count     = length(data.kubectl_path_documents.ingress_nginx.documents)
  yaml_body = element(data.kubectl_path_documents.ingress_nginx.documents, count.index)
}

Plan time: the new chart emits more documents in a different order, and every kubectl_manifest.ingress_nginx[N] past the inserted resource re-indexes silently.

The plan output includes lines like:

kubectl_manifest.ingress_nginx[12] will be destroyed
kubectl_manifest.ingress_nginx[13] will be destroyed
# ...and so on, for every resource past the insertion point.

Among those destroys: the ingress-nginx Deployment and its Service of type LoadBalancer backed by an NLB. Destroying and recreating that Service tears down the NLB and provisions a new one — new ARN, new DNS name. Traffic blackholes until DNS records and any upstream load balancers catch up. For minutes, possibly longer.

Can you afford that in prod? No. Ingress is the chokepoint every external request flows through. A new NLB on every helm upgrade is not a deploy strategy.

The migration is identical in shape to the route-table case:

resource "kubectl_manifest" "ingress_nginx" {
  for_each  = data.kubectl_path_documents.ingress_nginx.manifests
  yaml_body = each.value
}

The manifests attribute is a map keyed by each rendered manifest's API path — the resource's identity in the cluster, not its position in the file. Add or remove resources between upgrades and the keys for everything else don't change.

For kubectl_manifest backed by helm output, for_each is much more flexible to chart refactors than count: keys survive upgrades, positional indices don't.

The moved blocks key into the new addresses by API path:

moved {
  from = kubectl_manifest.ingress_nginx[0]
  to   = kubectl_manifest.ingress_nginx["/api/v1/namespaces/ingress-nginx"]
}

moved {
  from = kubectl_manifest.ingress_nginx[1]
  to   = kubectl_manifest.ingress_nginx["/api/v1/namespaces/ingress-nginx/serviceaccounts/ingress-nginx"]
}

# ...

moved {
  from = kubectl_manifest.ingress_nginx[39]
  to   = kubectl_manifest.ingress_nginx["/api/v1/namespaces/ingress-nginx/services/ingress-nginx-controller"]
}

# ...

moved {
  from = kubectl_manifest.ingress_nginx[42]
  to   = kubectl_manifest.ingress_nginx["/apis/apps/v1/namespaces/ingress-nginx/deployments/ingress-nginx-controller"]
}

# ...one per existing manifest, generated from `terraform state list`.

Same outcome: the destroy plan disappears. With the moved blocks in place, the migration itself is a pure state rewrite (0 to add, 0 to change, 0 to destroy). The next plan, with the upgraded chart applied, shows only in-place updates for resources whose YAML changed and creates for the new ones the chart added — no deletions at all. The NLB stays where it is.

The for_each tradeoff

for_each keys must be known at plan time. Many resource attributes — an EC2 instance's private IP, an RDS endpoint, a randomly generated password, a resource's ARN — are only determined by the cloud provider at creation time. Use any of those as a for_each key on a resource being created in the same plan and Terraform errors out:

The for_each map includes keys derived from resource attributes
that cannot be determined until apply.

count works regardless — it just takes a number and does not care where it came from.

This has not bitten me in every for_each case, only some — and even then, there are usually workarounds that keep for_each the right call. A topic for another post.

Rule of thumb

Default to for_each when items have identity. Default to count when they do not — fungible replicas, on/off toggles, conditional creation:

# fungible workers, no identity
resource "aws_instance" "worker" {
  count = var.worker_count
  # ...
}

# conditional creation
resource "aws_cloudwatch_metric_alarm" "high_cost" {
  count = var.enable_cost_alarms ? 1 : 0
  # ...
}

for_each = var.enabled ? toset(["this"]) : toset([]) works for a toggle but reads worse than count = var.enabled ? 1 : 0. Pick the one that says what you mean.

When the wrong choice is already in production, moved blocks turn a destructive plan into a state-only change — the difference between a Tuesday afternoon refactor and a Tuesday afternoon incident.

Yes, it is grunt work — lots of terraform state resource inspection and dozens of moved blocks for resources nobody is looking at. But if you lean on terraform heavily and have not hit the count-shifting trap yet, you will, and it tends to land on the day you can least afford it: mid-incident, mid-upgrade, mid-emergency-refactor. The proactive Tuesday version costs a lot less than the reactive 3am one.

← All writing