Compute Node Orchestration for Spatial Infrastructure as Code

Compute node orchestration is the discipline of declaring, scaling, and retiring the ephemeral execution layer that runs geospatial workloads — tile renderers, GDAL reprojection batches, vector ETL runners, and OGC geoprocessing workers — entirely as version-controlled code. The recurring operational problem is that generic virtual-machine provisioning treats a compute node as a long-lived pet, while spatial platforms need fleets of disposable, workload-aware workers that boot with a known runtime, attach to a persistent data tier, and disappear without leaving state behind. This work sits within the broader Geospatial Resource Provisioning domain and depends directly on two sibling resource types: the authoritative data tier defined in PostGIS Cluster Provisioning and the durable asset store defined in Object Storage for Raster/Vector Workloads. Orchestrating compute well means treating nodes as a stateless, replaceable layer bolted onto those two persistent dependencies, with every scaling decision, IAM boundary, and network path expressed in Terraform or Pulumi rather than improvised at the console.

Environment Parity and Configuration Drift Mitigation

Parity in compute orchestration begins with the machine image, because a node that boots with the wrong GDAL, PROJ, or GEOS build is an interoperability incident in waiting. Teams should standardize on pre-baked, hardened images — AMIs, VM templates, or container base layers — that pin the entire geospatial runtime: the GDAL release and its driver matrix, PROJ’s transformation grids, GEOS, and any spatial indexing utilities the workload calls. The image is built and promoted through a pipeline, given an immutable identifier, and referenced in code by that identifier rather than by a moving latest tag. When the image identifier is a Terraform variable, every workspace — development, staging, production — provisions from the byte-identical artifact, and a terraform plan or pulumi preview yields the same resource graph regardless of target.

Drift creeps in through three channels that orchestration code must close. The first is launch-template skew: an operator hand-edits an instance’s user data or instance profile, and the next scale-out event clones the deviated node. Codifying the launch template and forbidding direct instance mutation through policy keeps the fleet uniform. The second is runtime configuration divergence — kernel sysctl values, file-descriptor limits for tile caches, and JVM heap settings for Java engines drifting between environments. These belong in the baked image or in a versioned cloud-init document, never in a manual SSH session. The third is endpoint resolution: nodes must discover spatial database endpoints and object-store buckets dynamically through service discovery or parameter-store lookups, never through hardcoded IPs, so that a database failover or a region promotion does not silently strand a fleet against a dead address.

Version pinning closes the loop. Every provider, module, and runtime dependency carries an explicit constraint; an unpinned provider can change a default and force-replace a launch template on the next apply, rolling the entire fleet for no declared reason. Only parameterized variables — region identifiers, scaling bounds, instance families, and cost-allocation tags — should differ between environments, and those differences live in tracked .tfvars or Pulumi config, not in code paths.

CI/CD Validation and Operational Guardrails

Compute orchestration must move through the pipeline as an auditable, state-managed change, not an ad-hoc administrative action. The pipeline runs ordered gates before any apply touches a running fleet: HCL or TypeScript syntax validation, dependency resolution, policy-as-code evaluation, and drift detection against live infrastructure. Each gate has a defined failure behavior — a policy violation halts the run and emits an audit record rather than degrading to a partial apply.

The geospatial-specific checks are what separate a generic deploy gate from a spatial one. Pre-apply validation should confirm that the referenced machine image actually exists and carries the expected GDAL driver set, that the target subnets have routes to both the PostGIS endpoint and the object-store VPC endpoint, and that scaling bounds will not exceed the cloud account’s instance quota for the chosen family. Policy-as-code rules enforce mandatory cost-allocation tags, forbid public IP assignment on worker nodes, and require that every launch template references an approved instance profile. Quota checks matter acutely for spatial fleets because a tile-rendering surge can request dozens of GPU or high-memory instances at once and hit a soft limit mid-scale, leaving a half-provisioned fleet behind a load balancer.

Rollback triggers close the safety loop. Health probes that watch render latency, queue depth, and node readiness must be able to abort a deployment and revert the launch-template version if a new image regresses. Wiring these gates into pull-request workflows — the same discipline applied to the data tier in PostGIS Cluster Provisioning — ensures compute changes face the same scrutiny as application code, and the orchestration engine’s behavior under that workflow is shaped by the trade-offs covered in Terraform vs Pulumi for Spatial Infrastructure as Code.

Resource Architecture and Service Integration

A compute node is the transient center of a small constellation of persistent services, and orchestration code is mostly about wiring those edges correctly. Upstream, the fleet draws work from a queue or a load balancer and pulls its runtime from the baked image. Sideways, it authenticates to a spatial database and to object storage. Downstream, it emits telemetry and writes results back to durable storage. Each edge is a provisioning decision with a geospatial cost or risk attached.

The data-tier edge is the most consequential. Workers should resolve the database through the managed endpoint and connection-pooling layer established in PostGIS Cluster Provisioning, so that read-heavy tile generation routes to replicas while write paths reach the primary. The asset edge runs to Object Storage for Raster/Vector Workloads: heavy raster reads, cloud-optimized GeoTIFF range requests, and archival writes all traverse this path, which is why the security group below restricts egress to the object-store VPC endpoint rather than the open internet. Egress here is the silent cost driver, so the path should stay inside the cloud backbone through a VPC endpoint, never a NAT gateway, for object traffic.

The publishing edge connects to the map middleware. When the workload is a Java spatial engine, the node lifecycle and JVM tuning follow GeoServer Deployment Patterns, which govern heap limits, OGC endpoint hardening, and patch rotation without dropping in-flight requests. The network edges — subnet placement, route tables, and security-group reachability to tile clients — are governed by the conventions in VPC Routing for Tile Servers and Security Group Hardening. The most common concrete realization of this whole architecture is the demand-driven render fleet detailed in Auto-Scaling EC2 Instances for WMS Endpoints, where queue depth and request latency drive scale-out instead of static capacity planning.

Runnable Configuration

The following Terraform defines a stateless worker fleet: an encrypted, locked remote state backend, a least-privilege instance role, a launch template pinned to a baked geospatial image, a restrictive security group, and an auto-scaling group whose bounds are parameterized per environment. Provider versions are pinned so an upstream default change cannot force-replace the launch template.

terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = { source = "hashicorp/aws", version = ">= 5.0" }
  }

  # State must be encrypted, versioned, and locked: a concurrent apply
  # against a live fleet can corrupt the launch-template version chain.
  backend "s3" {
    bucket         = "spatial-iac-state-prod"
    key            = "compute-nodes/production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "spatial-iac-state-lock"
  }
}

# Least-privilege role: workers may read raster/vector objects and pull
# DB credentials, nothing more. No inline admin policies.
resource "aws_iam_role" "spatial_compute" {
  name = "spatial-compute-worker-role"
  assume_role_policy = jsonencode({
    Version   = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ec2.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_instance_profile" "spatial_compute" {
  name = "spatial-compute-worker-profile"
  role = aws_iam_role.spatial_compute.name
}

# Restrictive egress: ingress only from the ALB, egress only to the
# PostGIS subnet and the S3 VPC endpoint — keeps tile/raster traffic
# on the cloud backbone and off the public internet.
resource "aws_security_group" "compute_node" {
  name        = "spatial-compute-sg"
  description = "Restrictive ingress/egress for geospatial workers"
  vpc_id      = var.vpc_id

  ingress {
    from_port   = 8080
    to_port     = 8080
    protocol    = "tcp"
    cidr_blocks = [var.alb_cidr]
  }

  egress {
    from_port   = 5432
    to_port     = 5432
    protocol    = "tcp"
    cidr_blocks = [var.postgres_subnet_cidr]
  }

  egress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    prefix_list_ids = [var.s3_endpoint_prefix_list_id]
  }

  tags = {
    Environment = var.environment
    CostCenter  = "gis-platform"
    ManagedBy   = "terraform"
  }
}

# Launch template pinned to a baked image carrying the validated
# GDAL/PROJ/GEOS runtime — referenced by immutable id, never "latest".
resource "aws_launch_template" "worker" {
  name_prefix            = "spatial-worker-"
  image_id               = var.geospatial_ami_id
  instance_type          = var.worker_instance_type
  vpc_security_group_ids = [aws_security_group.compute_node.id]

  iam_instance_profile { arn = aws_iam_instance_profile.spatial_compute.arn }

  metadata_options {
    http_tokens   = "required" # enforce IMDSv2
    http_endpoint = "enabled"
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      Role        = "spatial-worker"
      Environment = var.environment
      CostCenter  = "gis-platform"
    }
  }
}

# Stateless fleet: bounds are parameterized per environment so scaling
# policy, not static capacity, governs size.
resource "aws_autoscaling_group" "workers" {
  name                = "spatial-worker-asg"
  min_size            = var.worker_min
  max_size            = var.worker_max
  desired_capacity    = var.worker_desired
  vpc_zone_identifier = var.private_subnet_ids
  target_group_arns   = [var.worker_target_group_arn]
  health_check_type   = "ELB"

  launch_template {
    id      = aws_launch_template.worker.id
    version = "$Latest"
  }

  instance_refresh {
    strategy = "Rolling"
    preferences { min_healthy_percentage = 75 }
  }
}

The instance_refresh block matters for spatial fleets specifically: when a new baked image ships, a rolling refresh replaces nodes while keeping three-quarters of render capacity serving traffic, so a tile endpoint never goes dark during an image rotation.

Guardrails Embedded in the Configuration

The configuration encodes four guardrails that survive operator turnover because they live in code rather than in a runbook. State locking through the DynamoDB table prevents two concurrent applies from interleaving launch-template versions — without it, a race can leave the auto-scaling group pointing at a half-written template and boot broken nodes. The locking discipline mirrors the backend strategy in State Backend Selection.

Secret handling keeps database credentials out of state and out of user data: the instance role grants read access to a secrets manager entry, and workers fetch the connection string at boot rather than receiving it as a templated variable that would be persisted in the plan. Network isolation is enforced by placing workers in private subnets with no public IP, restricting ingress to the load balancer’s CIDR, and pinning egress to the database subnet and the object-store endpoint prefix list — the design assumes the broader posture set out in IAM Role Mapping for Geospatial Workloads and Security Group Hardening.

Stateless sizing is the spatial-specific guardrail. Workers hold no durable data: raster and vector assets stay in object storage, the database holds authoritative spatial state, and a node carries only ephemeral scratch — tile caches and reprojection temp files — sized to the instance’s local storage rather than to a growing dataset. Because nothing persistent lives on the node, the auto-scaling group can terminate and replace instances freely, which is what makes demand-driven scaling safe. Where local scratch sizing or container memory limits are tuned, those values belong in the launch template or task definition alongside the image pin, so the whole runtime profile versions together.

Troubleshooting and Failure Modes

Machine-image runtime skew. A refreshed AMI ships with a GDAL minor-version bump that drops a raster driver the workload depends on, so newly booted nodes fail every reprojection while older nodes succeed. The symptom is a partial-failure pattern that tracks node age. Resolve it by pinning the GDAL release in the image build, asserting the driver matrix in the pre-apply gate, and using instance_refresh so a bad image can be rolled back as one launch-template version.

Instance-quota exhaustion mid-scale. A render surge requests more instances of a high-memory or GPU family than the account’s service quota allows; the auto-scaling group provisions some nodes, then stalls, leaving the load balancer with insufficient healthy targets and rising latency. Detect it through scaling-activity events showing repeated failed launches. Prevent it with a pre-apply quota check and by spreading capacity across instance families with a mixed-instances policy.

VPC-endpoint policy gap. Workers can reach object storage at the network layer but receive 403 responses because the S3 VPC endpoint policy or the instance role does not grant the specific bucket prefix the raster assets live under. The signature is network reachability with consistent authorization failures. Fix it by aligning the endpoint resource policy, the bucket policy, and the role’s prefix scope, then re-running the apply.

Orphaned state lock after an aborted apply. A pipeline run is killed mid-apply and the DynamoDB lock is never released, so the next deployment blocks indefinitely. The symptom is a hanging apply reporting an existing lock id. Resolve it by validating the holding run is genuinely dead, then force-unlocking that specific lock id — never by disabling locking, which would reopen the race the lock exists to prevent.

Drift from manual instance edits. An operator SSHes into a worker and changes a sysctl or heap value to chase a hot incident; the next scale-out event clones the unmodified template and the fleet becomes heterogeneous, producing intermittent, node-specific failures. Catch it with scheduled drift detection comparing live instances against the launch template, and encode the fix permanently in the baked image so the change applies to every future node.

Auto-Scaling EC2 Instances for WMS Endpoints — metric-driven scaling for the render fleet this page provisions.
PostGIS Cluster Provisioning — the authoritative data tier these stateless workers attach to.
Object Storage for Raster/Vector Workloads — durable asset store nodes read and write over a VPC endpoint.
GeoServer Deployment Patterns — JVM and OGC tuning when the workload is a Java spatial engine.
Geospatial Resource Provisioning — the provisioning domain this compute layer belongs to.