How (not) to Burn Money on VPC Endpoints (So You Don't Have To)

Lomanu4 · 20 Май 2025

The Hidden Cost of Speed: How 'Just Make It Work' Breaks Your AWS Budget
Why is it so challenging?
How does the “Just do it” approach affect pillars?
How do VPC interface endpoints fit into all this?
How much does it cost? – A gentle overview of provisioning VPC interface endpoints for each new VPC
- Total costs
- Summary
Optimizing Architecture for Cost Savings and Business Continuity
- Why Isn't Cost Enough to Convince the Business?
High Level Design
- Components Table ?
- Integrations Table ?
Key takeaways

The Hidden Cost of Speed: How 'Just Make It Work' Breaks Your AWS Budget

Working as a DevOps engineer is like juggling flaming swords while someone shouts, 'Can you deploy that by Friday?' Or worse, 'By 17:00 Friday.'

Why is it so challenging?

Explaining that your solution should align with the six pillars of the AWS Well-Architected Framework is like asking for a seatbelt in a car that's already halfway down the hill—or opening your umbrella after the rain has passed. You need time, planning, and a roadmap—and nobody wants to hear that when the only goal is “just make it work.”

“Just do it” is an effective strategy but out of those six pillars, cost optimization and sustainability are usually the first to be sacrificed.

How does the “Just do it” approach affect pillars?

Because in the race to deliver, speed beats everything. Deadlines are sacred.

And what about budgets? Well, they’re not a problem—until someone sees the monthly AWS bill and starts panicking. Simply because cost impact is often hidden behind shared billing, and nobody has tagging discipline in the early phase.

Now you're asked to deploy a Graviton instance for a legacy application that doesn't even support ARM. Why wouldn’t you? After all, cost optimization is suddenly top priority—never mind compatibility?

That’s when suddenly, cost optimization becomes everyone's favorite pillar.

How do VPC interface endpoints fit into all this?

Initially, VPC endpoints are provisioned separately per VPC—because we prioritized speed over cost and, sometimes, even quality or security.
If we have 20 VPCs, we will create endpoints in each, this will lead to increased costs 20 times, especially if we have same endpoints, while the traffic is almost idle. One VPC endpoint in one availability zone provides 10 Gbps with automatic scaling up to 100 Gbps. This is enough to handle multiple workloads, even high-throughput data workloads.

For those with a programming background, this is a classic example of violating the ‘Don’t Repeat Yourself’ (DRY) principle.
Because repeating the same setup in every VPC introduces unnecessary costs for a horizontally scalable networking component designed to handle large volumes of traffic efficiently—and doing it multiple times means paying multiple times.

According to the documentation

By default, each VPC endpoint can support a bandwidth of up to 10 Gbps per Availability Zone, and automatically scales up to 100 Gbps.

How much does it cost? - a gentle overview of provisioning VPC interface endpoints for each new VPC in environments with multi-account strategy. We will use 13 accounts (let believe it is an unlucky number) and some randomly generated endpoint services as an example

account	interface endpoints
1	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, ssm, ssmmessages, ssm-contacts, ec2, ec2messages, acm-pca, secretsmanager
2	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, ssm, ssmmessages, ssm-contacts, ec2, ec2messages, acm-pca, secretsmanager, sqs, airflow.api, airflow.env, airflow.ops
3	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager, sagemaker.api, sagemaker.runtime
4	ssm, ec2, ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, sagemaker.api, sagemaker.runtime, execute-api, secretsmanager, states, sts, acm-pca, glue, athena, macie2, ecs, bedrock-runtime
5	s3, sts
6	ssm, ssmmessages, ec2messages, ec2, s3, logs, monitoring, kms, sts
7	ssm, ec2, ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, sagemaker.api, secretsmanager, elasticfilesystem, codecommit, git-codecommit, glue, athena, application-autoscaling
8	logs, monitoring, sts, glue, lambda, states, secretsmanager
9	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager
10	logs, monitoring, sts, ec2
11	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, secretsmanager, acm-pca
12	athena, logs, monitoring, kms, secretsmanager, codecommit, sagemaker.api, sagemaker.runtime, glue, git-codecommit, sts, bedrock-runtime
13	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager

If we group the endpoints by frequency, assuming one environment or four environments, the numbers look like this:

VPC Endpoint	Frequency (x1)	Frequency (x4)
sts	14	56
logs	12	48
monitoring	12	48
kms	10	40
secretsmanager	10	40
lambda	9	36
ecr.api	8	32
ecr.dkr	8	32
acm-pca	7	28
ec2	6	24
ssm	5	20
sagemaker.api	4	16
glue	4	16
ssmmessages	3	12
ec2messages	3	12
sagemaker.runtime	3	12
athena	3	12
ssm-contacts	2	8
states	2	8
bedrock-runtime	2	8
s3	2	8
codecommit	2	8
git-codecommit	2	8
sqs	1	4
airflow.api	1	4
airflow.env	1	4
airflow.ops	1	4
execute-api	1	4
macie2	1	4
ecs	1	4
elasticfilesystem	1	4
application-autoscaling	1	4
Total	132	528

Total costs

Calculation of total costs for eu-west-2 or London region would look like

Total costs for 132 endpoints for 1 environment = 0.011 (per hour) * 3 AZs * 24* 30 * 132 = 3136.32
Total costs for 528 endpoints = 3136.32* 4 = 12545.28
Data Processing costs for 4 environments = 5.28 (rough estimation)

Total unique VPC endpoints count = 32
Costs for 32 endpoints = 0.011 (per hour) * 3 AZs * 24* 30 * 32 = 760.32

A centralized approach for VPC endpoints in a shared services account for prod and nonprod may provide same scalability and high availability, while reducing the costs with 87% and administrative burden. Of course we can do a step further and replace some of the interface endpoints like S3 and DynamoDB for gateway endpoints in case we don't want to use their transitive nature and share them across VPCs and we want to save money.

Summary

132 endpoints x 3 AZs x $0.011/hour x 24 hours x 30 days = $3,136.32/month
For 4 environments (528 endpoints): $12,545.28/month
Costs for 32 endpoints across 3 AZs: 0.011 USD/hour × 3 AZs × 24 hours × 30 days × 32 = $760.32/month
Savings: ~87%

Optimizing Architecture for Cost Savings and Business Continuity

The costs above are not necessarily something bad. You have isolation between environments and you gather extensive knowledge how things work and how you need to approach stakeholders in order improve the situation.

Why Isn't Cost Enough to Convince the Business?

Business is only interested of certain things. I would say: nobody cares that the administrative burden would be smaller. So how you can approach this?

When you start the deployment of the interface endpoints they were not secured well. This means now we have a lot of networks, resulting in inconsistent security standards—each VPC becomes a snowflake. You may avoid saying that is not secure, a more suitable approach would be:
By standardizing the security policies and security groups you can make sure that sensitive workloads have access only to those specific buckets and only those specific tables and only those specific APIs. This improves the security baseline and reduces the blast radius. As a result this reduces the possibility of a data leakage. (How to Sell Optimization Without Saying 'Security Is Bad')
By centralizing and standardizing the interface endpoints we would save 87% of the costs. In Bulgaria, there's a satirical series called 'The Three Fools.' In our case, the fool is the team paying thousands to AWS for redundant endpoints—just because no one paused to rethink the architecture.

Note: Security is always a good selling point for the business and nobody measures it after a change. Controlling fear factor and risk sells, a good example would be insurance, that we buy for our houses

High Level Design

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

? Components Table

ID	Name	Type	Description
C1	Interface Endpoints	VPC Interface Endpoints	Provides private access to AWS services (e.g., ssm.eu-west-1.amazonaws.com).
C2	Route 53 Private Hosted Zone	DNS Zone	Hosts private DNS entries for the services.
C3	Route 53 Resolver Inbound Endpoint	DNS Resolver	Accepts DNS queries from the spoke VPC.
C3	Shared Resolver	Route 53 Resolver	Used by EC2 instances in the spoke VPC to resolve private DNS.
C4	AWS RAM	Resource Access Manager	Shares the inbound endpoint and private hosted zone with the spoke VPC.
C5	Cloud WAN Segment Network	Network Routing	Routes traffic between segments (e.g., from spoke to shared services).
EC2	Amazon EC2 Instance	Compute	The instance initiating the request to ssm.eu-west-1.amazonaws.com.
Spoke VPC	VPC	Contains the EC2 instance. CIDR: 192.168.20.X.
Centralized VPC Endpoints	VPC	Hosts the interface endpoints and inbound resolver. CIDR: 192.168.10.X.

? Integrations Table

Step	Integration Description	Direction	Protocol/Mechanism
1	EC2 in spoke VPC wants to resolve ssm.eu-west-1.amazonaws.com.	Spoke → Shared	DNS Query via Shared Resolver
2	Shared Resolver provides IP 192.168.10.4 for the endpoint.	Shared → Spoke	DNS Response
3	Traffic to 192.168.10.4 is not local, forwarded to Cloud WAN uplink.	Spoke → Cloud WAN	VPC Route Table / Cloud WAN Routing
4	Cloud WAN checks if route to another network is permitted.	Cloud WAN	Firewall/Policy Check
5	If permitted, traffic is routed to shared services VPC.	Cloud WAN → Shared	Network Forwarding
I1	Private hosted zone is associated with the shared resolver and spoke via RAM.	Shared Spoke	AWS RAM and Route 53 Association
I4	RAM shares the inbound resolver with spoke VPC.	Shared → Spoke	AWS Resource Access Manager
I5	Spoke EC2 sends DNS queries to shared resolver.	Spoke → Shared	DNS

Prerequisite: All VPCs are connected via peering/TransitGateway/CloudWAN

Hub VPC

First we need to create centralized hub VPC that will have all of the necessary VPC interface endpoints. When you create a VPC endpoint to an AWS service, you can enable private DNS. When enabled, the setting creates an AWS managed Route 53 private hosted zone (PHZ), which enables the resolution of public AWS service endpoint to the private IP of the interface endpoint. You need this disabled in order to define a centralized PHZ trough a Route 53 inbound resolver, which would be shared with other accounts
To do this you need to disable this in terraform:

resource "aws_vpc_endpoint" "private_links" {
for_each = toset(local.vpc_endpoints_all)
vpc_id = aws_vpc.main.id
service_name = each.key
vpc_endpoint_type = "Interface"
private_dns_enabled = false
#Disabling private DNS lets us override the default endpoint #resolution and use our own Route 53 hosted zone across accounts
security_group_ids = [aws_security_group.vpc_endpoint[each.key].id]
policy = data.aws_iam_policy_document.vpc_endpoints_policy.json
subnet_ids = local.subnets
tags = merge({ Name = "${var.prefix}-${each.key}-interface-endpoint" }, var.tags)
}

As next step you create route 53 private hosted zones for each endpoint. We associate them with the centralized VPC from step 1.
Then we create Alias A record in each hosted zone pointing to the VPC endpoint's DNS name. For example: For STS endpoint
the name "sts.${data.aws_region.current.name}.amazonaws.com"
should point to the DNS of the newly created VPC endpoint for STS
This allows traffic from spoke VPCs to resolve AWS service endpoints via centralized VPC interface endpoints and inbound endpoint.
Then we create an inbound endpoint in Route 53 with security group, Do53 protocol in at least two subnets for high availability, that will be used in the spoke VPCs as well. The idea of the inbound endpoint resolver is route your DNS queries from other spoke VPCs or networks to the hub VPC
As last step we share the resolver or inbound endpoint with other accounts and define policy for security trough Resource Access Manager or RAM

Spoke VPCs

Each newly create spoke VPC needs to be associated with the inbound resolver, that was shared from hub VPC. Example

data "aws_route53_resolver_rules" "eu_west_2" {
owner_id = var.resolver_rules[terraform.workspace]
share_status = "SHARED_WITH_ME"
}
resource "aws_route53_resolver_rule_association" "eu_west_2" {
for_each = data.aws_route53_resolver_rules.eu_west_2.resolver_rule_ids
resolver_rule_id = each.value
vpc_id = data.terraform_remote_state.networking.outputs.network.aws_vpc.id
}
Minimizing downtime

Now you would ask: How to move from the state with centralized VPC interface endpoints to a state, where they were centralized with as minimal downtime as possible?

In general what could be done is associating the shared resolver with the spoke VPC and then destroying the decentralized VPC endpoints with a rolling deployment from development to production and automatic tests via System Manager Run Command/Lambda. This will guarantee that first you gather knowledge what can fail (Everything fails all the time) and you document it in Confluence or even Readme.md. This will give you confidence for production and will make the big change controllable and more understandable for technical and non-technical people

Key Takeaways

Rushing architecture decisions often leads to long-term cost explosions.
Interface endpoints are scalable—duplicating them per VPC isn’t.
Centralizing shared services like VPC endpoints saves money and simplifies security management.
To convince stakeholders, lead with security and cost—not technical purity.
AWS gives you the tools; architecture is about using them with purpose.

Sources:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

How (not) to Burn Money on VPC Endpoints (So You Don't Have To)

Lomanu4