• Что бы вступить в ряды "Принятый кодер" Вам нужно:
    Написать 10 полезных сообщений или тем и Получить 10 симпатий.
    Для того кто не хочет терять время,может пожертвовать средства для поддержки сервеса, и вступить в ряды VIP на месяц, дополнительная информация в лс.

  • Пользаватели которые будут спамить, уходят в бан без предупреждения. Спам сообщения определяется администрацией и модератором.

  • Гость, Что бы Вы хотели увидеть на нашем Форуме? Изложить свои идеи и пожелания по улучшению форума Вы можете поделиться с нами здесь. ----> Перейдите сюда
  • Все пользователи не прошедшие проверку электронной почты будут заблокированы. Все вопросы с разблокировкой обращайтесь по адресу электронной почте : info@guardianelinks.com . Не пришло сообщение о проверке или о сбросе также сообщите нам.

How (not) to Burn Money on VPC Endpoints (So You Don't Have To)

Lomanu4 Оффлайн

Lomanu4

Команда форума
Администратор
Регистрация
1 Мар 2015
Сообщения
1,481
Баллы
155
  1. The Hidden Cost of Speed: How 'Just Make It Work' Breaks Your AWS Budget
  2. Why is it so challenging?
  3. How does the “Just do it” approach affect pillars?
  4. How do VPC interface endpoints fit into all this?
  5. How much does it cost? – A gentle overview of provisioning VPC interface endpoints for each new VPC
    • Total costs
    • Summary
  6. Optimizing Architecture for Cost Savings and Business Continuity
    • Why Isn't Cost Enough to Convince the Business?
  7. High Level Design
    • Components Table ?
    • Integrations Table ?
  8. Key takeaways
The Hidden Cost of Speed: How 'Just Make It Work' Breaks Your AWS Budget


Working as a DevOps engineer is like juggling flaming swords while someone shouts, 'Can you deploy that by Friday?' Or worse, 'By 17:00 Friday.'

Why is it so challenging?


Explaining that your solution should align with the six pillars of the AWS Well-Architected Framework is like asking for a seatbelt in a car that's already halfway down the hill—or opening your umbrella after the rain has passed. You need time, planning, and a roadmap—and nobody wants to hear that when the only goal is “just make it work.”

“Just do it” is an effective strategy but out of those six pillars, cost optimization and sustainability are usually the first to be sacrificed.

How does the “Just do it” approach affect pillars?


Because in the race to deliver, speed beats everything. Deadlines are sacred.

And what about budgets? Well, they’re not a problem—until someone sees the monthly AWS bill and starts panicking. Simply because cost impact is often hidden behind shared billing, and nobody has tagging discipline in the early phase.

Now you're asked to deploy a Graviton instance for a legacy application that doesn't even support ARM. Why wouldn’t you? After all, cost optimization is suddenly top priority—never mind compatibility?

That’s when suddenly, cost optimization becomes everyone's favorite pillar.

How do VPC interface endpoints fit into all this?


Initially, VPC endpoints are provisioned separately per VPC—because we prioritized speed over cost and, sometimes, even quality or security.
If we have 20 VPCs, we will create endpoints in each, this will lead to increased costs 20 times, especially if we have same endpoints, while the traffic is almost idle. One VPC endpoint in one availability zone provides 10 Gbps with automatic scaling up to 100 Gbps. This is enough to handle multiple workloads, even high-throughput data workloads.

For those with a programming background, this is a classic example of violating the ‘Don’t Repeat Yourself’ (DRY) principle.
Because repeating the same setup in every VPC introduces unnecessary costs for a horizontally scalable networking component designed to handle large volumes of traffic efficiently—and doing it multiple times means paying multiple times.

According to the documentation

By default, each VPC endpoint can support a bandwidth of up to 10 Gbps per Availability Zone, and automatically scales up to 100 Gbps.
How much does it cost? - a gentle overview of provisioning VPC interface endpoints for each new VPC in environments with multi-account strategy. We will use 13 accounts (let believe it is an unlucky number) and some randomly generated endpoint services as an example

accountinterface endpoints
1ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, ssm, ssmmessages, ssm-contacts, ec2, ec2messages, acm-pca, secretsmanager
2ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, ssm, ssmmessages, ssm-contacts, ec2, ec2messages, acm-pca, secretsmanager, sqs, airflow.api, airflow.env, airflow.ops
3ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager, sagemaker.api, sagemaker.runtime
4ssm, ec2, ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, sagemaker.api, sagemaker.runtime, execute-api, secretsmanager, states, sts, acm-pca, glue, athena, macie2, ecs, bedrock-runtime
5s3, sts
6ssm, ssmmessages, ec2messages, ec2, s3, logs, monitoring, kms, sts
7ssm, ec2, ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, sagemaker.api, secretsmanager, elasticfilesystem, codecommit, git-codecommit, glue, athena, application-autoscaling
8logs, monitoring, sts, glue, lambda, states, secretsmanager
9ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager
10logs, monitoring, sts, ec2
11ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, secretsmanager, acm-pca
12athena, logs, monitoring, kms, secretsmanager, codecommit, sagemaker.api, sagemaker.runtime, glue, git-codecommit, sts, bedrock-runtime
13ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager

If we group the endpoints by frequency, assuming one environment or four environments, the numbers look like this:

VPC EndpointFrequency (x1)Frequency (x4)
sts1456
logs1248
monitoring1248
kms1040
secretsmanager1040
lambda936
ecr.api832
ecr.dkr832
acm-pca728
ec2624
ssm520
sagemaker.api416
glue416
ssmmessages312
ec2messages312
sagemaker.runtime312
athena312
ssm-contacts28
states28
bedrock-runtime28
s328
codecommit28
git-codecommit28
sqs14
airflow.api14
airflow.env14
airflow.ops14
execute-api14
macie214
ecs14
elasticfilesystem14
application-autoscaling14
Total132528
Total costs


Calculation of total costs for eu-west-2 or London region would look like

Total costs for 132 endpoints for 1 environment = 0.011 (per hour) * 3 AZs * 24* 30 * 132 = 3136.32
Total costs for 528 endpoints = 3136.32* 4 = 12545.28
Data Processing costs for 4 environments = 5.28 (rough estimation)

Total unique VPC endpoints count = 32
Costs for 32 endpoints = 0.011 (per hour) * 3 AZs * 24* 30 * 32 = 760.32

A centralized approach for VPC endpoints in a shared services account for prod and nonprod may provide same scalability and high availability, while reducing the costs with 87% and administrative burden. Of course we can do a step further and replace some of the interface endpoints like S3 and DynamoDB for gateway endpoints in case we don't want to use their transitive nature and share them across VPCs and we want to save money.

Summary


  • 132 endpoints x 3 AZs x $0.011/hour x 24 hours x 30 days = $3,136.32/month


  • For 4 environments (528 endpoints): $12,545.28/month


  • Costs for 32 endpoints across 3 AZs: 0.011 USD/hour × 3 AZs × 24 hours × 30 days × 32 = $760.32/month


  • Savings: ~87%
Optimizing Architecture for Cost Savings and Business Continuity


The costs above are not necessarily something bad. You have isolation between environments and you gather extensive knowledge how things work and how you need to approach stakeholders in order improve the situation.

Why Isn't Cost Enough to Convince the Business?


Business is only interested of certain things. I would say: nobody cares that the administrative burden would be smaller. So how you can approach this?


  1. When you start the deployment of the interface endpoints they were not secured well. This means now we have a lot of networks, resulting in inconsistent security standards—each VPC becomes a snowflake. You may avoid saying that is not secure, a more suitable approach would be:
    By standardizing the security policies and security groups you can make sure that sensitive workloads have access only to those specific buckets and only those specific tables and only those specific APIs. This improves the security baseline and reduces the blast radius. As a result this reduces the possibility of a data leakage. (How to Sell Optimization Without Saying 'Security Is Bad')


  2. By centralizing and standardizing the interface endpoints we would save 87% of the costs. In Bulgaria, there's a satirical series called 'The Three Fools.' In our case, the fool is the team paying thousands to AWS for redundant endpoints—just because no one paused to rethink the architecture.

Note: Security is always a good selling point for the business and nobody measures it after a change. Controlling fear factor and risk sells, a good example would be insurance, that we buy for our houses

High Level Design



Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



? Components Table

IDNameTypeDescription
C1Interface EndpointsVPC Interface EndpointsProvides private access to AWS services (e.g., ssm.eu-west-1.amazonaws.com).
C2Route 53 Private Hosted ZoneDNS ZoneHosts private DNS entries for the services.
C3Route 53 Resolver Inbound EndpointDNS ResolverAccepts DNS queries from the spoke VPC.
C3Shared ResolverRoute 53 ResolverUsed by EC2 instances in the spoke VPC to resolve private DNS.
C4AWS RAMResource Access ManagerShares the inbound endpoint and private hosted zone with the spoke VPC.
C5Cloud WAN Segment NetworkNetwork RoutingRoutes traffic between segments (e.g., from spoke to shared services).
EC2Amazon EC2 InstanceComputeThe instance initiating the request to ssm.eu-west-1.amazonaws.com.
Spoke VPCVPCContains the EC2 instance. CIDR: 192.168.20.X.
Centralized VPC EndpointsVPCHosts the interface endpoints and inbound resolver. CIDR: 192.168.10.X.
? Integrations Table

StepIntegration DescriptionDirectionProtocol/Mechanism
1EC2 in spoke VPC wants to resolve ssm.eu-west-1.amazonaws.com.Spoke → SharedDNS Query via Shared Resolver
2Shared Resolver provides IP 192.168.10.4 for the endpoint.Shared → SpokeDNS Response
3Traffic to 192.168.10.4 is not local, forwarded to Cloud WAN uplink.Spoke → Cloud WANVPC Route Table / Cloud WAN Routing
4Cloud WAN checks if route to another network is permitted.Cloud WANFirewall/Policy Check
5If permitted, traffic is routed to shared services VPC.Cloud WAN → SharedNetwork Forwarding
I1Private hosted zone is associated with the shared resolver and spoke via RAM.Shared ↔ SpokeAWS RAM and Route 53 Association
I4RAM shares the inbound resolver with spoke VPC.Shared → SpokeAWS Resource Access Manager
I5Spoke EC2 sends DNS queries to shared resolver.Spoke → SharedDNS

Prerequisite: All VPCs are connected via peering/TransitGateway/CloudWAN

Hub VPC


First we need to create centralized hub VPC that will have all of the necessary VPC interface endpoints. When you create a VPC endpoint to an AWS service, you can enable private DNS. When enabled, the setting creates an AWS managed Route 53 private hosted zone (PHZ), which enables the resolution of public AWS service endpoint to the private IP of the interface endpoint. You need this disabled in order to define a centralized PHZ trough a Route 53 inbound resolver, which would be shared with other accounts
To do this you need to disable this in terraform:


resource "aws_vpc_endpoint" "private_links" {
for_each = toset(local.vpc_endpoints_all)
vpc_id = aws_vpc.main.id
service_name = each.key
vpc_endpoint_type = "Interface"
private_dns_enabled = false
#Disabling private DNS lets us override the default endpoint #resolution and use our own Route 53 hosted zone across accounts
security_group_ids = [aws_security_group.vpc_endpoint[each.key].id]
policy = data.aws_iam_policy_document.vpc_endpoints_policy.json
subnet_ids = local.subnets
tags = merge({ Name = "${var.prefix}-${each.key}-interface-endpoint" }, var.tags)
}

  1. As next step you create route 53 private hosted zones for each endpoint. We associate them with the centralized VPC from step 1.


  2. Then we create Alias A record in each hosted zone pointing to the VPC endpoint's DNS name. For example: For STS endpoint
    the name "sts.${data.aws_region.current.name}.amazonaws.com"
    should point to the DNS of the newly created VPC endpoint for STS
    This allows traffic from spoke VPCs to resolve AWS service endpoints via centralized VPC interface endpoints and inbound endpoint.


  3. Then we create an inbound endpoint in Route 53 with security group, Do53 protocol in at least two subnets for high availability, that will be used in the spoke VPCs as well. The idea of the inbound endpoint resolver is route your DNS queries from other spoke VPCs or networks to the hub VPC


  4. As last step we share the resolver or inbound endpoint with other accounts and define policy for security trough Resource Access Manager or RAM
Spoke VPCs

  1. Each newly create spoke VPC needs to be associated with the inbound resolver, that was shared from hub VPC. Example

data "aws_route53_resolver_rules" "eu_west_2" {
owner_id = var.resolver_rules[terraform.workspace]
share_status = "SHARED_WITH_ME"
}
resource "aws_route53_resolver_rule_association" "eu_west_2" {
for_each = data.aws_route53_resolver_rules.eu_west_2.resolver_rule_ids
resolver_rule_id = each.value
vpc_id = data.terraform_remote_state.networking.outputs.network.aws_vpc.id
}
Minimizing downtime


Now you would ask: How to move from the state with centralized VPC interface endpoints to a state, where they were centralized with as minimal downtime as possible?

In general what could be done is associating the shared resolver with the spoke VPC and then destroying the decentralized VPC endpoints with a rolling deployment from development to production and automatic tests via System Manager Run Command/Lambda. This will guarantee that first you gather knowledge what can fail (Everything fails all the time) and you document it in Confluence or even Readme.md. This will give you confidence for production and will make the big change controllable and more understandable for technical and non-technical people

Key Takeaways


  • Rushing architecture decisions often leads to long-term cost explosions.


  • Interface endpoints are scalable—duplicating them per VPC isn’t.


  • Centralizing shared services like VPC endpoints saves money and simplifies security management.


  • To convince stakeholders, lead with security and cost—not technical purity.


  • AWS gives you the tools; architecture is about using them with purpose.

Sources:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.




Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.




Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

 
Вверх Снизу