- Регистрация
- 1 Мар 2015
- Сообщения
- 1,481
- Баллы
- 155
- The Hidden Cost of Speed: How 'Just Make It Work' Breaks Your AWS Budget
- Why is it so challenging?
- How does the “Just do it” approach affect pillars?
- How do VPC interface endpoints fit into all this?
- How much does it cost? – A gentle overview of provisioning VPC interface endpoints for each new VPC
- Total costs
- Summary
- Optimizing Architecture for Cost Savings and Business Continuity
- Why Isn't Cost Enough to Convince the Business?
- High Level Design
- Components Table ?
- Integrations Table ?
- Key takeaways
Working as a DevOps engineer is like juggling flaming swords while someone shouts, 'Can you deploy that by Friday?' Or worse, 'By 17:00 Friday.'
Why is it so challenging?
Explaining that your solution should align with the six pillars of the AWS Well-Architected Framework is like asking for a seatbelt in a car that's already halfway down the hill—or opening your umbrella after the rain has passed. You need time, planning, and a roadmap—and nobody wants to hear that when the only goal is “just make it work.”
“Just do it” is an effective strategy but out of those six pillars, cost optimization and sustainability are usually the first to be sacrificed.
How does the “Just do it” approach affect pillars?
Because in the race to deliver, speed beats everything. Deadlines are sacred.
And what about budgets? Well, they’re not a problem—until someone sees the monthly AWS bill and starts panicking. Simply because cost impact is often hidden behind shared billing, and nobody has tagging discipline in the early phase.
Now you're asked to deploy a Graviton instance for a legacy application that doesn't even support ARM. Why wouldn’t you? After all, cost optimization is suddenly top priority—never mind compatibility?
That’s when suddenly, cost optimization becomes everyone's favorite pillar.
How do VPC interface endpoints fit into all this?
Initially, VPC endpoints are provisioned separately per VPC—because we prioritized speed over cost and, sometimes, even quality or security.
If we have 20 VPCs, we will create endpoints in each, this will lead to increased costs 20 times, especially if we have same endpoints, while the traffic is almost idle. One VPC endpoint in one availability zone provides 10 Gbps with automatic scaling up to 100 Gbps. This is enough to handle multiple workloads, even high-throughput data workloads.
For those with a programming background, this is a classic example of violating the ‘Don’t Repeat Yourself’ (DRY) principle.
Because repeating the same setup in every VPC introduces unnecessary costs for a horizontally scalable networking component designed to handle large volumes of traffic efficiently—and doing it multiple times means paying multiple times.
According to the documentation
How much does it cost? - a gentle overview of provisioning VPC interface endpoints for each new VPC in environments with multi-account strategy. We will use 13 accounts (let believe it is an unlucky number) and some randomly generated endpoint services as an exampleBy default, each VPC endpoint can support a bandwidth of up to 10 Gbps per Availability Zone, and automatically scales up to 100 Gbps.
| account | interface endpoints |
|---|---|
| 1 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, ssm, ssmmessages, ssm-contacts, ec2, ec2messages, acm-pca, secretsmanager |
| 2 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, ssm, ssmmessages, ssm-contacts, ec2, ec2messages, acm-pca, secretsmanager, sqs, airflow.api, airflow.env, airflow.ops |
| 3 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager, sagemaker.api, sagemaker.runtime |
| 4 | ssm, ec2, ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, sagemaker.api, sagemaker.runtime, execute-api, secretsmanager, states, sts, acm-pca, glue, athena, macie2, ecs, bedrock-runtime |
| 5 | s3, sts |
| 6 | ssm, ssmmessages, ec2messages, ec2, s3, logs, monitoring, kms, sts |
| 7 | ssm, ec2, ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, sagemaker.api, secretsmanager, elasticfilesystem, codecommit, git-codecommit, glue, athena, application-autoscaling |
| 8 | logs, monitoring, sts, glue, lambda, states, secretsmanager |
| 9 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager |
| 10 | logs, monitoring, sts, ec2 |
| 11 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, secretsmanager, acm-pca |
| 12 | athena, logs, monitoring, kms, secretsmanager, codecommit, sagemaker.api, sagemaker.runtime, glue, git-codecommit, sts, bedrock-runtime |
| 13 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager |
If we group the endpoints by frequency, assuming one environment or four environments, the numbers look like this:
| VPC Endpoint | Frequency (x1) | Frequency (x4) |
|---|---|---|
| sts | 14 | 56 |
| logs | 12 | 48 |
| monitoring | 12 | 48 |
| kms | 10 | 40 |
| secretsmanager | 10 | 40 |
| lambda | 9 | 36 |
| ecr.api | 8 | 32 |
| ecr.dkr | 8 | 32 |
| acm-pca | 7 | 28 |
| ec2 | 6 | 24 |
| ssm | 5 | 20 |
| sagemaker.api | 4 | 16 |
| glue | 4 | 16 |
| ssmmessages | 3 | 12 |
| ec2messages | 3 | 12 |
| sagemaker.runtime | 3 | 12 |
| athena | 3 | 12 |
| ssm-contacts | 2 | 8 |
| states | 2 | 8 |
| bedrock-runtime | 2 | 8 |
| s3 | 2 | 8 |
| codecommit | 2 | 8 |
| git-codecommit | 2 | 8 |
| sqs | 1 | 4 |
| airflow.api | 1 | 4 |
| airflow.env | 1 | 4 |
| airflow.ops | 1 | 4 |
| execute-api | 1 | 4 |
| macie2 | 1 | 4 |
| ecs | 1 | 4 |
| elasticfilesystem | 1 | 4 |
| application-autoscaling | 1 | 4 |
| Total | 132 | 528 |
Calculation of total costs for eu-west-2 or London region would look like
Total costs for 132 endpoints for 1 environment = 0.011 (per hour) * 3 AZs * 24* 30 * 132 = 3136.32
Total costs for 528 endpoints = 3136.32* 4 = 12545.28
Data Processing costs for 4 environments = 5.28 (rough estimation)
Total unique VPC endpoints count = 32
Costs for 32 endpoints = 0.011 (per hour) * 3 AZs * 24* 30 * 32 = 760.32
A centralized approach for VPC endpoints in a shared services account for prod and nonprod may provide same scalability and high availability, while reducing the costs with 87% and administrative burden. Of course we can do a step further and replace some of the interface endpoints like S3 and DynamoDB for gateway endpoints in case we don't want to use their transitive nature and share them across VPCs and we want to save money.
Summary
132 endpoints x 3 AZs x $0.011/hour x 24 hours x 30 days = $3,136.32/month
For 4 environments (528 endpoints): $12,545.28/month
Costs for 32 endpoints across 3 AZs: 0.011 USD/hour × 3 AZs × 24 hours × 30 days × 32 = $760.32/month
Savings: ~87%
The costs above are not necessarily something bad. You have isolation between environments and you gather extensive knowledge how things work and how you need to approach stakeholders in order improve the situation.
Why Isn't Cost Enough to Convince the Business?
Business is only interested of certain things. I would say: nobody cares that the administrative burden would be smaller. So how you can approach this?
When you start the deployment of the interface endpoints they were not secured well. This means now we have a lot of networks, resulting in inconsistent security standards—each VPC becomes a snowflake. You may avoid saying that is not secure, a more suitable approach would be:
By standardizing the security policies and security groups you can make sure that sensitive workloads have access only to those specific buckets and only those specific tables and only those specific APIs. This improves the security baseline and reduces the blast radius. As a result this reduces the possibility of a data leakage. (How to Sell Optimization Without Saying 'Security Is Bad')
By centralizing and standardizing the interface endpoints we would save 87% of the costs. In Bulgaria, there's a satirical series called 'The Three Fools.' In our case, the fool is the team paying thousands to AWS for redundant endpoints—just because no one paused to rethink the architecture.
Note: Security is always a good selling point for the business and nobody measures it after a change. Controlling fear factor and risk sells, a good example would be insurance, that we buy for our houses
High Level Design
? Components Table
| ID | Name | Type | Description |
|---|---|---|---|
| C1 | Interface Endpoints | VPC Interface Endpoints | Provides private access to AWS services (e.g., ssm.eu-west-1.amazonaws.com). |
| C2 | Route 53 Private Hosted Zone | DNS Zone | Hosts private DNS entries for the services. |
| C3 | Route 53 Resolver Inbound Endpoint | DNS Resolver | Accepts DNS queries from the spoke VPC. |
| C3 | Shared Resolver | Route 53 Resolver | Used by EC2 instances in the spoke VPC to resolve private DNS. |
| C4 | AWS RAM | Resource Access Manager | Shares the inbound endpoint and private hosted zone with the spoke VPC. |
| C5 | Cloud WAN Segment Network | Network Routing | Routes traffic between segments (e.g., from spoke to shared services). |
| EC2 | Amazon EC2 Instance | Compute | The instance initiating the request to ssm.eu-west-1.amazonaws.com. |
| Spoke VPC | VPC | Contains the EC2 instance. CIDR: 192.168.20.X. |