How DevOps Engineers Can Build a Cloud Network Using AWS VPC

Overview
Amazon Virtual Private Cloud (VPC) is your private, isolated section of the AWS cloud where you launch and manage your AWS resources. Think of it as your own data centre in the cloud, but with the flexibility and scalability of AWS.
A VPC is a logically isolated virtual network that you define. You have complete control over your networking environment, including IP address ranges, subnets, route tables, network gateways, and security settings. For DevOps engineers, VPC is crucial because it determines how your applications communicate, how secure they are, and how they connect to the outside world.
Core Components
1. CIDR Blocks (IP Address Range): When you create a VPC, you assign it a CIDR block (e.g., 10.0.0.0/16), which determines the range of IP addresses available. You can have a primary CIDR block and add secondary CIDR blocks if you need more IP space. The CIDR block size can range from /16 (65,536 IPs) to /28 (16 IPs).
2. Subnets: Subnets are subdivisions of your VPC's IP address range, and they exist within a single Availability Zone.
You typically create:
Public Subnets: Have a route to an Internet Gateway, resources here can directly communicate with the internet (e.g., load balancers, bastion hosts)
Private Subnets: No direct internet access, used for application servers, databases (more secure)
Database Subnets: Often further isolated private subnets specifically for databases with additional security layers
3. Route Tables: Route tables contain rules (routes) that determine where network traffic is directed. Each subnet must be associated with a route table. You can have:
Main Route Table: Default for the VPC
Custom Route Tables: For specific routing needs per subnet
Routes define destinations (CIDR blocks) and targets (gateways, NAT devices, VPC peering connections)
4. Internet Gateway (IGW): A horizontally scaled, redundant, and highly available VPC component that allows communication between your VPC and the internet. You attach one IGW per VPC, and public subnets route internet-bound traffic through it.
5. NAT Gateway/NAT Instance: Network Address Translation (NAT) allows resources in private subnets to access the internet (for updates, external API calls) while remaining unreachable from the internet(outbound traffic). NAT Gateway is AWS-managed (preferred), while NAT Instance is a self-managed EC2 instance.
6. Security Groups: Virtual firewalls at the instance/ENI level that control inbound and outbound traffic. They are:
Stateful: Return traffic is automatically allowed
Applied at the instance level: Can be attached to multiple instances
Default deny: Only explicitly allowed traffic passes through
Support rules based on IP addresses, CIDR blocks, or other security groups
7. Network ACLs (NACLs): Stateless firewalls at the subnet level that control traffic in and out of subnets. Unlike security groups:
Stateless: Must explicitly allow both request and response traffic
Applied at the subnet level: Affects all resources in the subnet
Numbered rules: Processed in order (lowest to highest)
Default allow: Default NACL allows all traffic, custom NACLs deny all by default
8. VPC Peering: A network connection between two VPCs that enables routing using private IP addresses. Works across:
Same AWS account
Different AWS accounts
Different AWS regions (inter-region peering)
Non-transitive: If VPC A peers with B, and B peers with C, A cannot communicate with C unless explicitly peered.
9. VPC Endpoints: Enable private connections between your VPC and AWS services without going through the internet. Two types:
Interface Endpoints: Powered by AWS PrivateLink, support most AWS services, use ENIs with private IPs
Gateway Endpoints: For S3 and DynamoDB only, specified as a route table target
10. Transit Gateway: Acts as a central hub to connect multiple VPCs, on-premises networks, and remote networks. Simplifies network architecture when you have many VPCs (10+).
11. VPN Connections: Secure connection between your on-premises network and AWS VPC using:
Site-to-Site VPN: IPsec connection between your network and AWS
Client VPN: OpenVPN-based managed service for remote user access
Virtual Private Gateway: VPN concentrator on the AWS side
12. Direct Connect: Dedicated network connection from your premises to AWS, bypassing the internet for:
More consistent network performance
Reduced bandwidth costs
Private connectivity to VPC
13. Elastic Network Interfaces (ENI): Virtual network cards that you can attach to instances. Each ENI has:
Primary private IP address
One or more secondary private IP addresses
One Elastic IP address per private IP
One or more security groups
MAC address
14. Flow Logs: Capture information about IP traffic going to and from network interfaces in your VPC. Can be created at VPC, subnet, or ENI level, and sent to CloudWatch Logs, S3, or Kinesis Data Firehose.
15. DHCP Options Sets: Define domain name servers, domain names, NTP servers, and NetBIOS name servers for instances in your VPC.
Real-World DevOps Use Cases
Use Case 1: Multi-Tier Application Architecture
Scenario: You're deploying a typical three-tier web application (web tier, application tier, database tier) that needs to be secure, scalable, and highly available.
Solution Architecture:
┌─────────────────────────────────────────────────────────────┐
│ VPC (10.0.0.0/16) │
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Availability Zone 1 │ │ Availability Zone 2 │ │
│ │ │ │ │ │
│ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │
│ │ │ Public Subnet │ │ │ │ Public Subnet │ │ │
│ │ │ 10.0.1.0/24 │ │ │ │ 10.0.2.0/24 │ │ │
│ │ │ [ALB] │ │ │ │ [ALB] │ │ │
│ │ └────────────────┘ │ │ └────────────────┘ │ │
│ │ │ │ │ │
│ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │
│ │ │ Private Subnet │ │ │ │ Private Subnet │ │ │
│ │ │ 10.0.11.0/24 │ │ │ │ 10.0.12.0/24 │ │ │
│ │ │ [App Servers]│ │ │ │ [App Servers]│ │ │
│ │ └────────────────┘ │ │ └────────────────┘ │ │
│ │ │ │ │ │
│ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │
│ │ │ DB Subnet │ │ │ │ DB Subnet │ │ │
│ │ │ 10.0.21.0/24 │ │ │ │ 10.0.22.0/24 │ │ │
│ │ │ [RDS Primary]│ │ │ │ [RDS Standby]│ │ │
│ │ └────────────────┘ │ │ └────────────────┘ │ │
│ └─────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Why it matters:
High availability across multiple AZs
Security through network isolation (databases never exposed to the internet)
Scalability by adding instances in any tier
Cost optimisation by using NAT Gateway only in public subnets
Use Case 2: Hybrid Cloud with VPN Connection
Scenario: Your organisation is migrating to AWS gradually and needs to maintain connectivity with on-premises data centres. Some services remain on-premises while new services are built in AWS.
Solution:
Set up a Site-to-Site VPN between the corporate data centre and the AWS VPC
Configure routing to allow private IP communication
Use Transit Gateway if connecting multiple VPCs to on-premises
Implement security groups to control which resources can communicate
Why it matters:
Seamless integration during migration
Private, encrypted communication
Access to on-premises databases from cloud applications
Ability to burst to the cloud for additional capacity
Use Case 3: Microservices with Service Mesh
Scenario: You're running a microservices architecture on EKS/ECS where services need to communicate securely, and you want to minimise data transfer costs.
Solution:
Deploy microservices across private subnets in multiple AZs
Use VPC endpoints for AWS services (S3, DynamoDB, ECR, CloudWatch)
Implement security groups per microservice with least privilege
Use AWS App Mesh or Istio for service-to-service communication
Enable VPC Flow Logs for troubleshooting
Why it matters:
Reduced NAT Gateway costs (no internet egress for AWS service calls)
Better security (traffic never leaves the AWS network)
Improved performance (private connectivity)
Fine-grained access control
Use Case 4: Multi-Account Strategy with VPC Peering
Scenario: Your organisation follows AWS best practices with separate accounts for Dev, Staging, and Production. Shared services (monitoring, logging, CI/CD) exist in a central account.
Solution:
Create VPCs in each account
Set up VPC peering connections between accounts
Configure route tables to allow specific traffic flows
Use security groups referencing security groups in peered VPCs
Centralise logging using VPC Flow Logs sent to a central account
Why it matters:
Environmental isolation prevents accidents
Centralised services reduce duplication
Clear security boundaries
Cost allocation per environment
Best Practices of VPC for DevOps Engineers
Now, let's explore the essential practices that will make your VPC secure, scalable, and cost-effective. These practices are based on real-world scenarios.
1. Plan Your CIDR Blocks Before You Start
The Problem: Many beginners choose CIDR blocks without thinking about future growth or integration with other VPCs.
The Solution:
When planning your VPC CIDR block, consider:
Size: Use /16 for production VPCs (gives you 65,536 IP addresses)
Avoid overlaps: Don't use ranges that might overlap with on-premises networks or other VPCs
Reserve space: Don't allocate all IP space immediately; leave room for expansion
Example IP Allocation Strategy:
VPC: 10.0.0.0/16 (65,536 total IPs)
Reserved/Planned Allocation:
├── 10.0.0.0/20 (4,096 IPs) → Reserved for future use
├── 10.0.16.0/20 (4,096 IPs) → Public subnets
│ ├── 10.0.16.0/24 → Public Subnet AZ-1
│ ├── 10.0.17.0/24 → Public Subnet AZ-2
│ └── 10.0.18.0/24 - 10.0.31.0/24 → Future public subnets
├── 10.0.32.0/20 (4,096 IPs) → Private subnets (app tier)
│ ├── 10.0.32.0/24 → Private Subnet AZ-1
│ ├── 10.0.33.0/24 → Private Subnet AZ-2
│ └── 10.0.34.0/24 - 10.0.47.0/24 → Future private subnets
├── 10.0.48.0/20 (4,096 IPs) → Database subnets
│ ├── 10.0.48.0/24 → DB Subnet AZ-1
│ ├── 10.0.49.0/24 → DB Subnet AZ-2
│ └── 10.0.50.0/24 - 10.0.63.0/24 → Future DB subnets
└── 10.0.64.0/18 (16,384 IPs) → Reserved for major expansion
Common CIDR Ranges to Use (Non-overlapping):
Production:
10.0.0.0/16Staging:
10.1.0.0/16Development:
10.2.0.0/16DR Region:
10.10.0.0/16
Why This Matters:
You can't change the VPC CIDR once resources are deployed
Running out of IPs requires a complex migration to a new VPC
Overlapping CIDRs prevent VPC peering and hybrid connectivity
2. Always Deploy Across Multiple Availability Zones
The Problem: Single AZ deployment means a single point of failure.
The Solution: Deploy every tier of your application across at least 2 Availability Zones:
What This Protects Against:
Hardware failures in a single data centre
Network issues in one AZ
Planned AWS maintenance
Natural disasters (each AZ is in a separate location)
Real Impact: AWS has had AZ outages. Applications deployed in a single AZ went completely down. Multi-AZ applications continued running with minor performance degradation.
3. Implement Defence in Depth with Multiple Security Layers
The Problem: Relying on only security groups leaves you vulnerable if misconfigured.
The Solution: Use multiple layers of network security:
Layer 1: Network ACLs (Subnet level - Stateless)
↓
Layer 2: Security Groups (Instance level - Stateful)
↓
Layer 3: Host-based Firewall (Optional, for high security)
↓
Layer 4: Application-level Security
Key Differences:
| Feature | Security Group | Network ACL |
| Level | Instance/ENI | Subnet |
| State | Stateful (auto allows return) | Stateless (must allow both) |
| Rules | Allow rules only | Allow and Deny rules |
| Evaluation | All rules evaluated | Rules in order (lowest first) |
| Default | Deny all inbound | Default NACL allows all |
Why This Matters: If someone accidentally opens a security group to 0.0.0.0/0, NACL can still block unwanted traffic.
4. Use VPC Endpoints to Save Costs and Improve Security
The Problem: Applications in private subnets access AWS services through NAT Gateway, costing money and routing traffic through the internet.
The Solution: Implement VPC Endpoints for AWS services your applications use frequently.
Two Types of VPC Endpoints:
A) Gateway Endpoints (FREE!)
Available for: S3 and DynamoDB only
No hourly charges, no data transfer charges
Added to route tables
B) Interface Endpoints (Paid, but saves NAT costs)
Available for: Most AWS services (ECR, CloudWatch, Secrets Manager, etc.)
Cost: ~$7/month per endpoint
No data transfer charges
Cost Comparison Example:
Scenario: App in private subnet pulls Docker images from ECR (100GB/month)
Without VPC Endpoint (using NAT Gateway):
- NAT Gateway hourly: $0.045/hour × 730 hours = $32.85
- Data processed: $0.045/GB × 100GB = $4.50
- Total: $37.35/month
With VPC Endpoint:
- Interface Endpoint hourly: $0.01/hour × 730 hours = $7.30
- Data processed: $0/GB
- Total: $7.30/month
Monthly Savings: $30.05
Annual Savings: $360.60
When to Use VPC Endpoints:
Your app frequently accesses S3 → Always use (it's free!)
Your app uses DynamoDB → Always use (it's free!)
You pull Docker images from ECR → Use interface endpoint
You send logs to CloudWatch → Use interface endpoint
You have high NAT Gateway bills → Analyse VPC Flow Logs and add endpoints
5. Use Security Group References Instead of CIDR Blocks
The Problem: Hardcoding IP ranges in security groups is brittle and hard to maintain.
Problems with this approach:
If you add app servers in a new subnet, you must update the security group
If IP ranges change, rules break
Doesn't scale with Auto Scaling Groups
(With variables) Why This is Better:
Any instance with the "app" security group can access the database
Auto Scaling adds new instances? They automatically get access
Change app subnets? Security groups still work
More intuitive: "app tier can access database tier"
6. Always Enable VPC Flow Logs
The Problem: Without Flow Logs, troubleshooting network issues and security incidents is nearly impossible.
The Solution:
Enable VPC Flow Logs on day one. They capture all IP traffic and are invaluable for:
Debugging connectivity issues
Security analysis and forensics
Identifying unusual traffic patterns
Compliance and audit requirements
Cost optimisation (finding unnecessary traffic)
How to Use Flow Logs:
- Find rejected connections (blocked by security groups):
# CloudWatch Logs Insights Query
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, action
| filter action = "REJECT"
| sort @timestamp desc
| limit 100
- Find top talkers (highest traffic sources):
fields @timestamp, srcAddr, dstAddr, bytes
| stats sum(bytes) as totalBytes by srcAddr
| sort totalBytes desc
| limit 20
- Analyse traffic to a specific resource:
fields @timestamp, srcAddr, srcPort, dstPort, protocol, bytes
| filter dstAddr = "10.0.32.15"
| sort @timestamp desc
Cost Consideration:
Flow Logs ingestion: $0.50 per GB
CloudWatch Logs storage: $0.50 per GB/month
For high-traffic VPCs, consider sending to S3 instead (cheaper)
7. Implement Proper Tagging Strategy
The Problem: Without consistent tags, you can't track costs, find resources, or automate management.
The Solution (terraform code):
Define and enforce a tagging strategy across all VPC resources:
locals {
common_tags = {
# Identification
Name = "production-vpc"
Environment = "production" # production, staging, dev
Project = "ecommerce-platform"
# Management
ManagedBy = "terraform"
Owner = "devops-team"
Team = "platform-engineering"
# Cost Tracking
CostCenter = "engineering"
BillingCode = "CC-1234"
# Compliance
Compliance = "pci-dss" # If applicable
DataClass = "confidential" # public, internal, confidential
# Lifecycle
CreatedDate = "2026-01-28"
BackupSchedule = "daily"
}
}
# Apply to all resources
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
tags = merge(local.common_tags, {
Name = "production-vpc"
})
}
Benefits:
Cost allocation reports by project/team
Easy resource filtering and searching
Automated compliance checking
Resource lifecycle management
Better organisation in AWS Console
8. Use Infrastructure as Code for Everything
The Problem: Manually creating VPC components in the console leads to:
Configuration drift
No version control
Difficult disaster recovery
Hard to replicate across environments
The Solution:
Use Terraform (or CloudFormation) for all VPC infrastructure.
Benefits:
Version Control
Reproducibility
Disaster Recovery
Code Review
Documentation: Your Terraform code IS your documentation. It shows exactly what exists and how it's configured.
9. Plan for High Availability with NAT Gateways
The Problem: A single NAT Gateway creates a single point of failure.
The Solution: Deploy NAT Gateway in each Availability Zone (already in our Terraform example):
AZ-1:
Public Subnet → NAT Gateway 1 → Internet Gateway
Private Subnet ↗
AZ-2:
Public Subnet → NAT Gateway 2 → Internet Gateway
Private Subnet ↗
Why This Matters:
If the NAT Gateway in AZ-1 fails, only AZ-1 traffic is affected
AZ-2 continues operating normally
No total outage
Cost vs Availability Trade-off:
1 NAT Gateway: $32/month - Not HA, entire VPC loses internet if it fails
2 NAT Gateways: $64/month - HA, failure limited to one AZ
For production: Always use 2+ NAT Gateways
10. Monitor and Set Up Alerts
The Problem: Network issues go unnoticed until they become critical.
The Solution: Set up CloudWatch alarms for key VPC metrics:
Key Metrics to Monitor:
NAT Gateway: PacketsDropCount, BytesInFromDestination, BytesOutToDestination
VPC Flow Logs: REJECT actions count
VPN Connections: TunnelState, TunnelDataIn, TunnelDataOut
Network ACLs: Denied packet count
Common Pitfalls to Avoid
Now let's look at the mistakes that even experienced DevOps engineers make with VPCs. Learning from these will save you hours of troubleshooting and potential security incidents.
1. Choosing a CIDR Block That's Too Small
Why This is Problematic:
Let's do the math:
/24 CIDR = 256 total IPs
AWS reserves 5 IPs per subnet
If you create 2 public, 2 private, 2 database subnets (6 subnets):
Each subnet needs minimum /28 (16 IPs)
16 IPs - 5 reserved = 11 usable IPs per subnet
Total usable: 66 IPs across all subnets
This means:
You can have max 11 instances per subnet
No room for load balancers, RDS instances, Lambda ENIs
Can't add more subnets for expansion
Can't peer with other VPCs (may overlap)
Impact on Subnet Sizing:
| CIDR | Total IPs | Reserved | Usable | Good For |
| /28 | 16 | 5 | 11 | Very small subnets only |
| /27 | 32 | 5 | 27 | Small subnets |
| /26 | 64 | 5 | 59 | Medium subnets |
| /25 | 128 | 5 | 123 | Good for most use cases |
| /24 | 256 | 5 | 251 | Standard subnet size |
2. Forgetting About AWS Reserved IPs
AWS reserves 5 IPs in every subnet:
.0: Network address
.1: VPC router
.2: DNS server
.3: Future use
.255: Broadcast (not used but reserved)
A /28 subnet (16 IPs) only has 11 usable IPs!
3. Overly Permissive Security Groups
# ❌ NEVER DO THIS
ingress {
from_port = 0
to_port = 65535
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
4. Not Using Bastion Hosts or Session Manager
Direct SSH access to private instances is a security risk. Use:
AWS Systems Manager Session Manager (no bastion needed!)
Bastion/Jump host in public subnet with strict security groups
VPN or Direct Connect for admin access
5. Mixing Environment Workloads in the Same VPC
Don't run dev, staging, and prod in the same VPC:
Accidental changes affect all environments
Difficult to implement different security policies
Compliance issues
Cost allocation problems
6. Not Testing Failure Scenarios
Test these scenarios:
NAT Gateway failure
AZ failure
Security group misconfigurations
Route table errors
7. Ignoring VPC Limits
AWS VPC Limits (soft limits, can be increased):
5 VPCs per region (default)
200 subnets per VPC
500 security groups per VPC
60 rules per security group
125 VPC peering connections per VPC
Plan accordingly and request limit increases early!
Troubleshooting Common Issues
Issue 1: Instances in Private Subnet Can't Access the Internet
Check:
# 1. Verify NAT Gateway is running
aws ec2 describe-nat-gateways --filter "Name=vpc-id,Values=vpc-xxxxx"
# 2. Check route table has route to NAT Gateway
aws ec2 describe-route-tables --route-table-ids rtb-xxxxx
# 3. Verify subnet association
aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-xxxxx"
# 4. Check security group allows outbound traffic
aws ec2 describe-security-groups --group-ids sg-xxxxx
Issue 2: Can't SSH to the Instance
Check:
# 1. Security group allows SSH from your IP
aws ec2 describe-security-groups --group-ids sg-xxxxx | grep -A5 IpPermissions
# 2. Instance has public IP (if in public subnet)
aws ec2 describe-instances --instance-ids i-xxxxx | grep PublicIpAddress
# 3. Route table has route to IGW (for public subnet)
aws ec2 describe-route-tables --route-table-ids rtb-xxxxx
# 4. NACL allows SSH traffic
aws ec2 describe-network-acls --network-acl-ids acl-xxxxx
Issue 3: High NAT Gateway Costs
Analyse VPC Flow Logs:
# Example CloudWatch Insights query
fields @timestamp, srcAddr, dstAddr, bytes
| filter dstAddr not like /^10\./ # Traffic leaving VPC
| stats sum(bytes) as totalBytes by dstAddr
| sort totalBytes desc
| limit 20
Solutions:
Implement VPC endpoints for AWS services
Move large data transfers to Direct Connect
Use S3 Transfer Acceleration for uploads
Consolidate outbound traffic
Integration with Other AWS Services
EC2 Integration
resource "aws_instance" "app" {
ami = "ami-xxxxxxxxx"
instance_type = "t3.medium"
subnet_id = aws_subnet.private[0].id
vpc_security_group_ids = [aws_security_group.app.id]
# Uses VPC's DNS
# Communicates with other instances via private IPs
# Uses NAT Gateway for internet access
}
RDS Integration
resource "aws_db_subnet_group" "main" {
name = "main-db-subnet-group"
subnet_ids = aws_subnet.database[*].id
}
resource "aws_db_instance" "main" {
engine = "postgres"
instance_class = "db.t3.medium"
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.database.id]
# Automatically gets private IPs in database subnets
# Accessible only via security group rules
}
ECS/EKS Integration
resource "aws_ecs_service" "app" {
name = "app-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
network_configuration {
subnets = aws_subnet.private[*].id
security_groups = [aws_security_group.app.id]
assign_public_ip = false
}
# Tasks run in private subnets
# Use NAT Gateway for external access
# Use VPC endpoints for ECR, CloudWatch
}
Cost Optimisation Tips
1. Use Gateway Endpoints (Free!)
S3 Gateway Endpoint: Save NAT Gateway data transfer costs
DynamoDB Gateway Endpoint: Free private access
2. Right-Size NAT Gateways
Development: Single NAT Gateway (~$32/month)
Production: NAT Gateway per AZ (~$64/month for 2 AZs)
Consider NAT Instance for very low traffic (~$4/month for t3.nano)
3. VPC Endpoints vs NAT Gateway Cost Comparison
| Service | NAT Gateway | Interface Endpoint |
| ECR (100GB/month) | $36.50 | $7 + $0 data = $7 |
| CloudWatch Logs (10GB/month) | $32.45 | $7 + $0 data = $7 |
| Monthly Savings | - | $54.95 |
4. Monitor VPC Flow Logs
Identify and eliminate unnecessary traffic:
Cross-AZ data transfer costs ($0.01/GB)
NAT Gateway data processing costs
Unnecessary internet egress
5. Use Resource Tags for Cost Allocation
tags = {
CostCenter = "engineering"
Project = "web-app"
Environment = "production"
}
Then, create Cost Allocation Tags in AWS Billing Console.
Summary
VPC is the foundation of your AWS infrastructure. Key takeaways:
Plan IP addressing carefully - Use /16 for VPC, leave room for growth
Always use multiple AZs - High availability is not optional
Layer your security - NACLs + Security Groups + host firewalls
Use VPC Endpoints - Save money and improve security
Enable Flow Logs - Essential for troubleshooting and security
Infrastructure as Code - Use Terraform/CloudFormation for everything
Security group references over CIDRs - More flexible and maintainable
Test failure scenarios - Don't wait for production incidents
Master VPC networking, and you'll have a solid foundation for deploying secure, scalable, and cost-effective applications on AWS.
#AWS #DevOps #VPC #CloudNetworking #CloudArchitecture #Terraform #InfrastructureAsCode #CloudSecurity #DevSecOps #AWSCertified




