Mastering AWS EC2: A Comprehensive Guide for DevOps Engineers

Overview
Amazon Elastic Compute Cloud (EC2) is AWS's foundational compute service that provides resizable virtual servers in the cloud. As a DevOps engineer, EC2 is often your first hands-on experience with AWS, and understanding it deeply is crucial for building scalable, reliable infrastructure.
EC2 allows you to launch virtual machines (called instances) with various configurations of CPU, memory, storage, and networking. You pay only for what you use, can scale up or down based on demand, and have complete control over your computing resources.
Core Concepts
1. Instance Types:
EC2 offers instance families optimised for different workloads. Each instance type provides a specific combination of CPU, memory, storage, and networking capacity.
Instance Families:
General Purpose (T, M series): Balanced CPU, memory, and networking. Good for web servers, small databases, and development environments.
T3/T4g: Burstable performance, cost-effective for variable workloads
M5/M6i: Balanced performance for most workloads
Compute Optimised (C series): High-performance processors for compute-intensive workloads like batch processing, media transcoding, high-performance web servers, and scientific modelling.
- C5/C6i: Latest generation compute optimised
Memory Optimised (R, X series): Large memory for in-memory databases, real-time big data analytics, and high-performance databases.
R5/R6i: General memory-intensive workloads
X1/X2: Extreme memory for SAP HANA, big data processing
Storage Optimised (I, D series): High sequential read/write access to large datasets. Good for NoSQL databases, data warehousing, Hadoop/Spark clusters.
I3/I4i: NVMe SSD storage
D2/D3: Dense HDD storage
Accelerated Computing (P, G, F series): GPU or FPGA hardware accelerators for machine learning, graphics processing, and video encoding.
P3/P4: GPU for ML training
G4/G5: GPU for inference and graphics
Inf1: AWS Inferentia for ML inference
Instance Sizing: Each family has sizes from nano to 32xlarge
t3.nano → 2 vCPU, 0.5 GB RAM
t3.micro → 2 vCPU, 1 GB RAM
t3.small → 2 vCPU, 2 GB RAM
t3.medium → 2 vCPU, 4 GB RAM
t3.large → 2 vCPU, 8 GB RAM
...
t3.2xlarge → 8 vCPU, 32 GB RAM
2. Amazon Machine Images (AMI):
An AMI is a template that contains the software configuration (operating system, application server, applications) needed to launch an instance. Think of it as a snapshot of a complete system.
AMI Types:
AWS-provided AMIs: Amazon Linux 2, Ubuntu, Windows Server, Red Hat, etc.
Marketplace AMIs: Pre-configured by third parties (e.g., WordPress, Jenkins)
Community AMIs: Shared by AWS users
Custom AMIs: Your own images created from configured instances
AMI Components:
Root volume template (EBS snapshot or instance store)
Launch permissions (which accounts can use it)
Block device mapping (volumes to attach)
3. Instance Purchasing Options:
On-Demand Instances:
Pay by the hour/second with no commitments
Use case: Short-term, unpredictable workloads
Cost: Highest per-hour rate
Reserved Instances (RI):
Commit to 1 or 3 years for a 40-60% discount
Types: Standard (can't change), Convertible (can change instance family)
Payment: All upfront, partial upfront, or no upfront
Use case: Steady-state workloads
Savings Plans:
Commit to consistent usage ($/hour) for 1 or 3 years
More flexible than RIs (can change instance family, OS, region)
Use case: Modern alternative to Reserved Instances
Spot Instances:
Bid on unused EC2 capacity for up to 90% discount
AWS can terminate with a 2-minute notice when capacity is needed
Use case: Fault-tolerant, flexible workloads (batch jobs, CI/CD, big data)
Dedicated Hosts:
Physical EC2 server dedicated to your use
Use case: Compliance requirements, server-bound licenses
Cost: Most expensive
Dedicated Instances:
Instances run on hardware dedicated to a single customer
Cheaper than Dedicated Hosts, but less control
4. Instance Lifecycle:
Pending → Running → Stopping → Stopped → Terminated
↓ ↓
Rebooting (can restart)
Pending: Instance is launching
Running: Instance is running (you're billed)
Stopping: Instance is shutting down (EBS-backed only)
Stopped: Instance is stopped (not billed for compute, only storage)
Terminated: Instance is deleted (cannot be recovered)
Rebooting: Temporary restart (stays on the same host)
5. Storage Options:
Elastic Block Store (EBS):
Network-attached storage that persists independently from the instance
Types:
gp3/gp2: General-purpose SSD (most common)
io2/io1: Provisioned IOPS SSD (high-performance databases)
st1: Throughput optimised HDD (big data, data warehouses)
sc1: Cold HDD (infrequent access)
Can snapshot for backups
Can detach and attach to different instances
Instance Store:
Physical disk attached to the host machine
Ephemeral (data lost when instance stops/terminates)
Very high IOPS
Use case: Temporary data, caches, buffers
EBS vs Instance Store:
| Feature | EBS | Instance Store |
| Persistence | Yes | No (ephemeral) |
| Snapshot | Yes | No |
| Resize | Yes | No |
| Performance | Good | Excellent |
| Cost | Paid separately | Included |
6. Networking:
Elastic Network Interface (ENI):
Virtual network card attached to instances
Has MAC address, private IP(s), security groups
Can attach multiple ENIs to an instance
Can move ENI between instances
Elastic IP (EIP):
Static public IPv4 address
Can associate/disassociate from instances
Charged when not associated with a running instance
Limited to 5 per region (soft limit)
Enhanced Networking:
Single Root I/O Virtualisation (SR-IOV) for higher bandwidth, higher PPS, lower latency
Enabled by default on modern instance types
No additional charge
Placement Groups:
Cluster: Instances in a single AZ, low latency (HPC applications)
Spread: Instances on distinct hardware, max 7 per AZ (critical instances)
Partition: Instances in logical partitions, partitions on different racks (distributed systems like Hadoop, Cassandra)
7. Security:
Security Groups:
Virtual firewall at the instance level
Stateful (return traffic automatically allowed)
Default: Deny all inbound, allow all outbound
Can reference other security groups
Key Pairs:
Public-private key cryptography for SSH/RDP access
AWS stores the public key; you download the private key
Can create or import your own
IAM Roles for EC2:
Attach IAM role to the instance for AWS API access
Temporary credentials automatically rotated
Better than storing access keys on an instance
8. Monitoring and Management:
CloudWatch Metrics:
Default metrics (5-minute intervals, free):
CPUUtilization
NetworkIn/Out
DiskReadOps/WriteOps
StatusCheckFailed
Detailed monitoring (1-minute intervals, paid)
Systems Manager (SSM):
Manage instances without SSH/RDP
Run commands, patch management, and inventory
Session Manager for shell access
Parameter Store for configuration
User Data:
Script that runs on instance launch
Used for bootstrap/configuration
Runs as root user
Can be modified when the instance is stopped
9. Auto Scaling:
Auto Scaling Groups (ASG):
Automatically adjust the number of instances
Maintains desired capacity
Integrates with ELB for health checks
Scaling policies: Target tracking, step, scheduled
Launch Templates:
Versioned template for launching instances
Defines AMI, instance type, key pair, security groups, etc.
Used by Auto Scaling Groups
10. High Availability and Fault Tolerance:
Multi-AZ Deployment:
Deploy instances across multiple Availability Zones
Protect against AZ failures
Use with Auto Scaling and Load Balancers
Elastic Load Balancing:
Distributes traffic across instances
Types: ALB (Layer 7), NLB (Layer 4), GLB (Layer 3)
Health checks to route only to healthy instances
Snapshots:
Point-in-time backup of EBS volumes
Incremental backups stored in S3
Can create an AMI from a snapshot
Can copy across regions
11. Instance Metadata Service (IMDS):
API accessible from within the instance at
http://169.254.169.254Provides information about the instance (instance-id, AMI-id, IAM role credentials)
IMDSv1: Request/response (less secure)
IMDSv2: Session-oriented with token (more secure, recommended)
12. Hibernation:
Saves RAM contents to the EBS root volume
Instance resumes with the same instance ID, private IPs, and RAM state
Faster than stop/start
Use case: Long-running processes, pre-warmed applications
Limitations: Not all instance types, max 60 days of hibernation
13. Elastic Fabric Adapter (EFA):
Network device for HPC and ML workloads
Bypasses the OS kernel for ultra-low latency
Supports MPI (Message Passing Interface)
Use case: Distributed ML training, computational fluid dynamics
14. Nitro System:
AWS's custom hardware and hypervisor
Better performance, security, and innovation
Components:
Nitro cards (networking, storage, management)
Nitro security chip
Nitro hypervisor (lightweight)
Most modern instance types use Nitro
Real-World DevOps Use Cases
Use Case 1: Auto-Scaling Web Application with Load Balancer
Scenario: You have a web application that experiences variable traffic throughout the day. During peak hours (9 AM - 5 PM), you need 10 instances; during off-hours, 2 are sufficient. You want to optimise costs while maintaining performance.
Solution Architecture:
Internet
↓
Internet Gateway
↓
Application Load Balancer
↓
┌─────────────┴─────────────┐
↓ ↓
Auto Scaling Group Auto Scaling Group
(AZ-1: 1-5 instances) (AZ-2: 1-5 instances)
↓ ↓
Target Group Target Group
(Health Checks) (Health Checks)
Key Components:
Application Load Balancer (ALB): Distributes traffic across instances in multiple AZs
Auto Scaling Group: Automatically adds/removes instances based on CPU utilisation
Target Groups: Register instances for health monitoring
CloudWatch Alarms: Trigger scaling policies based on metrics
Why it matters:
Cost Optimisation: Pay only for instances you need
High Availability: Multi-AZ deployment protects against failures
Automatic Recovery: Unhealthy instances are automatically replaced
Performance: Scales out during peak demand
Implementation: We'll cover this in the hands-on section.
Use Case 2: Spot Instances for CI/CD Pipeline
Scenario: Your CI/CD pipeline runs hundreds of build and test jobs daily. These jobs are fault-tolerant (can be retried), and each runs for 10-30 minutes. You're spending $2,000/month on On-Demand instances.
Solution:
Use Spot Instances for build agents, saving up to 90% on compute costs:
Jenkins/GitLab CI Master (On-Demand)
↓
┌────┴────┐
↓ ↓
Spot Fleet Spot Fleet
(AZ-1) (AZ-2)
Build Build
Agents Agents
Configuration:
Spot Fleet: Mix of instance types (c5.large, c5.xlarge, c6i.large)
Diversification: Request multiple instance types to reduce interruptions
Fallback: On-Demand instances when Spotis unavailable
Interruption Handling: Save build state, retry on a different instance
Cost Comparison:
On-Demand (c5.xlarge): $0.17/hour
Spot (c5.xlarge): $0.034/hour (80% discount)
Monthly cost for 20 instances running 24/7:
On-Demand: 20 × $0.17 × 730 = $2,482/month
Spot: 20 × $0.034 × 730 = $496/month
Savings: $1,986/month = $23,832/year
Why it matters:
Massive Cost Savings: 80-90% cheaper than On-Demand
Same Performance: Identical instances, just cheaper
Fault Tolerance: CI/CD jobs can handle interruptions
Smart Orchestration: Spot Fleet automatically replaces interrupted instances
Use Case 3: Blue-Green Deployment with AMIs
Scenario: You need to deploy a new version of your application with zero downtime and the ability to quickly rollback if issues occur.
Solution Architecture:
Current State (Blue):
ALB → Target Group (Blue) → ASG with v1.0 AMI
Deployment:
1. Create new AMI with v2.0
2. Create new ASG with v2.0 AMI (Green)
3. Attach to same ALB
4. Gradually shift traffic to Green
5. Monitor health and errors
6. If successful: Terminate Blue ASG
If issues: Shift traffic back to Blue
Final State:
ALB → Target Group (Green) → ASG with v2.0 AMI
Steps:
Golden AMI Creation: Application + dependencies baked into AMI
Green Environment: Launch new ASG with new AMI
Testing: Verify green environment health
Traffic Shift: Use ALB weighted target groups to gradually shift traffic
Monitoring: Watch metrics closely during transition
Rollback or Complete: Quick rollback if issues, otherwise decommission blue
Why it matters:
Zero Downtime: Users never experience an outage
Fast Rollback: Switch back to blue in seconds if needed
Testing in Production: Test with real traffic gradually
Infrastructure as Code: Entire process automated with Terraform/CloudFormation
Best Practices for DevOps
Now that we've built a production-ready auto-scaling infrastructure, let's explore the essential practices that will make your EC2 deployments secure, cost-effective, and maintainable.
1. Always Use IAM Roles Instead of Access Keys
The Problem: Storing AWS credentials (access keys) on EC2 instances creates security risks and management overhead.
The Solution:
Never hardcode credentials. Always attach IAM roles to instances
Why This Matters:
Security: No credentials stored on disk
Automatic Rotation: AWS handles credential rotation
Audit Trail: CloudTrail logs all API calls with role information
Least Privilege: Easy to grant minimal permissions
No Key Management: No need to distribute/rotate keys
Real-World Impact: I've seen leaked credentials on GitHub cost companies thousands in unauthorised usage within hours. IAM roles eliminate this risk.
2. Use Launch Templates, Not Launch Configurations
The Problem: Launch Configurations are legacy and lack modern features.
The Solution:
Always use Launch Templates for new deployments:
Launch Template Advantages:
Versioning: Track changes, rollback easily
Multiple Instance Types: Support for Spot, On-Demand mix
Modern Features: IMDSv2, T3 unlimited, etc.
Template Inheritance: One template can inherit from another
Better Spot Support: Works with Spot Fleets and mixed instances
3. Enable Detailed Monitoring for Production Instances
The Problem: Basic monitoring (5-minute intervals) delays problem detection.
The Solution: Enable detailed monitoring (1-minute intervals) for production workloads
4. Implement a Proper Tagging Strategy for EC2
The Problem: Without tags, you can't track costs, automate management, or organise resources.
The Solution: Tag everything with a consistent strategy
locals {
common_tags = {
# Identification
Environment = "production"
Project = "webapp"
Application = "api-server"
}
5. Use Systems Manager Session Manager Instead of SSH
The Problem: SSH requires:
Open port 22 (security risk)
Bastion hosts (additional cost and complexity)
Key management and distribution
Direct network access to instances
The Solution:
Use AWS Systems Manager Session Manager for secure, audited shell access:
Benefits:
No SSH port exposed (better security)
No bastion hosts needed (lower cost)
No SSH key management
Full audit trail in CloudTrail
Centralised access control via IAM
Session recording and logging
Works with instances in private subnets
Port forwarding support
6. Implement Multi-AZ Deployment for High Availability
The Problem: A single AZ deployment creates a single point of failure.
The Solution: Always deploy across at least 2 Availability Zones
When to Use Multi-AZ:
Production applications (always!)
Customer-facing services
Critical internal services
❌ Development environments (single AZ okay)
❌ Batch processing (use Spot across AZs instead)
7. Use Mixed Instances for Cost Optimisation
The Problem: Relying on a single instance type limits flexibility and can be more expensive.
The Solution: Use mixed instance types in Auto Scaling Groups
When to Use Mixed Instances:
Stateless applications (web servers, workers)
Auto Scaling Groups
Fault-tolerant workloads
❌ Stateful applications (databases)
❌ Applications requiring specific CPU/memory ratios
❌ Regulated workloads requiring dedicated instances
Common Pitfalls to Avoid
Now that we've covered best practices, let's examine the mistakes that can cost you time, money, and security. Learn from these to avoid common EC2 traps.
1. Not Using Auto Scaling and Running Fixed Capacity
Always use Auto Scaling Groups to avoid launching a fixed number of instances.
2. Choosing the Wrong Instance Type
Using whatever instance type seems "good enough" without proper analysis
The Fix:
Step 1: Analyse Your Workload
Step 2: Match Instance Type to Workload
| Workload Type | Instance Family | Example |
| Web servers (variable CPU) | T3/T3a | t3.medium |
| API servers (steady CPU) | M5/M6i | m5.large |
| Batch processing | C5/C6i | c5.2xlarge |
| In-memory databases | R5/R6i | r5.xlarge |
| Data warehouses | I3/I4i | i3.2xlarge |
| ML training | P3/P4 | p3.8xlarge |
| ML inference | G4/Inf1 | g4dn.xlarge |
Step 3: Use AWS Compute Optimiser
# Get recommendations (free service!)
aws compute-optimizer get-ec2-instance-recommendations \
--instance-arns arn:aws:ec2:us-east-1:123456789012:instance/i-xxxxx
# Example output:
# Current: m5.xlarge (4 vCPU, 16 GB) = $0.192/hour
# Recommended: m5.large (2 vCPU, 8 GB) = $0.096/hour
# Reason: Average CPU 15%, Max CPU 32%, Avg Memory 40%
# Savings: $70/month
Step 4: Start Small, Scale Up
3. Not Enabling Termination Protection for Critical Instances
The Mistake:
Running critical instances without termination protection:
# ❌ BAD: No protection
resource "aws_instance" "database" {
ami = "ami-xxxxx"
instance_type = "r5.2xlarge"
# One accidental click = database gone!
}
The Fix:
Enable termination protection for critical instances:
# ✅ GOOD: Protected critical instance
resource "aws_instance" "database" {
ami = "ami-xxxxx"
instance_type = "r5.2xlarge"
disable_api_termination = true # Cannot terminate via API/Console
tags = {
Name = "production-database"
Critical = "true"
Environment = "production"
}
}
4. Running Instances in Public Subnets When Not Needed
The Mistake: Placing application servers in public subnets with public IPs:
# ❌ BAD: Application servers exposed to internet
resource "aws_instance" "app" {
subnet_id = aws_subnet.public.id
associate_public_ip_address = true
# Now directly accessible from internet!
# Constant brute force attacks on SSH
# Security group mistakes = instant breach
}
Why This is Dangerous:
Increased Attack Surface: Every instance is a target
Brute Force Attacks: SSH/RDP are constantly hammered
Security Group Mistakes: One wrong rule = compromise
Compliance Issues: Violates security frameworks
No Defence in Depth: Single layer of security
When Public Instances Are Okay:
Load Balancers (ALB/NLB)
NAT Instances (if you must)
Bastion Hosts (with heavy restrictions)
VPN servers
❌ Application servers (use private + ALB)
❌ Databases (always private!)
❌ Cache servers (always private!)
❌ Background workers (always private!)
Summary
EC2 is AWS's foundational compute service, and mastering it is essential for every DevOps engineer. Here are the key takeaways from this comprehensive guide:
Core Concepts to Master
Instance Types: Choose wisely based on workload - T3 for burstable, M5 for balanced, C5 for compute-intensive, R5 for memory-heavy, and I3 for storage-optimised workloads.
Purchasing Options: Mix On-Demand (flexibility), Reserved Instances/Savings Plans (steady workloads, 40-60% savings), and Spot Instances (fault-tolerant workloads, up to 90% savings) for optimal cost efficiency.
Storage: Use gp3 for general purpose (better than gp2), io2 for high-performance databases, st1 for big data, and sc1 for cold storage. Always match storage type to workload requirements.
Networking: Deploy across multiple AZs, use private subnets for applications, implement proper security groups, and leverage VPC endpoints to reduce NAT Gateway costs.
Final Thoughts
EC2 is powerful but requires careful planning and ongoing optimisation. Start with these fundamentals:
Security first - Use IAM roles, private subnets, IMDSv2, and least privilege
High availability - Multi-AZ, Auto Scaling, health checks
Cost optimisation - Right-sizing, Auto Scaling, Spot instances, proper storage
Automation - Infrastructure as Code, golden AMIs, automated backups
Monitoring - CloudWatch metrics, alarms, dashboards, and logs
Master EC2, and you'll have a solid foundation for building reliable, scalable, and cost-effective infrastructure on AWS. The practices covered here apply whether you're running a small startup application or managing enterprise-scale workloads.
Remember: Start simple, monitor everything, optimise continuously, and automate relentlessly.
#AWS #EC2 #DevOps #CloudComputing #AutoScaling #CostOptimization #CloudArchitecture #InfrastructureAsCode #Terraform #SRE




