AWS EC2 Guide for DevOps Engineers

Overview

Amazon Elastic Compute Cloud (EC2) is AWS's foundational compute service that provides resizable virtual servers in the cloud. As a DevOps engineer, EC2 is often your first hands-on experience with AWS, and understanding it deeply is crucial for building scalable, reliable infrastructure.

EC2 allows you to launch virtual machines (called instances) with various configurations of CPU, memory, storage, and networking. You pay only for what you use, can scale up or down based on demand, and have complete control over your computing resources.

Core Concepts

1. Instance Types:

EC2 offers instance families optimised for different workloads. Each instance type provides a specific combination of CPU, memory, storage, and networking capacity.

Instance Families:

General Purpose (T, M series): Balanced CPU, memory, and networking. Good for web servers, small databases, and development environments.
- T3/T4g: Burstable performance, cost-effective for variable workloads
- M5/M6i: Balanced performance for most workloads
Compute Optimised (C series): High-performance processors for compute-intensive workloads like batch processing, media transcoding, high-performance web servers, and scientific modelling.
- C5/C6i: Latest generation compute optimised
Memory Optimised (R, X series): Large memory for in-memory databases, real-time big data analytics, and high-performance databases.
- R5/R6i: General memory-intensive workloads
- X1/X2: Extreme memory for SAP HANA, big data processing
Storage Optimised (I, D series): High sequential read/write access to large datasets. Good for NoSQL databases, data warehousing, Hadoop/Spark clusters.
- I3/I4i: NVMe SSD storage
- D2/D3: Dense HDD storage
Accelerated Computing (P, G, F series): GPU or FPGA hardware accelerators for machine learning, graphics processing, and video encoding.
- P3/P4: GPU for ML training
- G4/G5: GPU for inference and graphics
- Inf1: AWS Inferentia for ML inference

Instance Sizing: Each family has sizes from nano to 32xlarge

t3.nano    → 2 vCPU, 0.5 GB RAM
t3.micro   → 2 vCPU, 1 GB RAM
t3.small   → 2 vCPU, 2 GB RAM
t3.medium  → 2 vCPU, 4 GB RAM
t3.large   → 2 vCPU, 8 GB RAM
...
t3.2xlarge → 8 vCPU, 32 GB RAM

2. Amazon Machine Images (AMI):

An AMI is a template that contains the software configuration (operating system, application server, applications) needed to launch an instance. Think of it as a snapshot of a complete system.

AMI Types:

AWS-provided AMIs: Amazon Linux 2, Ubuntu, Windows Server, Red Hat, etc.
Marketplace AMIs: Pre-configured by third parties (e.g., WordPress, Jenkins)
Community AMIs: Shared by AWS users
Custom AMIs: Your own images created from configured instances

AMI Components:

Root volume template (EBS snapshot or instance store)
Launch permissions (which accounts can use it)
Block device mapping (volumes to attach)

3. Instance Purchasing Options:

On-Demand Instances:

Pay by the hour/second with no commitments
Use case: Short-term, unpredictable workloads
Cost: Highest per-hour rate

Reserved Instances (RI):

Commit to 1 or 3 years for a 40-60% discount
Types: Standard (can't change), Convertible (can change instance family)
Payment: All upfront, partial upfront, or no upfront
Use case: Steady-state workloads

Savings Plans:

Commit to consistent usage ($/hour) for 1 or 3 years
More flexible than RIs (can change instance family, OS, region)
Use case: Modern alternative to Reserved Instances

Spot Instances:

Bid on unused EC2 capacity for up to 90% discount
AWS can terminate with a 2-minute notice when capacity is needed
Use case: Fault-tolerant, flexible workloads (batch jobs, CI/CD, big data)

Dedicated Hosts:

Physical EC2 server dedicated to your use
Use case: Compliance requirements, server-bound licenses
Cost: Most expensive

Dedicated Instances:

Instances run on hardware dedicated to a single customer
Cheaper than Dedicated Hosts, but less control

4. Instance Lifecycle:

Pending → Running → Stopping → Stopped → Terminated
           ↓                      ↓
        Rebooting            (can restart)

Pending: Instance is launching
Running: Instance is running (you're billed)
Stopping: Instance is shutting down (EBS-backed only)
Stopped: Instance is stopped (not billed for compute, only storage)
Terminated: Instance is deleted (cannot be recovered)
Rebooting: Temporary restart (stays on the same host)

5. Storage Options:

Elastic Block Store (EBS):

Network-attached storage that persists independently from the instance
Types:
- gp3/gp2: General-purpose SSD (most common)
- io2/io1: Provisioned IOPS SSD (high-performance databases)
- st1: Throughput optimised HDD (big data, data warehouses)
- sc1: Cold HDD (infrequent access)
Can snapshot for backups
Can detach and attach to different instances

Instance Store:

Physical disk attached to the host machine
Ephemeral (data lost when instance stops/terminates)
Very high IOPS
Use case: Temporary data, caches, buffers

EBS vs Instance Store:

Feature	EBS	Instance Store
Persistence	Yes	No (ephemeral)
Snapshot	Yes	No
Resize	Yes	No
Performance	Good	Excellent
Cost	Paid separately	Included

6. Networking:

Elastic Network Interface (ENI):

Virtual network card attached to instances
Has MAC address, private IP(s), security groups
Can attach multiple ENIs to an instance
Can move ENI between instances

Elastic IP (EIP):

Static public IPv4 address
Can associate/disassociate from instances
Charged when not associated with a running instance
Limited to 5 per region (soft limit)

Enhanced Networking:

Single Root I/O Virtualisation (SR-IOV) for higher bandwidth, higher PPS, lower latency
Enabled by default on modern instance types
No additional charge

Placement Groups:

Cluster: Instances in a single AZ, low latency (HPC applications)
Spread: Instances on distinct hardware, max 7 per AZ (critical instances)
Partition: Instances in logical partitions, partitions on different racks (distributed systems like Hadoop, Cassandra)

7. Security:

Security Groups:

Virtual firewall at the instance level
Stateful (return traffic automatically allowed)
Default: Deny all inbound, allow all outbound
Can reference other security groups

Key Pairs:

Public-private key cryptography for SSH/RDP access
AWS stores the public key; you download the private key
Can create or import your own

IAM Roles for EC2:

Attach IAM role to the instance for AWS API access
Temporary credentials automatically rotated
Better than storing access keys on an instance

8. Monitoring and Management:

CloudWatch Metrics:

Default metrics (5-minute intervals, free):
- CPUUtilization
- NetworkIn/Out
- DiskReadOps/WriteOps
- StatusCheckFailed
Detailed monitoring (1-minute intervals, paid)

Systems Manager (SSM):

Manage instances without SSH/RDP
Run commands, patch management, and inventory
Session Manager for shell access
Parameter Store for configuration

User Data:

Script that runs on instance launch
Used for bootstrap/configuration
Runs as root user
Can be modified when the instance is stopped

9. Auto Scaling:

Auto Scaling Groups (ASG):

Automatically adjust the number of instances
Maintains desired capacity
Integrates with ELB for health checks
Scaling policies: Target tracking, step, scheduled

Launch Templates:

Versioned template for launching instances
Defines AMI, instance type, key pair, security groups, etc.
Used by Auto Scaling Groups

10. High Availability and Fault Tolerance:

Multi-AZ Deployment:

Deploy instances across multiple Availability Zones
Protect against AZ failures
Use with Auto Scaling and Load Balancers

Elastic Load Balancing:

Distributes traffic across instances
Types: ALB (Layer 7), NLB (Layer 4), GLB (Layer 3)
Health checks to route only to healthy instances

Snapshots:

Point-in-time backup of EBS volumes
Incremental backups stored in S3
Can create an AMI from a snapshot
Can copy across regions

11. Instance Metadata Service (IMDS):

API accessible from within the instance at http://169.254.169.254
Provides information about the instance (instance-id, AMI-id, IAM role credentials)
IMDSv1: Request/response (less secure)
IMDSv2: Session-oriented with token (more secure, recommended)

12. Hibernation:

Saves RAM contents to the EBS root volume
Instance resumes with the same instance ID, private IPs, and RAM state
Faster than stop/start
Use case: Long-running processes, pre-warmed applications
Limitations: Not all instance types, max 60 days of hibernation

13. Elastic Fabric Adapter (EFA):

Network device for HPC and ML workloads
Bypasses the OS kernel for ultra-low latency
Supports MPI (Message Passing Interface)
Use case: Distributed ML training, computational fluid dynamics

14. Nitro System:

AWS's custom hardware and hypervisor
Better performance, security, and innovation
Components:
- Nitro cards (networking, storage, management)
- Nitro security chip
- Nitro hypervisor (lightweight)
Most modern instance types use Nitro

Real-World DevOps Use Cases

Use Case 1: Auto-Scaling Web Application with Load Balancer

Scenario: You have a web application that experiences variable traffic throughout the day. During peak hours (9 AM - 5 PM), you need 10 instances; during off-hours, 2 are sufficient. You want to optimise costs while maintaining performance.

Solution Architecture:

                    Internet
                       ↓
                 Internet Gateway
                       ↓
              Application Load Balancer
                       ↓
         ┌─────────────┴─────────────┐
         ↓                           ↓
    Auto Scaling Group          Auto Scaling Group
    (AZ-1: 1-5 instances)       (AZ-2: 1-5 instances)
         ↓                           ↓
    Target Group                Target Group
    (Health Checks)             (Health Checks)

Key Components:

Application Load Balancer (ALB): Distributes traffic across instances in multiple AZs
Auto Scaling Group: Automatically adds/removes instances based on CPU utilisation
Target Groups: Register instances for health monitoring
CloudWatch Alarms: Trigger scaling policies based on metrics

Why it matters:

Cost Optimisation: Pay only for instances you need
High Availability: Multi-AZ deployment protects against failures
Automatic Recovery: Unhealthy instances are automatically replaced
Performance: Scales out during peak demand

Implementation: We'll cover this in the hands-on section.

Use Case 2: Spot Instances for CI/CD Pipeline

Scenario: Your CI/CD pipeline runs hundreds of build and test jobs daily. These jobs are fault-tolerant (can be retried), and each runs for 10-30 minutes. You're spending $2,000/month on On-Demand instances.

Solution:

Use Spot Instances for build agents, saving up to 90% on compute costs:

Jenkins/GitLab CI Master (On-Demand)
         ↓
    ┌────┴────┐
    ↓         ↓
Spot Fleet   Spot Fleet
(AZ-1)       (AZ-2)
Build        Build
Agents       Agents

Configuration:

Spot Fleet: Mix of instance types (c5.large, c5.xlarge, c6i.large)
Diversification: Request multiple instance types to reduce interruptions
Fallback: On-Demand instances when Spotis unavailable
Interruption Handling: Save build state, retry on a different instance

Cost Comparison:

On-Demand (c5.xlarge): $0.17/hour
Spot (c5.xlarge):      $0.034/hour (80% discount)

Monthly cost for 20 instances running 24/7:
On-Demand: 20 × $0.17 × 730 = $2,482/month
Spot:      20 × $0.034 × 730 = $496/month
Savings:   $1,986/month = $23,832/year

Why it matters:

Massive Cost Savings: 80-90% cheaper than On-Demand
Same Performance: Identical instances, just cheaper
Fault Tolerance: CI/CD jobs can handle interruptions
Smart Orchestration: Spot Fleet automatically replaces interrupted instances

Use Case 3: Blue-Green Deployment with AMIs

Scenario: You need to deploy a new version of your application with zero downtime and the ability to quickly rollback if issues occur.

Solution Architecture:

Current State (Blue):
ALB → Target Group (Blue) → ASG with v1.0 AMI

Deployment:
1. Create new AMI with v2.0
2. Create new ASG with v2.0 AMI (Green)
3. Attach to same ALB
4. Gradually shift traffic to Green
5. Monitor health and errors
6. If successful: Terminate Blue ASG
   If issues: Shift traffic back to Blue

Final State:
ALB → Target Group (Green) → ASG with v2.0 AMI

Steps:

Golden AMI Creation: Application + dependencies baked into AMI
Green Environment: Launch new ASG with new AMI
Testing: Verify green environment health
Traffic Shift: Use ALB weighted target groups to gradually shift traffic
Monitoring: Watch metrics closely during transition
Rollback or Complete: Quick rollback if issues, otherwise decommission blue

Why it matters:

Zero Downtime: Users never experience an outage
Fast Rollback: Switch back to blue in seconds if needed
Testing in Production: Test with real traffic gradually
Infrastructure as Code: Entire process automated with Terraform/CloudFormation

Best Practices for DevOps

Now that we've built a production-ready auto-scaling infrastructure, let's explore the essential practices that will make your EC2 deployments secure, cost-effective, and maintainable.

1. Always Use IAM Roles Instead of Access Keys

The Problem: Storing AWS credentials (access keys) on EC2 instances creates security risks and management overhead.

The Solution:

Never hardcode credentials. Always attach IAM roles to instances

Why This Matters:

Security: No credentials stored on disk
Automatic Rotation: AWS handles credential rotation
Audit Trail: CloudTrail logs all API calls with role information
Least Privilege: Easy to grant minimal permissions
No Key Management: No need to distribute/rotate keys

Real-World Impact: I've seen leaked credentials on GitHub cost companies thousands in unauthorised usage within hours. IAM roles eliminate this risk.

2. Use Launch Templates, Not Launch Configurations

The Problem: Launch Configurations are legacy and lack modern features.

The Solution:

Always use Launch Templates for new deployments:

Launch Template Advantages:

Versioning: Track changes, rollback easily
Multiple Instance Types: Support for Spot, On-Demand mix
Modern Features: IMDSv2, T3 unlimited, etc.
Template Inheritance: One template can inherit from another
Better Spot Support: Works with Spot Fleets and mixed instances

3. Enable Detailed Monitoring for Production Instances

The Problem: Basic monitoring (5-minute intervals) delays problem detection.

The Solution: Enable detailed monitoring (1-minute intervals) for production workloads

4. Implement a Proper Tagging Strategy for EC2

The Problem: Without tags, you can't track costs, automate management, or organise resources.

The Solution: Tag everything with a consistent strategy

locals {
  common_tags = {
    # Identification
    Environment = "production"
    Project     = "webapp"
    Application = "api-server"
}

5. Use Systems Manager Session Manager Instead of SSH

The Problem: SSH requires:

Open port 22 (security risk)
Bastion hosts (additional cost and complexity)
Key management and distribution
Direct network access to instances

The Solution:

Use AWS Systems Manager Session Manager for secure, audited shell access:

Benefits:

No SSH port exposed (better security)
No bastion hosts needed (lower cost)
No SSH key management
Full audit trail in CloudTrail
Centralised access control via IAM
Session recording and logging
Works with instances in private subnets
Port forwarding support

6. Implement Multi-AZ Deployment for High Availability

The Problem: A single AZ deployment creates a single point of failure.

The Solution: Always deploy across at least 2 Availability Zones

When to Use Multi-AZ:

Production applications (always!)
Customer-facing services
Critical internal services
❌ Development environments (single AZ okay)
❌ Batch processing (use Spot across AZs instead)

7. Use Mixed Instances for Cost Optimisation

The Problem: Relying on a single instance type limits flexibility and can be more expensive.

The Solution: Use mixed instance types in Auto Scaling Groups

When to Use Mixed Instances:

Stateless applications (web servers, workers)
Auto Scaling Groups
Fault-tolerant workloads
❌ Stateful applications (databases)
❌ Applications requiring specific CPU/memory ratios
❌ Regulated workloads requiring dedicated instances

Common Pitfalls to Avoid

Now that we've covered best practices, let's examine the mistakes that can cost you time, money, and security. Learn from these to avoid common EC2 traps.

1. Not Using Auto Scaling and Running Fixed Capacity

Always use Auto Scaling Groups to avoid launching a fixed number of instances.

2. Choosing the Wrong Instance Type

Using whatever instance type seems "good enough" without proper analysis

The Fix:

Step 1: Analyse Your Workload

Step 2: Match Instance Type to Workload

Workload Type	Instance Family	Example
Web servers (variable CPU)	T3/T3a	t3.medium
API servers (steady CPU)	M5/M6i	m5.large
Batch processing	C5/C6i	c5.2xlarge
In-memory databases	R5/R6i	r5.xlarge
Data warehouses	I3/I4i	i3.2xlarge
ML training	P3/P4	p3.8xlarge
ML inference	G4/Inf1	g4dn.xlarge

Step 3: Use AWS Compute Optimiser

# Get recommendations (free service!)
aws compute-optimizer get-ec2-instance-recommendations \
  --instance-arns arn:aws:ec2:us-east-1:123456789012:instance/i-xxxxx

# Example output:
# Current: m5.xlarge (4 vCPU, 16 GB) = $0.192/hour
# Recommended: m5.large (2 vCPU, 8 GB) = $0.096/hour
# Reason: Average CPU 15%, Max CPU 32%, Avg Memory 40%
# Savings: $70/month

Step 4: Start Small, Scale Up

3. Not Enabling Termination Protection for Critical Instances

The Mistake:

Running critical instances without termination protection:

# ❌ BAD: No protection
resource "aws_instance" "database" {
  ami           = "ami-xxxxx"
  instance_type = "r5.2xlarge"

  # One accidental click = database gone!
}

The Fix:

Enable termination protection for critical instances:

# ✅ GOOD: Protected critical instance
resource "aws_instance" "database" {
  ami           = "ami-xxxxx"
  instance_type = "r5.2xlarge"

  disable_api_termination = true  # Cannot terminate via API/Console

  tags = {
    Name        = "production-database"
    Critical    = "true"
    Environment = "production"
  }
}

4. Running Instances in Public Subnets When Not Needed

The Mistake: Placing application servers in public subnets with public IPs:

# ❌ BAD: Application servers exposed to internet
resource "aws_instance" "app" {
  subnet_id                   = aws_subnet.public.id
  associate_public_ip_address = true

  # Now directly accessible from internet!
  # Constant brute force attacks on SSH
  # Security group mistakes = instant breach
}

Why This is Dangerous:

Increased Attack Surface: Every instance is a target
Brute Force Attacks: SSH/RDP are constantly hammered
Security Group Mistakes: One wrong rule = compromise
Compliance Issues: Violates security frameworks
No Defence in Depth: Single layer of security

When Public Instances Are Okay:

Load Balancers (ALB/NLB)
NAT Instances (if you must)
Bastion Hosts (with heavy restrictions)
VPN servers
❌ Application servers (use private + ALB)
❌ Databases (always private!)
❌ Cache servers (always private!)
❌ Background workers (always private!)

Summary

EC2 is AWS's foundational compute service, and mastering it is essential for every DevOps engineer. Here are the key takeaways from this comprehensive guide:

Core Concepts to Master

Instance Types: Choose wisely based on workload - T3 for burstable, M5 for balanced, C5 for compute-intensive, R5 for memory-heavy, and I3 for storage-optimised workloads.

Purchasing Options: Mix On-Demand (flexibility), Reserved Instances/Savings Plans (steady workloads, 40-60% savings), and Spot Instances (fault-tolerant workloads, up to 90% savings) for optimal cost efficiency.

Storage: Use gp3 for general purpose (better than gp2), io2 for high-performance databases, st1 for big data, and sc1 for cold storage. Always match storage type to workload requirements.

Networking: Deploy across multiple AZs, use private subnets for applications, implement proper security groups, and leverage VPC endpoints to reduce NAT Gateway costs.

Final Thoughts

EC2 is powerful but requires careful planning and ongoing optimisation. Start with these fundamentals:

Security first - Use IAM roles, private subnets, IMDSv2, and least privilege
High availability - Multi-AZ, Auto Scaling, health checks
Cost optimisation - Right-sizing, Auto Scaling, Spot instances, proper storage
Automation - Infrastructure as Code, golden AMIs, automated backups
Monitoring - CloudWatch metrics, alarms, dashboards, and logs

Master EC2, and you'll have a solid foundation for building reliable, scalable, and cost-effective infrastructure on AWS. The practices covered here apply whether you're running a small startup application or managing enterprise-scale workloads.

Remember: Start simple, monitor everything, optimise continuously, and automate relentlessly.

#AWS #EC2 #DevOps #CloudComputing #AutoScaling #CostOptimization #CloudArchitecture #InfrastructureAsCode #Terraform #SRE

Mastering AWS EC2: A Comprehensive Guide for DevOps Engineers

Overview

Core Concepts

Real-World DevOps Use Cases

Use Case 1: Auto-Scaling Web Application with Load Balancer

Use Case 2: Spot Instances for CI/CD Pipeline

Use Case 3: Blue-Green Deployment with AMIs

Best Practices for DevOps

1. Always Use IAM Roles Instead of Access Keys

2. Use Launch Templates, Not Launch Configurations

3. Enable Detailed Monitoring for Production Instances

4. Implement a Proper Tagging Strategy for EC2

5. Use Systems Manager Session Manager Instead of SSH

6. Implement Multi-AZ Deployment for High Availability

7. Use Mixed Instances for Cost Optimisation

Common Pitfalls to Avoid

1. Not Using Auto Scaling and Running Fixed Capacity

2. Choosing the Wrong Instance Type

3. Not Enabling Termination Protection for Critical Instances

4. Running Instances in Public Subnets When Not Needed

Summary

Core Concepts to Master

Final Thoughts

Comments

Essential AWS Services For DevOps Engineer

Amazon S3 for DevOps Engineers: The Essential Storage Service Explained

More from this blog

How DevOps Engineers Use AWS CloudTrail for Comprehensive Activity Auditing

AWS Secrets Manager for DevOps Engineers: Secure Secrets Management Explained

How AWS Systems Manager Simplifies DevOps Tasks Without SSH

Mastering AWS Route 53: A DevOps Guide to DNS and Traffic Management

Understanding AWS Elastic Load Balancing for Reliable DevOps Solutions

Command Palette

Overview

Core Concepts

Real-World DevOps Use Cases

Use Case 1: Auto-Scaling Web Application with Load Balancer

Use Case 2: Spot Instances for CI/CD Pipeline

Use Case 3: Blue-Green Deployment with AMIs

Best Practices for DevOps

1. Always Use IAM Roles Instead of Access Keys

2. Use Launch Templates, Not Launch Configurations

3. Enable Detailed Monitoring for Production Instances

4. Implement a Proper Tagging Strategy for EC2

5. Use Systems Manager Session Manager Instead of SSH

6. Implement Multi-AZ Deployment for High Availability

7. Use Mixed Instances for Cost Optimisation

Common Pitfalls to Avoid

1. Not Using Auto Scaling and Running Fixed Capacity

2. Choosing the Wrong Instance Type

3. Not Enabling Termination Protection for Critical Instances

4. Running Instances in Public Subnets When Not Needed

Summary

Core Concepts to Master

Final Thoughts

Comments

Essential AWS Services For DevOps Engineer

Amazon S3 for DevOps Engineers: The Essential Storage Service Explained

More from this blog