Skip to main content

Command Palette

Search for a command to run...

Mastering AWS EC2: A Comprehensive Guide for DevOps Engineers

Updated
15 min read
Mastering AWS EC2: A Comprehensive Guide for DevOps Engineers
V
Hi there! I’m a DevOps enthusiast, certified in AWS and Terraform, passionate about crafting innovative cloud solutions. From designing scalable CI/CD pipelines to deploying microservices on cloud platforms, I’ve immersed myself in transforming ideas into impactful technologies.

Overview

Amazon Elastic Compute Cloud (EC2) is AWS's foundational compute service that provides resizable virtual servers in the cloud. As a DevOps engineer, EC2 is often your first hands-on experience with AWS, and understanding it deeply is crucial for building scalable, reliable infrastructure.

EC2 allows you to launch virtual machines (called instances) with various configurations of CPU, memory, storage, and networking. You pay only for what you use, can scale up or down based on demand, and have complete control over your computing resources.

Core Concepts

1. Instance Types:

EC2 offers instance families optimised for different workloads. Each instance type provides a specific combination of CPU, memory, storage, and networking capacity.

Instance Families:

  • General Purpose (T, M series): Balanced CPU, memory, and networking. Good for web servers, small databases, and development environments.

    • T3/T4g: Burstable performance, cost-effective for variable workloads

    • M5/M6i: Balanced performance for most workloads

  • Compute Optimised (C series): High-performance processors for compute-intensive workloads like batch processing, media transcoding, high-performance web servers, and scientific modelling.

    • C5/C6i: Latest generation compute optimised
  • Memory Optimised (R, X series): Large memory for in-memory databases, real-time big data analytics, and high-performance databases.

    • R5/R6i: General memory-intensive workloads

    • X1/X2: Extreme memory for SAP HANA, big data processing

  • Storage Optimised (I, D series): High sequential read/write access to large datasets. Good for NoSQL databases, data warehousing, Hadoop/Spark clusters.

    • I3/I4i: NVMe SSD storage

    • D2/D3: Dense HDD storage

  • Accelerated Computing (P, G, F series): GPU or FPGA hardware accelerators for machine learning, graphics processing, and video encoding.

    • P3/P4: GPU for ML training

    • G4/G5: GPU for inference and graphics

    • Inf1: AWS Inferentia for ML inference

Instance Sizing: Each family has sizes from nano to 32xlarge

t3.nano    → 2 vCPU, 0.5 GB RAM
t3.micro   → 2 vCPU, 1 GB RAM
t3.small   → 2 vCPU, 2 GB RAM
t3.medium  → 2 vCPU, 4 GB RAM
t3.large   → 2 vCPU, 8 GB RAM
...
t3.2xlarge → 8 vCPU, 32 GB RAM

2. Amazon Machine Images (AMI):

An AMI is a template that contains the software configuration (operating system, application server, applications) needed to launch an instance. Think of it as a snapshot of a complete system.

AMI Types:

  • AWS-provided AMIs: Amazon Linux 2, Ubuntu, Windows Server, Red Hat, etc.

  • Marketplace AMIs: Pre-configured by third parties (e.g., WordPress, Jenkins)

  • Community AMIs: Shared by AWS users

  • Custom AMIs: Your own images created from configured instances

AMI Components:

  • Root volume template (EBS snapshot or instance store)

  • Launch permissions (which accounts can use it)

  • Block device mapping (volumes to attach)

3. Instance Purchasing Options:

On-Demand Instances:

  • Pay by the hour/second with no commitments

  • Use case: Short-term, unpredictable workloads

  • Cost: Highest per-hour rate

Reserved Instances (RI):

  • Commit to 1 or 3 years for a 40-60% discount

  • Types: Standard (can't change), Convertible (can change instance family)

  • Payment: All upfront, partial upfront, or no upfront

  • Use case: Steady-state workloads

Savings Plans:

  • Commit to consistent usage ($/hour) for 1 or 3 years

  • More flexible than RIs (can change instance family, OS, region)

  • Use case: Modern alternative to Reserved Instances

Spot Instances:

  • Bid on unused EC2 capacity for up to 90% discount

  • AWS can terminate with a 2-minute notice when capacity is needed

  • Use case: Fault-tolerant, flexible workloads (batch jobs, CI/CD, big data)

Dedicated Hosts:

  • Physical EC2 server dedicated to your use

  • Use case: Compliance requirements, server-bound licenses

  • Cost: Most expensive

Dedicated Instances:

  • Instances run on hardware dedicated to a single customer

  • Cheaper than Dedicated Hosts, but less control

4. Instance Lifecycle:

Pending → Running → Stopping → Stopped → Terminated
           ↓                      ↓
        Rebooting            (can restart)
  • Pending: Instance is launching

  • Running: Instance is running (you're billed)

  • Stopping: Instance is shutting down (EBS-backed only)

  • Stopped: Instance is stopped (not billed for compute, only storage)

  • Terminated: Instance is deleted (cannot be recovered)

  • Rebooting: Temporary restart (stays on the same host)

5. Storage Options:

Elastic Block Store (EBS):

  • Network-attached storage that persists independently from the instance

  • Types:

    • gp3/gp2: General-purpose SSD (most common)

    • io2/io1: Provisioned IOPS SSD (high-performance databases)

    • st1: Throughput optimised HDD (big data, data warehouses)

    • sc1: Cold HDD (infrequent access)

  • Can snapshot for backups

  • Can detach and attach to different instances

Instance Store:

  • Physical disk attached to the host machine

  • Ephemeral (data lost when instance stops/terminates)

  • Very high IOPS

  • Use case: Temporary data, caches, buffers

EBS vs Instance Store:

FeatureEBSInstance Store
PersistenceYesNo (ephemeral)
SnapshotYesNo
ResizeYesNo
PerformanceGoodExcellent
CostPaid separatelyIncluded

6. Networking:

Elastic Network Interface (ENI):

  • Virtual network card attached to instances

  • Has MAC address, private IP(s), security groups

  • Can attach multiple ENIs to an instance

  • Can move ENI between instances

Elastic IP (EIP):

  • Static public IPv4 address

  • Can associate/disassociate from instances

  • Charged when not associated with a running instance

  • Limited to 5 per region (soft limit)

Enhanced Networking:

  • Single Root I/O Virtualisation (SR-IOV) for higher bandwidth, higher PPS, lower latency

  • Enabled by default on modern instance types

  • No additional charge

Placement Groups:

  • Cluster: Instances in a single AZ, low latency (HPC applications)

  • Spread: Instances on distinct hardware, max 7 per AZ (critical instances)

  • Partition: Instances in logical partitions, partitions on different racks (distributed systems like Hadoop, Cassandra)

7. Security:

Security Groups:

  • Virtual firewall at the instance level

  • Stateful (return traffic automatically allowed)

  • Default: Deny all inbound, allow all outbound

  • Can reference other security groups

Key Pairs:

  • Public-private key cryptography for SSH/RDP access

  • AWS stores the public key; you download the private key

  • Can create or import your own

IAM Roles for EC2:

  • Attach IAM role to the instance for AWS API access

  • Temporary credentials automatically rotated

  • Better than storing access keys on an instance

8. Monitoring and Management:

CloudWatch Metrics:

  • Default metrics (5-minute intervals, free):

    • CPUUtilization

    • NetworkIn/Out

    • DiskReadOps/WriteOps

    • StatusCheckFailed

  • Detailed monitoring (1-minute intervals, paid)

Systems Manager (SSM):

  • Manage instances without SSH/RDP

  • Run commands, patch management, and inventory

  • Session Manager for shell access

  • Parameter Store for configuration

User Data:

  • Script that runs on instance launch

  • Used for bootstrap/configuration

  • Runs as root user

  • Can be modified when the instance is stopped

9. Auto Scaling:

Auto Scaling Groups (ASG):

  • Automatically adjust the number of instances

  • Maintains desired capacity

  • Integrates with ELB for health checks

  • Scaling policies: Target tracking, step, scheduled

Launch Templates:

  • Versioned template for launching instances

  • Defines AMI, instance type, key pair, security groups, etc.

  • Used by Auto Scaling Groups

10. High Availability and Fault Tolerance:

Multi-AZ Deployment:

  • Deploy instances across multiple Availability Zones

  • Protect against AZ failures

  • Use with Auto Scaling and Load Balancers

Elastic Load Balancing:

  • Distributes traffic across instances

  • Types: ALB (Layer 7), NLB (Layer 4), GLB (Layer 3)

  • Health checks to route only to healthy instances

Snapshots:

  • Point-in-time backup of EBS volumes

  • Incremental backups stored in S3

  • Can create an AMI from a snapshot

  • Can copy across regions

11. Instance Metadata Service (IMDS):

  • API accessible from within the instance at http://169.254.169.254

  • Provides information about the instance (instance-id, AMI-id, IAM role credentials)

  • IMDSv1: Request/response (less secure)

  • IMDSv2: Session-oriented with token (more secure, recommended)

12. Hibernation:

  • Saves RAM contents to the EBS root volume

  • Instance resumes with the same instance ID, private IPs, and RAM state

  • Faster than stop/start

  • Use case: Long-running processes, pre-warmed applications

  • Limitations: Not all instance types, max 60 days of hibernation

13. Elastic Fabric Adapter (EFA):

  • Network device for HPC and ML workloads

  • Bypasses the OS kernel for ultra-low latency

  • Supports MPI (Message Passing Interface)

  • Use case: Distributed ML training, computational fluid dynamics

14. Nitro System:

  • AWS's custom hardware and hypervisor

  • Better performance, security, and innovation

  • Components:

    • Nitro cards (networking, storage, management)

    • Nitro security chip

    • Nitro hypervisor (lightweight)

  • Most modern instance types use Nitro

Real-World DevOps Use Cases

Use Case 1: Auto-Scaling Web Application with Load Balancer

Scenario: You have a web application that experiences variable traffic throughout the day. During peak hours (9 AM - 5 PM), you need 10 instances; during off-hours, 2 are sufficient. You want to optimise costs while maintaining performance.

Solution Architecture:

                    Internet
                       ↓
                 Internet Gateway
                       ↓
              Application Load Balancer
                       ↓
         ┌─────────────┴─────────────┐
         ↓                           ↓
    Auto Scaling Group          Auto Scaling Group
    (AZ-1: 1-5 instances)       (AZ-2: 1-5 instances)
         ↓                           ↓
    Target Group                Target Group
    (Health Checks)             (Health Checks)

Key Components:

  • Application Load Balancer (ALB): Distributes traffic across instances in multiple AZs

  • Auto Scaling Group: Automatically adds/removes instances based on CPU utilisation

  • Target Groups: Register instances for health monitoring

  • CloudWatch Alarms: Trigger scaling policies based on metrics

Why it matters:

  • Cost Optimisation: Pay only for instances you need

  • High Availability: Multi-AZ deployment protects against failures

  • Automatic Recovery: Unhealthy instances are automatically replaced

  • Performance: Scales out during peak demand

Implementation: We'll cover this in the hands-on section.

Use Case 2: Spot Instances for CI/CD Pipeline

Scenario: Your CI/CD pipeline runs hundreds of build and test jobs daily. These jobs are fault-tolerant (can be retried), and each runs for 10-30 minutes. You're spending $2,000/month on On-Demand instances.

Solution:

Use Spot Instances for build agents, saving up to 90% on compute costs:

Jenkins/GitLab CI Master (On-Demand)
         ↓
    ┌────┴────┐
    ↓         ↓
Spot Fleet   Spot Fleet
(AZ-1)       (AZ-2)
Build        Build
Agents       Agents

Configuration:

  • Spot Fleet: Mix of instance types (c5.large, c5.xlarge, c6i.large)

  • Diversification: Request multiple instance types to reduce interruptions

  • Fallback: On-Demand instances when Spotis unavailable

  • Interruption Handling: Save build state, retry on a different instance

Cost Comparison:

On-Demand (c5.xlarge): $0.17/hour
Spot (c5.xlarge):      $0.034/hour (80% discount)

Monthly cost for 20 instances running 24/7:
On-Demand: 20 × $0.17 × 730 = $2,482/month
Spot:      20 × $0.034 × 730 = $496/month
Savings:   $1,986/month = $23,832/year

Why it matters:

  • Massive Cost Savings: 80-90% cheaper than On-Demand

  • Same Performance: Identical instances, just cheaper

  • Fault Tolerance: CI/CD jobs can handle interruptions

  • Smart Orchestration: Spot Fleet automatically replaces interrupted instances

Use Case 3: Blue-Green Deployment with AMIs

Scenario: You need to deploy a new version of your application with zero downtime and the ability to quickly rollback if issues occur.

Solution Architecture:

Current State (Blue):
ALB → Target Group (Blue) → ASG with v1.0 AMI

Deployment:
1. Create new AMI with v2.0
2. Create new ASG with v2.0 AMI (Green)
3. Attach to same ALB
4. Gradually shift traffic to Green
5. Monitor health and errors
6. If successful: Terminate Blue ASG
   If issues: Shift traffic back to Blue

Final State:
ALB → Target Group (Green) → ASG with v2.0 AMI

Steps:

  1. Golden AMI Creation: Application + dependencies baked into AMI

  2. Green Environment: Launch new ASG with new AMI

  3. Testing: Verify green environment health

  4. Traffic Shift: Use ALB weighted target groups to gradually shift traffic

  5. Monitoring: Watch metrics closely during transition

  6. Rollback or Complete: Quick rollback if issues, otherwise decommission blue

Why it matters:

  • Zero Downtime: Users never experience an outage

  • Fast Rollback: Switch back to blue in seconds if needed

  • Testing in Production: Test with real traffic gradually

  • Infrastructure as Code: Entire process automated with Terraform/CloudFormation

Best Practices for DevOps

Now that we've built a production-ready auto-scaling infrastructure, let's explore the essential practices that will make your EC2 deployments secure, cost-effective, and maintainable.

1. Always Use IAM Roles Instead of Access Keys

The Problem: Storing AWS credentials (access keys) on EC2 instances creates security risks and management overhead.

The Solution:

Never hardcode credentials. Always attach IAM roles to instances

Why This Matters:

  • Security: No credentials stored on disk

  • Automatic Rotation: AWS handles credential rotation

  • Audit Trail: CloudTrail logs all API calls with role information

  • Least Privilege: Easy to grant minimal permissions

  • No Key Management: No need to distribute/rotate keys

Real-World Impact: I've seen leaked credentials on GitHub cost companies thousands in unauthorised usage within hours. IAM roles eliminate this risk.

2. Use Launch Templates, Not Launch Configurations

The Problem: Launch Configurations are legacy and lack modern features.

The Solution:

Always use Launch Templates for new deployments:

Launch Template Advantages:

  • Versioning: Track changes, rollback easily

  • Multiple Instance Types: Support for Spot, On-Demand mix

  • Modern Features: IMDSv2, T3 unlimited, etc.

  • Template Inheritance: One template can inherit from another

  • Better Spot Support: Works with Spot Fleets and mixed instances

3. Enable Detailed Monitoring for Production Instances

The Problem: Basic monitoring (5-minute intervals) delays problem detection.

The Solution: Enable detailed monitoring (1-minute intervals) for production workloads

4. Implement a Proper Tagging Strategy for EC2

The Problem: Without tags, you can't track costs, automate management, or organise resources.

The Solution: Tag everything with a consistent strategy

locals {
  common_tags = {
    # Identification
    Environment = "production"
    Project     = "webapp"
    Application = "api-server"
}

5. Use Systems Manager Session Manager Instead of SSH

The Problem: SSH requires:

  • Open port 22 (security risk)

  • Bastion hosts (additional cost and complexity)

  • Key management and distribution

  • Direct network access to instances

The Solution:

Use AWS Systems Manager Session Manager for secure, audited shell access:

Benefits:

  • No SSH port exposed (better security)

  • No bastion hosts needed (lower cost)

  • No SSH key management

  • Full audit trail in CloudTrail

  • Centralised access control via IAM

  • Session recording and logging

  • Works with instances in private subnets

  • Port forwarding support

6. Implement Multi-AZ Deployment for High Availability

The Problem: A single AZ deployment creates a single point of failure.

The Solution: Always deploy across at least 2 Availability Zones

When to Use Multi-AZ:

  • Production applications (always!)

  • Customer-facing services

  • Critical internal services

  • ❌ Development environments (single AZ okay)

  • ❌ Batch processing (use Spot across AZs instead)

7. Use Mixed Instances for Cost Optimisation

The Problem: Relying on a single instance type limits flexibility and can be more expensive.

The Solution: Use mixed instance types in Auto Scaling Groups

When to Use Mixed Instances:

  • Stateless applications (web servers, workers)

  • Auto Scaling Groups

  • Fault-tolerant workloads

  • ❌ Stateful applications (databases)

  • ❌ Applications requiring specific CPU/memory ratios

  • ❌ Regulated workloads requiring dedicated instances

Common Pitfalls to Avoid

Now that we've covered best practices, let's examine the mistakes that can cost you time, money, and security. Learn from these to avoid common EC2 traps.

1. Not Using Auto Scaling and Running Fixed Capacity

Always use Auto Scaling Groups to avoid launching a fixed number of instances.

2. Choosing the Wrong Instance Type

Using whatever instance type seems "good enough" without proper analysis

The Fix:

Step 1: Analyse Your Workload

Step 2: Match Instance Type to Workload

Workload TypeInstance FamilyExample
Web servers (variable CPU)T3/T3at3.medium
API servers (steady CPU)M5/M6im5.large
Batch processingC5/C6ic5.2xlarge
In-memory databasesR5/R6ir5.xlarge
Data warehousesI3/I4ii3.2xlarge
ML trainingP3/P4p3.8xlarge
ML inferenceG4/Inf1g4dn.xlarge

Step 3: Use AWS Compute Optimiser

# Get recommendations (free service!)
aws compute-optimizer get-ec2-instance-recommendations \
  --instance-arns arn:aws:ec2:us-east-1:123456789012:instance/i-xxxxx

# Example output:
# Current: m5.xlarge (4 vCPU, 16 GB) = $0.192/hour
# Recommended: m5.large (2 vCPU, 8 GB) = $0.096/hour
# Reason: Average CPU 15%, Max CPU 32%, Avg Memory 40%
# Savings: $70/month

Step 4: Start Small, Scale Up

3. Not Enabling Termination Protection for Critical Instances

The Mistake:

Running critical instances without termination protection:

# ❌ BAD: No protection
resource "aws_instance" "database" {
  ami           = "ami-xxxxx"
  instance_type = "r5.2xlarge"

  # One accidental click = database gone!
}

The Fix:

Enable termination protection for critical instances:

# ✅ GOOD: Protected critical instance
resource "aws_instance" "database" {
  ami           = "ami-xxxxx"
  instance_type = "r5.2xlarge"

  disable_api_termination = true  # Cannot terminate via API/Console

  tags = {
    Name        = "production-database"
    Critical    = "true"
    Environment = "production"
  }
}

4. Running Instances in Public Subnets When Not Needed

The Mistake: Placing application servers in public subnets with public IPs:

# ❌ BAD: Application servers exposed to internet
resource "aws_instance" "app" {
  subnet_id                   = aws_subnet.public.id
  associate_public_ip_address = true

  # Now directly accessible from internet!
  # Constant brute force attacks on SSH
  # Security group mistakes = instant breach
}

Why This is Dangerous:

  1. Increased Attack Surface: Every instance is a target

  2. Brute Force Attacks: SSH/RDP are constantly hammered

  3. Security Group Mistakes: One wrong rule = compromise

  4. Compliance Issues: Violates security frameworks

  5. No Defence in Depth: Single layer of security

When Public Instances Are Okay:

  • Load Balancers (ALB/NLB)

  • NAT Instances (if you must)

  • Bastion Hosts (with heavy restrictions)

  • VPN servers

  • ❌ Application servers (use private + ALB)

  • ❌ Databases (always private!)

  • ❌ Cache servers (always private!)

  • ❌ Background workers (always private!)

Summary

EC2 is AWS's foundational compute service, and mastering it is essential for every DevOps engineer. Here are the key takeaways from this comprehensive guide:

Core Concepts to Master

Instance Types: Choose wisely based on workload - T3 for burstable, M5 for balanced, C5 for compute-intensive, R5 for memory-heavy, and I3 for storage-optimised workloads.

Purchasing Options: Mix On-Demand (flexibility), Reserved Instances/Savings Plans (steady workloads, 40-60% savings), and Spot Instances (fault-tolerant workloads, up to 90% savings) for optimal cost efficiency.

Storage: Use gp3 for general purpose (better than gp2), io2 for high-performance databases, st1 for big data, and sc1 for cold storage. Always match storage type to workload requirements.

Networking: Deploy across multiple AZs, use private subnets for applications, implement proper security groups, and leverage VPC endpoints to reduce NAT Gateway costs.

Final Thoughts

EC2 is powerful but requires careful planning and ongoing optimisation. Start with these fundamentals:

  • Security first - Use IAM roles, private subnets, IMDSv2, and least privilege

  • High availability - Multi-AZ, Auto Scaling, health checks

  • Cost optimisation - Right-sizing, Auto Scaling, Spot instances, proper storage

  • Automation - Infrastructure as Code, golden AMIs, automated backups

  • Monitoring - CloudWatch metrics, alarms, dashboards, and logs

Master EC2, and you'll have a solid foundation for building reliable, scalable, and cost-effective infrastructure on AWS. The practices covered here apply whether you're running a small startup application or managing enterprise-scale workloads.

Remember: Start simple, monitor everything, optimise continuously, and automate relentlessly.

#AWS #EC2 #DevOps #CloudComputing #AutoScaling #CostOptimization #CloudArchitecture #InfrastructureAsCode #Terraform #SRE

Essential AWS Services For DevOps Engineer

Part 3 of 16

In this series, I will share the top 15 essential AWS services that every DevOps engineer should know. I will not only share what these services are but also share how and why those services are used in a production from a DevOps perspective.

Up next

Amazon S3 for DevOps Engineers: The Essential Storage Service Explained

Introduction As a DevOps engineer, you deal with data constantly. Docker images, build artifacts, Terraform state files, application logs, database backups, static websites - the list goes on. You need somewhere reliable to store all of this, and you...

More from this blog

devopsbyvishu

18 posts