monitoring

Monitoring & Observability Stack: See Everything, Miss Nothing

Master CloudWatch, X-Ray, CloudTrail, and beyond to build observable, exam-ready AWS architectures

Updated 2026-02-22

Overview

AWS observability is built on three pillars — metrics, logs, and traces — delivered through an integrated stack of services including CloudWatch (metrics, logs, alarms, dashboards), AWS X-Ray (distributed tracing), CloudTrail (API audit logs), and AWS Config (resource configuration history). Understanding which tool answers which question ('Is it broken?', 'Why is it broken?', 'Who changed it?', 'What changed?') is the core skill tested across Solutions Architect, Developer, DevOps, and SysOps certifications. Exam questions routinely test your ability to select the right observability tool for a given operational or compliance scenario.

Distinguish between CloudWatch, X-Ray, CloudTrail, and AWS Config use cases so you can select the correct service for any operational, compliance, or troubleshooting scenario on the exam.

Patterns & Strategies

Metrics-Driven Alarming (CloudWatch Metrics + Alarms)

Collect numeric time-series data from AWS services and custom applications via CloudWatch Metrics, then set threshold-based or anomaly-detection alarms that trigger SNS notifications, Auto Scaling actions, EC2 actions, or Systems Manager OpsCenter items. Standard metrics arrive every 5 minutes (free); Detailed Monitoring reduces granularity to 1 minute (additional cost). Custom metrics can be published at 1-second resolution (High-Resolution Metrics).

✓

When you need to react automatically to operational conditions — CPU spikes, queue depth growth, error rate thresholds, or custom business KPIs. Ideal for Auto Scaling triggers, paging on-call engineers, or driving operational dashboards.

⚠

Standard 5-minute granularity misses short-lived spikes. High-Resolution Custom Metrics cost more per metric per month. Alarms in INSUFFICIENT_DATA state can be misconfigured as OK — a common exam trap.

Centralized Log Aggregation (CloudWatch Logs + Log Insights)

Aggregate logs from EC2 (via CloudWatch Agent), Lambda (automatic), ECS, EKS, VPC Flow Logs, Route 53 query logs, and custom applications into CloudWatch Log Groups. Use CloudWatch Logs Insights to run SQL-like queries across log groups for ad-hoc analysis. Metric Filters extract numeric signals from log data to create CloudWatch Metrics. Subscription Filters stream logs in real time to Kinesis Data Streams, Kinesis Data Firehose, or Lambda for downstream processing.

✓

When you need to store, search, and analyze application or infrastructure logs without managing log infrastructure. Use Logs Insights for ad-hoc troubleshooting; use Metric Filters to create alarms from log patterns (e.g., alert on ERROR keyword count).

⚠

CloudWatch Logs Insights queries are billed per GB scanned. Long-term log retention in CloudWatch is more expensive than archiving to S3. For large-scale log analytics, OpenSearch Service or Athena on S3 is more cost-effective.

Distributed Tracing (AWS X-Ray)

Instrument applications with the X-Ray SDK or AWS Distro for OpenTelemetry (ADOT) to generate trace data — segments and subsegments — that map the full request path across microservices, Lambda functions, API Gateway, DynamoDB, SQS, and more. The X-Ray Service Map provides a visual dependency graph with latency and error rates per node. Sampling rules control the percentage of requests traced to manage cost and performance overhead.

✓

When you need to identify latency bottlenecks or error sources in a distributed or microservices architecture. Essential for questions about 'why is my Lambda-to-DynamoDB call slow?' or 'which downstream service is causing 5xx errors in my API?'

⚠

Requires code instrumentation (SDK or agent). Sampling means not every request is traced by default — full tracing at high throughput is expensive. X-Ray does not replace CloudWatch Logs; they are complementary.

API Audit Trail (AWS CloudTrail)

CloudTrail records every AWS API call made in your account — who called it, from where, when, and with what parameters — as management events (control plane) and optionally data events (S3 object-level, Lambda invocations, DynamoDB item-level). Trails can deliver logs to S3 and optionally to CloudWatch Logs. CloudTrail Lake provides a managed, queryable event data store with SQL-based analysis. By default, management event history is retained for 90 days in the console without a trail.

✓

When you need to answer 'Who did what, when?' — security investigations, compliance audits (PCI-DSS, HIPAA, SOC), detecting unauthorized API calls, or root-causing configuration changes. Always the answer for 'who deleted my S3 bucket?' or 'who modified this IAM policy?'

⚠

Data events (S3, Lambda, DynamoDB) are NOT enabled by default and cost extra per 100K events. Without a configured Trail, you only get 90 days of management event history. CloudTrail is near-real-time (typically 15 minutes to S3 delivery) — not a real-time streaming solution.

Resource Configuration History & Compliance (AWS Config)

AWS Config continuously records the configuration state of AWS resources and evaluates them against Config Rules (managed or custom Lambda-backed). Config provides a timeline of configuration changes for any resource, enabling you to answer 'what did this security group look like 30 days ago?' Config Conformance Packs bundle multiple rules for compliance frameworks. Remediation actions can automatically fix non-compliant resources via Systems Manager Automation.

✓

When you need continuous compliance monitoring, configuration drift detection, or a historical record of resource configurations. The correct answer for 'which EC2 instances are not using approved AMIs?' or 'notify me when any security group opens port 22 to 0.0.0.0/0'.

⚠

AWS Config charges per configuration item recorded and per active Config Rule evaluation. It records configuration state, not API calls — that distinction separates it from CloudTrail on the exam. Config does not prevent changes; it detects and optionally remediates them.

Unified Operational Visibility (CloudWatch Container Insights, Lambda Insights, Application Insights)

Purpose-built CloudWatch Insights features provide curated dashboards and metrics for specific workload types. Container Insights collects metrics and logs from ECS, EKS, and Kubernetes on EC2. Lambda Insights provides enhanced function-level metrics (memory usage, init duration, cold starts). Application Insights automatically detects and monitors application components (.NET, SQL Server, etc.) and surfaces anomalies. All are built on CloudWatch but require explicit enablement.

✓

When you need deeper, workload-specific observability without building custom dashboards. Use Container Insights for ECS/EKS troubleshooting, Lambda Insights for serverless performance tuning, and Application Insights for enterprise application monitoring.

⚠

Each Insights feature has additional costs beyond base CloudWatch pricing. Container Insights requires the CloudWatch Agent deployed as a DaemonSet on ECS/EKS. These are tested as 'which feature enables container-level memory metrics?' type questions.

Cross-Account & Cross-Region Observability (CloudWatch Observability Access Manager)

CloudWatch cross-account observability allows a monitoring account to view metrics, logs, and traces from multiple source accounts without switching consoles. Uses AWS Organizations and resource policies. CloudWatch cross-region dashboards aggregate data from multiple regions into a single pane. Amazon Managed Grafana and Amazon Managed Service for Prometheus (AMP) extend this to open-source tooling with AWS-managed infrastructure.

✓

In multi-account AWS Organizations environments where a central security or operations team needs unified visibility. Use Managed Grafana for teams already using Grafana dashboards. Use AMP for Kubernetes-native Prometheus metric collection at scale.

⚠

Managed Grafana and AMP have workspace-based pricing independent of CloudWatch. Cross-account observability requires explicit configuration of sharing and linking accounts. Not the default — must be explicitly set up.

Decision Framework

• STEP 1 — What question are you trying to answer?

→ 'Is my system healthy / how is it performing?' → CloudWatch Metrics + Alarms + Dashboards

→ 'What do my application logs say?' → CloudWatch Logs + Logs Insights

→ 'Why is this distributed request slow or failing?' → AWS X-Ray (distributed tracing)

→ 'Who made an API call / who changed this resource?' → AWS CloudTrail

→ 'What was the configuration of this resource, and is it compliant?' → AWS Config

→ 'How do I monitor containers (ECS/EKS)?' → CloudWatch Container Insights

→ 'How do I monitor Lambda cold starts and memory?' → CloudWatch Lambda Insights

• STEP 2 — What is the data type?

→ Numeric time-series → CloudWatch Metrics

→ Free-text log events → CloudWatch Logs

→ Request traces across services → X-Ray

→ API call records → CloudTrail

→ Resource configuration snapshots → AWS Config

• STEP 3 — What is the action needed?

→ Auto-scale or alert → CloudWatch Alarm → SNS / Auto Scaling / EC2 Action

→ Stream logs downstream → CloudWatch Logs Subscription Filter → Kinesis / Lambda

→ Compliance remediation → AWS Config Rule + Systems Manager Automation

→ Security investigation → CloudTrail Lake or CloudTrail + Athena on S3

→ Unified multi-account view → CloudWatch cross-account observability or Managed Grafana

• STEP 4 — Cost sensitivity?

→ Long-term log storage → move logs to S3, query with Athena

→ High-volume tracing → configure X-Ray sampling rules

→ Many Config Rules → evaluate cost per rule evaluation

Exam Tips

criticalCloudTrail vs AWS Config

CloudTrail answers 'WHO did WHAT' (API-level audit); AWS Config answers 'WHAT changed in resource configuration and IS IT COMPLIANT'. These are different tools for different questions — never confuse them on the exam.

criticalCloudWatch Logs Metric Filters

CloudWatch Logs Metric Filters let you create a CloudWatch Metric from a log pattern (e.g., count of ERROR strings). This metric can then trigger an Alarm — enabling alerting on log content without a third-party tool.

criticalX-Ray instrumentation

X-Ray requires active instrumentation — you must add the X-Ray SDK or use ADOT. Lambda has built-in X-Ray active tracing support you enable with a checkbox, but other services (EC2, ECS) require the X-Ray daemon running alongside your application.

criticalCloudTrail data events vs management events

CloudTrail data events (S3 object-level operations, Lambda invocations, DynamoDB item-level) are NOT enabled by default. If an exam scenario asks about tracking who accessed a specific S3 object, the answer requires enabling CloudTrail data events for that bucket.

criticalCloudWatch Alarm states

A CloudWatch Alarm has three states: OK, ALARM, and INSUFFICIENT_DATA. INSUFFICIENT_DATA occurs when there is not enough data to evaluate the alarm — this is NOT the same as OK. Auto Scaling policies and notifications only trigger on ALARM state transitions.

critical

CloudTrail = WHO did WHAT (API audit). AWS Config = WHAT is the resource configuration and IS IT COMPLIANT. Never swap these two on an exam question.

critical

EC2 memory and disk metrics are NOT available in CloudWatch by default — you must install and configure the CloudWatch Agent. This is the #1 most-tested 'missing metric' scenario.

critical

CloudTrail data events (S3 object-level, Lambda invocations) are OFF by default and cost extra. If an exam question involves tracking access to specific S3 objects, the answer always includes enabling CloudTrail data events.

importantCloudWatch detailed monitoring

CloudWatch detailed monitoring for EC2 reduces the metric reporting interval from 5 minutes to 1 minute. This must be explicitly enabled and incurs additional cost. Auto Scaling groups benefit from detailed monitoring for faster scale-out reactions.

importantVPC Flow Logs

VPC Flow Logs capture IP traffic metadata (source IP, destination IP, port, protocol, bytes, action) but NOT the packet payload. They go to CloudWatch Logs or S3 and are essential for network security analysis — but they are NOT real-time; there is a capture window delay.

importantCloudWatch Logs Insights pricing

CloudWatch Logs Insights queries are billed per GB of log data scanned, not per query. Narrowing time range and using specific log groups reduces cost. This is frequently tested in cost-optimization scenarios.

importantEventBridge + CloudTrail integration

Amazon EventBridge (formerly CloudWatch Events) is the recommended way to react to CloudTrail API events in near-real-time (e.g., 'trigger a Lambda when someone calls DeleteSecurityGroup'). CloudWatch Events and EventBridge share the same underlying infrastructure.

importantAWS Config vs preventive controls

AWS Config records configuration changes but does NOT prevent them. To prevent non-compliant changes, you need Service Control Policies (SCPs) in AWS Organizations or IAM policies — Config only detects and optionally remediates after the fact.

Good to KnowCloudTrail log analysis

For long-term CloudTrail log analysis at scale, the recommended pattern is CloudTrail → S3 → Athena. CloudTrail Lake is a newer managed alternative that provides SQL querying without the S3/Athena setup, but at higher cost.

Good to KnowCloudWatch Synthetics

CloudWatch Synthetics Canaries are configurable scripts that monitor endpoints and APIs on a schedule, simulating user behavior. They are the AWS-native answer for 'how do I detect if my website is down from an external perspective?' — not CloudWatch Alarms on internal metrics.

Common Misconceptions & Traps

Common Mistake

CloudTrail and AWS Config both track changes, so they are interchangeable for compliance and audit use cases.

Correct

CloudTrail records API calls (who called what API, when, from where). AWS Config records the resulting resource configuration state and evaluates compliance rules against it. CloudTrail tells you the action; Config tells you the outcome and whether it's compliant. For 'who deleted this resource?' use CloudTrail. For 'is this security group compliant with our rules?' use Config.

Exam questions deliberately present both as options. The key discriminator is: API call audit = CloudTrail; resource configuration compliance = Config.

Common Mistake

CloudWatch monitors everything by default — if a service exists in AWS, its metrics are automatically visible in CloudWatch.

Correct

Many important metrics are NOT available by default. EC2 memory and disk utilization require the CloudWatch Agent. ECS container-level metrics require Container Insights. Lambda memory usage requires Lambda Insights. RDS Enhanced Monitoring requires explicit enablement. Custom application metrics must be published via the PutMetricData API.

Candidates assume CloudWatch is fully automatic. The exam tests knowledge of which metrics require additional agents or configuration — especially memory utilization on EC2, which is a classic trap.

Common Mistake

AWS X-Ray replaces CloudWatch Logs — once you enable X-Ray, you can see all your application errors and logs in the trace view.

Correct

X-Ray traces show the path and timing of requests across services (latency, errors per service node) but do NOT contain application log messages. CloudWatch Logs contains the actual log output. They are complementary: use X-Ray to identify which service is slow, then use CloudWatch Logs to read the detailed error messages from that service.

Candidates confuse tracing with logging. X-Ray answers 'where is the bottleneck?' — CloudWatch Logs answers 'what exactly happened?'

Common Mistake

A CloudWatch Alarm in INSUFFICIENT_DATA state means everything is fine — there's no problem to alert on.

Correct

INSUFFICIENT_DATA means the alarm cannot evaluate because there is not enough metric data — this often happens with new alarms, after a service stops publishing metrics, or when the metric period has no data points. It is NOT equivalent to OK. An alarm stuck in INSUFFICIENT_DATA may indicate a misconfigured metric name, a stopped EC2 instance, or a metric that simply hasn't published yet.

This is a classic exam trap. Candidates assume INSUFFICIENT_DATA = no problem. In reality it often signals a configuration issue or a dead service.

Common Mistake

Enabling CloudTrail in one region is sufficient to audit all AWS activity in my account.

Correct

CloudTrail trails can be scoped to a single region or configured as multi-region trails. A single-region trail ONLY captures API calls in that region. Global service events (IAM, STS, CloudFront) are delivered to the trail's home region. To capture all activity across all regions, you must create a multi-region trail or use AWS Organizations trail.

Multi-account and multi-region coverage is a common exam scenario. Missing global service events or regional API calls due to single-region trail configuration is a tested failure mode.

Common Mistake

VPC Flow Logs show the actual content of network packets, making them useful for deep packet inspection.

Correct

VPC Flow Logs only capture metadata: source/destination IP, source/destination port, protocol, bytes transferred, packets, action (ACCEPT/REJECT), and timestamps. They do NOT capture payload content. For deep packet inspection or IDS/IPS capabilities, you need AWS Network Firewall or a third-party appliance.

Candidates over-scope VPC Flow Logs. The exam tests this boundary — Flow Logs for traffic metadata analysis, not content inspection.

Common Mistake

CloudWatch Logs automatically expire after a set period, so you don't need to worry about log storage costs growing indefinitely.

Correct

By default, CloudWatch Log Groups have NEVER EXPIRE retention — logs are kept indefinitely and you are charged for storage. You must explicitly set a retention policy (1 day to 10 years) on each Log Group, or export logs to S3 for cheaper long-term storage. Forgetting to set retention is a common cost optimization failure.

The exam tests cost optimization. The correct answer for reducing CloudWatch Logs costs is to set appropriate retention periods and/or export to S3 with lifecycle policies to Glacier.

Memory Tricks

🧠

The Observability Stack = 'MLTAC': Metrics (CloudWatch), Logs (CloudWatch Logs), Traces (X-Ray), Audit (CloudTrail), Configuration (AWS Config)

🧠

CloudTrail = 'WHO called WHAT API WHEN' — think of it as the security camera recording every door swipe

🧠

AWS Config = 'WHAT does the resource LOOK LIKE and IS IT LEGAL?' — think of it as the building inspector checking code compliance

🧠

X-Ray = 'Follow the REQUEST through the maze' — it draws the map of your microservices journey

🧠

INSUFFICIENT_DATA ≠ OK — remember: 'No data is NOT good data'

Common Trap

Assuming CloudWatch automatically collects all metrics (especially EC2 memory/disk) and that CloudTrail captures all events by default (including S3 object-level access) — both require explicit additional configuration that candidates consistently overlook.

CertAI Tutor · · 2026-02-22

Ready to test your knowledge?

Practice exam questions with AI-powered explanations — free to start.

Monitoring & Observability Stack: See Everything, Miss Nothing

Overview

Patterns & Strategies

Decision Framework

Exam Tips

Common Misconceptions & Traps

Memory Tricks

Common Trap

Ready to test your knowledge?

Related Cheat Sheets