AWS Cloud Operations Blog
Automate installing AWS Systems Manager agent on unmanaged Amazon EC2 nodes
Managing a fleet of AWS resources at scale can be challenging. Organizations rely on multiple solutions to automate tasks, collect inventory, patch instances, and maintain security compliance. Organizations need to access instances without opening inbound ports or managing SSH keys. AWS Systems Manager (SSM) simplifies this by serving as a centralized management solution that supports […]
Centralized Multi-Account Application Resilience Assessment Using AWS Resilience Hub
Introduction As organizations scale their cloud environments, assessing resilience across multiple AWS accounts and regions presents several challenges. A disaggregated approach where resilience assessments are performed separately for each workload, account, or region can introduce inefficiencies, inconsistencies, and gaps in coverage. AWS Resilience Hub addresses these challenges by centralizing resilience evaluations, while accommodating the complexities […]
Simulating partial failures with AWS Fault Injection Service
Modern distributed systems must be resilient to unexpected disruptions to maintain availability, performance, and stability. Chaos engineering helps teams uncover hidden weaknesses by deliberately injecting faults into a system and observing how it recovers. While traditional testing validates expected behavior, chaos engineering tests system resilience during failures. AWS Fault Injection Service (AWS FIS) is a […]
Observing Agentic AI workloads using Amazon CloudWatch
Introduction As the adoption of agentic AI applications continues to grow, ensuring the reliability, performance, and overall observability of these systems becomes increasingly critical. Agentic AI applications, powered by large language models (LLM) and integrated with various data sources and APIs, can quickly become complex, making it challenging to gain visibility into their inner workings […]
Best practices for utilizing AWS Systems Manager with AWS Fault Injection Service
Introduction In today’s cloud-centric world, ensuring the resilience of mission-critical applications is paramount. The ability to withstand and recover from unexpected failures, including degradation of cloud provider services, can mean the difference between seamless operation and costly downtime. This is where the powerful combination of AWS Systems Manager (SSM) and AWS Fault Injection Service (AWS […]
Optimizing Queries with Amazon Managed Prometheus
Introduction In today’s cloud-native environments, organizations rely on metrics monitoring to maintain application reliability and performance. Amazon Managed Service for Prometheus serves as a tool for storing and analyzing application and infrastructure metrics. As applications and platforms evolve, teams often discover opportunities to optimize their metrics querying patterns. Common scenarios like expanding service deployments, growing […]
How Indegene Optimizes User Experience with Amazon CloudWatch
In today’s digital healthcare landscape, optimal application performance and user experience are crucial for business success. Indegene, a digital-first life sciences commercialization company, combines deep medical expertise with domain-contextualized technology to help clients accelerate innovation, modernize operations, and improve customer experience. With the world’s top 20 pharma companies among its clientele, Indegene brings an AI-first […]
Exporting a subset of AWS CloudTrail Lake events to Amazon S3
Introduction Monitoring and managing your AWS environment is critical to maintaining security and operational excellence. With the availability of AWS CloudTrail Lake data for zero-ETL analysis in Amazon Athena, you can use Athena to query your activity logs in CloudTrail Lake without the operational complexity of moving data or building data processing pipelines. CloudTrail Lake […]
New: AWS CloudTrail Lake Event Enrichment: Add Business Context to AWS Activity Logs
AWS customers use AWS CloudTrail Lake to aggregate and analyze their AWS activity for security, operational troubleshooting, and compliance purposes. However, when investigating security incidents or conducting compliance audits, customers often need additional business context beyond the basic event details – like which team or project owns the affected resources, or what where the properties […]
Visualizing Amazon DynamoDB data with Amazon OpenSearch Service and Amazon Managed Grafana
High-performance applications with unlimited throughput capabilities pose significant monitoring challenges, especially when tracking real-time metrics, utilization, and throttling events across distributed database workloads. Near real-time visibility into metrics is crucial for application performance and cost optimization. AWS allows you to seamlessly integrate multiple services to tackle these operational complexities. With Amazon DynamoDB, you can build […]