AWS Well-Architected Framework Cheatsheet Mobile App

AWS Well Architected

Mobile App

Security

Reliability

Performance Efficiency

Operational Excellence

Cost Optimization

Sustainability

Operational Excellence

The operational excellence pillar focuses on running and monitoring systems, and continually improving processes and procedures.

Key topics include automating changes, responding to events, and defining standards to manage daily operations.

There are four best practice areas for operational excellence in the cloud

Organization

You need to understand your organization's priorities, your organizational structure, and how your organization supports your team members, so that they can support your business outcomes.

Organization priorities

Your teams need to have a shared understanding of your entire workload, their role in it, and shared business goals to set the priorities that will enable business success

Well-defined priorities will maximize the benefits of your efforts.

Review your priorities regularly so that they can be updated as your organization's needs change.

Best Practices

Involve key stakeholders, including business, development, and operations teams, to determine where to focus efforts on external customer needs.

This will ensure that you have a thorough understanding of the operations support that is required to achieve your desired business outcomes.

Customers whose needs are satisfied are much more likely to remain customers.

Evaluating and understanding external customer needs will inform how you prioritize your efforts to deliver business value.

Understand business needs: Business success is enabled by shared goals and understanding across stakeholders, including business, development, and operations teams.
Review business goals, needs, and priorities of external customers: Engage key stakeholders, including business, development, and operations teams, to discuss goals, needs, and priorities of external customers.
Establish shared understanding: Establish shared understanding of the business functions of the workload, the roles of each of the teams in operating the workload, and how these factors support your shared business goals across internal and external customers.

Involve key stakeholders when determining where to focus efforts on internal customer needs.

Ensure you understand the operations support that is required to achieve business outcomes.

Use your established priorities to focus your improvement efforts where they will have the greatest impact (for example, developing team skills, improving workload performance, reducing costs, automating runbooks, or enhancing monitoring).

Update your priorities as needs change.

Understand business needs: Business success is enabled by shared goals and understanding across stakeholders including business, development, and operations teams.
Evaluate internal governance factors, such as program or organizational policy, program policies, issue or system specific policies, standards, procedures, baselines, and guidelines.
Review business goals, needs, and priorities of internal customers: Engage key stakeholders, including business, development, and operations teams, to discuss goals, needs, and priorities of internal customers.
Establish shared understanding: Establish shared understanding of the business functions of the workload, the roles of each of the teams in operating the workload, and how these factors support shared business goals across internal and external customers.

Ensure that you are aware of guidelines or obligations defined by your organization that may mandate or emphasize specific focus.

Evaluate internal factors, such as organization policy, standards, and requirements.

Validate that you have mechanisms to identify changes to governance. If no governance requirements are identified, ensure that you have applied due diligence to this determination.

Evaluating and understanding the governance requirements that your organization applies to your workload will inform how you prioritize your efforts to deliver business value.

Understand governance requirements
Evaluate internal governance factors, such as program or organizational policy, program policies, issue or system specific policies, standards, procedures, baselines, and guidelines.
Validate that you have mechanisms to identify changes to governance.
If no governance requirements are identified, ensure that you have applied due diligence to this determination.

AWS Cloud Compliance

Evaluate external factors, such as regulatory compliance requirements and industry standards, to ensure that you are aware of guidelines or obligations that might mandate or emphasize specific focus.

If no compliance requirements are identified, ensure that you apply due diligence to this determination.

Evaluating and understanding the compliance requirements that apply to your workload will inform how you prioritize your efforts to deliver business value.

Understand compliance requirements
Evaluate external factors, such as regulatory compliance requirements and industry standards, to ensure that you are aware of guidelines or obligations that might mandate or emphasize specific focus.
Understand regulatory compliance requirements: Identify regulatory compliance requirements that you are legally obligated to satisfy.
Understand industry standards and best practices
Understand internal compliance requirements that are established by your organisation

AWS Cloud Compliance
AWS Compliance latest news
AWS Compliance programs

Evaluate threats to the business (for example, competition, business risk and liabilities, operational risks, and information security threats) and maintain current information in a risk registry.

Include the impact of risks when determining where to focus efforts.

Evaluate threats to the business (for example, competition, business risk and liabilities, operational risks, and information security threats)
Maintain a threat model: Establish and maintain a threat model identifying potential threats, planned and in place mitigations, and their priority.
Review the probability of threats manifesting as incidents, the cost to recover from those incidents and the expected harm caused, and the cost to prevent those incidents.
Revise priorities as the contents of the threat model change.

AWS Cloud Compliance
AWS Latest Security Bulletins
AWS Trusted Advisor

Evaluate the impact of tradeoffs between competing interests or alternative approaches, to help make informed decisions when determining where to focus efforts or choosing a course of action.

For example, accelerating speed to market for new features may be emphasized over cost optimization, or you may choose a relational database for non-relational data to simplify the effort to migrate a system, rather than migrating to a database optimized for your data type and updating your application.

Evaluate the impact of tradeoffs between competing interests, to help make informed decisions when determining where to focus efforts
AWS can help you educate your teams about AWS and its services to increase their understanding of how their choices can have an impact on your workload

AWS Blog
AWS Cloud Compliance
AWS Discussions Forums
AWS Documentation
AWS Knowledge Center
AWS Support
AWS Support Centre
Amazon Builders Library
AWS Podcast (offical)

Manage benefits and risks to make informed decisions when determining where to focus efforts.

For example, it may be beneficial to deploy a workload with unresolved issues so that significant new features can be made available to customers.

It may be possible to mitigate associated risks, or it may become unacceptable to allow a risk to remain, in which case you will take action to address the risk.

You might find that you want to emphasize a small subset of your priorities at some point in time.

Use a balanced approach over the long term to ensure the development of needed capabilities and management of risk. Update your priorities as needs change.

Identifying the available benefits of your choices, and being aware of the risks to your organization, enables you to make informed decisions.

Manage benefits and risks
Identify benefits based on business goals, needs, and priorities. Examples include time-to-market, security, reliability, performance, and cost.
Identify risks
Assess benefits against risks and make informed decisions
Evaluate the value of the benefit against the probability of the risk being realized and the cost of its impact.

Operating model

Your teams must understand their part in achieving business outcomes.

Teams need to understand their roles in the success of other teams, the role of other teams in their success, and have shared goals.

Understanding responsibility, ownership, how decisions are made, and who has authority to make decisions will help focus efforts and maximize the benefits from your teams.

The needs of a team will be shaped by their industry, their organization, the makeup of the team, and the characteristics of their workload.

It is unreasonable to expect a single operating model to be able to support all teams and their workloads.

Best Practices

Understand who has ownership of each application, workload, platform, and infrastructure component, what business value is provided by that component, and why that ownership exists.

Understanding the business value of these individual components and how they support business outcomes informs the processes and procedures applied against them.

Understanding ownership identifies whom can approve improvements, implement those improvements, or both.

Resources have identified owners
Specify and record owners for resources
Store resource ownership information with resources using metadata such as tags or resource groups.
Define who owns an organization, account, collection of resources, or individual components
Capture ownership in the metadata for the resources
Use AWS Organizations to structure accounts

AWS Organisations
Tagging

Understand who has ownership of the definition of individual processes and procedures, why those specific process and procedures are used, and why that ownership exists.

Understanding the reasons that specific processes and procedures are used enables identification of improvement opportunities.

Benefits of establishing this best practice: Understanding ownership identifies who can approve improvements, implement those improvements, or both.

Process and procedures have identified owners responsible
Identify process and procedures
Define who owns the definition of a process or procedure
Capture ownership in the metadata of the activity artifact: Procedures automated in services like AWS Systems Manager, through documents, and AWS Lambda, as functions, support capturing metadata information as tags.
Use AWS Organizations to create tagging polices and ensure ownership and contact information are captured

AWS Systems Manager (automate procedures)
AWS Organisations
Tagging

Understand who has responsibility to perform specific activities on defined workloads and why that responsibility exists.

Understanding who has responsibility to perform activities informs who will conduct the activity, validate the result, and provide feedback to the owner of the activity.

Benefits of establishing this best practice: Understanding who is responsible to perform an activity informs whom to notify when action is needed and who will perform the action, validate the result, and provide feedback to the owner of the activity.

Capture the responsibility for performing processes and procedures used in your environment
Identify and document the operations activities conducted in support of your workloads
Define who is responsible to perform each activity: Identify the team responsible for an activity.
Make this information discoverable

Understanding the responsibilities of your role and how you contribute to business outcomes informs the prioritization of your tasks and why your role is important.

This enables team members to recognize needs and respond appropriately.

Benefits of establishing this best practice: Understanding your responsibilities informs the decisions you make, the actions you take, and your hand off activities to their proper owners.

Identify team members roles and responsibilities and ensure they understand the expectations of their role
Make this information discoverable

Where no individual or team is identified, there are defined escalation paths to someone with the authority to assign ownership or plan for that need to be addressed.

Benefits of establishing this best practice: Understanding who has responsbility or ownership allows you to reach out to the proper team or team member to make a request or transition a task.

Having an identified person who has the authority to assign responsbility or ownership or plan to address needs reduces the risk of inaction and needs not being addressed.

Provide accessible mechanisms for members of your organization to discover and identify ownership and responsibility
These mechanisms will enable them to identify who to contact, team or individual, for specific needs.

You are able to make requests to owners of processes, procedures, and resources.

Make informed decisions to approve requests where viable and determined to be appropriate after an evaluation of benefits and risks.

Benefits of establishing this best practice: It's critical that mechanisms exist to request additions, changes, and exceptions in support of teams' activities. Without this option, current state become a constraint on innovation.

Provide mechanisms for members of your organization to make requests to owners of processes, procedures, and resources in support of their business needs

Have defined or negotiated agreements between teams describing how they work with and support each other (for example, response times, service level objectives, or service level agreements).

Understanding the impact of the teams work on business outcomes, and the outcomes of other teams and organizations, informs the prioritization of their tasks and enables them to respond appropriately.

When responsibility and ownership are undefined or unknown, you are at risk of both not addressing necessary activities in a timely fashion and of redundant and potentially conflicting efforts emerging to address those needs.

Benefits of establishing this best practice: Establishing the responsibilities between teams, the objectives, and the methods for communicating needs, eases the flow of requests and helps ensures the necessary information is provided.

This reduces the delay introduced by transition tasks between teams and help support the achievement of business outcomes.

Responsibilities between teams are predefined or negotiated
Specifying the methods by which teams interact, and the information necessary for them to support each other, can help minimize the delay introduced as requests are iteratively reviewed and clarified
Having specific agreements that define expectations (for example, response time, or fulfillment time) enables teams to make effective plans and resource appropriately.

Organizational culture

Provide support for your team members so that they can be more effective in taking action and supporting your business outcome.

Best practices

Senior leadership clearly sets expectations for the organization and evaluates success.

Senior leadership is the sponsor, advocate, and driver for the adoption of best practices and evolution of the organization

Benefits of establishing this best practice: Engaged leadership, clearly communicated expectations, and shared goals ensures that team members know what is expected of them.

Evaluating success enables identification of barriers to success so that they can be addressed through intervention by the sponsor advocate or their delegates.

Have Executive Sponsorship
Set expectations: Define and publish goals for your organizations including how they will be measured.
Track achievement of goals
Provide the resources necessary to achieve your goals
Advocate for your teams. Act on behalf of your teams to help address obstacles and remove unnecessary burdens
Drive and acknowledge best practices that provide quantifiable benefits and recognize the creators and adopters
Create a culture of continual improvement. Encourage both personal and organizational growth and development

The workload owner has defined guidance and scope empowering team members to respond when outcomes are at risk.

Escalation mechanisms are used to get direction when events are outside of the defined scope.

By testing and validating changes early, you are able to address issues with minimized costs and limit the impact on your customers.

By testing prior to deployment you minimize the introduction of errors.

Empower your team. Provide your team members the permissions, tools, and opportunity to practice the skills necessary to respond effectively
Give your team members opportunity to practice the skills necessary to respond
Perform game days
Define and acknowledge team members' authority to take action

Team members have mechanisms and are encouraged to escalate concerns to decision makers and stakeholders if they believe outcomes are at risk.

Escalation should be performed early and often so that risks can be identified, and prevented from causing incidents.

Encourage early and frequent escalation
Have a mechanism for escalation
Escalations should include the nature of the risk, the criticality of the workload, who is impacted, what the impact is, and the urgency, that is, when is the impact expected
Protect employees who escalate

Mechanisms exist and are used to provide timely notice to team members of known risks and planned events. Necessary context, details, and time (when possible) are provided to support determining if action is necessary, what action is required, and to take action in a timely manner.

For example, providing notice of software vulnerabilities so that patching can be expedited, or providing notice of planned sales promotions so that a change freeze can be implemented to avoid the risk of service disruption.

Planned events can be recorded in a change calendar or maintenance schedule so that team members can identify what activities are pending.

On AWS, AWS Systems Manager Change Calendar can be used to record these details. It supports programmatic checks of calendar status to determine if the calendar is open or closed to activity at a particular point of time.

Operations activities can be planned around specific approved windows of time that are reserved for potentially disruptive activities. AWS Systems Manager Maintenance Windows allows you to schedule activities against instances and other supported resources to automate the activities and make those activities discoverable.

-Communications are timely, clear, and actionable: Mechanisms are in place to provide notification of risks or planned events in a clear and actionable way with enough notice to allow appropriate responses. - Document planned activities on a change calendar and provide notifications: Provide an accessible source of information where planned events can be discovered. Provide notifications of planned events from the same system. - Track events and activity that may have an impact on your workload: Monitoring vulnerability notifications and patch information to understand vulnerabilities in the wild and potential risks associated to your workload components. - Provide notification to team members so that they can take action.

AWS Systems Manager Change Calendar
AWS Systems Manager Maintenance Windows

Experimentation accelerates learning and keeps team members interested and engaged.

An undesired result is a successful experiment that has identified a path that will not lead to success.

Team members are not punished for successful experiments with undesired results. Experimentation is required for innovation to happen and turn ideas into outcomes.

Encourage experimentation to support learning and innovation
Encourage experimentation with technologies that may have applicability now or in the future to the achievement of your business outcomes
Encourage experimentation with specific goals for team members to reach for, or with technologies that may have applicability in the near future
Dedicate specific times when team members can be free of their normal responsibilities, so that they can focus on their experiments
Acknowledge success. Understand that experiments with undesired outcomes are successful and have identified a path that will not lead to success.

Teams must grow their skill sets to adopt new technologies, and to support changes in demand and responsibilities in support of your workloads.

Growth of skills in new technologies is frequently a source of team member satisfaction and supports innovation.

Support your team members pursuit and maintenance of industry certifications that validate and acknowledge their growing skills.

Cross train to promote knowledge transfer and reduce the risk of significant impact when you lose skilled and experienced team members with institutional knowledge.

Provide dedicated structured time for learning.

Team members are enabled and encouraged to maintain and grow their skill sets
Provide resources for education
Provide junior team members' access to senior team members as mentors
Plan for the continuing education needs of your team members
Provide opportunities for team members to join other teams (temporarily or permanently)
Support pursuit and maintenance of industry certifications

AWS Getting Started Resource Center
AWS Blogs
AWS Cloud Compliance
AWS Discussion Forums
AWS Documentation
AWS Online Tech Talks
AWS Events and Webinars
AWS Knowledge Centre
AWS Support
AWS Well Architected Framework
AWS Podcast (official)

Maintain team member capacity, and provide tools and resources to support your workload needs.

Overtasking team members increases the risk of incidents resulting from human error.

nvestments in tools and resources (for example, providing automation for frequently performed activities) can scale the effectiveness of your team, enabling them to support additional activities.

Resource teams appropriately
Understand team performance (Measure the achievement of operational outcomes)
Track changes in output and error rate over time
Understand impacts on team performance
Act on behalf of your teams to help address obstacles and remove unnecessary burdens
Provide the resources necessary for teams to be successful

Leverage cross-organizational diversity to seek multiple unique perspectives.

Use this perspective to increase innovation, challenge your assumptions, and reduce the risk of confirmation bias.

Grow inclusion, diversity, and accessibility within your teams to gain beneficial perspectives.

Organizational culture has a direct impact on team member job satisfaction and retention. Enable the engagement and capabilities of your team members to enable the success of your business.

Seek diverse opinions and perspectives
Give voice to underrepresented groups. Rotate roles and responsibilities in meetings
Provide opportunity for team members to take on roles that they might not otherwise
Provide a safe and welcoming environment
Enable team members to participate fully

Prepare

To prepare for operational excellence, you have to understand your workloads and their expected behaviors. You will then be able to design them to provide insight to their status and build the procedures to support them.

Design telemetry

Design your workload so that it provides the information necessary for you to understand its internal state (for example, metrics, logs, events, and traces) across all components in support of observability and investigating issues

Iterate to develop the telemetry necessary to monitor the health of your workload, identify when outcomes are at risk, and enable effective responses. In AWS, you can emit and collect logs, metrics, and events from your applications and workloads components to enable you to understand their internal state and health. You can integrate distributed tracing to track requests as they travel through your workload. Use this data to understand how your application and underlying components interact and to analyze issues and performance.

Best Practices

Application telemetry is the foundation for observability of your workload.

Your application should emit telemetry that provides insight into the state of the application and the achievement of business outcomes.

From troubleshooting to measuring the impact of a new feature, application telemetry informs the way you build, operate, and evolve your workload.

Collecting metrics over time can be used to develop baselines and detect anomalies.

Implementing application telemetry consists of three steps
1)Identifying a location to store telemetry
2) Identifying telemetry that describes the state of the application
3)Instrumenting the application to emit telemetry
To identify what telemetry you need, start with the following questions: - Ismyapplicationhealthy? - Ismyapplicationachievingbusinessoutcomes?

AWS CloudWatch
AWS SDK
AWS Builders Library – Instrumenting Distributed Systems for Operational Visibility
AWS Distro for OpenTelemetry

Design and configure your workload to emit information about its internal state and current status, for example, API call volume, HTTP status codes, and scaling events.

Use this information to help determine when a response is required.

Benefits of establishing this best practice: Understanding what is going on inside your workload enables you to respond if necessary.

Implement log and metric telemetry: Instrument your workload to emit information about its internal state, status, and the achievement of business outcomes.
Use this information to determine when a response is required.
Implement and configure workload telemetry: Design and configure your workload to emit information about its internal state and current status (for example, API call volume, HTTP status codes, and scaling events).

AWS CloudTrail
AWS CloudWatch
VPC Flow Logs

Instrument your application code to emit information about user activity, for example, click streams, or started, abandoned, and completed transactions.

Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required.

Design your application code to emit information about user activity (for example, click streams, or started, abandoned, and completed transactions).
Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required

Design and configure your workload to emit information about the status (for example, reachability or response time) of resources it depends on.

Examples of external dependencies can include, external databases, DNS, and network connectivity.

Use this information to determine when a response is required.

Implement dependency telemetry: Design and configure your workload to emit information about the state and status of systems it depends on.
Some examples include: external databases, DNS, network connectivity, and external credit card processing services.

AWS CloudWatch Agent with AWS Systems Manager

Implement your application code and configure your workload components to emit information about the flow of transactions across the workload.

Use this information to determine when a response is required and to assist you in identifying the factors contributing to an issue.

Implement transaction traceability: Design your application and workload to emit information about the flow of transactions across system components, such as transaction stage, active component, and time to complete activity
Use this information to determine what is in progress, what is complete, and what the results of completed activities are. This helps you determine when a response is required.
For example, longer than expected transaction response times within a component can indicate issues with that component.

AWS X-Ray

Design for operations

Adopt approaches that improve the flow of changes into production and that enable refactoring, fast feedback on quality, and bug fixing

These accelerate beneficial changes entering production, limit issues deployed, and enable rapid identification and remediation of issues introduced through deployment activities.

In AWS, you can view your entire workload (applications, infrastructure, policy, governance, and operations) as code. It can all be defined in and updated using code. This means you can apply the same engineering discipline that you use for application code to every element of your stack.

Best Practices

Use version control to enable tracking of changes and releases.

Use version control: Maintain assets inversion control ledrepositories.Doingsosupportstracking changes, deploying new versions, detecting changes to existing versions, and reverting to prior versions (for example, rolling back to a known good state in the event of a failure).
Integrate the version control capabilities of your configuration management systems into your procedures.

AWS Code Commit

Test and validate changes to help limit and detect errors. Automate testing to reduce errors caused by manual processes, and reduce the level of effort to test.

Test and validate changes: Changes should be tested and the results validated at all lifecycle stages (for example, development, test, and production).
Use testing results to confirm new features and mitigate the risk and impact of failed deployments.
Automate testing and validation to ensure consistency of review, to reduce errors caused by manual processes, and reduce the level of effort.

AWS CodeBuild

Use configuration management systems to make and track configuration changes.

These systems reduce errors caused by manual processes and reduce the level of effort to deploy changes.

Static configuration management sets values when initializing a resource that are expected to remain consistent throughout the resource’s lifetime.

Use configuration management systems: Use configuration management systems to track and implement changes, to reduce errors caused by manual processes, and reduce the level of effort.
Some examples include: external databases, DNS, network connectivity, and external credit card processing services.

AWS AppConfig
AWS Developer Tools
AWS OpsWorks
AWS Systems Manager Change Calendar
AWS Systems Manager Maintenance Windows
AWS CloudFormation
AWS Config
AWS Elastic Beanstalk

Use build and deployment management systems. These systems reduce errors caused by manual processes and reduce the level of effort to deploy changes.

Use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort.
Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation.
This reduces lead time, enables increased frequency of change, and reduces the level of effort.

AWS CodeBuild
AWS CodeDeploy
AWS Developer Tools

Perform patch management to gain features, address issues, and remain compliant with governance.

Automate patch management to reduce errors caused by manual processes, and reduce the level of effort to patch.

Patch and vulnerability management are part of your benefit and risk management activities. d.

Patch systems to remediate issues,to gain desired features or capabilities, and to remain compliant with governance policy and vendor support requirements.
In immutable systems, deploy with the appropriate patch set to achieve the desired result.
In immutable systems, deploy with the appropriate patch set to achieve the desired result.
Automate the patch management mechanism to reduce the elapsed time to patch, to reduce errors caused by manual processes, and reduce the level of effort to patch.

AWS Developer Tools
AWS Systems Manager Patch Manager

Share best practices across teams to increase awareness and maximize the benefits of development efforts.

On AWS, application, compute, infrastructure, and operations can be defined and managed using code methodologies. This allows for easy release, sharing, and adoption.

Use this information to determine when a response is required.

Share existing best practices, design standards, checklists, operating procedures, and guidance and governance requirements across teams to reduce complexity and maximize the benefits from development efforts.
Ensure that procedures exist to request changes, additions, and exceptions to design standards to support continual improvement and innovation.
Ensure that teams are aware of published content so that they can take advantage of content, and limit rework and wasted effort.

Share an AWS CodeCommit repository
Easy authorization of AWS Lambda functions
Sharing an AMI with specific AWS accounts

Implement practices to improve code quality and minimize defects. Some examples include test-driven development, code reviews, and standards adoption.

Implement practices to improve code quality to minimize defects and the risk of their being deployed.
For example, test-driven development, pair programming, code reviews, and standards adoption.

AWS Code Guru

Use multiple environments to experiment, develop, and test your workload.

Use increasing levels of controls as environments approach production to gain confidence your workload will operate as intended when deployed.

Provide developers sandbox environments with minimized controls to enable experimentation.
Provide individual development environments to enable work in parallel, increasing development agility.
Implement more rigorous controls in the environments approaching production to allow developers to innovate.
Use infrastructure as code and configuration management systems to deploy environments that are configured consistent with the controls present in production to ensure systems operate as expected when deployed.

AWS CloudFormation

Frequent, small, and reversible changes reduce the scope and impact of a change.

This eases troubleshooting, enables faster remediation, and provides the option to roll back a change.

Implement dependency telemetry: Design and configure your workload to emit information about the state and status of systems it depends on.
Some examples include: external databases, DNS, network connectivity, and external credit card processing services.

Automate build, deployment, and testing of the workload.

This reduces errors caused by manual processes and reduces the effort to deploy changes.

Apply metadata using Resource Tags and AWS Resource Groups following a consistent tagging strategy to enable identification of your resources

Tag your resources for organization, cost accounting, access controls, and targeting the execution of automated operations activities.

Use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort.
Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation.
This reduces lead time, enables increased frequency of change, and reduces the level of effort.

AWS CodeBuild
AWS CodeDeploy

Mitigate deployment risks

Adopt approaches that provide fast feedback on quality and enable rapid recovery from changes that do not have desired outcomes

Using these practices mitigates the impact of issues introduced through the deployment of changes.

Best Practices

Plan to revert to a known good state, or remediate in the production environment if a change does not have the desired outcome. This preparation reduces recovery time through faster responses.

Plan for unsuccessful changes: Plan to revert to a known good state (that is, roll back the change), or remediate in the production environment (that is, roll forward the change) if a change does not have the desired outcome.
When you identify changes that you cannot roll back if unsuccessful, apply due diligence prior to committing the change.

Test changes and validate the results at all lifecycle stages to confirm new features and minimize the risk and impact of failed deployments.

On AWS, you can create temporary parallel environments to lower the risk, effort, and cost of experimentation and testing. Automate the deployment of these environments using AWS CloudFormation to ensure consistent implementations of your temporary environments.

Test and validate changes: Changes should be tested and the results validated at all lifecycle stages (for example, development, test, and production).

AWS Cloud9
AWS CodeDeploy

Use deployment management systems to track and implement change. This reduces errors caused by manual processes and reduces the effort to deploy changes.

Build Continuous Integration/Continuous Deployment (CI/CD) pipelines

Use deployment management systems: Use deployment management systems to track and implement change. This will reduce errors caused by manual processes, and reduce the level of effort to deploy changes.
Automate the integration and deployment pipeline from code check-in through testing, deployment, and validation. This reduces lead time, enables increased frequency of change, and further reduces the level of effort.

AWS CodeDeploy

Test with limited deployments alongside existing systems to confirm desired outcomes prior to full scale deployment. For example, use deployment canary testing or one-box deployments.

Test with limited deployments alongside existing systems to confirm desired outcomes prior to full scale deployment. For example, use deployment canary testing or onebox deployments.

AWS CodeDeploy
AWS Blue/Green Deplyments

Implement changes onto parallel environments, and then transition over to the new environment. Maintain the prior environment until there is confirmation of successful deployment.

Doing so minimizes recovery time by enabling rollback to the previous environment.

Deploy using parallel environments: Implement changes onto parallel environments, and transition or cut over to the new environment.
Maintain the prior environment until there is confirmation of successful deployment.
This minimizes recovery time by enabling rollback to the previous environment

AWS CodeDeploy
AWS AWS Blue/Green Deployments

Use frequent, small, and reversible changes to reduce the scope of a change. This results in easier troubleshooting and faster remediation with the option to roll back a change.

Use frequent, small, and reversible changes to reduce the scope of a change.
This results in easier troubleshooting and faster remediation with the option to roll back a change

Automate build, deployment, and testing of the workload. This reduces errors cause by manual processes and reduces the effort to deploy changes.

Use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort.
Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation.

AWS CodeBuild
AWS CodeDeploy

Automate testing of deployed environments to confirm desired outcomes. Automate rollback to a previous known good state when outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes.

Automate testing of deployed environments to confirm desired outcomes.
Automate rollback to a previous known good state when outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes.

AWS IAM
AWS Organisations

Operational readiness and change management

Evaluate the operational readiness of your workload, processes, procedures, and personnel to understand the operational risks related to your workload

Manage the flow of change into your environments. You should use a consistent process (including manual or automated checklists) to know when you are ready to go live with your workload or a change. This will also enable you to find any areas that you need to make plans to address. You will have runbooks that document your routine activities and playbooks that guide your processes for issue resolution. Use a mechanism to manage changes that supports the delivery of business value and help mitigate risks associated to change.

Best Practices

Have a mechanism to validate that you have the appropriate number of trained personnel to provide support for operational needs. .

Train personnel and adjust personnel capacity as necessary to maintain effective support.

You will need to have enough team members to cover all activities (including on-call). Ensure that your teams have the necessary skills to be successful with training on your workload, your operations tools, and AWS.

Personnel capability: Validate that there are sufficient trained personnel to effectively support the workload.
Teamsize: Ensure that you have enough team members to cover operational activities, including on- call duties.
Review capabilities:Review team size and skill as operating conditions and work loads change, to ensure there is sufficient capability to maintain operational excellence.

AWS Blogs
AWS Events and Webinars
AWS Training and Certification

Use Operational Readiness Reviews (ORRs) to validate that you can operate your workload.

ORR is a mechanism developed at Amazon to validate that teams can safely operate their workloads.

An ORR is a review and inspection process using a checklist of requirements.

To learn more about ORRs, read the Operational Readiness Reviews (ORR) whitepaper.

A runbook is a documented process to achieve a specific outcome.

Runbooks consist of a series of steps that someone follows to get something done.

Runbooks are an essential part of operating your workload. From onboarding a new team member to deploying a major release, runbooks are the codified processes that provide consistent outcomes no matter who uses them.

Runbooks can take several forms depending on the maturity level of your organization. At a minimum, they should consist of a step-by-step text document. The desired outcome should be clearly indicated.
Clearly document necessary special permissions or tools.
Provide detailed guidance on error handling and escalations in case something goes wrong.

AWS Systems Manager Automation runbooks

Playbooks are step-by-step guides used to investigate an incident. When incidents happen, playbooks are used to investigate, scope impact, and identify a root cause.

Playbooks are used for a variety of scenarios, from failed deployments to security incidents.

In many cases, playbooks identify the root cause that a runbook is used to mitigate. Playbooks are an essential component of your organization's incident response plans.

If you are new to the cloud, build playbooks in text form in a central document repository
As your organization matures, playbooks can become semi-automated with scripting languages like Python.
Start building your playbooks by listing common incidents that happen to your workload.
Your text playbooks should be automated as your organization matures.

AWS Systems Manager Automation runbooks

Evaluate the capabilities of the team to support the workload and the workload's compliance with governance.

Evaluate the capabilities of the team to support the workload and the workload's compliance with governance
Evaluate these against the benefits of deployment when determining whether to transition a system or change into production.
Understand the benefits and risks, and make informed decisions.

Operate

Success is the achievement of business outcomes as measured by the metrics you define. By understanding the health of your workload and operations, you can identify when organizational and business outcomes may become at risk, or are at risk, and respond appropriately.

Understanding workload health

Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action

Your team should be able to understand the health of your workload easily. You will want to use metrics based on workload outcomes to gain useful insights. You should use these metrics to implement dashboards with business and technical viewpoints that will help team members make informed decisions.

Best Practices

Have a mechanism to validate that you have the appropriate number of trained personnel to provide support for operational needs. .

Train personnel and adjust personnel capacity as necessary to maintain effective support.

Personnel capability: Validate that there are sufficient trained personnel to effectively support the workload.
Teamsize: Ensure that you have enough team members to cover operational activities, including on- call duties.
Review capabilities:Review team size and skill as operating conditions and work loads change, to ensure there is sufficient capability to maintain operational excellence.

Have a mechanism to validate that you have the appropriate number of trained personnel to provide support for operational needs. .

Train personnel and adjust personnel capacity as necessary to maintain effective support.

Personnel capability: Validate that there are sufficient trained personnel to effectively support the workload.
Teamsize: Ensure that you have enough team members to cover operational activities, including on- call duties.
Review capabilities:Review team size and skill as operating conditions and work loads change, to ensure there is sufficient capability to maintain operational excellence.

AWS CloudWatch metrics
AWS Organisations

Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed.

You should aggregate log data from your application, workload components, services, and API calls to a service such as CloudWatch Logs.

Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed.

AWS Athena
AWS CloudWatch metrics
AWS DevOps Guru
AWS Glue
AWS Glue Data Catalog
AWS Health Dashboard
AWS QuickSight

Establish baselines for metrics to provide expected values as the basis for comparison and identification of under- and over-performing components. Identify thresholds for improvement, investigation, and intervention.

Establish baselines for workload metrics: Establish baselines for workload metrics to provide expected values as the basis for comparison.

AWS CloudWatch

Establish patterns of workload activity to identify anomalous behavior so that you can respond appropriately if required.

CloudWatch through the CloudWatch Anomaly Detection feature applies statistical and machine learning algorithms to generate a range of expected values that represent normal metric behavior.

Learn expected patterns of activity for workload: Establish patterns of workload activity to determine when behavior is outside of the expected values so that you can respond appropriately if required.

AWS DevOps Guru
AWS CloudWatch Anomaly Detection

Raise an alert when workload outcomes are at risk so that you can respond appropriately if necessary.

Ideally, you have previously identified a metric threshold that you are able to alarm upon or an event that you can use to trigger an automated response.

Alert when workload outcomes are at risk: Raise an alert when workload outcomes are at risk so that you can respond appropriately if required.

AWS CloudWatch Synthetics
AWS CloudWatch Log Insights
AWS CloudWatch Events

Raise an alert when workload anomalies are detected so that you can respond appropriately if necessary.

Your analysis of your workload metrics over time may establish patterns of behavior that you can quantify sufficiently to define an event or raise an alarm in response.

Raise an alert when workload anomalies are detected so that you can respond appropriately if required.

AWS CloudWatch Alarms/Events
AWS CloudWatch Anomaly Detection

Create a business-level view of your workload operations to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals.

Validate the effectiveness of KPIs and metrics and revise them if necessary.

Create a business level view of your workload operations to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals.
Validate the effectiveness of KPIs and metrics and revise them if necessary.

AWS CloudWatch dashboards

Understanding operational health

Define, capture, and analyze operations metrics to gain visibility to workload events so that you can take appropriate action

Your team should be able to understand the health of your operations easily. You will want to use metrics based on operations outcomes to gain useful insights. You should use these metrics to implement dashboards with business and technical viewpoints that will help team members make informed decisions.

Best Practices

Identify key performance indicators (KPIs) based on desired business outcomes (for example, new features delivered) and customer outcomes (for example, customer support cases).

Evaluate KPIs to determine operations success.

Define operations metrics to measure the achievement of KPIs (for example, successful deployments, and failed deployments).

Define operations metrics to measure the achievement of KPIs.
Define operations metrics to measure the health of operations and its activities. E
Evaluate metrics to determine if operations are achieving desired outcomes, and to understand the health of the operations.

AWS CloudWatch Events

Perform regular, proactive reviews of metrics to identify trends and determine where appropriate responses are needed.

You should aggregate log data from the execution of your operations activities and operations API calls, into a service such as CloudWatch Logs. G

Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed.

AWS Athena
AWS CloudWatch
AWS Glue
AWS Glue Data Catalog
AWS QuickSight

Establish baselines for metrics to provide expected values as the basis for comparison and identification of under and over performing operations activities.

Establish patterns of operations activity to determine when behavior is outside of the expected values so that you can respond appropriately if required.

Establish patterns of operations activities to identify anomalous activity so that you can respond appropriately if necessary.

Establish patterns of operations activity to determine when behavior is outside of the expected values so that you can respond appropriately if required.

Whenever operations outcomes are at risk, an alert must be raised and acted upon. Operations outcomes are any activity that supports a workload in production.

This includes everything from deploying new versions of applications to recovering from an outage

Start by defining what operations activities are most important to your organization.
Your organization must define key operations activities and how they are measured so that they can be monitored, improved, and alerted on.
You need a central location where workload and operations telemetry is stored and analyzed. The same mechanism should be able to raise an alert when an operations outcome is at risk.

AWS EventBridge
AWS SystemsManager OpsCentre

Raise an alert when operations anomalies are detected so that you can respond appropriately if necessary.

Your analysis of your operations metrics over time may established patterns of behavior that you can quantify sufficiently to define an event or raise an alarm in response.

Raise an alert when operations anomalies are detected so that you can respond appropriately if required.

AWS DevOps Guru
AWS CloudWatch Anomaly Detection

Create a business-level view of your operations activities to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals.

Create a business level view of your operations activities to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals.
Validate the effectiveness of KPIs and metrics and revise them if necessary

AWS CloudWatch dashboards

Responding to events

You should anticipate operational events, both planned (for example, sales promotions, deployments, and failure tests) and unplanned (for example, surges in utilization and component failures)

You should use your existing runbooks and playbooks to deliver consistent results when you respond to alerts. Defined alerts should be owned by a role or a team that is accountable for the response and escalations. You will also want to know the business impact of your system components and use this to target efforts when needed. You should perform a root cause analysis (RCA) after events, and then prevent recurrence of failures or document workarounds.

Best Practices

Your organization has processes to handle events, incidents, and problems.

Events are things that occur in your workload but may not need intervention.

Incidents are events that require intervention.

Problems are recurring events that require intervention or cannot be resolved.

You need processes to mitigate the impact of these events on your business and make sure that you respond appropriately

Track events that happen in your workload, even if no human intervention is required.
Work with workload stakeholders to develop a list of events that should be tracked. Some examples are completed deployments or successful patching.
You can use services like Amazon EventBridge or Amazon Simple Notification Service to generate custom events for tracking.

AWS EventBridge
AWS SNS
AWS Health Dashboard
AWS Systems Manager Incident Manager

Have a well-defined response (runbook or playbook), with a specifically identified owner, for any event for which you raise an alert.

Process per alert: Any event for which you raise an alert should have a well-defined response (runbook or playbook) with a specifically identified owner (for example, individual, team, or role) accountable for successful completion.
Performance of the response may be automated or conducted by another team but the owner is accountable for ensuring the process delivers the expected outcomes

AWS CloudWatch Event

Ensure that when multiple events require intervention, those that are most significant to the business are addressed first.

Impacts can include loss of life or injury, financial loss, or damage to reputation or trust.

Ensure that when multiple events require intervention, those that are most significant to the business are addressed first

Define escalation paths in your runbooks and playbooks, including what triggers escalation, and procedures for escalation.

Specifically identify owners for each action to ensure effective and prompt responses to operations events.

Define escalation paths in your runbooks and playbooks, including what triggers escalation, and procedures for escalation
Escalate an issue from support engineers to senior support engineers when runbooks cannot resolve the issue, or when a predefined period of time has elapsed

Communicate directly with your users (for example, with email or SMS) when the services they use are impacted, and again when the services return to normal operating conditions, to enable users to take appropriate action.

Enable push notifications: Communicate directly with your users (for example, with email or SMS) when the services they use are impacted, and when the services return to normal operating conditions, to enable users to take appropriate action.

AWS SES
AWS SNS

Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest.

Providing a self-service option for status information reduces the disruption of fielding requests for status by the operations team.

AWS QuickSight
AWS CloudWatch Dashboards

Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses.

Create CloudWatch Events rules to trigger responses through CloudWatch targets (for example, Lambda functions, Amazon Simple Notification Service (Amazon SNS) topics, Amazon ECS tasks, and AWS Systems Manager Automation).

AWS CoudWatch Events
AWS CloudTrail

Evolve

Evolution is the continuous cycle of improvement over time. Implement frequent small incremental changes based on the lessons learned from your operations activities and evaluate their success at bringing about improvement.

Learn, share, and improve

It's essential that you regularly provide time for analysis of operations activities, analysis of failures, experimentation, and making improvements

When things fail, you will want to ensure that your team, as well as your larger engineering community, learns from those failures. You should analyze failures to identify lessons learned and plan improvements. You will want to regularly review your lessons learned with other teams to validate your insights.

Best Practices

Regularly evaluate and prioritize opportunities for improvement to focus efforts where they can provide the greatest benefits.

Implement changes to improve and evaluate the outcomes to determine success
If the outcomes do not satisfy the goals, and the improvement is still a priority, iterate using alternative courses of action

Review customer-impacting events, and identify the contributing factors and preventative actions.

Use this information to develop mitigations to limit or prevent recurrence. Develop procedures for prompt and effective responses

Have a process to identify and document the contributing factors of an incident so that you can develop mitigations to limit or prevent recurrence and you can develop procedures for prompt and effective responses.
Communicate root cause as appropriate, tailored to target audiences

Feedback loops provide actionable insights that drive decision making. Build feedback loops into your procedures and workloads.

This helps you identify issues and areas that need improvement. They also validate investments made in improvements.

You need a mechanism to receive feedback from customers and team members. Your operations activities can also be configured to deliver automated feedback.
Your organization needs a process to review this feedback, determine what to improve, and schedule the improvement.
Feedback must be added into your software development process

AWS Systems Manager OpsCenter

Mechanisms exist for your team members to discover the information that they are looking for in a timely manner, access it, and identify that it’s current and complete.

Mechanisms are present to identify needed content, content in need of refresh, and content that should be archived so that it’s no longer referenced.

Ensure mechanisms exist for your team members to discover the information that they are looking for in a timely manner, access it, and identify that it’s current and complete.
Maintain mechanisms to identify needed content, content in need of refresh, and content that should be archived so that it’s no longer referenced.

Identify drivers for improvement to help you evaluate and prioritize opportunities.

On AWS, you can aggregate the logs of all your operations activities, workloads, and infrastructure to create a detailed activity history.

You can then use AWS tools to analyze your operations and workload health over time.

Understand drivers for improvement: You should only make changes to a system when a desired outcome is supported.

AWS Athena
AWS QuickSight
AWS Complaince
AWS Glue
AWS Trusted Advisor

Review your analysis results and responses with cross-functional teams and business owners. Use these reviews to establish common understanding, identify additional impacts, and determine courses of action. Adjust responses as appropriate.

Engage with business owners and subject matter experts to ensure there is common understanding and agreement of the meaning of the data you have collected. Identify additional concerns, potential impacts, and determine a courses of action.

Regularly perform retrospective analysis of operations metrics with cross-team participants from different areas of the business.

Use these reviews to identify opportunities for improvement, potential courses of action, and to share lessons learned.

Regularly perform retrospective analysis of operations metrics with cross-team participants from different areas of the business.
Engage stakeholders, including the business, development, and operations teams, to validate your findings from immediate feedback and retrospective analysis, and to share lessons learned.
Use their insights to identify opportunities for improvement and potential courses of action.

AWS CloudWatch
AWS CloudWatch metrics

Document and share lessons learned from the operations activities so that you can use them internally and across teams.

You should share what your teams learn to increase the benefit across your organization.

Have procedures to document the lessons learned from the execution of operations activities and retrospective analysis so that they can be used by other teams
Have procedures to share lessons learned and associated artifacts across teams. For example, share updated procedures, guidance, governance, and best practices through an accessible wiki.
Share scripts, code, and libraries through a common repository.

Dedicate time and resources within your processes to make continuous incremental improvements possible.

On AWS, you can create temporary duplicates of environments, lowering the risk, effort, and cost of experimentation and testing.

Dedicate time and resources within your processes to make continuous incremental improvements possible.
Implement changes to improve and evaluate the results to determine success.

Security

The security pillar describes how to take advantage of cloud technologies to protect data, systems, and assets in a way that can improve your security posture.

There are six focus areas for Security in the cloud

Security foundations

The security pillar describes how to take advantage of cloud technologies to protect data, systems, and assets in a way that can improve your security posture.

AWS account management and separation

We recommend that you organize workloads in separate accounts and group accounts based on function, compliance requirements, or a common set of controls rather than mirroring your organization’s reporting structure.

In AWS, accounts are a hard boundary. For example, account-level separation is strongly recommended for isolating production workloads from development and test workloads.

Manage accounts centrally: AWS Organizations automates AWS account creation and management, and control of those accounts after they are created.

Set controls centrally: Control what your AWS accounts can do by only allowing specific services, Regions, and service actions at the appropriate level.

Configure services and resources centrally: AWS Organizations helps you configure AWS services that apply to all of your accounts.

Best Practices

Start with security and infrastructure in mind to enable your organization to set common guardrails as your workloads grow. This approach provides boundaries and controls between workloads.

Account level separation is strongly recommended for isolating production environments from development and test environments, or providing a strong logical boundary between workloads that process data of different sensitivity levels, as defined by external compliance requirements (such as PCI-DSS or HIPAA), and workloads that don’t.

Use AWS Organizations to centrally enforce policy-based management for multiple AWS accounts.
Consider AWS Control Tower: AWS Control Tower provides an easy way to set up and govern a new, secure, multi-account AWS environment based on best practices.

AWS Organizations
AWS Control Tower

There are a number of aspects to securing your AWS accounts, including the securing of, and not using the root user, and keeping your contact information up-to-date.

You can use AWS Organizations to centrally manage and govern your accounts as you grow and scale your workloads in AWS.

AWS Organizations helps you manage accounts, set controls, and configure services across your accounts.

Use AWS Organizations to centrally enforce policy-based management for multiple AWS accounts.
Limit use of the AWS root user: Only use the root user to perform tasks that specifically require it.
Enable multi-factor-authentication (MFA) for the root user: Enable MFA on the AWS account root user, if AWS Organizations is not managing root users for you.
Periodically change the root user password.
Enable notification when the AWS account root user is used.
Restrict access to newly added Regions.
Consider AWS CloudFormation StackSets: CloudFormation StackSets can be used to deploy resources including IAM policies, roles, and groups into different AWS accounts and Regions from an approved template.

AWS Organizations
AWS Control Tower

Operating your workloads securely

Operating workloads securely covers the whole lifecycle of a workload from design, to build, to run, and to ongoing improvement.

One of the ways to improve your ability to operate securely in the cloud is by taking an organizational approach to governance

Governance is the way that decisions are guided consistently without depending solely on the good judgment of the people involved.

Best Practices

Based on your compliance requirements and risks identified from your threat model, derive and validate the control objectives and controls that you need to apply to your workload.

Ongoing validation of control objectives and controls help you measure the effectiveness of risk mitigation.

Identify compliance requirements: Discover the organizational, legal, and compliance requirements that your workload must comply with.
Identify AWS compliance resources: Identify resources that AWS has available to assist you with compliance.

AWS Compliance website

To help you define and implement appropriate controls, recognize attack vectors by staying up to date with the latest security threats.

Consume AWS Managed Services to make it easier to receive notification of unexpected or unusual behavior in your AWS accounts. Investigate using AWS Partner tools or thirdparty threat information feeds as part of your security information flow.

Subscribe to threat intelligence sources. Regularly review threat intelligence information from multiple sources that are relevant to the technologies used in your workload.
Consider AWS Shield Advanced service: It provides near real-time visibility into intelligence sources, if your workload is internet accessible.

AWS Shield

Stay up-to-date with both AWS and industry security recommendations to evolve the security posture of your workload.

AWS Security Bulletins contain important information about security and privacy notifications.

Follow AWS updates: Subscribe or regularly check for new recommendations, tips and tricks.
Subscribe to industry news: Regularly review news feeds from multiple sources that are relevant to the technologies that are used in your workload.

AWS security blog

Establish secure baselines and templates for security mechanisms that are tested and validated as part of your build, pipelines, and processes.

Use tools and automation to test and validate all security controls continuously.

Automate configuration management: Enforce and validate secure configurations automatically by using a configuration management service or tool.

AWS Systems Manager
AWS CloudFormation/CloudFormation Guard
AWS Config
AWS CodePipeline (CodeCommit + CodeDeploy)

Use a threat model to identify and maintain an up-to-date register of potential threats. Prioritize your threats and adapt your security controls to prevent, detect, and respond.

Revisit and maintain this in the context of the evolving security landscape.

Create a threat model: A threat model can help you identify and address potential security threats.

AWS security bulletins website

Evaluate and implement security services and features from AWS and AWS Partners that allow you to evolve the security posture of your workload.

The AWS Security Blog highlights new AWS services and features, implementation guides, and general security guidance.

What's New with AWS? is a great way to stay up to date with all new AWS features, services, and announcements.

Plan regular reviews: Create a calendar of review activities that includes compliance requirements, evaluation of new AWS security features and services, and staying up-to-date with industry news.
Discover AWS services and features: Discover the security features that are available for the services that you are using, and review new features as they are released.
Define processes for onboarding of new AWS services. Include how you evaluate new AWS services for functionality, and the compliance requirements for your workload.
Test new services and features as they are released in a non-production environment that closely replicates your production one.
Implement other defense mechanisms: Implement automated mechanisms to defend your workload, explore the options available.

AWS Security blog
AWS security bulletins website

Identity and access management

To use AWS services, you must grant your users and applications access to resources in your AWS accounts.

As you run more workloads on AWS, you need robust identity management and permissions in place to ensure that the right people have access to the right resources under the right conditions.

AWS offers a large selection of capabilities to help you manage your human and machine identities and their permissions.

Identity management

There are two types of identities you need to manage when operating secure AWS workloads, human identities and machine identities

Human identities: The administrators, developers, operators, and consumers of your applications require an identity to access your AWS environments and applications.

Machine identities: Your workload applications, operational tools, and components require an identity to make requests to AWS services, for example, to read data. These identities include machines running in your AWS environment, such as Amazon EC2 instances or AWS Lambda functions.

Best Practices

Enforce minimum password length, and educate your users to avoid common or reused passwords.

Enforce multi-factor authentication (MFA) with software or hardware mechanisms to provide an additional layer of verification

Create an AWS Identity and Access Management (IAM) policy to enforce MFA sign-in.
Enable MFA in your identity provider.
Configure a strong password policy.
Rotate credentials regularly.

AWS IAM Identity Centre
AWS Secrets Manager

For human identities using the AWS Management Console, require users to acquire temporary credentials and federate into AWS. You can do this using the AWS IAM Identity Center user portal.

For users requiring CLI access, ensure that they use AWS CLI v2, which supports direct integration with IAM Identity Center.

For machine identities, you should rely on IAM roles to grant access to AWS.

Audit and rotate credentials periodically: Periodic validation, preferably through an automated tool, is necessary to verify that the correct controls are enforced.
Store and use secrets securely: For credentials that are not IAM-related and cannot take advantage of temporary credentials, such as database logins, use a service that is designed to handle management of secrets, such as Secrets Manager.
Implement least privilege policies: Assign access policies with least privilege to IAM groups and roles to reflect the user's role or function that you have defined.
Remove unnecessary permissions: Implement least privilege by removing permissions that are unnecessary.
Consider permissions boundaries: A permissions boundary is an advanced feature for using a managed policy that sets the maximum permissions that an identity-based policy can grant to an IAM entity.
Consider resource tags for permissions: You can use tags to control access to your AWS resources that support tagging. You can also tag IAM users and roles to control what they can access.

AWS IAM Identity Centre
AWS Secrets Manager
AWS Cognito

For workforce and machine identities that require secrets such as passwords to third-party applications, store them with automatic rotation.

Secrets Manager makes it easy to manage, rotate, and securely store encrypted secrets using supported services.

Use AWS Secrets Manager: AWS Secrets Manager is an AWS service that makes it easier for you to manage secrets.
Secrets can be database credentials, passwords, third-party API keys, and even arbitrary text.

AWS Secrets Manager

For workforce identities, rely on an identity provider that enables you to manage identities in a centralized place.

This makes it easier to manage access across multiple applications and services,because you are creating, managing, and revoking access from a single location.

Centralize administrative access: Create an Identity and Access Management (IAM) identity provider entity to establish a trusted relationship between your AWS account and your identity provider (IdP).
Centralize application access: Consider Amazon Cognito for centralizing application access. It lets you add user sign-up, sign-in, and access control to your web and mobile apps quickly and easily.
Remove old IAM users and groups: After you start using an identity provider (IdP), remove IAM users and groups that are no longer required.

AWS IAM Identity Centre

When you cannot rely on temporary credentials and require long-term credentials, audit credentials to ensure that the defined controls for example, multi-factor authentication (MFA), are enforced, rotated regularly, and have the appropriate access level.

Regularly audit credentials: Use credential reports, and Identify and Access Management (IAM) Access Analyzer to audit IAM credentials and permissions.
Use Access Levels to Review IAM Permissions: To improve the security of your AWS account, regularly review and monitor each of your IAM policies.
Consider automating IAM resource creation and updates: AWS CloudFormation can be used to automate the deployment of IAM resources, including roles and policies, to reduce human error because the templates can be verified and version controlled.

AWS IAM Access Analyzer

As the number of users you manage grows, you will need to determine ways to organize them so that you can manage them at scale.

Place users with common security requirements in groups defined by your identity provider, and put mechanisms in place to ensure that user attributes that may be used for access control (for example, department or location) are correct and updated.

Use these groups and attributes to control access, rather than individual users.

If you are using AWS IAM Identity Center (successor to AWS Single Sign-On) (IAM Identity Center), configure groups: IAM Identity Center provides you with the ability to configure groups of users, and assign groups the desired level of permission.
Learn about attribute-based access control (ABAC): ABAC is an authorization strategy that defines permissions based on attributes.

AWS IAM Identity Center
AWS Attribute-based access control (ABAC)
AWS Secrets management

Permissions management

Manage permissions to control access to human and machine identities that require access to AWS and your workloads

Permissions control who can access what, and under what conditions. Set permissions to specific human and machine identities to grant access to specific service actions on specific resources.

There are a number of ways to grant access to different types of resources. One way is by using different policy types.

Identity-based policies in IAM attach to IAM identities, including users, groups, or roles. These policies let you specify what that identity can do (its permissions).

Best Practices

Each component or resource of your workload needs to be accessed by administrators, end users, or other components.

Have a clear definition of who or what should have access to each component, choose the appropriate identity type and method of authentication and authorization.

Have a clear definition of who or what should have access to each component, choose the appropriate identity type and method of authentication and authorization.
Regular access to AWS accounts within the organization should be provided using federated access or a centralized identity provider.
When defining access requirements for non-human identities, determine which applications and components need access and how permissions are granted. Using IAM roles built with the least privilege access model is a recommended approach.
AWS services, such as AWS Secrets Manager and AWS Systems Manager Parameter Store, can help decouple secrets from the application or workload securely in cases where it's not feasible to use IAM roles.

AWS IAM Identity Center
AWS Attribute-based access control (ABAC)
AWS IAM Roles anywhere
AWS IAM Policies

Grant only the access that identities require by allowing access to specific actions on specific AWS resources under specific conditions.

Rely on groups and identity attributes to dynamically set permissions at scale, rather than defining permissions for individual users.

Establishing a principle of least privilege ensures that identities are only permitted to perform the most minimal set of functions necessary to fulfill a specific task, while balancing usability and efficiency.
Use policies to explicitly grant permissions attached to IAM or resource entities, such as an IAM role used by federated identities or machines, or resources.
There are several AWS capabilities to help you scale permission management and adhere to the principle of least privilege. Attribute Based Access control allows you to limit permissions based on the tag of a resource, for making authorization decisions based on the tags applied to the resource and the calling IAM principal.

IAM Access Analyzer
IAM Policy Simulator
AWS Control Tower (GuardRails)
AWS Verified Access (zero trust)

A process that allows emergency access to your workload in the unlikely event of an automated process or pipeline issue.

This will help you rely on least privilege access, but ensure users can obtain the right level of access when they require it.

Establishing emergency access can take several forms for which you should be prepared. The first is a failure of your primary identity provider. In this case, you should rely on a second method of access with the required permissions to recover. This method could be a backup identity provider or an IAM user.
You should also be prepared for emergency access where temporary elevated administrative access is needed.

As teams and workloads determine what access they need, remove permissions they no longer use and establish review processes to achieve least privilege permissions.

Continuously monitor and reduce unused identities and permissions.

Configure AWS Identify and Access Management (IAM) Access Analyzer: AWS IAM Access Analyzer helps you identify the resources in your organization and accounts, such as Amazon Simple Storage Service (Amazon S3) buckets or IAM roles, that are shared with an external entity.

AWS IAM Access Analyzer

Establish common controls that restrict access to all identities in your organization

For example, you can restrict access to specific AWS Regions, or prevent your operators from deleting common resources, such as an IAM role used for your central security team.

As you grow and manage additional workloads in AWS, you should separate these workloads using accounts and manage those accounts using AWS Organizations.
We recommend that you establish common permission guardrails that restrict access to all identities in your organization.
You can get started by implementing example service control policies, such as preventing users from disabling key services.
We recommend you avoid running workloads in your management account. The management account should be used to govern and deploy security guardrails that will affect member accounts.
Using a multi-account strategy allows you to have greater flexibility in applying guardrails to your workloads.

AWS Organisations
AWS Service Control Policies
AWS Control Tower

Integrate access controls with operator and application lifecycle and your centralized federation provider.

For example, remove a user’s access when they leave the organization or change roles.

Implement a user access lifecycle policy for new users joining, job function changes, and users leaving so that only current users have access.

AWS IAM Access Analyzer
AWS Attribute-based access control (ABAC)

Continuously monitor findings that highlight public and cross-account access.

Reduce public access and cross-account access to only resources that require this type of access.

Consider configuring IAM Access Analyzer with AWS Organizations to verify you have visibility through all your accounts.
You can also use AWS Config to report and remediate resources for any accidental public access configuration, through AWS Config policy checks. Services like AWS Control Tower and AWS Security Hub simplify deploying checks and guardrails across an AWS Organizations to identify and remediate publicly exposed resources.

AWS IAM Access Analyzer
AWS Control Tower (Guardrails)
AWS Config (managed rules)
AWS Trusted Advisor

Govern the consumption of shared resources across accounts or within your AWS Organizations.

Monitor shared resources and review shared resource access.

Govern the consumption of shared resources across accounts or within your AWS Organizations. Monitor shared resources and review shared resource access.

AWS Resource Access Manager
VPC endpoints

Detection

Detection enables you to identify a potential security misconfiguration, threat, or unexpected behavior. It’s an essential part of the security lifecycle and can be used to support a quality process, a legal or compliance obligation, and for threat identification and response efforts.

Detection

Detection consists of two parts: detection of unexpected or unwanted configuration changes, and the detection of unexpected behavior

Detection enables you to identify a potential security misconfiguration, threat, or unexpected behavior.

It’s an essential part of the security lifecycle and can be used to support a quality process, a legal or compliance obligation, and for threat identification and response efforts.

Best Practices

Configure logging throughout the workload, including application logs, resource logs, and AWS service logs.

A foundational practice is to establish a set of detection mechanisms at the account level. This base set of mechanisms is aimed at recording and detecting a wide range of actions on all resources in your account.

Enable logging of AWS services.
Evaluate and enable logging of operating systems and application-specific logs to detect suspicious behavior.
Apply appropriate controls to the logs: Logs can contain sensitive information and only authorized users should have access.
Configure Amazon GuardDuty.
Configure customized trail in CloudTrail.
Enable AWS Config.
Enable AWS Security Hub.

AWS CloudWatch
AWS CloudTrail
AWS EventBridge
AWS Config
AWS Security Hub
AWS GuardDuty

Security operations teams rely on the collection of logs and the use of search tools to discover potential events of interest, which might indicate unauthorized activity or unintentional change.

A best practice for building a mature security operations team is to deeply integrate the flow of security events and findings into a notification and workflow system such as a ticketing system, a bug or issue system, or other security information and event management (SIEM) system.

Evaluate log processing capabilities: Evaluate the options that are available for processing logs.
As a start for analyzing CloudTrail logs, test Amazon Athena.
Implement centralize logging in AWS: See the following AWS example solution to centralize logging from multiple sources.
Implement centralize logging with partner: APN Partners have solutions to help you analyze logs centrally.

AWS Security Hub
AWS CloudWatch
AWS EventBridge

Using automation to investigate and remediate events reduces human effort and error, and enables you to scale investigation capabilities.

In AWS, investigating events of interest and information on potentially unexpected changes into an automated workflow can be achieved using Amazon EventBridge.

Detecting change and routing this information to the correct workflow can also be accomplished using AWS Config Rules and Conformance Packs.

Implement automated alerting with GuardDuty.
Develop automated processes that investigate an event and report information to an administrator to save time.

AWS CloudWatch
AWS EventBridge
AWS Security Hub

Create alerts that are sent to and can be actioned by your team. Ensure that alerts include relevant information for the team to take action.

For each detective mechanism you have, you should also have a process, in the form of a runbook or playbook, to investigate

Discover metrics available for AWS services: Discover the metrics that are available through Amazon CloudWatch for the services that you are using.
Configure Amazon CloudWatch alarms.

AWS CloudWatch
AWS EventBridge

Infrastructure protection

Infrastructure protection encompasses control methodologies, such as defense in depth, that are necessary to meet best practices and organizational or regulatory obligations. Use of these methodologies is critical for successful, ongoing operations in the cloud.

Protecting networks

Users, both in your workforce and your customers, can be located anywhere. You need to pivot from traditional models of trusting anyone and anything that has access to your network

When you follow the principle of applying security at all layers, you employ a Zero Trust approach.

Zero Trust security is a model where application components or microservices are considered discrete from each other and no component or microservice trusts any other.

Best Practices

Group components that share reachability requirements into layers.

For example, a database cluster in a virtual private cloud (VPC) with no need for internet access should be placed in subnets with no route to or from the internet.

For network connectivity that can include thousands of VPCs, AWS accounts, and on-premises networks, you should use AWS Transit Gateway. It acts as a hub that controls how traffic is routed among all the connected networks, which act like spokes.

Create subnets in VPC: Create subnets for each layer (in groups that include multiple Availability Zones), and associate route tables to control routing.

AWS Firewall Manager
AWS Inspector
AWS WAF

When architecting your network topology, you should examine the connectivity requirements of each component.

For example, if a component requires internet accessibility (inbound and outbound), connectivity to VPCs, edge services, and external data centers.

Control network traffic in a VPC: Implement VPC best practices to control traffic.
Control traffic at the edge: Implement edge services, such as Amazon CloudFront, to provide an additional layer of protection and other features.
Control private network traffic: Implement services that protect your private traffic for your workload.

AWS Firewall Manager
AWS Inspector
AWS CloudFront
AWS Global Accelerator
AWS WAF
AWS Route53
AWS VPC Peering
AWS Private Link
AWS Transit Gateway
AWS DirectConnect
AWS Site-to-Site VPN
AWS Client VPN

Automate protection mechanisms to provide a self-defending network based on threat intelligence and anomaly detection.

For example, intrusion detection and prevention tools that can adapt to current threats and reduce their impact.

A web application firewall is an example of where you can automate network protection, to automatically block requests originating from IP addresses associated with known threat actors

Automate protection for web-based traffic: AWS offers a solution that uses AWS CloudFormation to automatically deploy a set of AWS WAF rules designed to filter common web-based attacks.
Consider AWS Partner solutions: AWS Partners offer hundreds of industry-leading products that are equivalent, identical to, or integrate with existing controls in your on-premises environments.

AWS Firewall Manager
AWS Inspector
AWS WAF

Inspect and filter your traffic at each layer. You can inspect your VPC configurations for potential unintended access using VPC Network Access Analyzer.

For components transacting over HTTP-based protocols, a web application firewall can help protect from common attacks.

Configure Amazon GuardDuty: GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts and workloads.
Configure virtual private cloud (VPC) Flow Logs: VPC Flow Logs is a feature that enables you to capture information about the IP traffic going to and from network interfaces in your VPC.
Consider VPC traffic mirroring, an Amazon VPC feature that you can use to copy network traffic from an elastic network interface of Amazon Elastic Compute Cloud (Amazon EC2) instances and then send it to out-of-band security and monitoring appliances for content inspection, threat monitoring, and troubleshooting.

AWS Firewall Manager
AWS Inspector
AWS WAF
AWS Trasnsit Gateway

Protecting compute

Compute resources include EC2 instances, containers, AWS Lambda functions, database services, IoT devices, and more

Each of these compute resource types require different approaches to secure them.

However, they do share common strategies that you need to consider: defense in depth, vulnerability management, reduction in attack surface, automation of configuration and operation, and performing actions at a distance.

Best Practices

Frequently scan and patch for vulnerabilities in your code, dependencies, and in your infrastructure to help protect against new threats.

You are responsible for patch management for your AWS resources, including Amazon Elastic Compute Cloud(Amazon EC2) instances, Amazon Machine Images (AMIs), and many other compute resources.

Configure Amazon Inspector: Amazon Inspector tests the network accessibility of your Amazon Elastic Compute Cloud (Amazon EC2) instances and the security state of the applications that run on those instances.
Scan source code: Scan libraries and dependencies for vulnerabilities.

AWS CloudFormation
AWS CloudFormation Guard
AWS CodePipeline
AWS CodeGuru
AWS SystemsManager
AWS WAF

Reduce your exposure to unintended access by hardening operating systems and minimizing the components, libraries, and externally consumable services in use.

Start by reducing unused components.

You can find many hardening and security configuration guides for common operating systems and server software. For example, you can start with the Center for Internet Security and iterate.

Harden operating system: Configure operating systems to meet best practices.
Harden containerized resources: Configure containerized resources to meet security best practices.
Implement AWS Lambda best practices.

AWS Systems Manager

Implement services that manage resources, such as Amazon Relational Database Service (Amazon RDS), AWS Lambda, and Amazon Elastic Container Service (Amazon ECS), to reduce your security maintenance tasks as part of the shared responsibility model.

For example, Amazon RDS helps you set up, operate, and scale a relational database, automates administration tasks such as hardware provisioning, database setup, patching, and backups.

Explore available services: Explore, test, and implement services that manage resources, such as Amazon RDS, AWS Lambda, and Amazon ECS.

AWS Systems Manager

Automate your protective compute mechanisms including vulnerability management, reduction in attack surface, and management of resources.

The automation will help you invest time in securing other aspects of your workload, and reduce the risk of human error.

Automate configuration management.
Automate patching of Amazon Elastic Compute Cloud (Amazon EC2) instances.
Implement intrusion detection and prevention.
Consider AWS Partner solutions.

AWS CloudFormation
AWS Systems Manager Automation
AWS Systems Manager Patch Manager

Removing the ability for interactive access reduces the risk of human error, and the potential for manual configuration or management.

For example, use a change management workflow to deploy Amazon Elastic Compute Cloud (Amazon EC2) instances using infrastructure-as-code, then manage Amazon EC2 instances using tools such as AWS Systems Manager instead of allowing direct access or through a bastion host.

AWS Systems Manager can automate a variety of maintenance and deployment tasks, using features including automation workflows, documents (playbooks), and the run command.

Replace console access: Replace console access (SSH or RDP) to instances with AWS Systems Manager Run Command to automate management tasks.

AWS Systems Manager RUN Command

Implement mechanisms (for example, code signing) to validate that the software, code and libraries used in the workload are from trusted sources and have not been tampered with.

For example, you should verify the code signing certificate of binaries and scripts to confirm the author, and ensure it has not been tampered with since created by the author

Investigate mechanisms: Code signing is one mechanism that can be used to validate software integrity.

AWS Signer

Data protection

Data classification provides a way to categorize data based on levels of sensitivity, and encryption protects data by way of rendering it unintelligible to unauthorized access.

These methods are important because they support objectives such as preventing mishandling or complying with regulatory obligations.

In AWS, there are a number of different approaches you can use when addressing data protection. The following best practices describes how to use these approaches.

Data classification

Data classification provides a way to categorize organizational data based on criticality and sensitivity in order to help you determine appropriate protection and retention controls.

Best Practices

You need to understand the type and classification of data your workload is processing, the associated business processes, data owner, applicable legal and compliance requirements, where it’s stored, and the resulting controls that are needed to be enforced.

This may include classifications to indicate if the data is intended to be publicly available, if the data is internal use only such as customer personally identifiable information (PII), or if the data is for more restricted access such as intellectual property, legally privileged or marked sensitive, and more.

By carefully managing an appropriate data classification system, along with each workload’s level of protection requirements, you can map the controls and level of access or protection appropriate for the data.

Consider discovering data using Amazon Macie: Macie recognizes sensitive data such as personally identifiable information (PII) or intellectual property.

AWS Macie

Protect data according to its classification level. For example, secure data classified as public by using relevant recommendations while protecting sensitive data with additional controls.

By using resource tags, separate AWS accounts per sensitivity (and potentially also for each caveat, enclave, or community of interest), IAM policies, AWS Organizations SCPs, AWS Key Management Service (AWS KMS), and AWS CloudHSM, you can define and implement your policies for data classification and protection with encryption.

Define your data identification and classification schema.
Discover available AWS controls.
Identify AWS compliance resources.

AWS Macie
AWS Compliance website

Automating the identification and classification of data can help you implement the correct controls.

Using automation for this instead of direct access from a person reduces the risk of human error and exposure.

You should evaluate using a tool, such as Amazon Macie, that uses machine learning to automatically discover, classify, and protect sensitive data in AWS.

Use Amazon Simple Storage Service (Amazon S3) Inventory: Amazon S3 inventory is one of the tools you can use to audit and report on the replication and encryption status of your objects.
Consider Amazon Macie: Amazon Macie uses machine learning to automatically discover and classify data stored in Amazon S3.

AWS Macie

Your defined lifecycle strategy should be based on sensitivity level as well as legal and organization requirements.

Aspects including the duration for which you retain data, data destruction processes, data access management, data transformation, and data sharing should be considered.

When choosing a data classification methodology, balance usability versus access. You should also accommodate the multiple levels of access and nuances for implementing a secure, but still usable, approach for each level.

Identify data types: Identify the types of data that you are storing or processing in your workload. That data could be text, images, binary databases, and so forth.

AWS Macie

Protecting data at rest

Data at rest represents any data that you persist in non-volatile storage for any duration in your workload

This includes block storage, object storage, databases, archives, IoT devices, and any other storage medium on which data is persisted.

Protecting your data at rest reduces the risk of unauthorized access, when encryption and appropriate access controls are implemented.

Best Practices

By defining an encryption approach that includes the storage, rotation, and access control of keys, you can help provide protection for your content against unauthorized users and against unnecessary exposure to authorized users.

AWS Key Management Service (AWS KMS) helps you manage encryption keys and integrates with many AWS services. This service provides durable, secure, and redundant storage for your AWS KMS keys.

Implement AWS KMS: AWS KMS makes it easy for you to create and manage keys and control the use of encryption across a wide range of AWS services and in your applications.
Consider AWS Encryption SDK: Use the AWS Encryption SDK with AWS KMS integration when your application needs to encrypt data client-side.

AWS KMS
AWS S3 Encryption

You should ensure that the only way to store data is by using encryption.

AWS Key Management Service (AWS KMS) integrates seamlessly with many AWS services to make it easier for you to encrypt all your data at rest.

Enforce encryption at rest for Amazon Simple Storage Service (Amazon S3): Implement Amazon S3 bucket default encryption.
Use AWS Secrets Manager.
Configure default encryption for new EBS volumes.
Configure encrypted Amazon Machine Images (AMIs).
Configure Amazon Relational Database Service (Amazon RDS) encryption.
Configure encryption in additional AWS services. For the AWS services you use, determine the encryption capabilities.

AWS KMS
AWS Secrets Manager
AWS Encryption SDK
AWS RDS Encryption
AWS EBS Encryption

Use automated tools to validate and enforce data at rest controls continuously, for example, verify that there are only encrypted storage resources.

You can automate validation that all EBS volumes are encrypted using AWS Config Rules.

AWS Security Hub can also verify several different controls through automated checks against security standards. Additionally, your AWS Config Rules can automatically remediate noncompliant resources.

Enforce encryption at rest: You should ensure that the only way to store data is by using encryption.

AWS Config (Rules)
AWS Encryption SDK
AWS Security Hub

Enforce access control with least privileges and mechanisms, including backups, isolation, and versioning, to help protect your data at rest. Prevent operators from granting public access to your data.

Different controls including access (using least privilege), backups (see Reliability whitepaper), isolation, and versioning can all help protect your data at rest.

Enforce access control with least privileges, including access to encryption keys.
Separate data based on different classification levels.
Review AWS KMS policies.
Review Amazon S3 bucket and object permissions.
Enable Amazon S3 versioning and object lock.
Amazon S3 inventory is one of the tools you can use to audit and report on the replication and encryption status of your objects.
Review Amazon EBS and AMI sharing permissions: Sharing permissions can allow images and volumes to be shared to AWS accounts external to your workload.

AWS Organizations
AWS KMS
AWS Config Rules

Keep all users away from directly accessing sensitive data and systems under normal operational circumstances.

For example, use a change management workflow to manage Amazon Elastic Compute Cloud (Amazon EC2) instances using tools instead of allowing direct access or a bastion host. This can be achieved using AWS Systems Manager Automation.

Implement mechanisms to keep people away from data: Mechanisms include using dashboards, such as Amazon QuickSight, to display data to users instead of directly querying.
Automate configuration management: Perform actions at a distance, enforce and validate secure configurations automatically by using a configuration management service or tool.
Avoid use of bastion hosts or directly accessing EC2 instances.

AWS KMS
AWS Systems Manager
AWS QuickSight
AWS CloudFormation

Protecting data in transit

Data in transit is any data that is sent from one system to another. This includes communication between resources within your workload as well as communication between other services and your end users

By providing the appropriate level of protection for your data in transit, you protect the confidentiality and integrity of your workload’s data.

Best Practices

Store encryption keys and certificates securely and rotate them at appropriate time intervals with strict access control.

The best way to accomplish this is to use a managed service, such as AWS Certificate Manager (ACM).

Implement your defined secure key and certificate management solution.
Use secure protocols that offer authentication and confidentiality, such as Transport Layer Security (TLS) or IPsec, to reduce the risk of data tampering or loss.

AWS Certificate Manager (ACM)

Enforce your defined encryption requirements based on appropriate standards and recommendations to help you meet your organizational, legal, and compliance requirements.

AWS services provide HTTPS endpoints using TLS for communication, thus providing encryption in transit when communicating with the AWS APIs.

Enforce encryption in transit.
Use a VPN / IPsec for external connectivity.
For the AWS services you use, determine the encryption-in-transit capabilities.

AWS CloudFront
AWS LoadBalancer

Use tools such as Amazon GuardDuty to automatically detect suspicious activity or attempts to move data outside of defined boundaries.

Use a tool or detection mechanism to automatically detect attempts to move data outside of defined boundaries, for example, to detect a database system that is copying data to an unrecognized host.

AWS VPC Flow Logs
AWS Macie

Verify the identity of communications by using protocols that support authentication, such as Transport Layer Security (TLS) or IPsec.

Using network protocols that support authentication, allows for trust to be established between the parties. This adds to the encryption used in the protocol to reduce the risk of communications being altered or intercepted.

Implement secure protocols: Use secure protocols that offer authentication and confidentiality, such as TLS or IPsec, to reduce the risk of data tampering or loss.

AWS VPN

Incident response

Even with mature preventive and detective controls, your organization should implement mechanisms to respond to and mitigate the potential impact of security incidents.

Putting in place the tools and access ahead of a security incident, then routinely practicing incident response through game days, helps ensure that you can recover while minimizing business disruption.

Prepare

During an incident, your incident response teams must have access to various tools and the workload resources involved in the incident

Make sure that your teams have appropriate pre-provisioned access to perform their duties before an event occurs. All tools, access, and plans should be documented and tested before an event occurs to make sure that they can provide a timely response.

Best Practices

Identify internal and external personnel, resources, and legal obligations that would help your organization respond to an incident.

When you define your approach to incident response in the cloud, in unison with other teams (such as your legal counsel, leadership, business stakeholders, AWS Support Services, and others), you must identify key personnel, stakeholders, and relevant contacts.

Identify key personnel in your organization: Maintain a contact list of personnel within your organization that you would need to involve to respond to and recover from an incident.
Identify external partners: Engage with external partners if necessary that can help you respond to and recover from an incident.

AWS Security Incident Response Guide

Create plans to help you respond to, communicate during, and recover from an incident.

For example, you can start an incident response plan with the most likely scenarios for your workload and organization.

Educate and train for incident response.
Document the incident management plan.
Categorize incidents.
Standardize security controls.
Use automation.
Conduct root cause analysis and action lessons learned.

AWS Security Incident Response Guide

It’s important for your incident responders to understand when and how the forensic investigation fits into your response plan.

Your organization should define what evidence is collected and what tools are used in the process. Identify and prepare forensic investigation capabilities that are suitable, including external specialists, tools, and automation.

Identify forensic capabilities: Research your organization's forensic investigation capabilities, available tools, and external specialists.

AWS Systems Manager
AWS EventBridge
AWS Lambda

Verify that incident responders have the correct access pre-provisioned in AWS to reduce the time needed for investigation through to recovery.

AWS recommends reducing or eliminating reliance on long-lived credentials wherever possible, in favor of temporary credentials and just-in-time privilege escalation mechanisms.
For most management tasks, as well as incident response tasks, we recommend you implement identity federation alongside temporary escalation for administrative access.
We recommend the use of temporary privilege escalation in the majority of incident response scenarios.
The correct way to do this is to use the AWS Security Token Service and session policies to scope access.

AWS Systems Manager Incident Manager
AWS IAM Access Analyzer

Ensure that security personnel have the right tools pre-deployed into AWS to reduce the time for investigation through to recovery.

To automate security engineering and operations functions, you can use a comprehensive set of APIs and tools from AWS. You can fully automate identity management, network security, data protection, and monitoring capabilities and deliver them using popular software development methods that you already have in place.

Ensure that security personnel have the right tools pre-deployed in AWS so that an appropriate response can be made to an incident.
Implement resource tagging.

AWS Security Incident Response Guide

Simulate

Practice your incident management plans and procedures during a realistic scenario

The value derived from participating in a simulation activity increases an organization's effectiveness during stressful events.

Best Practices

Game days, also known as simulations or exercises, are internal events that provide a structured opportunity to practice your incident management plans and procedures during a realistic scenario

These events should exercise responders using the same tools and techniques that would be used in a real-world scenario - even mimicking real-world environments. Game days are fundamentally about being prepared and iteratively improving your response capabilities.

Run game days: Run simulated incident response events (game days) for different threats that involve key staff and management.
Capture lessons learned: Lessons learned from running game days should be part of a feedback loop to improve your processes.

AWS Incident Response Guide
AWS Elastic Disaster Recovery

Iterate

Automate containment and recovery of an incident to reduce response times and organizational impact.

Best Practices

Once you create and practice the processes and tools from your playbooks, you can deconstruct the logic into a code-based solution, which can be used as a tool by many responders to automate the response and remove variance or guess-work by your responders.

This can speed up the lifecycle of a response.

Build automate containment capability.

AWS Incident Response Guide

Reliability

The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to.

This includes the ability to operate and test the workload through its total lifecycle.

There are 4 best practice areas for Reliability in the cloud

Foundations

Foundational requirements are those whose scope extends beyond a single workload or project.

Before architecting any system, foundational requirements that influence reliability should be in place.

For example, you must have sufficient network bandwidth to your data center.

Manage service quotas and constraints

For cloud-based workload architectures, there are service quotas (which are also referred to as service limits)

These quotas exist to prevent accidentally provisioning more resources than you need and to limit request rates on API operations so as to protect services from abuse.

There are also resource constraints, for example, the rate that you can push bits down a fiber-optic cable, or the amount of storage on a physical disk.

Best Practices

You are aware of your default quotas and quota increase requests for your workload architecture.

You additionally know which resource constraints, such as disk or network, are potentially impactful.

Service Quotas is an AWS service that helps you manage your quotas for over 100 AWS services from one location.

Review AWS service quotas in the published documentation and Service Quotas.
Determine all the services your workload requires by looking at the deployment code.
Use AWS Config to find all AWS resources used in your AWS accounts.
You can also use your AWS CloudFormation to determine your AWS resources used.
Determine the service quotas that apply. Use the programmatically accessible information via Trusted Advisor and Service Quotas.

AWS Marketplace: CMDB products
AWS Service Quotas
AWS Trusted Advisor

If you are using multiple AWS accounts or AWS Regions, ensure that you request the appropriate quotas in all environments in which your production workloads run.

Service quotas are tracked per account. Unless otherwise noted, each quota is AWS Region-specific.

In addition to the production environments, also manage quotas in all applicable non-production environments, so that testing and development are not hindered.

Select relevant accounts and Regions based on your service requirements, latency, regulatory, and disaster recovery (DR) requirements.
Identify service quotas across all relevant accounts, Regions, and Availability Zones. The limits are scoped to account and Region.

AWS Marketplace: CMDB products
AWS Service Quotas
AWS Trusted Advisor

Be aware of unchangeable service quotas and physical resources, and architect to prevent these from impacting reliability.

Examples include network bandwidth, AWS Lambda payload size, throttle burst rate for API Gateway, and concurrent user connections to an Amazon Redshift cluster.

Be aware of fixed service quotas Be aware of fixed service quotas and constraints and architect around these.

AWS Marketplace: CMDB products
AWS Service Quotas
AWS Trusted Advisor

Evaluate your potential usage and increase your quotas appropriately, allowing for planned growth in usage.

For supported services, you can manage your quotas by configuring CloudWatch alarms to monitor usage and alert you to approaching quotas.

These alarms can be triggered from Service Quotas or from Trusted Advisor.

You can also use metric filters on CloudWatch Logs to search and extract patterns in logs to determine if usage is approaching quota thresholds.

Monitor and manage your quotas Evaluate your potential usage on AWS, increase your regional service quotas appropriately, and allow planned growth in usage.
Capture current resource consumption (for example, buckets, instances). Use service API operations, such as the Amazon EC2 DescribeInstances API, to collect current resource consumption.
Capture your current quotas Use AWS Service Quotas, AWS Trusted Advisor, and AWS documentation.

AWS Marketplace: CMDB products
AWS Service Quotas
AWS Trusted Advisor

Implement tools to alert you when thresholds are being approached.

You can automate quota increase requests by using AWS Service Quotas APIs.

If you integrate your Configuration Management Database (CMDB) or ticketing system with Service Quotas, you can automate the tracking of quota increase requests and current quotas.

Set up automated monitoring Implement tools using SDKs to alert you when thresholds are being approached.
Use Service Quotas and augment the service with an automated quota monitoring solution, such as AWS Limit Monitor or an offering from AWS Marketplace.
Set up triggered responses based on quota thresholds, using Amazon SNS and AWS Service Quotas APIs.
Test automation

APN Partner
AWS Marketplace: CMDB products
AWS Service Quotas
AWS Trusted Advisor

When a resource fails, it might still be counted against quotas until it’s successfully terminated.

Ensure that your quotas cover the overlap of all failed resources with replacements before the failed resources are terminated. You should consider an Availability Zone failure when calculating this gap.

Ensure that there is enough gap between your service quota and your maximum usage to accommodate for a failover.
Determine your service quotas, accounting for your deployment patterns, availability requirements, and consumption growth.
Request quota increases if necessary. Plan for necessary time for quota increase requests to be fulfilled.

AWS Marketplace: CMDB products
AWS Service Quotas
AWS Trusted Advisor

Plan your network topology

Workloads often exist in multiple environments. These include multiple cloud environments (both publicly accessible and private) and possibly your existing data center infrastructure

Plans must include network considerations, such as intrasystem and intersystem connectivity, public IP address management, private IP address management, and domain name resolution.

When architecting systems using IP address-based networks, you must plan network topology and addressing in anticipation of possible failures, and to accommodate future growth and integration with other systems and their networks.

Amazon Virtual Private Cloud (Amazon VPC) lets you provision a private, isolated section of the AWS Cloud where you can launch AWS resources in a virtual network.

Best Practices

These endpoints and the routing to them must be highly available.

To achieve this, use highly available DNS, content delivery networks (CDNs), API Gateway, load balancing, or reverse proxies.

Amazon Route 53, AWS Global Accelerator, Amazon CloudFront, Amazon API Gateway, and Elastic Load Balancing (ELB) all provide highly available public endpoints.

You might also choose to evaluate AWS Marketplace software appliances for load balancing and proxying.

Ensure that you have highly available connectivity for users of the workload Amazon Route 53, AWS Global Accelerator, Amazon CloudFront, Amazon API Gateway, and Elastic Load Balancing (ELB) all provide highly available public facing endpoints.
Ensure that you have a highly available connection to your users.
Ensure that you are using a highly available DNS to manage the domain names of your application endpoints.
Ensure that you are using a highly available reverse proxy or load balancer in front of your application.

APN Partners
AWS Direct Connect
AWS Marketplace for Network Infrastructure
AWS VPC
AWS Private Link
AWS Global Accelerator
AWS Multi Data Centre
AWS VPN

Use multiple AWS Direct Connect connections or VPN tunnels between separately deployed private networks.

Use multiple Direct Connect locations for high availability. If using multiple AWS Regions, ensure redundancy in at least two of them.

Ensure that you have highly available connectivity between AWS and on-premises environment.
Use multiple AWS Direct Connect connections or VPN tunnels between separately deployed private networks.
Use multiple Direct Connect locations for high availability. If using multiple AWS Regions, ensure redundancy in at least two of them.

APN Partners
AWS Direct Connect
AWS Marketplace for Network Infrastructure
AWS VPC
AWS Private Link
AWS Global Accelerator
AWS Multi Data Centre
AWS VPN

Amazon VPC IP address ranges must be large enough to accommodate workload requirements, including factoring in future expansion and allocation of IP addresses to subnets across Availability Zones.

This includes load balancers, EC2 instances, and container-based applications.

Plan your network to accommodate for growth, regulatory compliance, and integration with others.
Select relevant AWS accounts and Regions based on your service requirements, latency, regulatory, and disaster recovery (DR) requirements.
Identify your needs for regional VPC deployments.
Identify the size of the VPCs.
Determine if you need segregated networking for regulatory requirements.

APN Partners
AWS Marketplace for Network Infrastructure
AWS VPC

If more than two network address spaces (for example, VPCs and on-premises networks) are connected via VPC peering, AWS Direct Connect, or VPN, then use a hub-and-spoke model, like that provided by AWS Transit Gateway.

Prefer hub-and-spoke topologies over many-to-many mesh.
If more than two network address spaces (VPCs, on-premises networks) are connected via VPC peering, AWS Direct Connect, or VPN, then use a hub-and-spoke model like that provided by AWS Transit Gateway.

AWS Transit Gateway
AWS VPC
APN Partners

The IP address ranges of each of your VPCs must not overlap when peered or connected via VPN.

You must similarly avoid IP address conflicts between a VPC and on-premises environments or with other cloud providers that you use.

You must also have a way to allocate private IP address ranges when needed.

Monitor and manage your CIDR use. Evaluate your potential usage on AWS, add CIDR ranges to existing VPCs, and create VPCs to allow planned growth in usage.
Capture current CIDR consumption (for example, VPCs, subnets)
Capture your current subnet usage.

APN Partners
AWS Marketplace for Network Infrastructure
AWS VPC
AWS VPC IPAM

Workload Architecture

A reliable workload starts with upfront design decisions for both software and infrastructure

Your architecture choices will impact your workload behavior across all six Well-Architected pillars.

For reliability, there are specific patterns you must follow

Design your workload service architecture

Build highly scalable and reliable workloads using a service-oriented architecture (SOA) or a microservices architecture

Service-oriented architecture (SOA) is the practice of making software components reusable via service interfaces.

Microservices architecture goes further to make components smaller and simpler.

Best Practices

Workload segmentation is important when determining the resilience requirements of your application. Monolithic architecture should be avoided whenever possible.

Instead, carefully consider which application components can be broken out into microservices. Depending on your application requirements, this may end up being a combination of a service-oriented architecture (SOA) with microservices where possible.

Workloads that are capable of statelessness are more capable of being deployed as microservices.

Choose your architecture type based on how you will segment your workload.
Choose an SOA or microservices architecture (or in some rare cases, a monolithic architecture).

AWS API Gateway
AWS App Mesh

Service-oriented architecture (SOA) builds services with well-delineated functions defined by business needs.

Microservices use domain models and bounded context to limit this further so that each service does just one thing. Focusing on specific functionality enables you to differentiate the reliability requirements of different services, and target investments more specifically.

Design your workload based on your business domains and their respective functionality. Focusing on specific functionality enables you to differentiate the reliability requirements of different services, and target investments more specifically.
Decompose your services into smallest possible components. With microservices architecture you can separate your workload into components with the minimal functionality to enable organizational scaling and agility.

AWS API Gateway

Service contracts are documented agreements between teams on service integration and include a machine-readable API definition, rate limits, and performance expectations.

A versioning strategy allows your clients to continue using the existing API and migrate their applications to the newer API when they are ready.

Provide service contracts per API Service contracts are documented agreements between teams on service integration and include a machine-readable API definition, rate limits, and performance expectations.

AWS API Gateway

Design interactions in a distributed system to prevent failures

Distributed systems rely on communications networks to interconnect components, such as servers or services.

Your workload must operate reliably despite data loss or latency in these networks

Components of the distributed system must operate in a way that does not negatively impact other components or the workload.

These best practices prevent failures and improve mean time between failures (MTBF).

Best Practices

Hard real-time distributed systems require responses to be given synchronously and rapidly, while soft real-time systems have a more generous time window of minutes or more for response.

Offline systems handle responses through batch or asynchronous processing.

Hard real-time distributed systems have the most stringent reliability requirements.

Identify which kind of distributed system is required. Challenges with distributed systems involved latency, scaling, understanding networking APIs, marshalling and unmarshalling data, and the complexity of algorithms such as Paxos.

AWS EventBridge
AWS SQS
AWS Builders Library

Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled.

Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility.

Implement loosely coupled dependencies. Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled.
Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility.

AWS EventBridge
AWS SQS
AWS Builders Library

Systems can fail when there are large, rapid changes in load. For example, if your workload is doing a health check that monitors the health of thousands of servers, it should send the same size payload (a full snapshot of the current state) each time.

Whether no servers are failing, or all of them, the health check system is doing constant work with no large, rapid changes.

Do constant work so that systems do not fail when there are large, rapid changes in load.
Implement loosely coupled dependencies. Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled.

AWS Builders Library

An idempotent service promises that each request is completed exactly once, such that making multiple identical requests has the same effect as making a single request.

An idempotent service makes it easier for a client to implement retries without fear that a request will be erroneously processed multiple times.

Make all responses idempotent.
An idempotent service promises that each request is completed exactly once, such that making multiple identical requests has the same effect as making a single request.

AWS Builders Library

Design interactions in a distributed system to mitigate or withstand failures.

Distributed systems rely on communications networks to interconnect components (such as servers or services)

Your workload must operate reliably despite data loss or latency over these networks. Components of the distributed system must operate in a way that does not negatively impact other components or the workload.

These best practices enable workloads to withstand stresses or failures, more quickly recover from them, and mitigate the impact of such impairments. The result is improved mean time to recovery (MTTR)

Best Practices

When a component's dependencies are unhealthy, the component itself can still function, although in a degraded manner.

For example, when a dependency call fails, failover to a predetermined static response.

Implement graceful degradation to transform applicable hard dependencies into soft dependencies.
When a component's dependencies are unhealthy, the component itself can still function, although in a degraded manner. For example, when a dependency call fails, failover to a predetermined static response.

AWS API Gateway (throttling)
AWS Builders Library

Throttling requests is a mitigation pattern to respond to an unexpected increase in demand.

Some requests are honored but those over a defined limit are rejected and return a message indicating they have been throttled. The expectation on clients is that they will back off and abandon the request or try again at a slower rate.

Throttle requests. This is a mitigation pattern to respond to an unexpected increase in demand.
Some requests are honored but those over a defined limit are rejected and return a message indicating they have been throttled.

AWS API Gateway (throttling)
AWS Builders Library

Use exponential backoff to retry after progressively longer intervals. Introduce jitter to randomize those retry intervals, and limit the maximum number of retries.

Control and limit retry calls. Use exponential backoff to retry after progressively longer intervals.
Introduce jitter to randomize those retry intervals, and limit the maximum number of retries.

AWS API Gateway (throttling)
AWS Builders Library

If the workload is unable to respond successfully to a request, then fail fast.

This allows the releasing of resources associated with a request, and permits the service to recover if it’s running out of resources.

Fail fast and limit queues. If the workload is unable to respond successfully to a request, then fail fast.
This allows the releasing of resources associated with a request, and permits the service to recover if it’s running out of resources.
Limit queues In a queue-based system, when processing stops but messages keep arriving, the message debt can accumulate into a large backlog, driving up processing time.

AWS Builders Library

Set timeouts appropriately, verify them systematically, and do not rely on default values as they are generally set too high.

This best practice applies to the client-side, or sender, of the request.

Set both a connection timeout and a request timeout on any remote call, and generally on any call across processes.
Many frameworks offer built-in timeout capabilities, but be careful as many have default values that are infinite or too high.

AWS SDK
AWS API Gateway
AWS Builders Library

Services should either not require state, or should offload state such that between different client requests, there is no dependence on locally stored data on disk and in memory.

This enables servers to be replaced at will without causing an availability impact. Amazon ElastiCache or Amazon DynamoDB are good destinations for offloaded state.

Make your applications stateless. Stateless applications enable horizontal scaling and are tolerant to the failure of an individual node.
Remove state that could actually be stored in request parameters.
After examining whether the state is required, move any state tracking to a resilient multi-zone cache or data store like Amazon ElastiCache, Amazon RDS, Amazon DynamoDB, or a third-party distributed data solution.

AWS Builders Library

Emergency levers are rapid processes that can mitigate availability impact on your workload.

Implement emergency levers. These are rapid processes that may mitigate availability impact on your workload.
They can be operated in the absence of a root cause.

Change Management

Changes to your workload or its environment must be anticipated and accommodated to achieve reliable operation of the workload.

Changes include those imposed on your workload such as spikes in demand, as well as those from within such as feature deployments and security patches

Monitor workload resources

Logs and metrics are powerful tools to gain insight into the health of your workload

You can configure your workload to monitor logs and metrics and send notifications when thresholds are crossed or significant events occur

Monitoring enables your workload to recognize when low-performance thresholds are crossed or failures occur, so it can recover automatically in response.

Best Practices

All components of your workload should be monitored, including the front-end, business logic, and storage tiers.

Monitor the components of the workload with Amazon CloudWatch or third-party tools. Monitor AWS services with AWS Health Dashboard.

Enable logging where available.
Review all default metrics and explore any data collection gaps.
Evaluate all the metrics to decide which ones to alert on for each AWS service in your workload.
Define alerts and the recovery process for your workload after the alert is triggered.
Explore use of synthetic transactions to collect relevant data about workloads state.

AWS Health Dashboard
AWS CloudWatch Metrics
AWS X-Ray
AWS DevOps Guru

Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps.

Define and calculate metrics (Aggregation).
Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps

AWS CloudWatch

Organizations that need to know, receive notifications when significant events occur.

Alerts can be sent to Amazon Simple Notification Service (Amazon SNS) topics, and then pushed to any number of subscribers.

For example, Amazon SNS can forward alerts to an email alias so that technical staff can respond.

Perform real-time processing and alarming.
Organizations that need to know, receive notifications when significant events occur.

AWS CloudWatch
AWS SNS

Use automation to take action when an event is detected, for example, to replace failed components.

Perform real-time processing and alarming. Organizations that need to know, receive notifications when significant events occur.
Use AWS Systems Manager to perform automated actions. AWS Config continually monitors and records your AWS resource configurations.
Create and execute a plan to automate responses.

AWS Systems Manager
AWS EventBridge

Collect log files and metrics histories and analyze these for broader trends and workload insights.

Use Amazon CloudWatch Logs send logs to Amazon S3 where you can use or Amazon Athena to query the data.

AWS Builders Library
AWS CloudWatch

Frequently review how workload monitoring is implemented and update it based on significant events and changes.

Effective monitoring is driven by key business metrics.

Ensure these metrics are accommodated in your workload as business priorities change.

Create multiple dashboards for the workload.
You must have a top-level dashboard that contains the key business metrics, as well as the technical metrics you have identified to be the most relevant to the projected health of the workload as usage varies.
You should also have dashboards for various application tiers and dependencies that can be inspected.

AWS CloudWatch Dashboards
AWS CloudWatch Synthetics
AWS X-Ray
AWS Builders Library

Use AWS X-Ray or third-party tools so that developers can more easily analyze and debug distributed systems to understand how their applications and its underlying services are performing.

Monitor end-to-end tracing of requests through your system.
AWS X-Ray is a service that collects data about requests that your application serves, and provides tools you can use to view, filter, and gain insights into that data to identify issues and opportunities for optimization.

AWS CloudWatch Synthetics
AWS X-Ray
AWS Builders Library

Design your workload to adapt to changes in demand

A scalable workload provides elasticity to add or remove resources automatically so that they closely match the current demand at any given point in time.

Best Practices

When replacing impaired resources or scaling your workload, automate the process by using managed AWS services, such as Amazon S3 and AWS Auto Scaling.

You can also use third-party tools and AWS SDKs to automate scaling.

Configure and use AWS Auto Scaling
Use Elastic Load Balancing. Load balancers can distribute load by path or by network connectivity.
Use a highly available DNS provider.
Use the AWS global network to optimize the path from your users to your applications.
Configure and use Amazon CloudFront or a trusted content delivery network (CDN).

AWS Partner
AWS Autoscaling
AWS Marketplace
AWS Elastic Load Balancer
AWS Network Load Balancer
AWS Application Load Balancer
AWS CloudFront
AWS Route 53

Scale resources reactively when necessary if availability is impacted, to restore workload availability.

You first must configure health checks and the criteria on these checks to indicate when availability is impacted by lack of resources.

Then either notify the appropriate personnel to manually scale the resource, or trigger automation to automatically scale it.

Obtain resources upon detection of impairment to a workload. Scale resources reactively when necessary if availability is impacted, to restore workload availability.
Use scaling plans, which are the core component of AWS Auto Scaling, to configure a set of instructions for scaling your resources.

AWS Partner
AWS Auto Scaling
AWS Marketplace

Scale resources proactively to meet demand and avoid availability impact.

Obtain resources upon detection that more resources are needed for a workload.
Scale resources proactively to meet demand and avoid availability impact.

AWS Auto Scaling
AWS Marketplace

Adopt a load testing methodology to measure if scaling activity meets workload requirements.

It’s important to perform sustained load testing.

Load tests should discover the breaking point and test the performance of your workload.

Perform load testing to identify which aspect of your workload indicates that you must add or remove capacity.
Load testing should have representative traffic similar to what you receive in production. Increase the load while watching the metrics you have instrumented to determine which metric indicates when you must add or remove resources.

Implement change

Controlled changes are necessary to deploy new functionality and to ensure that the workloads and the operating environment are running known, properly patched software

If these changes are uncontrolled, then it makes it difficult to predict the effect of these changes, or to address issues that arise because of them.

Best Practices

Runbooks are the predefined procedures to achieve specific outcomes.

Use runbooks to perform standard activities, whether done manually or automatically.

Examples include deploying a workload, patching a workload, or making DNS modifications.

Enable consistent and prompt responses to well understood events by documenting procedures in runbooks.
Use the principle of infrastructure as code (CloudFormation) to define your infrastructure.

AWS Partner
AWS Marketplace
AWS CloudFormation
AWS CodeCommit

Functional tests are run as part of automated deployment. If success criteria are not met, the pipeline is halted or rolled back.

These tests are run in a pre-production environment, which is staged prior to production in the pipeline.

Integrate functional testing as part of your deployment.
Functional tests are run as part of automated deployment. If success criteria are not met, the pipeline is halted or rolled back.

AWS CodePipeline
AWS CodeBuild

Resiliency tests (using the principles of chaos engineering) are run as part of the automated deployment pipeline in a pre-production environment.

These tests are staged and run in the pipeline in a pre-production environment.

They should also be run in production as part of game days.

Integrate resiliency testing as part of your deployment.
Use Chaos Engineering, the discipline of experimenting on a workload to build confidence in the workload’s capability to withstand turbulent conditions in production.
Resiliency tests inject faults or resource degradation to assess that your workload responds with its designed resilience.

Immutable infrastructure is a model that mandates that no updates, security patches, or configuration changes happen in-place on production workloads.

When a change is needed, the architecture is built onto new infrastructure and deployed into production.

Deploy using immutable infrastructure. Immutable infrastructure is a model in which no updates, security patches, or configuration changes happen in-place on production systems.
If any change is needed, a new version of the architecture is built and deployed into production.

AWS CodeDeploy
AWS CodePipeline

Deployments and patching are automated to eliminate negative impact.

Automate your deployment pipeline.
Deployment pipelines allow you to invoke automated testing and detection of anomalies, and either halt the pipeline at a certain step before production deployment, or automatically roll back a change.

AWS CodeDeploy
AWS CodePipeline
AWS Systems Manager Patch Manager
AWS Partner
AWS Marketplace
AWS SNS
AWS SES

Failure Management

Failures are a given and everything will eventually fail over time: from routers to hard disks, from operating systems to memory units corrupting TCP packets, from transient errors to permanent failures

Regardless of your cloud provider, there is the potential for failures to impact your workload. Therefore, you must take steps to implement resiliency if you need your workload to be reliable.

Back up data

Back up data, applications, and configuration to meet requirements for recovery time objectives (RTO) and recovery point objectives (RPO).

Best Practices

All AWS data stores offer backup capabilities.

Services such as Amazon RDS and Amazon DynamoDB additionally support automated backup that enables point-in-time recovery (PITR), which allows you to restore a backup to any time up to five minutes or less before the current time.

Many AWS services offer the ability to copy backups to another AWS Region. AWS Backup is a tool that gives you the ability to centralize and automate data protection across AWS services.

Identify all data sources for the workload.
Classify data sources based on criticality.
Use AWS or third-party services to create backups of the data.
For data that is not backed up, establish a data reproduction mechanism.
Establish a cadence for backing up data.

AWS Backup
AWS DataSync
AWS Volume Gateway
AWS EBS Snapshots
AWS Cross Region Replication

Control and detect access to backups using authentication and authorization, such as AWS IAM.

Prevent and detect if data integrity of backups is compromised using encryption.

Use encryption on each of your data stores. If your source data is encrypted, then the backup will also be encrypted.
Implement least privilege permissions to access your backups. Follow best practices to limit the access to the backups, snapshots, and replicas in accordance with security best practices.

AWS Encryption: EBS, S3, DynamoDB, RDS, EFS, ++
AWS Backup Encryption
AWS Marketplace

Configure backups to be taken automatically based on a periodic schedule informed by the Recovery Point Objective (RPO), or by changes in the dataset.

Critical datasets with low data loss requirements need to be backed up automatically on a frequent basis, whereas less critical data where some loss is acceptable can be backed up less frequently.

Identify data sources that are currently being backed up manually.
Determine the RPO for the workload.
Use an automated backup solution or managed service.

AWS Partner
AWS Marketplace
AWS Backup
AWS Step Functions
AWS EventBridge

Validate that your backup process implementation meets your recovery time objectives (RTO) and recovery point objectives (RPO) by performing a recovery test.

Testing backup and restore capability increases confidence in the ability to perform these actions during an outage.
Periodically restore backups to a new location and run tests to verify the integrity of the data. Some common tests that should be performed are checking

AWS Partner
AWS Marketplace
AWS Backup
AWS Step Functions
AWS EventBridge

Use fault isolation to protect your workload

Fault isolated boundaries limit the effect of a failure within a workload to a limited number of components

Components outside of the boundary are unaffected by the failure. Using multiple fault isolated boundaries, you can limit the impact on your workload.

Best Practices

Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions.

These locations can be as diverse as required.

Use multiple Availability Zones and AWS Regions. Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions.
If your workload must be deployed to multiple Regions, choose a multi-Region strategy. Most reliability needs can be met within a single AWS Region using a multi-Availability Zone strategy.
Evaluate AWS Outposts for your workload. If your workload requires low latency to your on-premises data center or has local data processing requirements.
Determine if AWS Local Zones helps you provide service to your users. If you have low-latency requirements, see if AWS Local Zones is located near your users.

AWS Local Zones
AWS Global Tables (DynamoDB)
AWS Outposts

For high availability, always (when possible) deploy your workload components to multiple Availability Zones (AZs).

For workloads with extreme resilience requirements, carefully evaluate the options for a multi-Region architecture.

Evaluate your workload and determine whether the resilience needs can be met by a multi-AZ approach (single AWS Region), or if they require a multi-Region approach.
Implementing a multiRegion architecture to satisfy these requirements will introduce additional complexity, therefore carefully consider your use case and its requirements.
Resilience requirements can almost always be met using a single AWS Region.

AWS Local Zones
AWS Global Tables (DynamoDB)
AWS Outposts

If components of the workload can only run in a single Availability Zone or in an on-premises data center, you must implement the capability to do a complete rebuild of the workload within your defined recovery objectives.

Implement self-healing. Deploy your instances or containers using automatic scaling when possible.
If you cannot use automatic scaling, use automatic recovery for EC2 instances or implement self-healing automation based on Amazon EC2 or ECS container lifecycle events.

AWS ECS Events
AWS EC2 Auto Scaling

Like the bulkheads on a ship, this pattern ensures that a failure is contained to a small subset of requests or clients so that the number of impaired requests is limited, and most can continue without error.

Bulkheads for data are often called partitions, while bulkheads for services are known as cells.

Use bulkhead architectures.
Evaluate cell-based architecture for your workload.

AWS Builders Library

Design your workload to withstand component failures

Workloads with a requirement for high availability and low mean time to recovery (MTTR) must be architected for resiliency.

Best Practices

Continuously monitor the health of your workload so that you and your automated systems are aware of degradation or failure as soon as they occur.

Determine the collection interval for your components based on your recovery goals.
Configure detailed monitoring for components.
Create custom metrics to measure business key performance indicators (KPIs).
Monitor the user experience for failures using user canaries.
Create custom metrics that track the user's experience.
Set alarms to detect when any part of your workload is not working properly.
Create dashboards to visualize your metrics.

AWS CloudWatch Synthetics
AWS CloudWatch Dashboards

Ensure that if a resource failure occurs, that healthy resources can continue to serve requests.

For location failures (such as Availability Zone or AWS Region) ensure that you have systems in place to fail over to healthy resources in unimpaired locations.

Fail over to healthy resources. Ensure that if a resource failure occurs, that healthy resources can continue to serve requests.
For location failures (such as Availability Zone or AWS Region) ensure you have systems in place to fail over to healthy resources in unimpaired locations.

APN Partner
AWS MarketPlace
AWS OpsWorks
AWS R53
AWS RDS Read Replicas
AWS ECS task placement
AWS Global Accelerator

Upon detection of a failure, use automated capabilities to perform actions to remediate.

Use Auto Scaling groups to deploy tiers in an workload.
Implement automatic recovery on EC2 instances that have applications deployed that cannot be deployed in multiple locations, and can tolerate rebooting upon failures.
Implement automated recovery using AWS Step Functions and AWS Lambda when you cannot use automatic scaling or automatic recovery, or when automatic recovery fails.

APN Partner
AWS MarketPlace
AWS OpsWorks
AWS CloudWatch
AWS EventBridge
AWS Systems Manager Automation
AWS Step Functions

The control plane is used to configure resources, and the data plane delivers services.

Data planes typically have higher availability design goals than control planes and are usually less complex.

When implementing recovery or mitigation responses to potentially resiliency-impacting events, using control plane operations can lower the overall resiliency of your architecture.

Rely on the data plane and not the control plane when using Amazon Route 53 for disaster recovery.
Route 53 Application Recovery Controller helps you manage and coordinate failover using readiness checks and routing controls.
These features continually monitor your application’s ability to recoverfrom failures, and enables you to control your application recovery across multiple AWS Regions, Availability Zones, and on premises.

APN Partner
AWS MarketPlace
AWS Builders Library
AWS R53

Bimodal behavior is when your workload exhibits different behavior under normal and failure modes, for example, relying on launching new instances if an Availability Zone fails.

You should instead build workloads that are statically stable and operate in only one mode.

Use static stability to prevent bimodal behavior.
Bimodal behavior is when your workload exhibits different behavior under normal and failure modes, for example, relying on launching new instances if an Availability Zone fails.

AWS Builders Library

Notifications are sent upon the detection of significant events, even if the issue caused by the event was automatically resolved.

Alarms on business Key Performance Indicators when they exceed a low threshold.
Having a low threshold alarm on your business KPIs help you know when your workload is unavailable or nonfunctional.

AWS CloudWatch
AWS EventBridge
AWS SNS

Test reliability

After you have designed your workload to be resilient to the stresses of production, testing is the only way to ensure that it will operate as designed, and deliver the resiliency you expect.

Test to validate that your workload meets functional and non-functional requirements, because bugs or performance bottlenecks can impact the reliability of your workload. Test the resiliency of your workload to help you find latent bugs that only surface in production. Exercise these tests regularly.

Best Practices

Enable consistent and prompt responses to failure scenarios that are not well understood, by documenting the investigation process in playbooks.

Playbooks are the predefined steps performed to identify the factors contributing to a failure scenario.

The results from any process step are used to determine the next steps to take until the issue is identified or escalated.

Use playbooks to identify issues. Playbooks are documented processes to investigate issues.
Enable consistent and prompt responses to failure scenarios by documenting processes in playbooks.

AWS Systems Manager Automation
AWS Systems Manager Run Command
AWS CloudWatch Alarms
AWS EventBridge

Review customer-impacting events, and identify the contributing factors and preventative action items.

Use this information to develop mitigations to limit or prevent recurrence.

Develop procedures for prompt and effective responses.

Establish a standard for your post-incident analysis.
Good post-incident analysis provides opportunities to propose common solutions for problems with architecture patterns that are used in other places in your systems.
Have a process to identify and document the contributing factors of an event so that you can develop mitigations to limit or prevent recurrence and you can develop procedures for prompt and effective responses.

Use techniques such as unit tests and integration tests that validate required functionality.

Test functional requirements. These include unit tests and integration tests that validate required functionality.

AWS CodeBuild
AWS CodePipeline
AWS CloudWatch Synthetics

Use techniques such as load testing to validate that the workload meets scaling and performance requirements.

Test scaling and performance requirements. Perform load testing to validate that the workload meets scaling and performance requirements.

Run chaos experiments regularly in environments that are in or as close to production as possible to understand how your system responds to adverse conditions.

Chaos engineering provides your teams with capabilities to continually inject real world disruptions (simulations) in a controlled way at the service provider, infrastructure, workload, and component level, with minimal to no impact to your customers.

AWS Fault Injection Simulator
AWS Resilience Hub
AWS Marketplace: Gremlin Chaos Engineering Platform

Use game days to regularly exercise your procedures for responding to events and failures as close to production as possible (including in production environments) with the people who will be involved in actual failure scenarios.

Game days enforce measures to ensure that production events do not impact users.

Schedule game days to regularly exercise your runbooks and playbooks.
Game days should involve everyone who would be involved in a production event: business owner, development staff, operational staff, and incident response teams.

Plan for Disaster Recovery (DR)

Having backups and redundant workload components in place is the start of your DR strategy

RTO and RPO are your objectives for restoration of your workload

Set these based on business needs. Implement a strategy to meet these objectives, considering locations and function of workload resources and data

Both Availability and Disaster Recovery rely on the same best practices such as monitoring for failures, deploying to multiple locations, and automatic failover. However Availability focuses on components of the workload, while Disaster Recovery focuses on discrete copies of the entire workload. Disaster Recovery has different objectives from Availability, focusing on time to recovery after a disaster

Best Practices

The workload has a recovery time objective (RTO) and recovery point objective (RPO).

Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of service and restoration of service.

Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data recovery point.

For the given workload, you must understand the impact of downtime and lost data on your business.
The impact generally grows larger with greater downtime or data loss, but the shape of this growth can differ based on the workload type.

Define a disaster recovery (DR) strategy that meets your workload's recovery objectives.

Choose a strategy such as: backup and restore; standby (active/passive); or active/active.

Determine a DR strategy that will satisfy recovery requirements for this workload.

Regularly test failover to your recovery site to ensure proper operation, and that RTO and RPO are met.

Engineer your workloads for recovery. Regularly test your recovery paths Recovery Oriented Computing identifies the characteristics in systems that enhance recovery.
These characteristics are: isolation and redundancy, system-wide ability to roll back changes, ability to monitor and determine health, ability to provide diagnostics, automated recovery, modular design, and ability to restart.

AWS Fault Injection Simulator

Ensure that the infrastructure, data, and configuration are as needed at the DR site or Region.

For example, check that AMIs and service quotas are up to date.

Ensure that your delivery pipelines deliver to both your primary and backup sites.
Delivery pipelines for deploying applications into production must distribute to all the specified disaster recovery strategy locations, including dev and test environments.
Use AWS Config rules to create systems that enforce your disaster recovery strategies and generate alerts when they detect drift.
Use AWS CloudFormation to deploy your infrastructure. AWS CloudFormation can detect drift between what your CloudFormation templates specify and what is actually deployed.

AWS Systems Manager Automation
AWS Config Rules
AWS CloudFormation

Use AWS or third-party tools to automate system recovery and route traffic to the DR site or Region.

Use Elastic Disaster Recovery for automated Failover and Failback.
Elastic Disaster Recovery continuously replicates your machines (including operating system, system state configuration, databases, applications, and files) into a low-cost staging area in your target AWS account and preferred Region.
In the case of a disaster, after choosing to recover using Elastic Disaster Recovery, Elastic Disaster Recovery automates the conversion of your replicated servers into fully provisioned workloads in your recovery Region on AWS.

AWS Elastic Disaster Recovery
AWS Systems Manager Automation
AWS Marketplace

Performance Efficiency

The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements, and how to maintain efficiency as demand changes and technologies evolve.

There are 4 focus areas for Performance Efficiency in the cloud

Selection

The optimal solution for a particular workload varies, and solutions often combine multiple approaches. Well-architected workloads use multiple solutions and enable different features to improve performance.

AWS resources are available in many types and configurations, which makes it easier to find an approach that closely matches your needs. Selection of the right services/rsources is key to performance efficiency.

Performance architecture selection

Use a data-driven approach to select the patterns and implementation for your architecture and achieve a cost effective solution

AWS Solutions Architects, AWS Reference Architectures, and AWS Partners can help you select an architecture based on industry knowledge, but data obtained through benchmarking or load testing will be required to optimize your architecture.

Your architecture will likely combine a number of different architectural approaches (for example, eventdriven, ETL, or pipeline). The implementation of your architecture will use the AWS services that are specific to the optimization of your architecture's performance. In the following sections we discuss the four main resource types to consider (compute, storage, database, and network).

Best Practices

Learn about and understand the wide range of services and resources available in the cloud.

Identify the relevant services and configuration options for your workload, and understand how to achieve optimal performance.

Inventory your workload software and architecture for related services: Gather an inventory of your workload and decide which category of products to learn more about.
Identify workload components that can be replaced with managed services to increase performance and reduce operational complexity.

AWS Architecture Center
AWS Partner Network
AWS Solutions Library
AWS Knowledge Center
AWS Samples
AWS SDK Examples

Use internal experience and knowledge of the cloud, or external resources such as published use cases, relevant documentation, or whitepapers, to define a process to choose resources and services. Y

You should define a process that encourages experimentation and benchmarking with the services that could be used in your workload

Select an architectural approach: Identify the kind of architecture that meets your performance requirements.
Identify constraints, such as the media for delivery (desktop, web, mobile, IoT), legacy requirements, and integrations. Identify opportunities for reuse, including refactoring.
Consult other teams, architecture diagrams, and resources such as AWS Solution Architects, AWS Reference Architectures, and AWS Partners to help you choose an architecture.

AWS Partner Network
AWS Solutions Library
AWS Knowledge Center
AWS Samples
AWS SDK Examples

Workloads often have cost requirements for operation. Use internal cost controls to select resource types and sizes based on predicted resource need.

Determine which workload components could be replaced with fully managed services, such as managed databases, in-memory caches, and ETL services. Reducing your operational workload allows you to focus resources on business outcomes.

Optimize workload components to reduce cost: Right size workload components and enable elasticity to reduce cost and maximize component efficiency.
Determine which workload components can be replaced with managed services when appropriate, such as managed databases, in-memory caches, and reverse proxies.

AWS Architecture Center
AWS Partner Network
AWS Solutions Library
AWS Knowledge Center
AWS Compute Optimizer
AWS Samples
AWS SDK Examples

Maximize performance and efficiency by evaluating internal policies and existing reference architectures and using your analysis to select services and configurations for your workload.

Deploy your workload using existing policies or reference architectures: Integrate the services into your cloud deployment, then use your performance tests to ensure that you can continue to meet your performance requirements.

AWS Architecture Center
AWS Partner Network
AWS Solutions Library
AWS Knowledge Center
AWS Samples
AWS SDK Examples

Use cloud company resources, such as solutions architects, professional services, or an appropriate partner to guide your decisions. These resources can help review and improve your architecture for optimal performance.

Reach out to AWS for assistance when you need additional guidance or product information. AWS Solutions Architects and AWS Professional Services provide guidance for solution implementation. AWS Partners provide AWS expertise to help you unlock agility and innovation for your business.

Reach out to AWS resources for assistance: AWS Solutions Architects and Professional Services provide guidance for solution implementation.
APN Partners provide AWS expertise to help you unlock agility and innovation for your business.

AWS Architecture Center
AWS Partner Network
AWS Solutions Library
AWS Knowledge Center
AWS Samples
AWS SDK Examples

Benchmark the performance of an existing workload to understand how it performs on the cloud. Use the data collected from benchmarks to drive architectural decisions.

Use benchmarking with synthetic tests and real-user monitoring to generate data about how your workload’s components perform. Benchmarking is generally quicker to set up than load testing and is used to evaluate the technology for a particular component. Benchmarking is often used at the start of a new project, when you lack a full solution to load test.

Monitor performance during development: : Implement processes that provide visibility into performance as your workload evolves.
Integrate into your delivery pipeline: Automatically run load tests in your delivery pipeline.
Test user journeys: Use synthetic or sanitized versions of production data (remove sensitive or identifying information) for load testing.
Real-user monitoring: Use CloudWatch RUM to help you collect and view client-side data about your application performance.

AWS Architecture Center
AWS Partner Network
AWS Solutions Library
AWS Knowledge Center
AWS CloudWatch RUM
AWS CloudWatch Synthetics
AWS Samples
AWS SDK Examples

Deploy your latest workload architecture on the cloud using different resource types and sizes. Monitor the deployment to capture performance metrics that identify bottlenecks or excess capacity.

Use this performance information to design or improve your architecture and resource selection.

Validate your approach with load testing: Load test a proof-of-concept to find out if you meet your performance requirements.
You can use AWS services to run production-scale environments to test your architecture. Monitor metrics: Amazon CloudWatch can collect metrics across the resources in your architecture.
Test at scale: Load testing uses your actual workload so you can see how your solution performs in a production environment.
You can use AWS services to run production-scale environments to test your architecture.

AWS CloudFormation
AWS CloudWatch RUM
AWS CloudWatch Synthetics

Compute architecture selection

The optimal compute choice for a particular workload can vary based on application design, usage patterns, and configuration settings

Architectures may use different compute choices for various components and enable different features to improve performance.

Selecting the wrong compute choice for an architecture can lead to lower performance efficiency

Best Practices

Understand how your workload can benefit from the use of different compute options, such as instances, containers and functions.

Understand the virtualization, containerization, and management solutions that can benefit your workload and meet your performance requirements.
A workload can contain multiple types of compute solutions.
Each compute solution has differing characteristics. Based on your workload scale and compute requirements, a compute solution can be selected and configured to meet your needs.

AWS EC2
AWS ECS
AWS EKS
AWS Lambda

Each compute solution has options and configurations available to you to support your workload characteristics.

Learn how various options complement your workload, and which configuration options are best for your application.

If your workload has been using the same compute option for more than four weeks and you anticipate that the characteristics will remain the same in the future, you can use AWS Compute Optimizer to provide a recommendation to you based on your compute characteristics.
If AWS Compute Optimizer is not an option due to lack of metrics, a non-supported instance type or a foreseeable change in your characteristics then you must predict your metrics based on load testing and experimentation.

AWS Compute Optimizer

To understand how your compute resources are performing, you must record and track the utilization of various systems.

This data can be used to make more accurate determinations about resource requirements.

Identify, collect, aggregate, and correlate compute-related metrics.
Using a service such as Amazon CloudWatch, can make the implementation quicker and easier to maintain.
In addition to the default metrics recorded, identify and track additional system-level metrics within your workload.
Record data such as CPU utilization, memory, disk I/O, and network inbound and outbound metrics to gain insight into utilization levels or bottlenecks.

AWS CloudWatch
AWS Systems Manager automation
Amazon Managed Service for Prometheus

Analyze the various performance characteristics of your workload and how these characteristics relate to memory, network, and CPU usage.

Use this data to choose resources that best match your workload's profile.

Modify your workload configuration by right sizing: To optimize both performance and overall efficiency, determine which resources your workload needs.
Choose memory-optimized instances for systems that require more memory than CPU, or compute-optimized instances for components that do data processing that is not memory-intensive.

AWS CloudWatch
AWS Compute Optimizer

The cloud provides the flexibility to expand or reduce your resources dynamically through a variety of mechanisms to meet changes in demand.

Combined with compute-related metrics, a workload can automatically respond to changes and use the optimal set of resources to achieve its goal.

Take advantage of elasticity: Elasticity matches the supply of resources you have against the demand for those resources.
Instances, containers, and functions provide mechanisms for elasticity either in combination with automatic scaling or as a feature of the service.
Use elasticity in your architecture to ensure that you have sufficient capacity to meet performance requirements at all scales of use.

AWS EC2 Auto Scaling
AWS EFS

Use system-level metrics to identify the behavior and requirements of your workload over time.

Evaluate your workload's needs by comparing the available resources with these requirements and make changes to your compute environment to best match your workload's profile

Use a data-driven approach to optimize resources: To achieve maximum performance and efficiency, use the data gathered over time from your workload to tune and optimize your resources.
Look at the trends in your workload's usage of current resources and determine where you can make changes to better match your workload's needs.

AWS Compute Optimizer

Storage Architecture Selection .

The optimal storage solution for a particular system varies based on the kind of access method (block, file, or object), patterns of access (random or sequential), throughput required, frequency of access (online, offline, archival), frequency of update (WORM, dynamic), and availability and durability constraints

Well-architected systems use multiple storage solutions and enable different features to improve performance.

In AWS, storage is virtualized and is available in a number of different types. This makes it easier to match your storage methods with your needs, and offers storage options that are not easily achievable with on-premises infrastructure.

Best Practices

Identify and document the workload storage needs and define the storage characteristics of each location.

Examples of storage characteristics include: shareable access, file size, growth rate, throughput, IOPS, latency, access patterns, and persistence of data. Use these characteristics to evaluate if block, file, object, or instance storage services are the most efficient solution for your storage needs.

Identify your workload’s most important storage performance metrics and implement improvements as part of a data-driven approach, using benchmarking or load testing.
Use this data to identify where your storage solution is constrained, and examine configuration options to improve the solution.
Determine the expected growth rate for your workload and choose a storage solution that will meet those rates.

AWS EBS Volume Types
AWS EC2 Storage
AWS EFS
AWS FSx for Lustre
AWS FSx for Windows File Server
AWS FSx for NetApp ONTAP
AWS FSx for OpenZFS
AWS S3 Glacier
AWS S3
AWS Snow Family

Evaluate the various characteristics and configuration options and how they relate to storage.

Understand where and how to use provisioned IOPS, SSDs, magnetic storage, object storage, archival storage, or ephemeral storage to optimize storage space and performance for your workload.

Determine storage characteristics: When you evaluate a storage solution, determine which storage characteristics you require, such as ability to share, file size, cache size, latency, throughput, and persistence of data.
Then match your requirements to the AWS service that best fits your needs.

AWS EBS Volume Types
AWS EC2 Storage
AWS EFS: Amazon EFS Performance
AWS FSx for Lustre Performance
AWS FSx for Windows File Server Performance
AWS FSx for NetApp ONTAP performance
AWS FSx for OpenZFS performance
AWS S3 Glacier: Amazon S3 Glacier Documentation
AWS S3: Request Rate and Performance Considerations
AWS Snow Family

Choose storage systems based on your workload's access patterns and configure them by determining how the workload accesses data.

Increase storage efficiency by choosing object storage over block storage. Configure the storage options you choose to match your data access patterns.

Optimize your storage usage and access patterns: Choose storage systems based on your workload's access patterns and the characteristics of the available storage options.
Determine the best place to store data that will enable you to meet your requirements while reducing overhead.
Use performance optimizations and access patterns when configuring and interacting with data based on the characteristics of your storage (for example, striping volumes or partitioning data).

AWS EBS Volume Types
AWS EC2 Storage
AWS EFS
AWS FSx for Lustre
AWS FSx for Windows File Server
AWS FSx for NetApp ONTAP
AWS FSx for OpenZFS
AWS S3 Glacie
AWS S3
AWS Snow Family

Database architecture selection

The optimal database solution for a system varies based on requirements for availability, consistency, partition tolerance, latency, durability, scalability, and query capability

Many systems use different database solutions for various sub-systems and enable different features to improve performance.

Selecting the wrong database solution and features for a system can lead to lower performance efficiency.

Best Practices

Choose your data management solutions to optimally match the characteristics, access patterns, and requirements of your workload datasets.

When selecting and implementing a data management solution, you must ensure that the querying, scaling, and storage characteristics support the workload data requirements.

Learn how various database options match your data models, and which configuration options are best for your use-case

Define the data characteristics and access patterns of your workload.
Review all available database solutions to identify which solution supports your data requirements.
Within a given workload, multiple databases may be selected. Evaluate each service or group of services and assess them individually.
How is the data structured? (for example, unstructured, key-value, semi-structured, relational)
Is ACID (atomicity, consistency, isolation, durability) compliance required?
What consistency model is required?
What query and result formats must be supported? (for example, SQL, CSV, Parque, Avro, JSON, etc.)
What is the proportion of read queries in relation to write queries? Would caching be likely to improve performance?

AWS DynamoDB
AWS Aurora
AWS Redshift
AWS Athena
AWS Redshift Spectrum
AWS RDS
AWS ElastiCache
AWS Neptune GraphDB

Understand the available database options and how it can optimize your performance before you select your data management solution.

Understand the available database options and how it can optimize your performance before you select your data management solution. Use load testing to identify database metrics that matter for your workload.

Understand your workload data characteristics so that you can configure your database options.
Run load tests to identify your key performance metrics and bottlenecks. Use these characteristics and metrics to evaluate database options and experiment with different configurations.
What configuration options are available for the selected databases?
Is the workload read or write heavy?
What solutions are available for scaling writes (partition key sharding, introducing a queue, etc.)?
What are the current or expected peak transactions per second (TPS)? Test using this volume of traffic and this volume +X% to understand the scaling characteristics.

AWS DynamoDB
AWS Aurora
AWS Redshift
AWS Athena
AWS Redshift Spectrum
AWS RDS
AWS ElastiCache
AWS Neptune GraphDB

To understand how your data management systems are performing, it is important to track relevant metrics. These metrics will help you to optimize your data management resources, to ensure that your workload requirements are met, and that you have a clear overview on how the workload performs.

Use tools, libraries, and systems that record performance measurements related to database performance.

Identify, collect, aggregate, and correlate database-related metrics. Metrics should include both the underlying system that is supporting the database and the database metrics.
The underlying system metrics might include CPU utilization, memory, available disk storage, disk I/O, and network inbound and outbound metrics while the database metrics might include transactions per second, top queries, average queries rates, response times, index usage, table locks, query timeouts, and number of connections open.

AWS CloudWatch
AWS X-Ray
AWS DevOps Guru

Use the access patterns of the workload to decide which services and technologies to use. In addition to non-functional requirements such as performance and scale, access patterns heavily influence the choice of the database and storage solutions.

The first dimension is the need for transactions, ACID compliance, and consistent reads. Not every database supports these and most of the NoSQL databases provide an eventual consistency model. The second important dimension would be the distribution of write and reads over time and space

Identify and evaluate your data access pattern to select the correct storage configuration.
Each database solution has options to configure and optimize your storage solution.
Use the collected metrics and logs and experiment with options to find the optimal configuration. Use the following table to review storage options per database service.

AWS DynamoDB
AWS Aurora
AWS Redshift
AWS Athena
AWS Redshift Spectrum
AWS RDS
AWS ElastiCache
AWS Neptune GraphDB

Use performance characteristics and access patterns that optimize how data is stored or queried to achieve the best possible performance.

Measure how optimizations such as indexing, key distribution, data warehouse design, or caching strategies impact system performance or overall efficiency.

Optimize data storage based on metrics and patterns: Use reported metrics to identify any underperforming areas in your workload and optimize your database components.
Each database system has different performance related characteristics to evaluate, such as how data is indexed, cached, or distributed among multiple systems.
Measure the impact of your optimizations.

AWS DynamoDB
AWS Aurora
AWS Redshift
AWS Athena
AWS Redshift Spectrum
AWS RDS
AWS ElastiCache
AWS Neptune GraphDB

Network architecture selection .

The optimal network solution for a workload varies based on latency, throughput requirements, jitter, and bandwidth.

Physical constraints, such as user or on-premises resources, determine location options. These constraints can be offset with edge locations or resource placement.

On AWS, networking is virtualized and is available in a number of different types and configurations. This makes it easier to match your networking methods with your needs.

AWS offers product features (for example, Enhanced Networking, Amazon EC2 networking optimized instances, Amazon S3 transfer acceleration, and dynamic Amazon CloudFront) to optimize network traffic.

AWS also offers networking features (for example, Amazon Route 53 latency routing, Amazon VPC endpoints, AWS Direct Connect, and AWS Global Accelerator) to reduce network distance or jitter.

Best Practices

Analyze and understand how network-related decisions impact workload performance. The network is responsible for the connectivity between application components, cloud services, edge networks and on-premises data and therefor it can highly impact workload performance.

In addition to workload performance, user experience is also impacted by network latency, bandwidth, protocols, location, network congestion, jitter, throughput, and routing rules.

Identify important network performance metrics of your workload and capture its networking characteristics.
Define and document requirements as part of a data-driven approach, using benchmarking or load testing.
Use this data to identify where your network solution is constrained, and examine configuration options that could improve the workload.

Application Load Balancer
EC2 Enhanced Networking on Linux
EC2 Enhanced Networking on Windows
EC2 Placement Groups
Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
Network Load Balancer
Networking Products with AWS
AWS Transit Gateway
AWS Route 53
VPC Endpoints
VPC Flow Logs

Evaluate networking features in the cloud that may increase performance. Measure the impact of these features through testing, metrics, and analysis.

For example, take advantage of network-level features that are available to reduce latency, packet loss, or jitter.

Review which network-related configuration options are available to you, and how they could impact your workload.
Understanding how these options interact with your architecture and the impact that they will have on both measured performance and the performance perceived by users is critical for performance optimization.

AWS EBS - Optimized Instances
AWS Application Load Balancer
AWS EC2 instance network bandwidth
AWS EC2 Enhanced Networking on Linux
AWS EC2 Enhanced Networking on Windows
AWS EC2 Placement Groups
AWS Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
AWS Network Load Balancer
AWS Transit Gateway
AWS Latency-Based Routing in Amazon Route 53
AWS VPC Endpoints
AWS VPC Flow Logs

When a common network is required to connect on-premises and cloud resources in AWS, ensure that you have adequate bandwidth to meet your performance requirements.

Estimate the bandwidth and latency requirements for your hybrid workload. These numbers will drive the sizing requirements for AWS Direct Connect or your VPN endpoints.

Develop a hybrid networking architecture based on your bandwidth requirements: Estimate the bandwidth and latency requirements of your hybrid applications.
Based on your bandwidth requirements, a single VPN or Direct Connect connection might not be enough, and you must architect a hybrid setup to enable traffic load balancing across multiple connections.

AWS Network Load Balancer
AWS Networking Products with AWS
AWS Transit Gateway
AWS Transitioning to latency-based Routing in Amazon Route 53
AWS VPC Endpoints
AWS VPC Flow Logs
AWS Site-to-Site VPN
AWS Direct Connect
AWS Client VPN

Distribute traffic across multiple resources or services to allow your workload to take advantage of the elasticity that the cloud provides.

You can also use load balancing for offloading encryption termination to improve performance and to manage and route traffic effectively

Use the appropriate load balancer for your workload: Select the appropriate load balancer for your workload.
If you must load balance HTTP requests, we recommend Application Load Balancer. For network and transport protocols (layer 4 – TCP, UDP) load balancing, and for extreme performance and low latency applications, we recommend Network Load Balancer.
Application Load Balancers support HTTPS and Network Load Balancers support TLS encryption offloading.

AWS Network Load Balancer
AWS Networking Products with AWS
AWS Transit Gateway
AWS Transitioning to latency-based Routing in Amazon Route 53
AWS VPC Endpoints
AWS VPC Flow Logs
AWS Site-to-Site VPN
AWS Direct Connect
AWS Client VPN

Make decisions about protocols for communication between systems and networks based on the impact to the workload’s performance.

There is a relationship between latency and bandwidth to achieve throughput. If your file transfer is using TCP, higher latencies will reduce overall throughput. There are approaches to fix this with TCP tuning and optimized transfer protocols, some approaches use UDP.

Optimize network traffic: Select the appropriate protocol to optimize the performance of your workload.
There are approaches to fix latency with TCP tuning and optimized transfer protocols, some which use UDP.

AWS EBS - Optimized Instances
AWS Application Load Balancer
AWS EC2 Enhanced Networking on Linux
AWS EC2 Enhanced Networking on Windows
AWS EC2 Placement Groups
AWS Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
AWS Network Load Balancer
AWS Transit Gateway
AWS Latency-Based Routing in Amazon Route 53
AWS VPC Endpoints
AWS VPC Flow Logs

Use the cloud location options available to reduce network latency or improve throughput.

Use AWS Regions, Availability Zones, placement groups, and edge locations such as AWS Outposts, AWS Local Zones, and AWS Wavelength, to reduce network latency or improve throughput.

Reduce latency by selecting the correct locations: Identify where your users and data are located.
Take advantage of AWS Regions, Availability Zones, placement groups, and edge locations to reduce latency.

AWS EBS - Optimized Instances
AWS Application Load Balancer
AWS EC2 Enhanced Networking on Linux
AWS EC2 Enhanced Networking on Windows
AWS EC2 Placement Groups
AWS Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
AWS Network Load Balancer
AWS Transit Gateway
AWS Latency-Based Routing in Amazon Route 53
AWS VPC Endpoints
AWS VPC Flow Logs

Use collected and analyzed data to make informed decisions about optimizing your network configuration. Measure the impact of those changes and use the impact measurements to make future decisions.

Enable VPC Flow Logs for all VPC networks that are used by your workload. VPC Flow Logs are a feature that allows you to capture information about the IP traffic going to and from network interfaces in your VPC.

Enable VPC Flow Logs: VPC Flow Logs enable you to capture information about the IP traffic going to and from network interfaces in your VPC.
Enable appropriate metrics for network options: Ensure that you select the appropriate network metrics for your workload. You can enable metrics for VPC NAT gateway, transit gateways, and VPN tunnels.

AWS EBS - Optimized Instances
AWS Application Load Balancer
AWS EC2 Enhanced Networking on Linux
AWS EC2 Enhanced Networking on Windows
AWS EC2 Placement Groups
AWS Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
AWS Network Load Balancer
AWS Transit Gateway
AWS Latency-Based Routing in Amazon Route 53
AWS VPC Endpoints
AWS VPC Flow Logs

Review

When architecting workloads, there are finite options that you can choose from. However, over time, new technologies and approaches become available that could improve the performance of your workload.

In the cloud, it’s much easier to experiment with new features and services because your infrastructure is code. To adopt a data-driven approach to architecture you should implement a performance review process

Evolve your workload to take advantage of new releases .

Take advantage of the continual innovation at AWS driven by customer need. We release new Regions, edge locations, services, and features regularly.

Any of these releases could positively improve the performance efficiency of your architecture.

Best Practices

Evaluate ways to improve performance as new services, design patterns, and product offerings become available.

Determine which of these could improve performance or increase the efficiency of the workload through evaluation, internal discussion, or external analysis.

Document your workload solutions.
Use a tagging strategy to document owners for each workload component and category.
Identify news and update sources related to your workload components.
Document your process for evaluating updates and new services.

AWS Config
AWS Tagging
AWS Github
AWS Skill Builder
AWS Blog
Whats New With AWS website

Define a process to evaluate new services, design patterns, resource types, and configurations as they become available.

For example, run existing performance tests on new instance offerings to determine their potential to improve your workload.

Identify the key performance constraints for your workload: Document your workload’s performance constraints so that you know what kinds of innovation might improve the performance of your workload.

AWS Blog
What's New with AWS website
AWS Github
AWS Skill Builder

As an organization, use the information gathered through the evaluation process to actively drive adoption of new services or resources when they become available.

Use the information you gather when evaluating new services or technologies to drive change. As your business or workload changes, performance needs also change.

Evolve your workload over time: Use the information you gather when evaluating new services or technologies to drive change.
As your business or workload changes, performance needs also change.
Use data gathered from your workload metrics to evaluate areas where you can achieve the biggest gains in efficiency or performance, and proactively adopt new services and technologies to keep up with demand.

AWS Blog
What's New with AWS website
AWS Github
AWS Skill Builder

Monitoring

After you implement your architecture you must monitor its performance so that you can remediate any issues before they impact your customers. Monitoring metrics should be used to raise alarms when thresholds are breached.

Monitor your resources to ensure that they are performing as expected .

System performance can degrade over time. Monitor system performance to identify degradation and remediate internal or external factors, such as the operating system or application load.

Best Practices

Use a monitoring and observability service to record performance-related metrics.

Examples of metrics include record database transactions, slow queries, I/O latency, HTTP request throughput, service latency, or other key data.

Identify the relevant performance metrics for your workload and record them. This data helps identify which components are impacting overall performance or efficiency of your workload.
Identify performance metrics: Use the customer experience to identify the most important metrics. For each metric, identify the target, measurement approach, and priority.
Use these data points to build alarms and notifications to proactively address performance-related issues.

AWS CloudWatch (monitoring/logging)
AWS X-Ray

In response to (or during) an event or incident, use monitoring dashboards or reports to understand and diagnose the impact. These views provide insight into which portions of the workload are not performing as expected.

Prioritize experience concerns for critical user stories: When you write critical user stories for your architecture, include performance requirements, such as specifying how quickly each critical story should run.
For these critical stories, implement additional scripted user journeys to ensure that you know how the user stories perform against your requirements.

AWS CloudWatch
AWS CloudWatch Synthetics
AWS X-Ray

Identify the KPIs that quantitatively and qualitatively measures workload performance. KPIs help to measure the health of a workload as it relates to a business goal.

KPIs allow business and engineering teams to align on the measurement of goals and strategies and how this combines to produce business outcomes.

KPIs should be revisited when business goals, strategies, or end-user requirements change.

All departments and business teams impacted by the health of the workload should contribute to defining KPIs.
A single person should drive the collaboration, timelines, documentation, and information related to an organization’s KPIs.

AWS CloudWatch
AWS CloudWatch Synthetics
AWS X-Ray
AWS QuickSight

Using the performance-related key performance indicators (KPIs) that you defined, use a monitoring system that generates alarms automatically when these measurements are outside expected boundaries.

Amazon CloudWatch can collect metrics across the resources in your architecture. You can also collect and publish custom metrics to surface business or derived metrics.

Monitor metrics: Amazon CloudWatch can collect metrics across the resources in your architecture.
You can collect and publish custom metrics to surface business or derived metrics.
Use CloudWatch or a third-party monitoring service to set alarms that indicate when thresholds are exceeded.

AWS CloudWatch
AWS CloudWatch Synthetics
AWS X-Ray

As routine maintenance, or in response to events or incidents, review which metrics are collected.

Use these reviews to identify which metrics were essential in addressing issues and which additional metrics, if they were being tracked, would help to identify, address, or prevent issues.

Constantly improve metric collection and monitoring: As part of responding to incidents or events, evaluate which metrics were helpful in addressing the issue and which metrics could have helped that are not currently being tracked.
Use this method to improve the quality of metrics you collect so that you can prevent or more quickly resolve future incidents.

AWS CloudWatch
AWS CloudWatch Synthetics
AWS X-Ray

Use key performance indicators (KPIs), combined with monitoring and alerting systems, to proactively address performance-related issues.

Use alarms to trigger automated actions to remediate issues where possible. Escalate the alarm to those able to respond if automated response is not possible

Monitor performance during operations: Implement processes that provide visibility into performance as your workload is running.
Build monitoring dashboards and establish a baseline for performance expectations.

AWS CloudWatch
AWS X-Ray

Trade-offs

When you architect solutions, think about trade-offs to ensure an optimal approach. Depending on your situation, you could trade consistency, durability, and space for time or latency, to deliver higher performance.

Using trade-offs to improve performance .

When architecting solutions, actively considering trade-offs enables you to select an optimal approach

Often you can improve performance by trading consistency, durability, and space for time and latency. Trade-offs can increase the complexity of your architecture and require load testing to ensure that a measurable benefit is obtained.

Best Practices

Understand and identify areas where increasing the performance of your workload will have a positive impact on efficiency or customer experience.

For example, a website that has a large amount of customer interaction can benefit from using edge services to move content delivery closer to customers.

Set up end-to-end tracing to identify traffic patterns, latency, and critical performance areas.
Monitor your data access patterns for slow queries or poorly fragmented and partitioned data.
Identify the constrained areas of the workload using load testing or monitoring.

AWS Builders’ Library
AWS X-Ray
AWS CloudWatch RUM
AWS DevOps Guru
AWS CloudWatch Synthetics

Research and understand the various design patterns and services that help improve workload performance. As part of the analysis, identify what you could trade to achieve higher performance.

For example, using a cache service can help to reduce the load placed on database systems. However, caching can introduce eventual consistency and requires engineering effort to implement within business requirements and customer expectations.

Evaluate and review design patterns that would improve your workload performance.
Improve your workload to model the selected design patterns and use services and the service configuration options to improve your workload performance.

AWS Architecture Center
AWS Partner Network
AWS Solutions Library
AWS Knowledge Center
Amazon Builders’ Library

When evaluating performance-related improvements, determine which choices will impact your customers and workload efficiency.

For example, if using a key-value data store increases system performance, it is important to evaluate how the eventually consistent nature of it will impact customers.

Identify tradeoffs: Use metrics and monitoring to identify areas of poor performance in your system.
Determine how to make improvements, and how tradeoffs will impact the system and the user experience.

AWS Builders’ Library
AWS QuickSight KPIs
AWS CloudWatch RUM
AWS X-Ray Documentation

As changes are made to improve performance, evaluate the collected metrics and data. Use this information to determine impact that the performance improvement had on the workload, the workload’s components, and your customers.

A well-architected system uses a combination of performance-related strategies.
Determine which strategy will have the largest positive impact on a given hotspot or bottleneck.
For example, sharding data across multiple relational database systems could improve overall throughput while retaining support for transactions and, within each shard, caching can help to reduce the load.

AWS Builders’ Library
AWS CloudWatch RUM
AWS CloudWatch Synthetics

Where applicable, use multiple strategies to improve performance.

For example, using strategies like caching data to prevent excessive network or database calls, using read-replicas for database engines to improve read rates, sharding or compressing data where possible to reduce data volumes, and buffering and streaming of results as they are available to avoid blocking.

Use a data-driven approach to evolve your architecture: As you make changes to the workload, collect and evaluate metrics to determine the impact of those changes.
Measure the impacts to the system and to the end-user to understand how your tradeoffs impact your workload.
Use a systematic approach, such as load testing, to explore whether the tradeoff improves performance.

AWS CloudWatch Synthetics
AWS Builders’ Library
AWS ElastiCache
AWS Database Caching
AWS CloudWatch RUM

Cost Optimization

Architect workloads with the most effective use of services and resources, to achieve business outcomes at the lowest price point.

There are five focus areas for Cost Optimization in the cloud

Practice Cloud Financial Management

Cloud Financial Management (CFM) enables organizations to realize business value and financial success as they optimize their cost and usage and scale on AWS.

Best Practices

Create a team (Cloud Business Office or Cloud Center of Excellence) that is responsible for establishing and maintaining cost awareness across your organization. The team requires people from finance, technology, and business roles across the organization.

Establish a Cloud Business Office (CBO) or Cloud Center of Excellence (CCOE) team that is responsible for establishing and maintaining a culture of cost awareness in cloud computing.
Define key members: You need to ensure that all relevant parts of your organization contribute and have a stake in cost management.
Define goals and metrics: The function needs to deliver value to the organization in different ways. These goals are defined and continually evolve as the organization evolves.
Establish regular cadence: The group(finance,technology,and business teams) should come together regularly to review their goals and metrics.

AWS CCOE Blog
Creating Cloud Business Office
Create a Cloud Center of Excellence

Involve finance and technology teams in cost and usage discussions at all stages of your cloud journey. Teams regularly meet and discuss topics such as organizational goals and targets, current state of cost and usage, and financial and accounting practices.

Establish a partnership between key finance and technology stakeholders to create a shared understanding of organizational goals and develop mechanisms to succeed financially in the variable spend model of cloud computing.
Define key members: Verify that all relevant members of your finance and technology teams participate in the partnership. Relevant finance members will be those having interaction with the cloud bill. This will typically be CFOs, financial controllers, financial planners, business analysts, procurement, and sourcing.
Define topics for discussion: Define the topics that are common across the teams, or will need a shared understanding.
Establish regular cadence: To create a finance and technology partnership, establish a regular communication cadence to create and maintain alignment. The group needs to come together regularly against their goals and metrics.

AWS News Blog website

Adjust existing organizational budgeting and forecasting processes to be compatible with the highly variable nature of cloud costs and usage. Processes must be dynamic using trend-based or business driver-based algorithms, or a combination of both.

Update existing budget and forecasting processes: Implement trend-based, business driver-based,or a combination of both in your budgeting and forecasting processes.
Configure alerts and notifications: Use AWS Budgets Alerts and Cost Anomaly Detection.
Perform regular reviews with key stakeholders: Forexample, stakeholders in IT, Finance, Platform, and other areas of the business, to align with changes in business direction and usage.

AWS Cost Explorer
AWS Budgets
AWS Pricing Calculator
AWS Cost Anomaly Detection
AWS License Manager

Implement cost awareness, create transparency, and accountability of costs into new or existing processes that impact usage, and leverage existing processes for cost awareness. Implement cost awareness into employee training.

Identify relevant organizational processes: Each organizational unit reviews their processes and identifies processes that impact cost and usage.
Establish self-sustaining cost-aware culture: Make sure all the relevant stakeholders align with cause-of-change and impact as a cost so that they understand cloud cost.
Update processes with cost awareness: Each process is modified to be made cost aware. The process may require additional pre-checks, such as assessing the impact of cost, or post-checks validating that the expected changes in cost and usage occurred.

AWS Cloud Financial Management website

Configure AWS Budgets and AWS Cost Anomaly Detection to provide notifications on cost and usage against targets. Have regular meetings to analyze your workload's cost efficiency and to promote cost- aware culture.

Configure AWS Budgets on all accounts for your workload. Set a budget for the overall account spend, and a budget for the workload by using tags.
Report on cost optimization: Setup a regular cycle to discuss and analyze the efficiency of the workload.

AWS Cost Explorer
AWS Trusted Advisor
AWS Budgets
AWS Budgets Best Practices
Amazon CloudWatch
AWS CloudTrail
Amazon S3 Analytics
AWS Cost and Usage Report

Implement tooling and dashboards to monitor cost proactively for the workload. Regularly review the costs with configured tools or out of the box tools, do not just look at costs and categories when you receive notifications. Monitoring and analyzing costs proactively helps to identify positive trends and allows you to promote them throughout your organization.

Report on cost optimization: Setup a regular cycle to discuss and analyze the efficiency of the workload.
Create and enable daily granularity AWS Budgets for the cost and usage to take timely actions to prevent any potential cost overruns
Create AWS Cost Anomaly Detection for cost monitor
Use AWS Cost Explorer or integrate your AWS Cost andU sage Report(CUR)data with Amazon QuickSight dashboards to visualize your organization’s costs

AWS Budgets
AWS Cost Explorer
AWS Cost Anomaly Detection

Consult regularly with experts or AWS Partners to consider which services and features provide lower cost. Review AWS blogs and other information sources.

Subscribe to blogs
Subscribe to AWS News
Follow AWS Price Reductions
Meet with your account team

AWS Cost Management website
What’s New with AWS website
AWS News Blog

Implement changes or programs across your organization to create a cost-aware culture. It is recommended to start small, then as your capabilities increase and your organization’s use of the cloud increases, implement large and wide ranging programs.

Report cloud costs to technology teams
Inform stake holders or team members about planned changes
Meet with your account team
Share success stories
Training

AWS Cost Management website
AWS News Blog

Quantifying business value from cost optimization allows you to understand the entire set of benefits to your organization. Because cost optimization is a necessary investment, quantifying business value allows you to explain the return on investment to stakeholders. Quantifying business value can help you gain more buy-in from stakeholders on future cost optimization investments, and provides a framework to measure the outcomes for your organization’s cost optimization activities.

Execute cost optimization best practices
Implementing automation, for example AutoScaling

AWS Cost Management website
AWS Cost Explorer
AWS News Blog

Expenditure and usage awareness

Understanding your organization’s costs and drivers is critical for managing your cost and usage effectively, and identifying cost-reduction opportunities

Organizations typically operate multiple workloads run by multiple teams. These teams can be in different organization units, each with its own revenue stream. The capability to attribute resource costs to the workloads, individual organization, or product owners drives efficient usage behavior and helps reduce waste. Accurate cost and usage monitoring allows you to understand how profitable organization units and products are, and allows you to make more informed decisions about where to allocate resources within your organization. Awareness of usage at all levels in the organization is key to driving change, as change in usage drives changes in cost.

Governance

To manage your costs in the cloud, you must manage your usage through the following governance best practicies.

Best Practices

Develop policies that define how resources are managed by your organization. Policies should cover cost aspects of resources and workloads, including creation, modification and decommission over the resource lifetime.

Meet with team members
Define locations for your workload
Define and group services and resources
Define and group the users by function
Define the actions
Define the review period
Document the policies

AWS Managed Policies for Job Functions website
AWS Compliance latest news website
AWS Compliance programs website

Implement both cost and usage goals for your workload. Goals provide direction to your organization on cost and usage, and targets provide measurable outcomes for your workloads.

Define expected usage levels: Focus on usage levels to begin with.Engage with the application owners, marketing, and greater business teams to understand what the expected usage levels will be for the workload.
Define workload resourcing and costs: With the usage levels defined, quantify the changes in workload resources required to meet these usage levels.
Define business goals: Taking the output from the expected changes in usage and cost, combine this with expected changes in technology, or any programs that you are running, and develop goals for the workload
Define targets: For each of the defined goals specify a measurable target.

AWS managed policies for job functions
AWS multi-account strategy for your AWS Control Tower landingzone
Control access to AWS Regions using IAM policies

Implement a structure of accounts that maps to your organization. This assists in allocating and managing costs throughout your organization.

Define separation requirements: Requirements for separation are a combination of multiple factors, including security, reliability, and financial constructs
Define grouping requirements: Requirements for grouping do not override the separation requirements, but are used to assist management
Define account structure: Using the seseparations and groupings,specify an account for each group and ensure that separation requirements are maintained.

AWS managed policies for job functions
AWS multiple account billing strategy
Control access to AWS Regions using IAM policies
AWS Control Tower
AWS Organizations
Consolidated billing

Implement groups and roles that align to your policies and control who can create, modify, or decommission instances and resources in each group. For example, implement development, test, and production groups. This applies to AWS services and third-party solutions.

Implement groups: Using the groups of users defined in your organizational policies,implement the corresponding groups, if necessary.
Implement roles and policies: Using the actions defined in your organizational policies, create the required roles and access policies.

AWS managed policies for job functions
AWS multiple account billing strategy
Control access to AWS Regions using IAM policies

Implement controls based on organization policies and defined groups and roles. These certify that costs are only incurred as defined by organization requirements: for example, control access to regions or resource types with AWS Identity and Access Management (IAM) policies.

Implement notifications on spend: Using your defined organization policies, createAWSbudgets to provide notifications when spending is outside of your policies.
Implement controls on usage: Using your defined organization policies, implement IAM policies and roles to specify which actions users can perform and which actions they cannot perform.

AWS managed policies for job functions
AWS multiple account billing strategy
Control access to AWS Regions using IAM policies

Track, measure, and audit the lifecycle of projects, teams, and environments to avoid using and paying for unnecessary resources.

Perform workload reviews: As defined by your organizational policies, audit your existing projects. The amount of effort spent in the audit should be proportional to the approximate risk, value, or cost to the organization.

AWS Config
AWS Systems Manager
AWS managed policies for job functions
AWS multiple account billing strategy
Control access to AWS Regions using IAM policies

Monitor Cost and Usage

Enable teams to take action on their cost and usage through detailed visibility into the workload

Cost optimization begins with a granular understanding of the breakdown in cost and usage, the ability to model and forecast future spend, usage, and features, and the implementation of sufficient mechanisms to align cost and usage to your organization’s objectives.

Best Practices

Configure the AWS Cost and Usage Report, and Cost Explorer hourly granularity, to provide detailed cost and usage information. Configure your workload to have log entries for every delivered business outcome. Tag resources.

Configure the cost and usage report: Using the billing console, configure at least one cost and usage report.
Configure hourly granularity in Cost Explorer: Using the billing console, enable Hourly and Resource Level Data.
Configure application logging: Verify that your application logs each business outcome that it delivers so it can be tracked and measured.

AWS Cost and Usage Report (CUR)
AWS Glue / Athena
AWS resource tagging
AWS Cost Explorer
AWS Budgets

Identify organization categories that could be used to allocate cost within your organization.

Define your organization categories: Meet with stakeholders to define categories that reflect your organization's structure and requirements
Define your functional categories: Meet with stakeholders to define categories that reflect the functions that you have within your business.

Tagging AWS resources
AWS Budgets
AWS Cost Explorer

Establish the organization metrics that are required for this workload. Example metrics of a workload are customer reports produced, or web pages served to customers.

Define workload outcomes: Meet with the stakeholders in the business and define the outcomes for the workload.
Define workload component outcomes: Optionally, if you have a large and complex workload, or can easily break your workload into components (such as microservices) with well-defined inputs and outputs, define metrics for each component.

Tagging AWS resources
AWS Budgets
AWS Cost Explorer
AWS Cost and Usage Reports

Configure AWS Cost Explorer and AWS Budgets inline with your organization policies.

Create a Cost Optimization group: Configure your account and create a group that has access to the required Cost and Usage reports.
Configure AWS Budgets: Configure AWS Budgets on all accounts for your workload. Set a budget for the overall account spend, and a budget for the workload by using tags.
Configure AWS Cost Explorer: Configure AWS Cost Explorer for your workload and accounts. Create a dashboard for the workload that tracks overall spend, and key usage metrics for the workload.
Configure advanced tooling: Optionally, you can create custom tooling for your organization that provides additional detail and granularity.

Tagging AWS resources
AWS Budgets
AWS Cost Explorer
AWS Cost and Usage Reports

Define a tagging schema based on organization, and workload attributes, and cost allocation categories. Implement tagging across all resources. Use Cost Categories to group costs and usage according to organization attributes.

Define a tagging schema: Gather all stakeholders from across your business to define a schema. This typically includes people in technical, financial, and management roles. D
Tag resources: Using your defined cost attribution categories, place tags on all resources in your workloads according to the categories.
Implement Cost Categories: You can create Cost Categories without implementing tagging. Cost Categories use the existing cost and usage dimensions. C
Automate tagging: To verify that you maintain high levels of tagging across all resources, automate tagging so that resources are automatically tagged when they are created.
Monitor and report on tagging: To verify that you maintain high levels of tagging across your organization, report and monitor the tags across your workloads.

AWS CloudFormation Resource Tag
Tagging AWS resources
AWS Budgets
AWS Cost Explorer
AWS Cost and Usage Reports

Allocate the workload's costs by metrics or business outcomes to measure workload cost efficiency. Implement a process to analyze the AWS Cost and Usage Report with Amazon Athena, which can provide insight and charge back capability.

Allocate costs to workload metrics: Using the defined metrics and tagging configured, create a metric that combines the workload output and workload cost.
Use the analytics services such as Amazon Athena and Amazon QuickSight to create an efficiency dashboard for the overall workload, and any components.

Tagging AWS resources
AWS Budgets
AWS Cost Explorer
AWS Cost and Usage Reports

Decommison resources

After you manage a list of projects, employees, and technology resources over time you will be able to identify which resources are no longer being used, and which projects that no longer have an owner.

Best practices

Define and implement a method to track resources and their associations with systems over their lifetime. You can use tagging to identify the workload or function of the resource.

Implement a tagging scheme: Implement a tagging scheme that identifies the workload the resource belongs to, verifying that all resources within the workload are tagged accordingly.
Implement workload throughput or output monitoring: Implement workload throughput monitoring or alarming, triggering on either input requests or output completions. C

AWS Auto Scaling
AWS Trusted Advisor
Tagging AWS resources

Implement a process to identify and decommission orphaned resources.

Create and implement a decommissioning process: Working with the workload developers and owners, build a decommissioning process for the workload and its resources.

AWS Auto Scaling
AWS Trusted Advisor

Decommission resources triggered by events such as periodic audits, or changes in usage. Decommissioning is typically performed periodically, and is manual or automated.

Decommission resources: Using the decommissioning process, decommission each of the resources that have been identified as orphaned.

AWS Auto Scaling
AWS Trusted Advisor

Design your workload to gracefully handle resource termination as you identify and decommission non- critical resources, resources that are not required, or resources with low utilization.

Implement AWS Auto Scaling: For resources that are supported, configure them with AWS Auto Scaling.
Configure CloudWatch to terminate instances: Instances can be configured to terminate using CloudWatch alarms. U
Implement code within the workload: You can use the AWS SDK or AWS CLI to decommission workload resources.

AWS Auto Scaling
AWS Trusted Advisor
Create Alarms to Stop, Terminate, Reboot, or Recover an Instance

Cost-effective resources

Using the appropriate services, resources, and configurations for your workloads is key to cost savings

Consider the following when creating cost-effective resources: You can use AWS Solutions Architects, AWS Solutions, AWS Reference Architectures, and APN Partners to help you choose an architecture based on what you have learned.

Evaluate cost when selecting service

Evaluate service costs through the following best practicies.

Best Practices

Work with team members to define the balance between cost optimization and other pillars, such as performance and reliability, for this workload.

Identify organization requirements for cost: Meet with team members from your organization, including those in product management, application owners, development and operational teams, management, and financial roles.
Prioritize the Well-Architected pillars for this workload and its components, the output is a list of the pillars in order.

AWS Total Cost of Ownership (TCO) Calculator

Verify every workload component is analyzed, regardless of current size or current costs. The review effort should reflect the potential benefit, such as current and projected costs.

List the workload components: Build the list of all the workload components. This is used as verification to check that each component was analyzed.
Prioritize component list: Take the component list and prioritize it in order of effort. This is typically in order of the cost of the component from most expensive to least expensive, or the criticality as defined by your organization’s priorities.
Perform the analysis: For each component on the list, review the options and services available and chose the option that aligns best with your organizational priorities.

AWS Pricing Calculator
AWS Cost Explorer

Look at overall cost to the organization of each component. Look at total cost of ownership by factoring in cost of operations and management, especially when using managed services. The review effort should reflect potential benefit, for example, time spent analyzing is proportional to component cost.

Using the component list, work through each component from the highest priority to the lowest priority.
For the higher priority and more costly components, perform additional analysis and assess all available options and their long term impact.

• AWS Total Cost of Ownership (TCO) Calculator

Open-source software eliminates software licensing costs, which can contribute significant costs to workloads. Where licensed software is required, avoid licenses bound to arbitrary attributes such as CPUs, look for licenses that are bound to output or outcomes. The cost of these licenses scales more closely to the benefit they provide.

Analyze license options: Review the licensing terms of available software. Look for open-source versions that have the required functionality, and whether the benefits of licensed software outweigh the cost.
Analyze the software provider: Review any historical pricing or licensing changes from the vendor. Look for any changes that do not align to outcomes, such as punitive terms for running on specific vendors hardware or platforms.

AWS Total Cost of Ownership (TCO) Calculator

Factor in cost when selecting all components. This includes using application level and managed services, such as Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, Amazon Simple Notification Service (Amazon SNS), and Amazon Simple Email Service (Amazon SES) to reduce overall organization cost. Use serverless and containers for compute, such as AWS Lambda, Amazon Simple Storage Service (Amazon S3)for static websites, and Amazon Elastic Container Service (Amazon ECS). Minimize license costs by using open source software, or software that does not have license fees: for example, Amazon Linux for compute workloads or migrate databases to Amazon Aurora.

Select each service to optimize cost: Using your prioritized list and analysis, select each option that provides the best match with your organizational priorities.

AWS Total Cost of Ownership (TCO) Calculator

Workloads can change over time. Some services or features are more cost effective at different usage levels. By performing the analysis on each component over time and at projected usage, the workload remains cost-effective over its lifetime.

Define predicted usage patterns: Working with your organization, such as marketing and product owners, document what the expected and predicted usage patterns will be for the workload.
Perform cost analysis at predicted usage: Using the usage patterns defined, perform the analysis at each of these points.

AWS Total Cost of Ownership (TCO) Calculator

Select the correct resource type, size, and number

By selecting the best resource type, size, and number of resources, you meet the technical requirements with the lowest cost resource

Right-sizing activities takes into account all of the resources of a workload, all of the attributes of each individual resource, and the effort involved in the right-sizing operation. Right-sizing can be an iterative process, triggered by changes in usage patterns and external factors, such as AWS price drops or new AWS resource types. Right-sizing can also be one-off if the cost of the effort to right-size, outweighs the potential savings over the life of the workload.

Best Practices

Identify organization requirements and perform cost modeling of the workload and each of its components. Perform benchmark activities for the workload under different predicted loads and compare the costs. The modeling effort should reflect the potential benefit. For example, time spent is proportional to component cost.

Perform cost modeling: Deploy the workload or a proof-of-concept, into a separate account with the specific resource types and sizes to test. Run the workload with the test data and record the output results, along with the cost data for the time the test was run. Then redeploy the workload or change the resource types and sizes and re-run the test.

AWS Auto Scaling
AWS CloudWatch

Select resource size or type based on data about the workload and resource characteristics. For example, compute, memory, throughput, or write intensive. This selection is typically made using a previous (on- premises) version of the workload, using documentation, or using other sources of information about the workload.

Select resources based on data: Using your cost modeling data, select the expected workload usage level, then select the specified resource type and size.

AWS Auto Scaling
AWS CloudWatch

Use metrics from the currently running workload to select the right size and type to optimize for cost. Appropriately provision throughput, sizing, and storage for services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon DynamoDB, Amazon Elastic Block Store (Amazon EBS) (PIOPS), Amazon Relational Database Service (Amazon RDS), Amazon EMR, and networking. This can be done with a feedback loop such as automatic scaling or by custom code in the workload.

Configure workload metrics: Ensure you capture the key metrics for the workload.
View rightsizing recommendations: Use the rightsizing recommendations in AWS Compute Optimizer to make adjustments to your workload.
Select resource type and size automatically based on metrics: Using the workload metrics, manually or automatically select your workload resources.

AWS Auto Scaling
AWS CloudWatch

Select the best pricing model

Perform workload cost modeling: Consider the requirements of the workload components and understand the potential pricing models

Perform regular account level analysis: Performing regular cost modeling ensures that opportunities to optimize across multiple workloads can be implemented.

• On-Demand Instances • Spot Instances • Commitment discounts - Savings Plans • Commitment discounts - Reserved Instances/Capacity • Geographic selection • Third-party agreements and pricing

Best practices

Analyze each component of the workload. Determine if the component and resources will be running for extended periods (for commitment discounts), or dynamic and short-running (for Spot or On-Demand Instances). Perform an analysis on the workload using the Recommendations feature in AWS Cost Explorer.

Perform a commitment discount analysis: Using Cost Explorer in your account, review the Savings Plans and Reserved Instance recommendations.
Analyze workload elasticity: Using the hourly granularity in Cost Explorer, or a custom dashboard. Analyze the workload elasticity. Look for regular changes in the number of instances that are running. Short duration instances are candidates for Spot Instances or Spot Fleet.

Resource pricing can be different in each Region. Factoring in Region cost helps ensure that you pay the lowest overall price for this workload.

Review Region pricing: Analyze the workload costs in the current Region. Starting with the highest costs by service and usage type, calculate the costs in other Regions that are available.

Cost efficient agreements and terms ensure the cost of these services scales with the benefits they provide. Select agreements and pricing that scale when they provide additional benefits to your organization.

Analyze third-party agreements and terms: Review the pricing in third party agreements. Perform modeling for different levels of your usage, and factor in new costs such as new service usage, or increases in current services due to workload growth.

Permanently running resources should utilize reserved capacity such as Savings Plans or Reserved Instances. Short-term capacity is configured to use Spot Instances, or Spot Fleet. On-Demand Instances are only used for short-term workloads that cannot be interrupted and do not run long enough for reserved capacity, between 25% to 75% of the period, depending on the resource type.

Implement pricing models: Using your analysis results, purchase Savings Plans (SPs), Reserved Instances (RIs) or implement Spot Instances.
Workload review cycle: Implement a review cycle for the workload that specifically analyzes pricing model coverage.

Use Cost Explorer Savings Plans and Reserved Instance recommendations to perform regular analysis at the management account level for commitment discounts.

Perform a commitment discount analysis: Using Cost Explorer in your account review the Savings Plans and Reserved Instance recommendations.

Plan the data transfer

Efficient use of networking resources is required for cost optimization in the cloud.

Best practices

Gather organization requirements and perform data transfer modeling of the workload and each of its components. This identifies the lowest cost point for its current data transfer requirements.

Calculate data transfer costs: Use the AWS pricing pages and calculate the data transfer costs for the workload.
Link costs to outcomes: For each data transfer cost incurred, specify the outcome that it achieves for the workload.

AWS caching solutions (doc)
AWS Pricing (doc)
Amazon EC2 Pricing (doc)
Amazon VPC pricing (doc)
Deliver content faster with Amazon CloudFront (doc)

All components are selected, and architecture is designed to reduce data transfer costs. This includes using components such as wide-area-network (WAN) optimization and Multi-Availability Zone (AZ) configurations

Select components for data transfer: Using the data transfer modeling, focus on where the largest data transfer costs are or where they would be if the workload usage changes. L

AWS caching solutions (doc)
Deliver content faster with Amazon CloudFront (doc)

Implement services to reduce data transfer. For example, using a content delivery network (CDN) such as Amazon CloudFront to deliver content to end users, caching layers using Amazon ElastiCache, or using AWS Direct Connect instead of VPN for connectivity to AWS.

Implement services: Using the data transfer modeling, look at where the largest costs and highest volume flows are.

AWS Direct Connect
AWS CloudFront

Manage demand and supplying resources

When you move to the cloud, you pay only for what you need. You can supply resources to match the workload demand at the time they’re needed — eliminating the need for costly and wasteful overprovisioning

You can also modify the demand using a throttle, buffer, or queue to smooth the demand and serve it with less resources.

Best Practices

Analyze the demand of the workload over time. Verify that the analysis covers seasonal trends and accurately represents operating conditions over the full workload lifetime. Analysis effort should reflect the potential benefit, for example, time spent is proportional to the workload cost.

Analyze existing workload data: Analyze data from the existing workload, previous versions of the workload, or predicted usage patterns. Use log files and monitoring data to gain insight on how customers use the workload.
Forecast outside influence: Meet with team members from across the organization that can influence or change the demand in the workload.

AWS Auto Scaling
AWS Instance Scheduler
AWS Cost Explorer
AWS QuickSight

Buffering and throttling modify the demand on your workload, smoothing out any peaks. Implement throttling when your clients perform retries. Implement buffering to store the request and defer processing until a later time. Verify that your throttles and buffers are designed so clients receive a response in the required time.

Analyze the client requirements: Analyze the client requests to determine if they are capable of performing retries. For clients that cannot perform retries, buffers will need to be implemented.
Implement a buffer or throttle: Implement a buffer or throttle in the workload. A queue such as Amazon Simple Queue Service (Amazon SQS) can provide a buffer to your workload components.

AWS Auto Scaling
AWS Instance Scheduler
AWS API Gateway
AWS SQS
AWS Kinesis

Resources are provisioned in a planned manner. This can be demand-based, such as through automatic scaling, or time-based, where demand is predictable and resources are provided based on time. These methods result in the least amount of over or under-provisioning.

Configure time-based scheduling: For predictable changes in demand, time-based scaling can provide the correct number of resources in a timely manner.
Configure Auto Scaling: To configure scaling based on active workload metrics, use Amazon Auto Scaling.

AWS Auto Scaling
AWS Instance Scheduler
AWS SQS
AWS Kinesis

Optimize over time

In AWS, you optimize over time by reviewing new services and implementing them in your workload

As AWS releases new services and features, it is a best practice to review your existing architectural decisions to ensure that they remain cost effective. As your requirements change, be aggressive in decommissioning resources, components, and workloads that you no longer require. Consider the following best practices to help you optimize over time.

Best Practices

Develop a process that defines the criteria and process for workload review. The review effort should reflect potential benefit. For example, core workloads or workloads with a value of over 10% of the bill are reviewed quarterly, while workloads below 10% are reviewed annually.

Define review frequency: Define how frequently the workload and its components should be reviewed.
Define review thoroughness: Define how much effort is spent on the review of the workload or workload components.

AWS News Blog

Existing workloads are regularly reviewed based on for each defined processes.

Regularly review the workload: Using your defined process, perform reviews with the frequency specified. Verify that you spend the correct amount of effort on each component.
Implement new services: If the outcome of the analysis is to implement changes, first perform a baseline of the workload to know the current cost for each output.

AWS News Blog
Whats new with AWS

Sustainability

Sustainability in the cloud is a continuous effort focused primarily on energy reduction and efficiency across all components of a workload by achieving the maximum benefit from the resources provisioned and minimizing the total resources required.

This effort can range from the initial selection of an efficient programming language, adoption of modern algorithms, use of efficient data storage techniques, deploying to correctly sized and efficient compute infrastructure, and minimizing requirements for highpowered end-user hardware.

There are six focus areas for Sustainability in the cloud

Region selection

Choose Regions where you will implement your workloads based on both your business requirements and sustainability goals.

Best Practices

User behavior patterns

The way users consume your workloads and other resources can help you identify improvements to meet sustainability goals. Scale infrastructure to continually match user load and ensure that only the minimum resources required to support users are deployed

Align service levels to customer needs.

Position resources to limit the network required for users to consume them.

Remove existing, unused assets.

Identify created assets that are unused and stop generating them. Provide your team members with devices that support their needs with minimized sustainability impact.

Best Practices

Identify periods of low or no utilization and scale down resources to eliminate excess capacity and improve efficiency.

Use elasticity in your architecture to ensure that workload can scale down quickly and easily during the period of low user load:
Verify that the metrics for scaling up or down are validated against the type of workload being deployed. If you are deploying a video transcoding application, 100% CPU utilization is expected and should not be your primary metric. You can use a customized metric (such as memory utilization) for your scaling policy if required.

AWS Auto Scaling
AWS CloudWatch
AWS X-Ray
AWS VPC Flow logs

Define and update Service Level Agreements (SLAs) such as availability or data retention periods to minimize the number of resources required to support your workload while continuing to meet business requirements.

Define SLAs that support your sustainability goals while meeting your business requirements.
Redefine SLAs to meet business requirements, not exceed them.
Make trade-offs that significantly reduce sustainability impacts in exchange for acceptable decreases in service levels.
Use design patterns that prioritize business-critical functions, and allow lower service levels (such as response time or recovery time objectives) for non-critical functions.

AWS Service Level Agreements (SLAs) website

Analyze application assets (such as pre-compiled reports, datasets, and static images) and asset access patterns to identify redundancy, underutilization, and potential decommission targets.

Consolidate generated assets with redundant content (for example, monthly reports with overlapping or common datasets and outputs) to remove the resources consumed when duplicating outputs

Decommission unused assets (for example, images of products that are no longer sold) to free consumed resources and reduce the number of resources used to support the workload.

Manage static assets and remove assets that are no longer required.
Manage generated assets and stop generating and remove assets that are no longer required.
Consolidate overlapping generated assets to remove redundant processing.
Instruct third parties to stop producing and storing assets managed on your behalf that are no longer required.

Analyze network access patterns to identify where your customers are connecting from geographically. Select Regions and services that reduce the distance network traffic must travel to decrease the total network resources required to support your workload.

Select the Regions for your workload deployment based on the following key elements:
Your Sustainability goal
Where your data is located
Where your users are located
Other constraints
Use AWS Local Zones to run workloads like video rendering and graphics-intensive virtual desktop applications.
Use local caching or AWS Caching Solutions for frequently used resources to improve performance, reduce data movement, and lower environmental impact.

AWS CloudFront
AWS ElastiCache
AWS DynamoDB Accelerator
AWS Lambda@Edge

Optimize resources provided to team members to minimize the sustainability impact while supporting their needs. For example, perform complex operations, such as rendering and compilation, on highly utilized shared cloud desktops instead of on underutilized high-powered single-user systems.

Provision workstations and other devices to align with how they’re used.
Use virtual desktops and application streaming to limit upgrade and device requirements.
Move processor or memory-intensive tasks to the cloud.
Evaluate the impact of processes and systems on your device lifecycle, and select solutions that minimize the requirement for device replacement while satisfying business requirements.
Implement remote management for devices to reduce required business travel.

AWS Workspaces
AWS AppStream
AWS Systems Manager Fleet Manager

Software and Architecture patterns

Implement patterns for performing load smoothing and maintaining consistent high utilization of deployed resources to minimize the resources consumed. Components might become idle from lack of use because of changes in user behavior over time

Revise patterns and architecture to consolidate under-utilized components to increase overall utilization.

Retire components that are no longer required.

Understand the performance of your workload components, and optimize the components that consume the most resources. Be aware of the devices your customers use to access your services, and implement

patterns to minimize the need for device upgrades.

Best Practices

Use efficient software designs and architectures to minimize the average resources required per unit of work.

Implement mechanisms that result in even utilization of components to reduce resources that are idle between tasks and minimize the impact of load spikes.

Queue requests that don’t require immediate processing.
Increase serialization to flatten utilization across your pipeline.
Modify the capacity of individual components to prevent idling resources waiting for input.
Create buffers and establish rate limiting to smooth the consumption of external services.
Use the most efficient available hardware for your software optimizations.
Use queue-driven architectures, pipeline management, and On-Demand Instance workers to maximize utilization for batch processing.
Schedule tasks to avoid load spikes and resource contention from simultaneous execution.
Schedule jobs during times of day where carbon intensity for power is lowest.

AWS SQS
AWS Step Functions
AWS Lambda
AWS EventBridge

Monitor workload activity to identify changes in utilization of individual components over time. Remove components that are unused and no longer required, and refactor components with little utilization to limit wasted resources.

Analyze load (using indicators such as transaction flow and API calls) on functional components to identify unused and underutilized components.
Retire components that are no longer needed.
Refactor underutilized components.

AWS X-Ray
AWS CoudWatch

Monitor workload activity to identify application components that consume the most resources. Optimize the code that runs within these components to minimize resource usage while maximizing performance.

Monitor performance as a function of resource usage to identify components with high resource requirements per unit of work as targets for optimization.
Use a code profiler to identify the areas of code that use the most time or resources as targets for optimization.
Replace algorithms with more efficient versions that produce the same result.
Use hardware acceleration to improve the efficiency of blocks of code with long execution times.
Use the most efficient operating system and programming language for the workload.
Remove unnecessary sorting and formatting.
Use data transfer patterns that minimize the resources used based on how frequently the data changes and how it is consumed.

AWS CloudWatch
AWS CodeGuru

Understand the devices and equipment your customers use to consume your services, their expected lifecycle, and the financial and sustainability impact of replacing those components.

Implement software patterns and architectures to minimize the need for customers to replace devices and upgrade equipment.

Inventory the devices your customers use.
Test using managed device farms with representative sets of hardware to understand the impact of your changes, and iterate development to maximize the devices supported.
Account for network bandwidth and latency when building payloads, and implement capabilities that help your applications work well on low-bandwidth, high-latency links.
Pre-process data payloads to reduce local processing requirements and limit data transfer requirements.
Perform computationally intense activities server-side (such as image rendering), or use application streaming to improve the user experience on older devices.
Segment and paginate output, especially for interactive sessions, to manage payloads and limit local storage requirements.

AWS Device Farm
AWS AppStream

Understand how data is used within your workload, consumed by your users, transferred, and stored. Select technologies to minimize data processing and storage requirements.

Analyze your data access and storage patterns.
Store data files in efficient file formats such as Parquet to prevent unnecessary processing (for example, when running analytics) and to reduce the total storage provisioned.
Use technologies that work natively with compressed data
Use the database engine that best supports your dominant query pattern.
Manage your database indexes to ensure index designs support efficient query execution.
Select network protocols that reduce the amount of network capacity consumed.

Data patterns

Implement data management practices to reduce the provisioned storage required to support your workload, and the resources required to use it. Understand your data, and use storage technologies and configurations that best support the business value of the data and how it’s used

Lifecycle data to more efficient, less performant storage when requirements decrease, and delete data that’s no longer required.

Best Practices

Classify data to understand its significance to business outcomes. Use this information to determine when you can move data to more energy-efficient storage or safely delete it.

Determine requirements for the distribution, retention, and deletion of your data.
Use tagging on volumes and objects to record the metadata that’s used to determine how it’s managed, including data classification.
Periodically audit your environment for untagged and unclassified data, and classify and tag the data appropriately.

Use storage that best supports how your data is accessed and stored to minimize the resources provisioned while supporting your workload.

For example, Solid State Devices (SSDs) are more energy intensive than magnetic drives and should be used only for active data use cases. Use energy-efficient, archival-class storage for infrequently accessed data.

Monitor your data access patterns.
Migrate data to the appropriate technology based on access pattern.
Migrate archival data to storage designed for that purpose.

AWS CloudWatch

Manage the lifecycle of all your data and automatically enforce deletion timelines to minimize the total storage requirements of your workload.

Define lifecycle policies for all your data classification types.
Set automated lifecycle policies to enforce lifecycle rules.
Delete unused volumes and snapshots.
Aggregate data where applicable based on lifecycle rules.

AWS Config Rules
AWS S3 Intelligent Tiering

To minimize total provisioned storage, create block storage with size allocations that are appropriate for the workload.

Use elastic volumes to expand storage as data grows without having to resize storage attached to compute resources. Regularly review elastic volumes and shrink over-provisioned volumes to fit the current data size.

Monitor the utilization of your data volumes.
Use elastic volumes and managed block data services to automate allocation of additional storage as your persistent data grows.
Set target levels of utilization for your data volumes, and resize volumes outside of expected ranges.
Size read-only volumes to fit the data.
Migrate data to object stores to avoid provisioning the excess capacity from fixed volume sizes on block storage.

AWS EBS Elastic Volumes
AWS FSx
AWS EFS
AWS CloudWatch

Duplicate data only when necessary to minimize total storage consumed. Use backup technologies that deduplicate data at the file and block level.

Limit the use of Redundant Array of Independent Drives (RAID) configurations except where required to meet Service Level Agreements (SLAs).

Use mechanisms that can deduplicate data at the block and object level.
Use backup technology that can make incremental backups and deduplicate data at the block, file, and object level.
Use RAID only when required to meet your SLAs
Centralize log and trace data, deduplicate identical log entries, and establish mechanisms to tune verbosity when needed.
Pre-populate caches only where justified.
Establish cache monitoring and automation to resize cache accordingly.
Remove out-of-date deployments and assets from object stores and edge caches when pushing new versions of your workload.

AWS EBS Snapshots
AWS CloudWatch logs
AWS FSx data dedup
AWS CloudFront

Adopt shared storage and single sources of truth to avoid data duplication and reduce the total storage requirements of your workload.

Fetch data from shared storage only as needed. Detach unused volumes to make more resources available.

Migrate data to shared storage when the data has multiple consumers.
Fetch data from shared storage only as needed.
Delete data as appropriate for your usage patterns, and implement time-to-live (TTL) functionality to manage cached data.
Detach volumes from clients that are not actively using them

AWS FSx
AWS EFS
AWS S3

Use shared storage and access data from regional data stores to minimize the total networking resources required to support data movement for your workload.

Store data as close to the consumer as possible
Partition regionally consumed services so that their Region-specific data is stored within the Region where it is consumed.
Use block-level duplication instead of file or object-level duplication when copying changes across the network.
Compress data before moving it over the network.

To minimize storage consumption, only back up data that has business value or is needed to satisfy compliance requirements.

Examine backup policies and exclude ephemeral storage that doesn’t provide value in a recovery scenario.

Use your data classification to establish what data needs to be backed up.
Exclude data that you can easily recreate.
Exclude ephemeral data from your backups.
Exclude local copies of data, unless the time required to restore that data from a common location exceeds your service level agreements (SLAs).

AWS Backup
AWS EBS Snapshots
AWS RDS Backups

Harware patterns

Look for opportunities to reduce workload sustainability impacts by making changes to your hardware management practices

Minimize the amount of hardware needed to provision and deploy, and select the most efficient hardware for your individual workload.

Best Practices

Using the capabilities of the cloud, you can make frequent changes to your workload implementations.

Update deployed components as your needs change.

Enable horizontal scaling, and use automation to scale out as loads increase and to scale in as loads decrease
Scale using small increments for variable workloads.
Align scaling with cyclical utilization patterns (for example, a payroll system with intense bi-weekly processing activities) as load varies over days, weeks, months, or years.
Negotiate service level Agreements (SLAs) that allow for a temporary reduction in capacity while automation deploys replacement resources.

AWS Compute Optimizer
AWS Auto Scaling

Continually monitor the release of new instance types and take advantage of energy efficiency improvements, including those instance types designed to support specific workloads such as machine learning training, inference, and video transcoding.

Learn and explore instance types which can lower your workload environmental impact.
Plan and transition your workload to instance types with the least impact
Operate and optimize your workload instance.

AWS Compute Optimizer
AWS CloudWatch
AWS Graviton-based instances
AWS Trainium
AWS Inferentia

Managed services shift responsibility for maintaining high-average utilization, and sustainability optimization of the deployed hardware to AWS.

Use managed services to distribute the sustainability impact of the service across all tenants of the service, reducing your individual contribution.

Migrate from self-hosted services to managed services.
Use managed Amazon Relational Database Service (Amazon RDS) instances instead of maintaining your own Amazon RDS instances on Amazon Elastic Compute Cloud (Amazon EC2).
Use managed container services, such as AWS Fargate, instead of implementing your own container infrastructure.

AWS Fargate
AWS DocumentDB
AWS Elastic Kubernetes Service (EKS)
AWS Managed Streaming for Apache Kafka (Amazon MSK)
AWS Redhsift
AWS RDS

Graphics Processing Units (GPUs) can be a source of high-power consumption, and many GPU workloads are highly variable, such as rendering, transcoding, and machine learning training and modeling.

Only run GPU instances for the time needed, and decommission them with automation when not required to minimize resources consumed.

Use GPUs only for tasks where they’re more efficient than CPU-based alternatives
Use automation to release GPU instances when not in use
Use flexible graphics acceleration rather than dedicated GPU instances
Take advantage of custom-purpose hardware that is specific to your workload

AWS Inferentia
AWS Trainium
AWS Accelerated Computing for EC2 Instances
AWS EC2 VT1 Instances
AWS Elastic Graphics

Development and deployment process

Look for opportunities to reduce your sustainability impact by making changes to your development, test, and deployment practices.

Best Practices

Test and validate potential improvements before deploying them to production. Account for the cost of testing when calculating potential future benefit of an improvement.

Develop low-cost testing methods to enable delivery of small improvements.

Add requirements for sustainability to your development process.
Allow resources to work in parallel to develop, test, and deploy sustainability improvements
Test and validate potential sustainability impact improvements before deploying into production.
Test potential improvements using the minimum viable representative components.
Deploy tested sustainability improvements to production as they become available.

Up-to-date operating systems, libraries, and applications can improve workload efficiency and enable easier adoption of more efficient technologies.

Up-to-date software might also include features to measure the sustainability impact of your workload more accurately, as vendors deliver features to meet their own sustainability goals.

Take advantage of agility in the cloud to quickly test how new features can improve your workload to:
Reduce sustainability impacts
Gain performance efficiencies
Remove barriers for a planned improvement
Improve your ability to measure and manage sustainability impacts
Inventory your workload software and architecture and identify components that need to be updated.
You can use AWS Systems Manager Inventory to collect operating system (OS), application, and instance metadata from your Amazon EC2 instances and quickly understand which instances are running the software and configurations required by your software policy and which instances need to be updated.

AWS Systems Manager Patch Manager
What's New with AWS (website)

Use automation and infrastructure-as-code to bring pre-production environments up when needed and take them down when not used.

A common pattern is to schedule periods of availability that coincide with the working hours of your development team members.

Hibernation is a useful tool to preserve the state and rapidly bring instances online only when needed.

Use automation to maximize utilization of your development and test environments.
Use automation to manage the lifecycle of your development and test environments.
Use minimum viable representative environments to develop and test potential improvements.
Use On-Demand Instances to supplement your developer devices.
Use automation to maximize the efficiency of your build resources.
Use instance types with burst capacity, Spot Instances, and other technologies to align build capacity with use.

AWS Systems Manager Session Manager
AWS EC2 Burstable performance instances
AWS CloudFormation

Managed device farms spread the sustainability impact of hardware manufacturing and resource usage across multiple tenants.

Managed device farms offer diverse device types so you can support older, less popular hardware, and avoid customer sustainability impact from unnecessary device upgrades.

AWS Device Farm

xxxxx