AWS Well Architected
Mobile App
Operational Excellence
The operational excellence pillar focuses on running and monitoring systems, and continually improving processes and procedures.
Key topics include automating changes, responding to events, and defining standards to manage daily operations.
There are four best practice areas for operational excellence in the cloud
Organization
You need to understand your organization's priorities, your organizational structure, and how your organization supports your team members, so that they can support your business outcomes.
Organization priorities
Your teams need to have a shared understanding of your entire workload, their role in it, and shared business goals to set the priorities that will enable business success
Well-defined priorities will maximize the benefits of your efforts.
Review your priorities regularly so that they can be updated as your organization's needs change.
Best Practices
Involve key stakeholders, including business, development, and operations teams, to determine where to focus efforts on external customer needs.
This will ensure that you have a thorough understanding of the operations support that is required to achieve your desired business outcomes.
Customers whose needs are satisfied are much more likely to remain customers.
Evaluating and understanding external customer needs will inform how you prioritize your efforts to deliver business value.
- Understand business needs: Business success is enabled by shared goals and understanding across stakeholders, including business, development, and operations teams.
- Review business goals, needs, and priorities of external customers: Engage key stakeholders, including business, development, and operations teams, to discuss goals, needs, and priorities of external customers.
- Establish shared understanding: Establish shared understanding of the business functions of the workload, the roles of each of the teams in operating the workload, and how these factors support your shared business goals across internal and external customers.
Involve key stakeholders when determining where to focus efforts on internal customer needs.
Ensure you understand the operations support that is required to achieve business outcomes.
Use your established priorities to focus your improvement efforts where they will have the greatest impact (for example, developing team skills, improving workload performance, reducing costs, automating runbooks, or enhancing monitoring).
Update your priorities as needs change.
- Understand business needs: Business success is enabled by shared goals and understanding across stakeholders including business, development, and operations teams.
- Evaluate internal governance factors, such as program or organizational policy, program policies, issue or system specific policies, standards, procedures, baselines, and guidelines.
- Review business goals, needs, and priorities of internal customers: Engage key stakeholders, including business, development, and operations teams, to discuss goals, needs, and priorities of internal customers.
- Establish shared understanding: Establish shared understanding of the business functions of the workload, the roles of each of the teams in operating the workload, and how these factors support shared business goals across internal and external customers.
Ensure that you are aware of guidelines or obligations defined by your organization that may mandate or emphasize specific focus.
Evaluate internal factors, such as organization policy, standards, and requirements.
Validate that you have mechanisms to identify changes to governance. If no governance requirements are identified, ensure that you have applied due diligence to this determination.
Evaluating and understanding the governance requirements that your organization applies to your workload will inform how you prioritize your efforts to deliver business value.
- Understand governance requirements
- Evaluate internal governance factors, such as program or organizational policy, program policies, issue or system specific policies, standards, procedures, baselines, and guidelines.
- Validate that you have mechanisms to identify changes to governance.
- If no governance requirements are identified, ensure that you have applied due diligence to this determination.
- AWS Cloud Compliance
Evaluate external factors, such as regulatory compliance requirements and industry standards, to ensure that you are aware of guidelines or obligations that might mandate or emphasize specific focus.
If no compliance requirements are identified, ensure that you apply due diligence to this determination.
Evaluating and understanding the compliance requirements that apply to your workload will inform how you prioritize your efforts to deliver business value.
- Understand compliance requirements
- Evaluate external factors, such as regulatory compliance requirements and industry standards, to ensure that you are aware of guidelines or obligations that might mandate or emphasize specific focus.
- Understand regulatory compliance requirements: Identify regulatory compliance requirements that you are legally obligated to satisfy.
- Understand industry standards and best practices
- Understand internal compliance requirements that are established by your organisation
- AWS Cloud Compliance
- AWS Compliance latest news
- AWS Compliance programs
Evaluate threats to the business (for example, competition, business risk and liabilities, operational risks, and information security threats) and maintain current information in a risk registry.
Include the impact of risks when determining where to focus efforts.
- Evaluate threats to the business (for example, competition, business risk and liabilities, operational risks, and information security threats)
- Maintain a threat model: Establish and maintain a threat model identifying potential threats, planned and in place mitigations, and their priority.
- Review the probability of threats manifesting as incidents, the cost to recover from those incidents and the expected harm caused, and the cost to prevent those incidents.
- Revise priorities as the contents of the threat model change.
- AWS Cloud Compliance
- AWS Latest Security Bulletins
- AWS Trusted Advisor
Evaluate the impact of tradeoffs between competing interests or alternative approaches, to help make informed decisions when determining where to focus efforts or choosing a course of action.
For example, accelerating speed to market for new features may be emphasized over cost optimization, or you may choose a relational database for non-relational data to simplify the effort to migrate a system, rather than migrating to a database optimized for your data type and updating your application.
- Evaluate the impact of tradeoffs between competing interests, to help make informed decisions when determining where to focus efforts
- AWS can help you educate your teams about AWS and its services to increase their understanding of how their choices can have an impact on your workload
- AWS Blog
- AWS Cloud Compliance
- AWS Discussions Forums
- AWS Documentation
- AWS Knowledge Center
- AWS Support
- AWS Support Centre
- Amazon Builders Library
- AWS Podcast (offical)
Manage benefits and risks to make informed decisions when determining where to focus efforts.
For example, it may be beneficial to deploy a workload with unresolved issues so that significant new features can be made available to customers.
It may be possible to mitigate associated risks, or it may become unacceptable to allow a risk to remain, in which case you will take action to address the risk.
You might find that you want to emphasize a small subset of your priorities at some point in time.
Use a balanced approach over the long term to ensure the development of needed capabilities and management of risk. Update your priorities as needs change.
Identifying the available benefits of your choices, and being aware of the risks to your organization, enables you to make informed decisions.
- Manage benefits and risks
- Identify benefits based on business goals, needs, and priorities. Examples include time-to-market, security, reliability, performance, and cost.
- Identify risks
- Assess benefits against risks and make informed decisions
- Evaluate the value of the benefit against the probability of the risk being realized and the cost of its impact.
Operating model
Your teams must understand their part in achieving business outcomes.
Teams need to understand their roles in the success of other teams, the role of other teams in their success, and have shared goals.
Understanding responsibility, ownership, how decisions are made, and who has authority to make decisions will help focus efforts and maximize the benefits from your teams.
The needs of a team will be shaped by their industry, their organization, the makeup of the team, and the characteristics of their workload.
It is unreasonable to expect a single operating model to be able to support all teams and their workloads.
Best Practices
Understand who has ownership of each application, workload, platform, and infrastructure component, what business value is provided by that component, and why that ownership exists.
Understanding the business value of these individual components and how they support business outcomes informs the processes and procedures applied against them.
Understanding ownership identifies whom can approve improvements, implement those improvements, or both.
- Resources have identified owners
- Specify and record owners for resources
- Store resource ownership information with resources using metadata such as tags or resource groups.
- Define who owns an organization, account, collection of resources, or individual components
- Capture ownership in the metadata for the resources
- Use AWS Organizations to structure accounts
- AWS Organisations
- Tagging
Understand who has ownership of the definition of individual processes and procedures, why those specific process and procedures are used, and why that ownership exists.
Understanding the reasons that specific processes and procedures are used enables identification of improvement opportunities.
Benefits of establishing this best practice: Understanding ownership identifies who can approve improvements, implement those improvements, or both.
- Process and procedures have identified owners responsible
- Identify process and procedures
- Define who owns the definition of a process or procedure
- Capture ownership in the metadata of the activity artifact: Procedures automated in services like AWS Systems Manager, through documents, and AWS Lambda, as functions, support capturing metadata information as tags.
- Use AWS Organizations to create tagging polices and ensure ownership and contact information are captured
- AWS Systems Manager (automate procedures)
- AWS Organisations
- Tagging
Understand who has responsibility to perform specific activities on defined workloads and why that responsibility exists.
Understanding who has responsibility to perform activities informs who will conduct the activity, validate the result, and provide feedback to the owner of the activity.
Benefits of establishing this best practice: Understanding who is responsible to perform an activity informs whom to notify when action is needed and who will perform the action, validate the result, and provide feedback to the owner of the activity.
- Capture the responsibility for performing processes and procedures used in your environment
- Identify and document the operations activities conducted in support of your workloads
- Define who is responsible to perform each activity: Identify the team responsible for an activity.
- Make this information discoverable
Understanding the responsibilities of your role and how you contribute to business outcomes informs the prioritization of your tasks and why your role is important.
This enables team members to recognize needs and respond appropriately.
Benefits of establishing this best practice: Understanding your responsibilities informs the decisions you make, the actions you take, and your hand off activities to their proper owners.
- Identify team members roles and responsibilities and ensure they understand the expectations of their role
- Make this information discoverable
Where no individual or team is identified, there are defined escalation paths to someone with the authority to assign ownership or plan for that need to be addressed.
Benefits of establishing this best practice: Understanding who has responsbility or ownership allows you to reach out to the proper team or team member to make a request or transition a task.
Having an identified person who has the authority to assign responsbility or ownership or plan to address needs reduces the risk of inaction and needs not being addressed.
- Provide accessible mechanisms for members of your organization to discover and identify ownership and responsibility
- These mechanisms will enable them to identify who to contact, team or individual, for specific needs.
You are able to make requests to owners of processes, procedures, and resources.
Make informed decisions to approve requests where viable and determined to be appropriate after an evaluation of benefits and risks.
Benefits of establishing this best practice: It's critical that mechanisms exist to request additions, changes, and exceptions in support of teams' activities. Without this option, current state become a constraint on innovation.
- Provide mechanisms for members of your organization to make requests to owners of processes, procedures, and resources in support of their business needs
Have defined or negotiated agreements between teams describing how they work with and support each other (for example, response times, service level objectives, or service level agreements).
Understanding the impact of the teams work on business outcomes, and the outcomes of other teams and organizations, informs the prioritization of their tasks and enables them to respond appropriately.
When responsibility and ownership are undefined or unknown, you are at risk of both not addressing necessary activities in a timely fashion and of redundant and potentially conflicting efforts emerging to address those needs.
Benefits of establishing this best practice: Establishing the responsibilities between teams, the objectives, and the methods for communicating needs, eases the flow of requests and helps ensures the necessary information is provided.
This reduces the delay introduced by transition tasks between teams and help support the achievement of business outcomes.
- Responsibilities between teams are predefined or negotiated
- Specifying the methods by which teams interact, and the information necessary for them to support each other, can help minimize the delay introduced as requests are iteratively reviewed and clarified
- Having specific agreements that define expectations (for example, response time, or fulfillment time) enables teams to make effective plans and resource appropriately.
Organizational culture
Provide support for your team members so that they can be more effective in taking action and supporting your business outcome.
Best practices
Senior leadership clearly sets expectations for the organization and evaluates success.
Senior leadership is the sponsor, advocate, and driver for the adoption of best practices and evolution of the organization
Benefits of establishing this best practice: Engaged leadership, clearly communicated expectations, and shared goals ensures that team members know what is expected of them.
Evaluating success enables identification of barriers to success so that they can be addressed through intervention by the sponsor advocate or their delegates.
- Have Executive Sponsorship
- Set expectations: Define and publish goals for your organizations including how they will be measured.
- Track achievement of goals
- Provide the resources necessary to achieve your goals
- Advocate for your teams. Act on behalf of your teams to help address obstacles and remove unnecessary burdens
- Drive and acknowledge best practices that provide quantifiable benefits and recognize the creators and adopters
- Create a culture of continual improvement. Encourage both personal and organizational growth and development
The workload owner has defined guidance and scope empowering team members to respond when outcomes are at risk.
Escalation mechanisms are used to get direction when events are outside of the defined scope.
By testing and validating changes early, you are able to address issues with minimized costs and limit the impact on your customers.
By testing prior to deployment you minimize the introduction of errors.
- Empower your team. Provide your team members the permissions, tools, and opportunity to practice the skills necessary to respond effectively
- Give your team members opportunity to practice the skills necessary to respond
- Perform game days
- Define and acknowledge team members' authority to take action
Team members have mechanisms and are encouraged to escalate concerns to decision makers and stakeholders if they believe outcomes are at risk.
Escalation should be performed early and often so that risks can be identified, and prevented from causing incidents.
- Encourage early and frequent escalation
- Have a mechanism for escalation
- Escalations should include the nature of the risk, the criticality of the workload, who is impacted, what the impact is, and the urgency, that is, when is the impact expected
- Protect employees who escalate
Mechanisms exist and are used to provide timely notice to team members of known risks and planned events. Necessary context, details, and time (when possible) are provided to support determining if action is necessary, what action is required, and to take action in a timely manner.
For example, providing notice of software vulnerabilities so that patching can be expedited, or providing notice of planned sales promotions so that a change freeze can be implemented to avoid the risk of service disruption.
Planned events can be recorded in a change calendar or maintenance schedule so that team members can identify what activities are pending.
On AWS, AWS Systems Manager Change Calendar can be used to record these details. It supports programmatic checks of calendar status to determine if the calendar is open or closed to activity at a particular point of time.
Operations activities can be planned around specific approved windows of time that are reserved for potentially disruptive activities. AWS Systems Manager Maintenance Windows allows you to schedule activities against instances and other supported resources to automate the activities and make those activities discoverable.
- AWS Systems Manager Change Calendar
- AWS Systems Manager Maintenance Windows
Experimentation accelerates learning and keeps team members interested and engaged.
An undesired result is a successful experiment that has identified a path that will not lead to success.
Team members are not punished for successful experiments with undesired results. Experimentation is required for innovation to happen and turn ideas into outcomes.
- Encourage experimentation to support learning and innovation
- Encourage experimentation with technologies that may have applicability now or in the future to the achievement of your business outcomes
- Encourage experimentation with specific goals for team members to reach for, or with technologies that may have applicability in the near future
- Dedicate specific times when team members can be free of their normal responsibilities, so that they can focus on their experiments
- Acknowledge success. Understand that experiments with undesired outcomes are successful and have identified a path that will not lead to success.
Teams must grow their skill sets to adopt new technologies, and to support changes in demand and responsibilities in support of your workloads.
Growth of skills in new technologies is frequently a source of team member satisfaction and supports innovation.
Support your team members pursuit and maintenance of industry certifications that validate and acknowledge their growing skills.
Cross train to promote knowledge transfer and reduce the risk of significant impact when you lose skilled and experienced team members with institutional knowledge.
Provide dedicated structured time for learning.
- Team members are enabled and encouraged to maintain and grow their skill sets
- Provide resources for education
- Provide junior team members' access to senior team members as mentors
- Plan for the continuing education needs of your team members
- Provide opportunities for team members to join other teams (temporarily or permanently)
- Support pursuit and maintenance of industry certifications
- AWS Getting Started Resource Center
- AWS Blogs
- AWS Cloud Compliance
- AWS Discussion Forums
- AWS Documentation
- AWS Online Tech Talks
- AWS Events and Webinars
- AWS Knowledge Centre
- AWS Support
- AWS Well Architected Framework
- AWS Podcast (official)
Maintain team member capacity, and provide tools and resources to support your workload needs.
Overtasking team members increases the risk of incidents resulting from human error.
nvestments in tools and resources (for example, providing automation for frequently performed activities) can scale the effectiveness of your team, enabling them to support additional activities.
- Resource teams appropriately
- Understand team performance (Measure the achievement of operational outcomes)
- Track changes in output and error rate over time
- Understand impacts on team performance
- Act on behalf of your teams to help address obstacles and remove unnecessary burdens
- Provide the resources necessary for teams to be successful
Leverage cross-organizational diversity to seek multiple unique perspectives.
Use this perspective to increase innovation, challenge your assumptions, and reduce the risk of confirmation bias.
Grow inclusion, diversity, and accessibility within your teams to gain beneficial perspectives.
Organizational culture has a direct impact on team member job satisfaction and retention. Enable the engagement and capabilities of your team members to enable the success of your business.
- Seek diverse opinions and perspectives
- Give voice to underrepresented groups. Rotate roles and responsibilities in meetings
- Provide opportunity for team members to take on roles that they might not otherwise
- Provide a safe and welcoming environment
- Enable team members to participate fully
Prepare
To prepare for operational excellence, you have to understand your workloads and their expected behaviors. You will then be able to design them to provide insight to their status and build the procedures to support them.
Design telemetry
Design your workload so that it provides the information necessary for you to understand its internal state (for example, metrics, logs, events, and traces) across all components in support of observability and investigating issues
Iterate to develop the telemetry necessary to monitor the health of your workload, identify when outcomes are at risk, and enable effective responses. In AWS, you can emit and collect logs, metrics, and events from your applications and workloads components to enable you to understand their internal state and health. You can integrate distributed tracing to track requests as they travel through your workload. Use this data to understand how your application and underlying components interact and to analyze issues and performance.
Best Practices
Application telemetry is the foundation for observability of your workload.
Your application should emit telemetry that provides insight into the state of the application and the achievement of business outcomes.
From troubleshooting to measuring the impact of a new feature, application telemetry informs the way you build, operate, and evolve your workload.
Collecting metrics over time can be used to develop baselines and detect anomalies.
- Implementing application telemetry consists of three steps
- 1)Identifying a location to store telemetry
- 2) Identifying telemetry that describes the state of the application
- 3)Instrumenting the application to emit telemetry
- To identify what telemetry you need, start with the following questions: - Ismyapplicationhealthy? - Ismyapplicationachievingbusinessoutcomes?
- AWS CloudWatch
- AWS SDK
- AWS Builders Library – Instrumenting Distributed Systems for Operational Visibility
- AWS Distro for OpenTelemetry
Design and configure your workload to emit information about its internal state and current status, for example, API call volume, HTTP status codes, and scaling events.
Use this information to help determine when a response is required.
Benefits of establishing this best practice: Understanding what is going on inside your workload enables you to respond if necessary.
- Implement log and metric telemetry: Instrument your workload to emit information about its internal state, status, and the achievement of business outcomes.
- Use this information to determine when a response is required.
- Implement and configure workload telemetry: Design and configure your workload to emit information about its internal state and current status (for example, API call volume, HTTP status codes, and scaling events).
- AWS CloudTrail
- AWS CloudWatch
- VPC Flow Logs
Instrument your application code to emit information about user activity, for example, click streams, or started, abandoned, and completed transactions.
Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required.
- Design your application code to emit information about user activity (for example, click streams, or started, abandoned, and completed transactions).
- Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required
Design and configure your workload to emit information about the status (for example, reachability or response time) of resources it depends on.
Examples of external dependencies can include, external databases, DNS, and network connectivity.
Use this information to determine when a response is required.
- Implement dependency telemetry: Design and configure your workload to emit information about the state and status of systems it depends on.
- Some examples include: external databases, DNS, network connectivity, and external credit card processing services.
- AWS CloudWatch Agent with AWS Systems Manager
Implement your application code and configure your workload components to emit information about the flow of transactions across the workload.
Use this information to determine when a response is required and to assist you in identifying the factors contributing to an issue.
- Implement transaction traceability: Design your application and workload to emit information about the flow of transactions across system components, such as transaction stage, active component, and time to complete activity
- Use this information to determine what is in progress, what is complete, and what the results of completed activities are. This helps you determine when a response is required.
- For example, longer than expected transaction response times within a component can indicate issues with that component.
- AWS X-Ray
Design for operations
Adopt approaches that improve the flow of changes into production and that enable refactoring, fast feedback on quality, and bug fixing
These accelerate beneficial changes entering production, limit issues deployed, and enable rapid identification and remediation of issues introduced through deployment activities.
In AWS, you can view your entire workload (applications, infrastructure, policy, governance, and operations) as code. It can all be defined in and updated using code. This means you can apply the same engineering discipline that you use for application code to every element of your stack.
Best Practices
Use version control to enable tracking of changes and releases.
- Use version control: Maintain assets inversion control ledrepositories.Doingsosupportstracking changes, deploying new versions, detecting changes to existing versions, and reverting to prior versions (for example, rolling back to a known good state in the event of a failure).
- Integrate the version control capabilities of your configuration management systems into your procedures.
- AWS Code Commit
Test and validate changes to help limit and detect errors. Automate testing to reduce errors caused by manual processes, and reduce the level of effort to test.
- Test and validate changes: Changes should be tested and the results validated at all lifecycle stages (for example, development, test, and production).
- Use testing results to confirm new features and mitigate the risk and impact of failed deployments.
- Automate testing and validation to ensure consistency of review, to reduce errors caused by manual processes, and reduce the level of effort.
- AWS CodeBuild
Use configuration management systems to make and track configuration changes.
These systems reduce errors caused by manual processes and reduce the level of effort to deploy changes.
Static configuration management sets values when initializing a resource that are expected to remain consistent throughout the resource’s lifetime.
- Use configuration management systems: Use configuration management systems to track and implement changes, to reduce errors caused by manual processes, and reduce the level of effort.
- Some examples include: external databases, DNS, network connectivity, and external credit card processing services.
- AWS AppConfig
- AWS Developer Tools
- AWS OpsWorks
- AWS Systems Manager Change Calendar
- AWS Systems Manager Maintenance Windows
- AWS CloudFormation
- AWS Config
- AWS Elastic Beanstalk
Use build and deployment management systems. These systems reduce errors caused by manual processes and reduce the level of effort to deploy changes.
- Use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort.
- Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation.
- This reduces lead time, enables increased frequency of change, and reduces the level of effort.
- AWS CodeBuild
- AWS CodeDeploy
- AWS Developer Tools
Perform patch management to gain features, address issues, and remain compliant with governance.
Automate patch management to reduce errors caused by manual processes, and reduce the level of effort to patch.
Patch and vulnerability management are part of your benefit and risk management activities. d.
- Patch systems to remediate issues,to gain desired features or capabilities, and to remain compliant with governance policy and vendor support requirements.
- In immutable systems, deploy with the appropriate patch set to achieve the desired result.
- In immutable systems, deploy with the appropriate patch set to achieve the desired result.
- Automate the patch management mechanism to reduce the elapsed time to patch, to reduce errors caused by manual processes, and reduce the level of effort to patch.
- AWS Developer Tools
- AWS Systems Manager Patch Manager
Share best practices across teams to increase awareness and maximize the benefits of development efforts.
On AWS, application, compute, infrastructure, and operations can be defined and managed using code methodologies. This allows for easy release, sharing, and adoption.
Use this information to determine when a response is required.
- Share existing best practices, design standards, checklists, operating procedures, and guidance and governance requirements across teams to reduce complexity and maximize the benefits from development efforts.
- Ensure that procedures exist to request changes, additions, and exceptions to design standards to support continual improvement and innovation.
- Ensure that teams are aware of published content so that they can take advantage of content, and limit rework and wasted effort.
- Share an AWS CodeCommit repository
- Easy authorization of AWS Lambda functions
- Sharing an AMI with specific AWS accounts
Implement practices to improve code quality and minimize defects. Some examples include test-driven development, code reviews, and standards adoption.
- Implement practices to improve code quality to minimize defects and the risk of their being deployed.
- For example, test-driven development, pair programming, code reviews, and standards adoption.
- AWS Code Guru
Use multiple environments to experiment, develop, and test your workload.
Use increasing levels of controls as environments approach production to gain confidence your workload will operate as intended when deployed.
- Provide developers sandbox environments with minimized controls to enable experimentation.
- Provide individual development environments to enable work in parallel, increasing development agility.
- Implement more rigorous controls in the environments approaching production to allow developers to innovate.
- Use infrastructure as code and configuration management systems to deploy environments that are configured consistent with the controls present in production to ensure systems operate as expected when deployed.
- AWS CloudFormation
Frequent, small, and reversible changes reduce the scope and impact of a change.
This eases troubleshooting, enables faster remediation, and provides the option to roll back a change.
- Implement dependency telemetry: Design and configure your workload to emit information about the state and status of systems it depends on.
- Some examples include: external databases, DNS, network connectivity, and external credit card processing services.
Automate build, deployment, and testing of the workload.
This reduces errors caused by manual processes and reduces the effort to deploy changes.
Apply metadata using Resource Tags and AWS Resource Groups following a consistent tagging strategy to enable identification of your resources
Tag your resources for organization, cost accounting, access controls, and targeting the execution of automated operations activities.
- Use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort.
- Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation.
- This reduces lead time, enables increased frequency of change, and reduces the level of effort.
- AWS CodeBuild
- AWS CodeDeploy
Mitigate deployment risks
Adopt approaches that provide fast feedback on quality and enable rapid recovery from changes that do not have desired outcomes
Using these practices mitigates the impact of issues introduced through the deployment of changes.
Best Practices
Plan to revert to a known good state, or remediate in the production environment if a change does not have the desired outcome. This preparation reduces recovery time through faster responses.
- Plan for unsuccessful changes: Plan to revert to a known good state (that is, roll back the change), or remediate in the production environment (that is, roll forward the change) if a change does not have the desired outcome.
- When you identify changes that you cannot roll back if unsuccessful, apply due diligence prior to committing the change.
Test changes and validate the results at all lifecycle stages to confirm new features and minimize the risk and impact of failed deployments.
On AWS, you can create temporary parallel environments to lower the risk, effort, and cost of experimentation and testing. Automate the deployment of these environments using AWS CloudFormation to ensure consistent implementations of your temporary environments.
- Test and validate changes: Changes should be tested and the results validated at all lifecycle stages (for example, development, test, and production).
- AWS Cloud9
- AWS CodeDeploy
Use deployment management systems to track and implement change. This reduces errors caused by manual processes and reduces the effort to deploy changes.
Build Continuous Integration/Continuous Deployment (CI/CD) pipelines
- Use deployment management systems: Use deployment management systems to track and implement change. This will reduce errors caused by manual processes, and reduce the level of effort to deploy changes.
- Automate the integration and deployment pipeline from code check-in through testing, deployment, and validation. This reduces lead time, enables increased frequency of change, and further reduces the level of effort.
- AWS CodeDeploy
Test with limited deployments alongside existing systems to confirm desired outcomes prior to full scale deployment. For example, use deployment canary testing or one-box deployments.
- Test with limited deployments alongside existing systems to confirm desired outcomes prior to full scale deployment. For example, use deployment canary testing or onebox deployments.
- AWS CodeDeploy
- AWS Blue/Green Deplyments
Implement changes onto parallel environments, and then transition over to the new environment. Maintain the prior environment until there is confirmation of successful deployment.
Doing so minimizes recovery time by enabling rollback to the previous environment.
- Deploy using parallel environments: Implement changes onto parallel environments, and transition or cut over to the new environment.
- Maintain the prior environment until there is confirmation of successful deployment.
- This minimizes recovery time by enabling rollback to the previous environment
- AWS CodeDeploy
- AWS AWS Blue/Green Deployments
Use frequent, small, and reversible changes to reduce the scope of a change. This results in easier troubleshooting and faster remediation with the option to roll back a change.
- Use frequent, small, and reversible changes to reduce the scope of a change.
- This results in easier troubleshooting and faster remediation with the option to roll back a change
Automate build, deployment, and testing of the workload. This reduces errors cause by manual processes and reduces the effort to deploy changes.
- Use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort.
- Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation.
- AWS CodeBuild
- AWS CodeDeploy
Automate testing of deployed environments to confirm desired outcomes. Automate rollback to a previous known good state when outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes.
- Automate testing of deployed environments to confirm desired outcomes.
- Automate rollback to a previous known good state when outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes.
- AWS IAM
- AWS Organisations
Operational readiness and change management
Evaluate the operational readiness of your workload, processes, procedures, and personnel to understand the operational risks related to your workload
Manage the flow of change into your environments. You should use a consistent process (including manual or automated checklists) to know when you are ready to go live with your workload or a change. This will also enable you to find any areas that you need to make plans to address. You will have runbooks that document your routine activities and playbooks that guide your processes for issue resolution. Use a mechanism to manage changes that supports the delivery of business value and help mitigate risks associated to change.
Best Practices
Have a mechanism to validate that you have the appropriate number of trained personnel to provide support for operational needs. .
Train personnel and adjust personnel capacity as necessary to maintain effective support.
You will need to have enough team members to cover all activities (including on-call). Ensure that your teams have the necessary skills to be successful with training on your workload, your operations tools, and AWS.
- Personnel capability: Validate that there are sufficient trained personnel to effectively support the workload.
- Teamsize: Ensure that you have enough team members to cover operational activities, including on- call duties.
- Review capabilities:Review team size and skill as operating conditions and work loads change, to ensure there is sufficient capability to maintain operational excellence.
- AWS Blogs
- AWS Events and Webinars
- AWS Training and Certification
Use Operational Readiness Reviews (ORRs) to validate that you can operate your workload.
ORR is a mechanism developed at Amazon to validate that teams can safely operate their workloads.
An ORR is a review and inspection process using a checklist of requirements.
- To learn more about ORRs, read the Operational Readiness Reviews (ORR) whitepaper.
A runbook is a documented process to achieve a specific outcome.
Runbooks consist of a series of steps that someone follows to get something done.
Runbooks are an essential part of operating your workload. From onboarding a new team member to deploying a major release, runbooks are the codified processes that provide consistent outcomes no matter who uses them.
- Runbooks can take several forms depending on the maturity level of your organization. At a minimum, they should consist of a step-by-step text document. The desired outcome should be clearly indicated.
- Clearly document necessary special permissions or tools.
- Provide detailed guidance on error handling and escalations in case something goes wrong.
- AWS Systems Manager Automation runbooks
Playbooks are step-by-step guides used to investigate an incident. When incidents happen, playbooks are used to investigate, scope impact, and identify a root cause.
Playbooks are used for a variety of scenarios, from failed deployments to security incidents.
In many cases, playbooks identify the root cause that a runbook is used to mitigate. Playbooks are an essential component of your organization's incident response plans.
- If you are new to the cloud, build playbooks in text form in a central document repository
- As your organization matures, playbooks can become semi-automated with scripting languages like Python.
- Start building your playbooks by listing common incidents that happen to your workload.
- Your text playbooks should be automated as your organization matures.
- AWS Systems Manager Automation runbooks
Evaluate the capabilities of the team to support the workload and the workload's compliance with governance.
- Evaluate the capabilities of the team to support the workload and the workload's compliance with governance
- Evaluate these against the benefits of deployment when determining whether to transition a system or change into production.
- Understand the benefits and risks, and make informed decisions.
Operate
Success is the achievement of business outcomes as measured by the metrics you define. By understanding the health of your workload and operations, you can identify when organizational and business outcomes may become at risk, or are at risk, and respond appropriately.
Understanding workload health
Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action
Your team should be able to understand the health of your workload easily. You will want to use metrics based on workload outcomes to gain useful insights. You should use these metrics to implement dashboards with business and technical viewpoints that will help team members make informed decisions.
Best Practices
Have a mechanism to validate that you have the appropriate number of trained personnel to provide support for operational needs. .
Train personnel and adjust personnel capacity as necessary to maintain effective support.
You will need to have enough team members to cover all activities (including on-call). Ensure that your teams have the necessary skills to be successful with training on your workload, your operations tools, and AWS.
- Personnel capability: Validate that there are sufficient trained personnel to effectively support the workload.
- Teamsize: Ensure that you have enough team members to cover operational activities, including on- call duties.
- Review capabilities:Review team size and skill as operating conditions and work loads change, to ensure there is sufficient capability to maintain operational excellence.
Have a mechanism to validate that you have the appropriate number of trained personnel to provide support for operational needs. .
Train personnel and adjust personnel capacity as necessary to maintain effective support.
You will need to have enough team members to cover all activities (including on-call). Ensure that your teams have the necessary skills to be successful with training on your workload, your operations tools, and AWS.
- Personnel capability: Validate that there are sufficient trained personnel to effectively support the workload.
- Teamsize: Ensure that you have enough team members to cover operational activities, including on- call duties.
- Review capabilities:Review team size and skill as operating conditions and work loads change, to ensure there is sufficient capability to maintain operational excellence.
- AWS CloudWatch metrics
- AWS Organisations
Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed.
You should aggregate log data from your application, workload components, services, and API calls to a service such as CloudWatch Logs.
- Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed.
- AWS Athena
- AWS CloudWatch metrics
- AWS DevOps Guru
- AWS Glue
- AWS Glue Data Catalog
- AWS Health Dashboard
- AWS QuickSight
Establish baselines for metrics to provide expected values as the basis for comparison and identification of under- and over-performing components. Identify thresholds for improvement, investigation, and intervention.
- Establish baselines for workload metrics: Establish baselines for workload metrics to provide expected values as the basis for comparison.
- AWS CloudWatch
Establish patterns of workload activity to identify anomalous behavior so that you can respond appropriately if required.
CloudWatch through the CloudWatch Anomaly Detection feature applies statistical and machine learning algorithms to generate a range of expected values that represent normal metric behavior.
- Learn expected patterns of activity for workload: Establish patterns of workload activity to determine when behavior is outside of the expected values so that you can respond appropriately if required.
- AWS DevOps Guru
- AWS CloudWatch Anomaly Detection
Raise an alert when workload outcomes are at risk so that you can respond appropriately if necessary.
Ideally, you have previously identified a metric threshold that you are able to alarm upon or an event that you can use to trigger an automated response.
- Alert when workload outcomes are at risk: Raise an alert when workload outcomes are at risk so that you can respond appropriately if required.
- AWS CloudWatch Synthetics
- AWS CloudWatch Log Insights
- AWS CloudWatch Events
Raise an alert when workload anomalies are detected so that you can respond appropriately if necessary.
Your analysis of your workload metrics over time may establish patterns of behavior that you can quantify sufficiently to define an event or raise an alarm in response.
- Raise an alert when workload anomalies are detected so that you can respond appropriately if required.
- AWS CloudWatch Alarms/Events
- AWS CloudWatch Anomaly Detection
Create a business-level view of your workload operations to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals.
Validate the effectiveness of KPIs and metrics and revise them if necessary.
- Create a business level view of your workload operations to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals.
- Validate the effectiveness of KPIs and metrics and revise them if necessary.
- AWS CloudWatch dashboards
Understanding operational health
Define, capture, and analyze operations metrics to gain visibility to workload events so that you can take appropriate action
Your team should be able to understand the health of your operations easily. You will want to use metrics based on operations outcomes to gain useful insights. You should use these metrics to implement dashboards with business and technical viewpoints that will help team members make informed decisions.
Best Practices
Identify key performance indicators (KPIs) based on desired business outcomes (for example, new features delivered) and customer outcomes (for example, customer support cases).
- Evaluate KPIs to determine operations success.
Define operations metrics to measure the achievement of KPIs (for example, successful deployments, and failed deployments).
- Define operations metrics to measure the achievement of KPIs.
- Define operations metrics to measure the health of operations and its activities. E
- Evaluate metrics to determine if operations are achieving desired outcomes, and to understand the health of the operations.
- AWS CloudWatch Events
Perform regular, proactive reviews of metrics to identify trends and determine where appropriate responses are needed.
You should aggregate log data from the execution of your operations activities and operations API calls, into a service such as CloudWatch Logs. G
- Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed.
- AWS Athena
- AWS CloudWatch
- AWS Glue
- AWS Glue Data Catalog
- AWS QuickSight
Establish baselines for metrics to provide expected values as the basis for comparison and identification of under and over performing operations activities.
- Establish patterns of operations activity to determine when behavior is outside of the expected values so that you can respond appropriately if required.
Establish patterns of operations activities to identify anomalous activity so that you can respond appropriately if necessary.
- Establish patterns of operations activity to determine when behavior is outside of the expected values so that you can respond appropriately if required.
Whenever operations outcomes are at risk, an alert must be raised and acted upon. Operations outcomes are any activity that supports a workload in production.
This includes everything from deploying new versions of applications to recovering from an outage
- Start by defining what operations activities are most important to your organization.
- Your organization must define key operations activities and how they are measured so that they can be monitored, improved, and alerted on.
- You need a central location where workload and operations telemetry is stored and analyzed. The same mechanism should be able to raise an alert when an operations outcome is at risk.
- AWS EventBridge
- AWS SystemsManager OpsCentre
Raise an alert when operations anomalies are detected so that you can respond appropriately if necessary.
Your analysis of your operations metrics over time may established patterns of behavior that you can quantify sufficiently to define an event or raise an alarm in response.
- Raise an alert when operations anomalies are detected so that you can respond appropriately if required.
- AWS DevOps Guru
- AWS CloudWatch Anomaly Detection
Create a business-level view of your operations activities to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals.
- Create a business level view of your operations activities to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals.
- Validate the effectiveness of KPIs and metrics and revise them if necessary
- AWS CloudWatch dashboards
Responding to events
You should anticipate operational events, both planned (for example, sales promotions, deployments, and failure tests) and unplanned (for example, surges in utilization and component failures)
You should use your existing runbooks and playbooks to deliver consistent results when you respond to alerts. Defined alerts should be owned by a role or a team that is accountable for the response and escalations. You will also want to know the business impact of your system components and use this to target efforts when needed. You should perform a root cause analysis (RCA) after events, and then prevent recurrence of failures or document workarounds.
Best Practices
Your organization has processes to handle events, incidents, and problems.
Events are things that occur in your workload but may not need intervention.
Incidents are events that require intervention.
Problems are recurring events that require intervention or cannot be resolved.
You need processes to mitigate the impact of these events on your business and make sure that you respond appropriately
- Track events that happen in your workload, even if no human intervention is required.
- Work with workload stakeholders to develop a list of events that should be tracked. Some examples are completed deployments or successful patching.
- You can use services like Amazon EventBridge or Amazon Simple Notification Service to generate custom events for tracking.
- AWS EventBridge
- AWS SNS
- AWS Health Dashboard
- AWS Systems Manager Incident Manager
Have a well-defined response (runbook or playbook), with a specifically identified owner, for any event for which you raise an alert.
- Process per alert: Any event for which you raise an alert should have a well-defined response (runbook or playbook) with a specifically identified owner (for example, individual, team, or role) accountable for successful completion.
- Performance of the response may be automated or conducted by another team but the owner is accountable for ensuring the process delivers the expected outcomes
- AWS CloudWatch Event
Ensure that when multiple events require intervention, those that are most significant to the business are addressed first.
Impacts can include loss of life or injury, financial loss, or damage to reputation or trust.
- Ensure that when multiple events require intervention, those that are most significant to the business are addressed first
Define escalation paths in your runbooks and playbooks, including what triggers escalation, and procedures for escalation.
Specifically identify owners for each action to ensure effective and prompt responses to operations events.
- Define escalation paths in your runbooks and playbooks, including what triggers escalation, and procedures for escalation
- Escalate an issue from support engineers to senior support engineers when runbooks cannot resolve the issue, or when a predefined period of time has elapsed
Communicate directly with your users (for example, with email or SMS) when the services they use are impacted, and again when the services return to normal operating conditions, to enable users to take appropriate action.
- Enable push notifications: Communicate directly with your users (for example, with email or SMS) when the services they use are impacted, and when the services return to normal operating conditions, to enable users to take appropriate action.
- AWS SES
- AWS SNS
Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest.
- Providing a self-service option for status information reduces the disruption of fielding requests for status by the operations team.
- AWS QuickSight
- AWS CloudWatch Dashboards
Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses.
- Create CloudWatch Events rules to trigger responses through CloudWatch targets (for example, Lambda functions, Amazon Simple Notification Service (Amazon SNS) topics, Amazon ECS tasks, and AWS Systems Manager Automation).
- AWS CoudWatch Events
- AWS CloudTrail
Evolve
Evolution is the continuous cycle of improvement over time. Implement frequent small incremental changes based on the lessons learned from your operations activities and evaluate their success at bringing about improvement.
Learn, share, and improve
It's essential that you regularly provide time for analysis of operations activities, analysis of failures, experimentation, and making improvements
When things fail, you will want to ensure that your team, as well as your larger engineering community, learns from those failures. You should analyze failures to identify lessons learned and plan improvements. You will want to regularly review your lessons learned with other teams to validate your insights.
Best Practices
Regularly evaluate and prioritize opportunities for improvement to focus efforts where they can provide the greatest benefits.
- Implement changes to improve and evaluate the outcomes to determine success
- If the outcomes do not satisfy the goals, and the improvement is still a priority, iterate using alternative courses of action
Review customer-impacting events, and identify the contributing factors and preventative actions.
Use this information to develop mitigations to limit or prevent recurrence. Develop procedures for prompt and effective responses
- Have a process to identify and document the contributing factors of an incident so that you can develop mitigations to limit or prevent recurrence and you can develop procedures for prompt and effective responses.
- Communicate root cause as appropriate, tailored to target audiences
Feedback loops provide actionable insights that drive decision making. Build feedback loops into your procedures and workloads.
This helps you identify issues and areas that need improvement. They also validate investments made in improvements.
- You need a mechanism to receive feedback from customers and team members. Your operations activities can also be configured to deliver automated feedback.
- Your organization needs a process to review this feedback, determine what to improve, and schedule the improvement.
- Feedback must be added into your software development process
- AWS Systems Manager OpsCenter
Mechanisms exist for your team members to discover the information that they are looking for in a timely manner, access it, and identify that it’s current and complete.
Mechanisms are present to identify needed content, content in need of refresh, and content that should be archived so that it’s no longer referenced.
- Ensure mechanisms exist for your team members to discover the information that they are looking for in a timely manner, access it, and identify that it’s current and complete.
- Maintain mechanisms to identify needed content, content in need of refresh, and content that should be archived so that it’s no longer referenced.
Identify drivers for improvement to help you evaluate and prioritize opportunities.
On AWS, you can aggregate the logs of all your operations activities, workloads, and infrastructure to create a detailed activity history.
You can then use AWS tools to analyze your operations and workload health over time.
- Understand drivers for improvement: You should only make changes to a system when a desired outcome is supported.
- AWS Athena
- AWS QuickSight
- AWS Complaince
- AWS Glue
- AWS Trusted Advisor
Review your analysis results and responses with cross-functional teams and business owners. Use these reviews to establish common understanding, identify additional impacts, and determine courses of action. Adjust responses as appropriate.
- Engage with business owners and subject matter experts to ensure there is common understanding and agreement of the meaning of the data you have collected. Identify additional concerns, potential impacts, and determine a courses of action.
Regularly perform retrospective analysis of operations metrics with cross-team participants from different areas of the business.
Use these reviews to identify opportunities for improvement, potential courses of action, and to share lessons learned.
- Regularly perform retrospective analysis of operations metrics with cross-team participants from different areas of the business.
- Engage stakeholders, including the business, development, and operations teams, to validate your findings from immediate feedback and retrospective analysis, and to share lessons learned.
- Use their insights to identify opportunities for improvement and potential courses of action.
- AWS CloudWatch
- AWS CloudWatch metrics
Document and share lessons learned from the operations activities so that you can use them internally and across teams.
You should share what your teams learn to increase the benefit across your organization.
- Have procedures to document the lessons learned from the execution of operations activities and retrospective analysis so that they can be used by other teams
- Have procedures to share lessons learned and associated artifacts across teams. For example, share updated procedures, guidance, governance, and best practices through an accessible wiki.
- Share scripts, code, and libraries through a common repository.
Dedicate time and resources within your processes to make continuous incremental improvements possible.
On AWS, you can create temporary duplicates of environments, lowering the risk, effort, and cost of experimentation and testing.
- Dedicate time and resources within your processes to make continuous incremental improvements possible.
- Implement changes to improve and evaluate the results to determine success.
Security
The security pillar describes how to take advantage of cloud technologies to protect data, systems, and assets in a way that can improve your security posture.
There are six focus areas for Security in the cloud
Security foundations
The security pillar describes how to take advantage of cloud technologies to protect data, systems, and assets in a way that can improve your security posture.
AWS account management and separation
We recommend that you organize workloads in separate accounts and group accounts based on function, compliance requirements, or a common set of controls rather than mirroring your organization’s reporting structure.
In AWS, accounts are a hard boundary. For example, account-level separation is strongly recommended for isolating production workloads from development and test workloads.
Manage accounts centrally: AWS Organizations automates AWS account creation and management, and control of those accounts after they are created.
Set controls centrally: Control what your AWS accounts can do by only allowing specific services, Regions, and service actions at the appropriate level.
Configure services and resources centrally: AWS Organizations helps you configure AWS services that apply to all of your accounts.
Best Practices
Start with security and infrastructure in mind to enable your organization to set common guardrails as your workloads grow. This approach provides boundaries and controls between workloads.
Account level separation is strongly recommended for isolating production environments from development and test environments, or providing a strong logical boundary between workloads that process data of different sensitivity levels, as defined by external compliance requirements (such as PCI-DSS or HIPAA), and workloads that don’t.
- Use AWS Organizations to centrally enforce policy-based management for multiple AWS accounts.
- Consider AWS Control Tower: AWS Control Tower provides an easy way to set up and govern a new, secure, multi-account AWS environment based on best practices.
- AWS Organizations
- AWS Control Tower
There are a number of aspects to securing your AWS accounts, including the securing of, and not using the root user, and keeping your contact information up-to-date.
You can use AWS Organizations to centrally manage and govern your accounts as you grow and scale your workloads in AWS.
AWS Organizations helps you manage accounts, set controls, and configure services across your accounts.
- Use AWS Organizations to centrally enforce policy-based management for multiple AWS accounts.
- Limit use of the AWS root user: Only use the root user to perform tasks that specifically require it.
- Enable multi-factor-authentication (MFA) for the root user: Enable MFA on the AWS account root user, if AWS Organizations is not managing root users for you.
- Periodically change the root user password.
- Enable notification when the AWS account root user is used.
- Restrict access to newly added Regions.
- Consider AWS CloudFormation StackSets: CloudFormation StackSets can be used to deploy resources including IAM policies, roles, and groups into different AWS accounts and Regions from an approved template.
- AWS Organizations
- AWS Control Tower
Operating your workloads securely
Operating workloads securely covers the whole lifecycle of a workload from design, to build, to run, and to ongoing improvement.
One of the ways to improve your ability to operate securely in the cloud is by taking an organizational approach to governance
Governance is the way that decisions are guided consistently without depending solely on the good judgment of the people involved.
Best Practices
Based on your compliance requirements and risks identified from your threat model, derive and validate the control objectives and controls that you need to apply to your workload.
Ongoing validation of control objectives and controls help you measure the effectiveness of risk mitigation.
- Identify compliance requirements: Discover the organizational, legal, and compliance requirements that your workload must comply with.
- Identify AWS compliance resources: Identify resources that AWS has available to assist you with compliance.
- AWS Compliance website
To help you define and implement appropriate controls, recognize attack vectors by staying up to date with the latest security threats.
Consume AWS Managed Services to make it easier to receive notification of unexpected or unusual behavior in your AWS accounts. Investigate using AWS Partner tools or thirdparty threat information feeds as part of your security information flow.
- Subscribe to threat intelligence sources. Regularly review threat intelligence information from multiple sources that are relevant to the technologies used in your workload.
- Consider AWS Shield Advanced service: It provides near real-time visibility into intelligence sources, if your workload is internet accessible.
- AWS Shield
Stay up-to-date with both AWS and industry security recommendations to evolve the security posture of your workload.
AWS Security Bulletins contain important information about security and privacy notifications.
- Follow AWS updates: Subscribe or regularly check for new recommendations, tips and tricks.
- Subscribe to industry news: Regularly review news feeds from multiple sources that are relevant to the technologies that are used in your workload.
- AWS security blog
Establish secure baselines and templates for security mechanisms that are tested and validated as part of your build, pipelines, and processes.
Use tools and automation to test and validate all security controls continuously.
- Automate configuration management: Enforce and validate secure configurations automatically by using a configuration management service or tool.
- AWS Systems Manager
- AWS CloudFormation/CloudFormation Guard
- AWS Config
- AWS CodePipeline (CodeCommit + CodeDeploy)
Use a threat model to identify and maintain an up-to-date register of potential threats. Prioritize your threats and adapt your security controls to prevent, detect, and respond.
Revisit and maintain this in the context of the evolving security landscape.
- Create a threat model: A threat model can help you identify and address potential security threats.
- AWS security bulletins website
Evaluate and implement security services and features from AWS and AWS Partners that allow you to evolve the security posture of your workload.
The AWS Security Blog highlights new AWS services and features, implementation guides, and general security guidance.
What's New with AWS? is a great way to stay up to date with all new AWS features, services, and announcements.
- Plan regular reviews: Create a calendar of review activities that includes compliance requirements, evaluation of new AWS security features and services, and staying up-to-date with industry news.
- Discover AWS services and features: Discover the security features that are available for the services that you are using, and review new features as they are released.
- Define processes for onboarding of new AWS services. Include how you evaluate new AWS services for functionality, and the compliance requirements for your workload.
- Test new services and features as they are released in a non-production environment that closely replicates your production one.
- Implement other defense mechanisms: Implement automated mechanisms to defend your workload, explore the options available.
- AWS Security blog
- AWS security bulletins website
Identity and access management
To use AWS services, you must grant your users and applications access to resources in your AWS accounts.
As you run more workloads on AWS, you need robust identity management and permissions in place to ensure that the right people have access to the right resources under the right conditions.
AWS offers a large selection of capabilities to help you manage your human and machine identities and their permissions.
Identity management
There are two types of identities you need to manage when operating secure AWS workloads, human identities and machine identities
Human identities: The administrators, developers, operators, and consumers of your applications require an identity to access your AWS environments and applications.
Machine identities: Your workload applications, operational tools, and components require an identity to make requests to AWS services, for example, to read data. These identities include machines running in your AWS environment, such as Amazon EC2 instances or AWS Lambda functions.
Best Practices
Enforce minimum password length, and educate your users to avoid common or reused passwords.
Enforce multi-factor authentication (MFA) with software or hardware mechanisms to provide an additional layer of verification
- Create an AWS Identity and Access Management (IAM) policy to enforce MFA sign-in.
- Enable MFA in your identity provider.
- Configure a strong password policy.
- Rotate credentials regularly.
- AWS IAM Identity Centre
- AWS Secrets Manager
For human identities using the AWS Management Console, require users to acquire temporary credentials and federate into AWS. You can do this using the AWS IAM Identity Center user portal.
For users requiring CLI access, ensure that they use AWS CLI v2, which supports direct integration with IAM Identity Center.
For machine identities, you should rely on IAM roles to grant access to AWS.
- Audit and rotate credentials periodically: Periodic validation, preferably through an automated tool, is necessary to verify that the correct controls are enforced.
- Store and use secrets securely: For credentials that are not IAM-related and cannot take advantage of temporary credentials, such as database logins, use a service that is designed to handle management of secrets, such as Secrets Manager.
- Implement least privilege policies: Assign access policies with least privilege to IAM groups and roles to reflect the user's role or function that you have defined.
- Remove unnecessary permissions: Implement least privilege by removing permissions that are unnecessary.
- Consider permissions boundaries: A permissions boundary is an advanced feature for using a managed policy that sets the maximum permissions that an identity-based policy can grant to an IAM entity.
- Consider resource tags for permissions: You can use tags to control access to your AWS resources that support tagging. You can also tag IAM users and roles to control what they can access.
- AWS IAM Identity Centre
- AWS Secrets Manager
- AWS Cognito
For workforce and machine identities that require secrets such as passwords to third-party applications, store them with automatic rotation.
Secrets Manager makes it easy to manage, rotate, and securely store encrypted secrets using supported services.
- Use AWS Secrets Manager: AWS Secrets Manager is an AWS service that makes it easier for you to manage secrets.
- Secrets can be database credentials, passwords, third-party API keys, and even arbitrary text.
- AWS Secrets Manager
For workforce identities, rely on an identity provider that enables you to manage identities in a centralized place.
This makes it easier to manage access across multiple applications and services,because you are creating, managing, and revoking access from a single location.
- Centralize administrative access: Create an Identity and Access Management (IAM) identity provider entity to establish a trusted relationship between your AWS account and your identity provider (IdP).
- Centralize application access: Consider Amazon Cognito for centralizing application access. It lets you add user sign-up, sign-in, and access control to your web and mobile apps quickly and easily.
- Remove old IAM users and groups: After you start using an identity provider (IdP), remove IAM users and groups that are no longer required.
- AWS IAM Identity Centre
When you cannot rely on temporary credentials and require long-term credentials, audit credentials to ensure that the defined controls for example, multi-factor authentication (MFA), are enforced, rotated regularly, and have the appropriate access level.
- Regularly audit credentials: Use credential reports, and Identify and Access Management (IAM) Access Analyzer to audit IAM credentials and permissions.
- Use Access Levels to Review IAM Permissions: To improve the security of your AWS account, regularly review and monitor each of your IAM policies.
- Consider automating IAM resource creation and updates: AWS CloudFormation can be used to automate the deployment of IAM resources, including roles and policies, to reduce human error because the templates can be verified and version controlled.
- AWS IAM Access Analyzer
As the number of users you manage grows, you will need to determine ways to organize them so that you can manage them at scale.
Place users with common security requirements in groups defined by your identity provider, and put mechanisms in place to ensure that user attributes that may be used for access control (for example, department or location) are correct and updated.
Use these groups and attributes to control access, rather than individual users.
- If you are using AWS IAM Identity Center (successor to AWS Single Sign-On) (IAM Identity Center), configure groups: IAM Identity Center provides you with the ability to configure groups of users, and assign groups the desired level of permission.
- Learn about attribute-based access control (ABAC): ABAC is an authorization strategy that defines permissions based on attributes.
- AWS IAM Identity Center
- AWS Attribute-based access control (ABAC)
- AWS Secrets management
Permissions management
Manage permissions to control access to human and machine identities that require access to AWS and your workloads
Permissions control who can access what, and under what conditions. Set permissions to specific human and machine identities to grant access to specific service actions on specific resources.
There are a number of ways to grant access to different types of resources. One way is by using different policy types.
Identity-based policies in IAM attach to IAM identities, including users, groups, or roles. These policies let you specify what that identity can do (its permissions).
Best Practices
Each component or resource of your workload needs to be accessed by administrators, end users, or other components.
Have a clear definition of who or what should have access to each component, choose the appropriate identity type and method of authentication and authorization.
- Have a clear definition of who or what should have access to each component, choose the appropriate identity type and method of authentication and authorization.
- Regular access to AWS accounts within the organization should be provided using federated access or a centralized identity provider.
- When defining access requirements for non-human identities, determine which applications and components need access and how permissions are granted. Using IAM roles built with the least privilege access model is a recommended approach.
- AWS services, such as AWS Secrets Manager and AWS Systems Manager Parameter Store, can help decouple secrets from the application or workload securely in cases where it's not feasible to use IAM roles.
- AWS IAM Identity Center
- AWS Attribute-based access control (ABAC)
- AWS IAM Roles anywhere
- AWS IAM Policies
Grant only the access that identities require by allowing access to specific actions on specific AWS resources under specific conditions.
Rely on groups and identity attributes to dynamically set permissions at scale, rather than defining permissions for individual users.
- Establishing a principle of least privilege ensures that identities are only permitted to perform the most minimal set of functions necessary to fulfill a specific task, while balancing usability and efficiency.
- Use policies to explicitly grant permissions attached to IAM or resource entities, such as an IAM role used by federated identities or machines, or resources.
- There are several AWS capabilities to help you scale permission management and adhere to the principle of least privilege. Attribute Based Access control allows you to limit permissions based on the tag of a resource, for making authorization decisions based on the tags applied to the resource and the calling IAM principal.
- IAM Access Analyzer
- IAM Policy Simulator
- AWS Control Tower (GuardRails)
- AWS Verified Access (zero trust)
A process that allows emergency access to your workload in the unlikely event of an automated process or pipeline issue.
This will help you rely on least privilege access, but ensure users can obtain the right level of access when they require it.
- Establishing emergency access can take several forms for which you should be prepared. The first is a failure of your primary identity provider. In this case, you should rely on a second method of access with the required permissions to recover. This method could be a backup identity provider or an IAM user.
- You should also be prepared for emergency access where temporary elevated administrative access is needed.
As teams and workloads determine what access they need, remove permissions they no longer use and establish review processes to achieve least privilege permissions.
Continuously monitor and reduce unused identities and permissions.
- Configure AWS Identify and Access Management (IAM) Access Analyzer: AWS IAM Access Analyzer helps you identify the resources in your organization and accounts, such as Amazon Simple Storage Service (Amazon S3) buckets or IAM roles, that are shared with an external entity.
- AWS IAM Access Analyzer
Establish common controls that restrict access to all identities in your organization
For example, you can restrict access to specific AWS Regions, or prevent your operators from deleting common resources, such as an IAM role used for your central security team.
- As you grow and manage additional workloads in AWS, you should separate these workloads using accounts and manage those accounts using AWS Organizations.
- We recommend that you establish common permission guardrails that restrict access to all identities in your organization.
- You can get started by implementing example service control policies, such as preventing users from disabling key services.
- We recommend you avoid running workloads in your management account. The management account should be used to govern and deploy security guardrails that will affect member accounts.
- Using a multi-account strategy allows you to have greater flexibility in applying guardrails to your workloads.
- AWS Organisations
- AWS Service Control Policies
- AWS Control Tower
Integrate access controls with operator and application lifecycle and your centralized federation provider.
For example, remove a user’s access when they leave the organization or change roles.
- Implement a user access lifecycle policy for new users joining, job function changes, and users leaving so that only current users have access.
- AWS IAM Access Analyzer
- AWS Attribute-based access control (ABAC)
Continuously monitor findings that highlight public and cross-account access.
Reduce public access and cross-account access to only resources that require this type of access.
- Consider configuring IAM Access Analyzer with AWS Organizations to verify you have visibility through all your accounts.
- You can also use AWS Config to report and remediate resources for any accidental public access configuration, through AWS Config policy checks. Services like AWS Control Tower and AWS Security Hub simplify deploying checks and guardrails across an AWS Organizations to identify and remediate publicly exposed resources.
- AWS IAM Access Analyzer
- AWS Control Tower (Guardrails)
- AWS Config (managed rules)
- AWS Trusted Advisor
Govern the consumption of shared resources across accounts or within your AWS Organizations.
Monitor shared resources and review shared resource access.
- Govern the consumption of shared resources across accounts or within your AWS Organizations. Monitor shared resources and review shared resource access.
- AWS Resource Access Manager
- VPC endpoints
Detection
Detection enables you to identify a potential security misconfiguration, threat, or unexpected behavior. It’s an essential part of the security lifecycle and can be used to support a quality process, a legal or compliance obligation, and for threat identification and response efforts.
Detection
Detection consists of two parts: detection of unexpected or unwanted configuration changes, and the detection of unexpected behavior
Detection enables you to identify a potential security misconfiguration, threat, or unexpected behavior.
It’s an essential part of the security lifecycle and can be used to support a quality process, a legal or compliance obligation, and for threat identification and response efforts.
Best Practices
Configure logging throughout the workload, including application logs, resource logs, and AWS service logs.
A foundational practice is to establish a set of detection mechanisms at the account level. This base set of mechanisms is aimed at recording and detecting a wide range of actions on all resources in your account.
- Enable logging of AWS services.
- Evaluate and enable logging of operating systems and application-specific logs to detect suspicious behavior.
- Apply appropriate controls to the logs: Logs can contain sensitive information and only authorized users should have access.
- Configure Amazon GuardDuty.
- Configure customized trail in CloudTrail.
- Enable AWS Config.
- Enable AWS Security Hub.
- AWS CloudWatch
- AWS CloudTrail
- AWS EventBridge
- AWS Config
- AWS Security Hub
- AWS GuardDuty
Security operations teams rely on the collection of logs and the use of search tools to discover potential events of interest, which might indicate unauthorized activity or unintentional change.
A best practice for building a mature security operations team is to deeply integrate the flow of security events and findings into a notification and workflow system such as a ticketing system, a bug or issue system, or other security information and event management (SIEM) system.
- Evaluate log processing capabilities: Evaluate the options that are available for processing logs.
- As a start for analyzing CloudTrail logs, test Amazon Athena.
- Implement centralize logging in AWS: See the following AWS example solution to centralize logging from multiple sources.
- Implement centralize logging with partner: APN Partners have solutions to help you analyze logs centrally.
- AWS Security Hub
- AWS CloudWatch
- AWS EventBridge
Using automation to investigate and remediate events reduces human effort and error, and enables you to scale investigation capabilities.
In AWS, investigating events of interest and information on potentially unexpected changes into an automated workflow can be achieved using Amazon EventBridge.
Detecting change and routing this information to the correct workflow can also be accomplished using AWS Config Rules and Conformance Packs.
- Implement automated alerting with GuardDuty.
- Develop automated processes that investigate an event and report information to an administrator to save time.
- AWS CloudWatch
- AWS EventBridge
- AWS Security Hub
Create alerts that are sent to and can be actioned by your team. Ensure that alerts include relevant information for the team to take action.
For each detective mechanism you have, you should also have a process, in the form of a runbook or playbook, to investigate
- Discover metrics available for AWS services: Discover the metrics that are available through Amazon CloudWatch for the services that you are using.
- Configure Amazon CloudWatch alarms.
- AWS CloudWatch
- AWS EventBridge
Infrastructure protection
Infrastructure protection encompasses control methodologies, such as defense in depth, that are necessary to meet best practices and organizational or regulatory obligations. Use of these methodologies is critical for successful, ongoing operations in the cloud.
Protecting networks
Users, both in your workforce and your customers, can be located anywhere. You need to pivot from traditional models of trusting anyone and anything that has access to your network
When you follow the principle of applying security at all layers, you employ a Zero Trust approach.
Zero Trust security is a model where application components or microservices are considered discrete from each other and no component or microservice trusts any other.
Best Practices
Group components that share reachability requirements into layers.
For example, a database cluster in a virtual private cloud (VPC) with no need for internet access should be placed in subnets with no route to or from the internet.
For network connectivity that can include thousands of VPCs, AWS accounts, and on-premises networks, you should use AWS Transit Gateway. It acts as a hub that controls how traffic is routed among all the connected networks, which act like spokes.
- Create subnets in VPC: Create subnets for each layer (in groups that include multiple Availability Zones), and associate route tables to control routing.
- AWS Firewall Manager
- AWS Inspector
- AWS WAF
When architecting your network topology, you should examine the connectivity requirements of each component.
For example, if a component requires internet accessibility (inbound and outbound), connectivity to VPCs, edge services, and external data centers.
- Control network traffic in a VPC: Implement VPC best practices to control traffic.
- Control traffic at the edge: Implement edge services, such as Amazon CloudFront, to provide an additional layer of protection and other features.
- Control private network traffic: Implement services that protect your private traffic for your workload.
- AWS Firewall Manager
- AWS Inspector
- AWS CloudFront
- AWS Global Accelerator
- AWS WAF
- AWS Route53
- AWS VPC Peering
- AWS Private Link
- AWS Transit Gateway
- AWS DirectConnect
- AWS Site-to-Site VPN
- AWS Client VPN
Automate protection mechanisms to provide a self-defending network based on threat intelligence and anomaly detection.
For example, intrusion detection and prevention tools that can adapt to current threats and reduce their impact.
A web application firewall is an example of where you can automate network protection, to automatically block requests originating from IP addresses associated with known threat actors
- Automate protection for web-based traffic: AWS offers a solution that uses AWS CloudFormation to automatically deploy a set of AWS WAF rules designed to filter common web-based attacks.
- Consider AWS Partner solutions: AWS Partners offer hundreds of industry-leading products that are equivalent, identical to, or integrate with existing controls in your on-premises environments.
- AWS Firewall Manager
- AWS Inspector
- AWS WAF
Inspect and filter your traffic at each layer. You can inspect your VPC configurations for potential unintended access using VPC Network Access Analyzer.
Inspect and filter your traffic at each layer. You can inspect your VPC configurations for potential unintended access using VPC Network Access Analyzer.
For components transacting over HTTP-based protocols, a web application firewall can help protect from common attacks.
- Configure Amazon GuardDuty: GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts and workloads.
- Configure virtual private cloud (VPC) Flow Logs: VPC Flow Logs is a feature that enables you to capture information about the IP traffic going to and from network interfaces in your VPC.
- Consider VPC traffic mirroring, an Amazon VPC feature that you can use to copy network traffic from an elastic network interface of Amazon Elastic Compute Cloud (Amazon EC2) instances and then send it to out-of-band security and monitoring appliances for content inspection, threat monitoring, and troubleshooting.
- AWS Firewall Manager
- AWS Inspector
- AWS WAF
- AWS Trasnsit Gateway
Protecting compute
Compute resources include EC2 instances, containers, AWS Lambda functions, database services, IoT devices, and more
Each of these compute resource types require different approaches to secure them.
However, they do share common strategies that you need to consider: defense in depth, vulnerability management, reduction in attack surface, automation of configuration and operation, and performing actions at a distance.
Best Practices
Frequently scan and patch for vulnerabilities in your code, dependencies, and in your infrastructure to help protect against new threats.
You are responsible for patch management for your AWS resources, including Amazon Elastic Compute Cloud(Amazon EC2) instances, Amazon Machine Images (AMIs), and many other compute resources.
- Configure Amazon Inspector: Amazon Inspector tests the network accessibility of your Amazon Elastic Compute Cloud (Amazon EC2) instances and the security state of the applications that run on those instances.
- Scan source code: Scan libraries and dependencies for vulnerabilities.
- AWS CloudFormation
- AWS CloudFormation Guard
- AWS CodePipeline
- AWS CodeGuru
- AWS SystemsManager
- AWS WAF
Reduce your exposure to unintended access by hardening operating systems and minimizing the components, libraries, and externally consumable services in use.
Start by reducing unused components.
You can find many hardening and security configuration guides for common operating systems and server software. For example, you can start with the Center for Internet Security and iterate.
- Harden operating system: Configure operating systems to meet best practices.
- Harden containerized resources: Configure containerized resources to meet security best practices.
- Implement AWS Lambda best practices.
- AWS Systems Manager
Implement services that manage resources, such as Amazon Relational Database Service (Amazon RDS), AWS Lambda, and Amazon Elastic Container Service (Amazon ECS), to reduce your security maintenance tasks as part of the shared responsibility model.
For example, Amazon RDS helps you set up, operate, and scale a relational database, automates administration tasks such as hardware provisioning, database setup, patching, and backups.
- Explore available services: Explore, test, and implement services that manage resources, such as Amazon RDS, AWS Lambda, and Amazon ECS.
- AWS Systems Manager
Automate your protective compute mechanisms including vulnerability management, reduction in attack surface, and management of resources.
The automation will help you invest time in securing other aspects of your workload, and reduce the risk of human error.
- Automate configuration management.
- Automate patching of Amazon Elastic Compute Cloud (Amazon EC2) instances.
- Implement intrusion detection and prevention.
- Consider AWS Partner solutions.
- AWS CloudFormation
- AWS Systems Manager Automation
- AWS Systems Manager Patch Manager
Removing the ability for interactive access reduces the risk of human error, and the potential for manual configuration or management.
For example, use a change management workflow to deploy Amazon Elastic Compute Cloud (Amazon EC2) instances using infrastructure-as-code, then manage Amazon EC2 instances using tools such as AWS Systems Manager instead of allowing direct access or through a bastion host.
AWS Systems Manager can automate a variety of maintenance and deployment tasks, using features including automation workflows, documents (playbooks), and the run command.
- Replace console access: Replace console access (SSH or RDP) to instances with AWS Systems Manager Run Command to automate management tasks.
- AWS Systems Manager RUN Command
Implement mechanisms (for example, code signing) to validate that the software, code and libraries used in the workload are from trusted sources and have not been tampered with.
For example, you should verify the code signing certificate of binaries and scripts to confirm the author, and ensure it has not been tampered with since created by the author
- Investigate mechanisms: Code signing is one mechanism that can be used to validate software integrity.
- AWS Signer
Data protection
Data classification provides a way to categorize data based on levels of sensitivity, and encryption protects data by way of rendering it unintelligible to unauthorized access.
These methods are important because they support objectives such as preventing mishandling or complying with regulatory obligations.
In AWS, there are a number of different approaches you can use when addressing data protection. The following best practices describes how to use these approaches.
Data classification
Data classification provides a way to categorize organizational data based on criticality and sensitivity in order to help you determine appropriate protection and retention controls.
Best Practices
You need to understand the type and classification of data your workload is processing, the associated business processes, data owner, applicable legal and compliance requirements, where it’s stored, and the resulting controls that are needed to be enforced.
This may include classifications to indicate if the data is intended to be publicly available, if the data is internal use only such as customer personally identifiable information (PII), or if the data is for more restricted access such as intellectual property, legally privileged or marked sensitive, and more.
By carefully managing an appropriate data classification system, along with each workload’s level of protection requirements, you can map the controls and level of access or protection appropriate for the data.
- Consider discovering data using Amazon Macie: Macie recognizes sensitive data such as personally identifiable information (PII) or intellectual property.
- AWS Macie
Protect data according to its classification level. For example, secure data classified as public by using relevant recommendations while protecting sensitive data with additional controls.
By using resource tags, separate AWS accounts per sensitivity (and potentially also for each caveat, enclave, or community of interest), IAM policies, AWS Organizations SCPs, AWS Key Management Service (AWS KMS), and AWS CloudHSM, you can define and implement your policies for data classification and protection with encryption.
- Define your data identification and classification schema.
- Discover available AWS controls.
- Identify AWS compliance resources.
- AWS Macie
- AWS Compliance website
Automating the identification and classification of data can help you implement the correct controls.
Using automation for this instead of direct access from a person reduces the risk of human error and exposure.
You should evaluate using a tool, such as Amazon Macie, that uses machine learning to automatically discover, classify, and protect sensitive data in AWS.
- Use Amazon Simple Storage Service (Amazon S3) Inventory: Amazon S3 inventory is one of the tools you can use to audit and report on the replication and encryption status of your objects.
- Consider Amazon Macie: Amazon Macie uses machine learning to automatically discover and classify data stored in Amazon S3.
- AWS Macie
Your defined lifecycle strategy should be based on sensitivity level as well as legal and organization requirements.
Aspects including the duration for which you retain data, data destruction processes, data access management, data transformation, and data sharing should be considered.
When choosing a data classification methodology, balance usability versus access. You should also accommodate the multiple levels of access and nuances for implementing a secure, but still usable, approach for each level.
- Identify data types: Identify the types of data that you are storing or processing in your workload. That data could be text, images, binary databases, and so forth.
- AWS Macie
Protecting data at rest
Data at rest represents any data that you persist in non-volatile storage for any duration in your workload
This includes block storage, object storage, databases, archives, IoT devices, and any other storage medium on which data is persisted.
Protecting your data at rest reduces the risk of unauthorized access, when encryption and appropriate access controls are implemented.
Best Practices
By defining an encryption approach that includes the storage, rotation, and access control of keys, you can help provide protection for your content against unauthorized users and against unnecessary exposure to authorized users.
AWS Key Management Service (AWS KMS) helps you manage encryption keys and integrates with many AWS services. This service provides durable, secure, and redundant storage for your AWS KMS keys.
- Implement AWS KMS: AWS KMS makes it easy for you to create and manage keys and control the use of encryption across a wide range of AWS services and in your applications.
- Consider AWS Encryption SDK: Use the AWS Encryption SDK with AWS KMS integration when your application needs to encrypt data client-side.
- AWS KMS
- AWS S3 Encryption
You should ensure that the only way to store data is by using encryption.
AWS Key Management Service (AWS KMS) integrates seamlessly with many AWS services to make it easier for you to encrypt all your data at rest.
- Enforce encryption at rest for Amazon Simple Storage Service (Amazon S3): Implement Amazon S3 bucket default encryption.
- Use AWS Secrets Manager.
- Configure default encryption for new EBS volumes.
- Configure encrypted Amazon Machine Images (AMIs).
- Configure Amazon Relational Database Service (Amazon RDS) encryption.
- Configure encryption in additional AWS services. For the AWS services you use, determine the encryption capabilities.
- AWS KMS
- AWS Secrets Manager
- AWS Encryption SDK
- AWS RDS Encryption
- AWS EBS Encryption
Use automated tools to validate and enforce data at rest controls continuously, for example, verify that there are only encrypted storage resources.
You can automate validation that all EBS volumes are encrypted using AWS Config Rules.
AWS Security Hub can also verify several different controls through automated checks against security standards. Additionally, your AWS Config Rules can automatically remediate noncompliant resources.
- Enforce encryption at rest: You should ensure that the only way to store data is by using encryption.
- AWS Config (Rules)
- AWS Encryption SDK
- AWS Security Hub
Enforce access control with least privileges and mechanisms, including backups, isolation, and versioning, to help protect your data at rest. Prevent operators from granting public access to your data.
Different controls including access (using least privilege), backups (see Reliability whitepaper), isolation, and versioning can all help protect your data at rest.
- Enforce access control with least privileges, including access to encryption keys.
- Separate data based on different classification levels.
- Review AWS KMS policies.
- Review Amazon S3 bucket and object permissions.
- Enable Amazon S3 versioning and object lock.
- Amazon S3 inventory is one of the tools you can use to audit and report on the replication and encryption status of your objects.
- Review Amazon EBS and AMI sharing permissions: Sharing permissions can allow images and volumes to be shared to AWS accounts external to your workload.
- AWS Organizations
- AWS KMS
- AWS Config Rules
Keep all users away from directly accessing sensitive data and systems under normal operational circumstances.
For example, use a change management workflow to manage Amazon Elastic Compute Cloud (Amazon EC2) instances using tools instead of allowing direct access or a bastion host. This can be achieved using AWS Systems Manager Automation.
- Implement mechanisms to keep people away from data: Mechanisms include using dashboards, such as Amazon QuickSight, to display data to users instead of directly querying.
- Automate configuration management: Perform actions at a distance, enforce and validate secure configurations automatically by using a configuration management service or tool.
- Avoid use of bastion hosts or directly accessing EC2 instances.
- AWS KMS
- AWS Systems Manager
- AWS QuickSight
- AWS CloudFormation
Protecting data in transit
Data in transit is any data that is sent from one system to another. This includes communication between resources within your workload as well as communication between other services and your end users
By providing the appropriate level of protection for your data in transit, you protect the confidentiality and integrity of your workload’s data.
Best Practices
Store encryption keys and certificates securely and rotate them at appropriate time intervals with strict access control.
The best way to accomplish this is to use a managed service, such as AWS Certificate Manager (ACM).
- Implement your defined secure key and certificate management solution.
- Use secure protocols that offer authentication and confidentiality, such as Transport Layer Security (TLS) or IPsec, to reduce the risk of data tampering or loss.
- AWS Certificate Manager (ACM)
Enforce your defined encryption requirements based on appropriate standards and recommendations to help you meet your organizational, legal, and compliance requirements.
AWS services provide HTTPS endpoints using TLS for communication, thus providing encryption in transit when communicating with the AWS APIs.
- Enforce encryption in transit.
- Use a VPN / IPsec for external connectivity.
- For the AWS services you use, determine the encryption-in-transit capabilities.
- AWS CloudFront
- AWS LoadBalancer
Use tools such as Amazon GuardDuty to automatically detect suspicious activity or attempts to move data outside of defined boundaries.
- Use a tool or detection mechanism to automatically detect attempts to move data outside of defined boundaries, for example, to detect a database system that is copying data to an unrecognized host.
- AWS VPC Flow Logs
- AWS Macie
Verify the identity of communications by using protocols that support authentication, such as Transport Layer Security (TLS) or IPsec.
Using network protocols that support authentication, allows for trust to be established between the parties. This adds to the encryption used in the protocol to reduce the risk of communications being altered or intercepted.
- Implement secure protocols: Use secure protocols that offer authentication and confidentiality, such as TLS or IPsec, to reduce the risk of data tampering or loss.
- AWS VPN
Incident response
Even with mature preventive and detective controls, your organization should implement mechanisms to respond to and mitigate the potential impact of security incidents.
Putting in place the tools and access ahead of a security incident, then routinely practicing incident response through game days, helps ensure that you can recover while minimizing business disruption.
Prepare
During an incident, your incident response teams must have access to various tools and the workload resources involved in the incident
Make sure that your teams have appropriate pre-provisioned access to perform their duties before an event occurs. All tools, access, and plans should be documented and tested before an event occurs to make sure that they can provide a timely response.
Best Practices
Identify internal and external personnel, resources, and legal obligations that would help your organization respond to an incident.
When you define your approach to incident response in the cloud, in unison with other teams (such as your legal counsel, leadership, business stakeholders, AWS Support Services, and others), you must identify key personnel, stakeholders, and relevant contacts.
- Identify key personnel in your organization: Maintain a contact list of personnel within your organization that you would need to involve to respond to and recover from an incident.
- Identify external partners: Engage with external partners if necessary that can help you respond to and recover from an incident.
- AWS Security Incident Response Guide
Create plans to help you respond to, communicate during, and recover from an incident.
For example, you can start an incident response plan with the most likely scenarios for your workload and organization.
- Educate and train for incident response.
- Document the incident management plan.
- Categorize incidents.
- Standardize security controls.
- Use automation.
- Conduct root cause analysis and action lessons learned.
- AWS Security Incident Response Guide
It’s important for your incident responders to understand when and how the forensic investigation fits into your response plan.
Your organization should define what evidence is collected and what tools are used in the process. Identify and prepare forensic investigation capabilities that are suitable, including external specialists, tools, and automation.
- Identify forensic capabilities: Research your organization's forensic investigation capabilities, available tools, and external specialists.
- AWS Systems Manager
- AWS EventBridge
- AWS Lambda
Verify that incident responders have the correct access pre-provisioned in AWS to reduce the time needed for investigation through to recovery.
- AWS recommends reducing or eliminating reliance on long-lived credentials wherever possible, in favor of temporary credentials and just-in-time privilege escalation mechanisms.
- For most management tasks, as well as incident response tasks, we recommend you implement identity federation alongside temporary escalation for administrative access.
- We recommend the use of temporary privilege escalation in the majority of incident response scenarios.
- The correct way to do this is to use the AWS Security Token Service and session policies to scope access.
- AWS Systems Manager Incident Manager
- AWS IAM Access Analyzer
Ensure that security personnel have the right tools pre-deployed into AWS to reduce the time for investigation through to recovery.
To automate security engineering and operations functions, you can use a comprehensive set of APIs and tools from AWS. You can fully automate identity management, network security, data protection, and monitoring capabilities and deliver them using popular software development methods that you already have in place.
- Ensure that security personnel have the right tools pre-deployed in AWS so that an appropriate response can be made to an incident.
- Implement resource tagging.
- AWS Security Incident Response Guide
Simulate
Practice your incident management plans and procedures during a realistic scenario
The value derived from participating in a simulation activity increases an organization's effectiveness during stressful events.
Best Practices
Game days, also known as simulations or exercises, are internal events that provide a structured opportunity to practice your incident management plans and procedures during a realistic scenario
These events should exercise responders using the same tools and techniques that would be used in a real-world scenario - even mimicking real-world environments. Game days are fundamentally about being prepared and iteratively improving your response capabilities.
- Run game days: Run simulated incident response events (game days) for different threats that involve key staff and management.
- Capture lessons learned: Lessons learned from running game days should be part of a feedback loop to improve your processes.
- AWS Incident Response Guide
- AWS Elastic Disaster Recovery
Iterate
Automate containment and recovery of an incident to reduce response times and organizational impact.
Best Practices
Once you create and practice the processes and tools from your playbooks, you can deconstruct the logic into a code-based solution, which can be used as a tool by many responders to automate the response and remove variance or guess-work by your responders.
This can speed up the lifecycle of a response.
- Build automate containment capability.
- AWS Incident Response Guide
Reliability
The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to.
This includes the ability to operate and test the workload through its total lifecycle.
There are 4 best practice areas for Reliability in the cloud
Foundations
Foundational requirements are those whose scope extends beyond a single workload or project.
Before architecting any system, foundational requirements that influence reliability should be in place.
For example, you must have sufficient network bandwidth to your data center.
Manage service quotas and constraints
For cloud-based workload architectures, there are service quotas (which are also referred to as service limits)
These quotas exist to prevent accidentally provisioning more resources than you need and to limit request rates on API operations so as to protect services from abuse.
There are also resource constraints, for example, the rate that you can push bits down a fiber-optic cable, or the amount of storage on a physical disk.
Best Practices
You are aware of your default quotas and quota increase requests for your workload architecture.
You additionally know which resource constraints, such as disk or network, are potentially impactful.
Service Quotas is an AWS service that helps you manage your quotas for over 100 AWS services from one location.
- Review AWS service quotas in the published documentation and Service Quotas.
- Determine all the services your workload requires by looking at the deployment code.
- Use AWS Config to find all AWS resources used in your AWS accounts.
- You can also use your AWS CloudFormation to determine your AWS resources used.
- Determine the service quotas that apply. Use the programmatically accessible information via Trusted Advisor and Service Quotas.
- AWS Marketplace: CMDB products
- AWS Service Quotas
- AWS Trusted Advisor
If you are using multiple AWS accounts or AWS Regions, ensure that you request the appropriate quotas in all environments in which your production workloads run.
Service quotas are tracked per account. Unless otherwise noted, each quota is AWS Region-specific.
In addition to the production environments, also manage quotas in all applicable non-production environments, so that testing and development are not hindered.
- Select relevant accounts and Regions based on your service requirements, latency, regulatory, and disaster recovery (DR) requirements.
- Identify service quotas across all relevant accounts, Regions, and Availability Zones. The limits are scoped to account and Region.
- AWS Marketplace: CMDB products
- AWS Service Quotas
- AWS Trusted Advisor
Be aware of unchangeable service quotas and physical resources, and architect to prevent these from impacting reliability.
Examples include network bandwidth, AWS Lambda payload size, throttle burst rate for API Gateway, and concurrent user connections to an Amazon Redshift cluster.
- Be aware of fixed service quotas Be aware of fixed service quotas and constraints and architect around these.
- AWS Marketplace: CMDB products
- AWS Service Quotas
- AWS Trusted Advisor
Evaluate your potential usage and increase your quotas appropriately, allowing for planned growth in usage.
For supported services, you can manage your quotas by configuring CloudWatch alarms to monitor usage and alert you to approaching quotas.
These alarms can be triggered from Service Quotas or from Trusted Advisor.
You can also use metric filters on CloudWatch Logs to search and extract patterns in logs to determine if usage is approaching quota thresholds.
- Monitor and manage your quotas Evaluate your potential usage on AWS, increase your regional service quotas appropriately, and allow planned growth in usage.
- Capture current resource consumption (for example, buckets, instances). Use service API operations, such as the Amazon EC2 DescribeInstances API, to collect current resource consumption.
- Capture your current quotas Use AWS Service Quotas, AWS Trusted Advisor, and AWS documentation.
- AWS Marketplace: CMDB products
- AWS Service Quotas
- AWS Trusted Advisor
Implement tools to alert you when thresholds are being approached.
You can automate quota increase requests by using AWS Service Quotas APIs.
If you integrate your Configuration Management Database (CMDB) or ticketing system with Service Quotas, you can automate the tracking of quota increase requests and current quotas.
- Set up automated monitoring Implement tools using SDKs to alert you when thresholds are being approached.
- Use Service Quotas and augment the service with an automated quota monitoring solution, such as AWS Limit Monitor or an offering from AWS Marketplace.
- Set up triggered responses based on quota thresholds, using Amazon SNS and AWS Service Quotas APIs.
- Test automation
- APN Partner
- AWS Marketplace: CMDB products
- AWS Service Quotas
- AWS Trusted Advisor
When a resource fails, it might still be counted against quotas until it’s successfully terminated.
Ensure that your quotas cover the overlap of all failed resources with replacements before the failed resources are terminated. You should consider an Availability Zone failure when calculating this gap.
- Ensure that there is enough gap between your service quota and your maximum usage to accommodate for a failover.
- Determine your service quotas, accounting for your deployment patterns, availability requirements, and consumption growth.
- Request quota increases if necessary. Plan for necessary time for quota increase requests to be fulfilled.
- AWS Marketplace: CMDB products
- AWS Service Quotas
- AWS Trusted Advisor
Plan your network topology
Workloads often exist in multiple environments. These include multiple cloud environments (both publicly accessible and private) and possibly your existing data center infrastructure
Plans must include network considerations, such as intrasystem and intersystem connectivity, public IP address management, private IP address management, and domain name resolution.
When architecting systems using IP address-based networks, you must plan network topology and addressing in anticipation of possible failures, and to accommodate future growth and integration with other systems and their networks.
Amazon Virtual Private Cloud (Amazon VPC) lets you provision a private, isolated section of the AWS Cloud where you can launch AWS resources in a virtual network.
Best Practices
These endpoints and the routing to them must be highly available.
To achieve this, use highly available DNS, content delivery networks (CDNs), API Gateway, load balancing, or reverse proxies.
Amazon Route 53, AWS Global Accelerator, Amazon CloudFront, Amazon API Gateway, and Elastic Load Balancing (ELB) all provide highly available public endpoints.
You might also choose to evaluate AWS Marketplace software appliances for load balancing and proxying.
- Ensure that you have highly available connectivity for users of the workload Amazon Route 53, AWS Global Accelerator, Amazon CloudFront, Amazon API Gateway, and Elastic Load Balancing (ELB) all provide highly available public facing endpoints.
- Ensure that you have a highly available connection to your users.
- Ensure that you are using a highly available DNS to manage the domain names of your application endpoints.
- Ensure that you are using a highly available reverse proxy or load balancer in front of your application.
- APN Partners
- AWS Direct Connect
- AWS Marketplace for Network Infrastructure
- AWS VPC
- AWS Private Link
- AWS Global Accelerator
- AWS Multi Data Centre
- AWS VPN
Use multiple AWS Direct Connect connections or VPN tunnels between separately deployed private networks.
Use multiple Direct Connect locations for high availability. If using multiple AWS Regions, ensure redundancy in at least two of them.
- Ensure that you have highly available connectivity between AWS and on-premises environment.
- Use multiple AWS Direct Connect connections or VPN tunnels between separately deployed private networks.
- Use multiple Direct Connect locations for high availability. If using multiple AWS Regions, ensure redundancy in at least two of them.
- APN Partners
- AWS Direct Connect
- AWS Marketplace for Network Infrastructure
- AWS VPC
- AWS Private Link
- AWS Global Accelerator
- AWS Multi Data Centre
- AWS VPN
Amazon VPC IP address ranges must be large enough to accommodate workload requirements, including factoring in future expansion and allocation of IP addresses to subnets across Availability Zones.
This includes load balancers, EC2 instances, and container-based applications.
- Plan your network to accommodate for growth, regulatory compliance, and integration with others.
- Select relevant AWS accounts and Regions based on your service requirements, latency, regulatory, and disaster recovery (DR) requirements.
- Identify your needs for regional VPC deployments.
- Identify the size of the VPCs.
- Determine if you need segregated networking for regulatory requirements.
- APN Partners
- AWS Marketplace for Network Infrastructure
- AWS VPC
If more than two network address spaces (for example, VPCs and on-premises networks) are connected via VPC peering, AWS Direct Connect, or VPN, then use a hub-and-spoke model, like that provided by AWS Transit Gateway.
- Prefer hub-and-spoke topologies over many-to-many mesh.
- If more than two network address spaces (VPCs, on-premises networks) are connected via VPC peering, AWS Direct Connect, or VPN, then use a hub-and-spoke model like that provided by AWS Transit Gateway.
- AWS Transit Gateway
- AWS VPC
- APN Partners
The IP address ranges of each of your VPCs must not overlap when peered or connected via VPN.
You must similarly avoid IP address conflicts between a VPC and on-premises environments or with other cloud providers that you use.
You must also have a way to allocate private IP address ranges when needed.
- Monitor and manage your CIDR use. Evaluate your potential usage on AWS, add CIDR ranges to existing VPCs, and create VPCs to allow planned growth in usage.
- Capture current CIDR consumption (for example, VPCs, subnets)
- Capture your current subnet usage.
- APN Partners
- AWS Marketplace for Network Infrastructure
- AWS VPC
- AWS VPC IPAM
Workload Architecture
A reliable workload starts with upfront design decisions for both software and infrastructure
Your architecture choices will impact your workload behavior across all six Well-Architected pillars.
For reliability, there are specific patterns you must follow
Design your workload service architecture
Build highly scalable and reliable workloads using a service-oriented architecture (SOA) or a microservices architecture
Service-oriented architecture (SOA) is the practice of making software components reusable via service interfaces.
Microservices architecture goes further to make components smaller and simpler.
Best Practices
Workload segmentation is important when determining the resilience requirements of your application. Monolithic architecture should be avoided whenever possible.
Instead, carefully consider which application components can be broken out into microservices. Depending on your application requirements, this may end up being a combination of a service-oriented architecture (SOA) with microservices where possible.
Workloads that are capable of statelessness are more capable of being deployed as microservices.
- Choose your architecture type based on how you will segment your workload.
- Choose an SOA or microservices architecture (or in some rare cases, a monolithic architecture).
- AWS API Gateway
- AWS App Mesh
Service-oriented architecture (SOA) builds services with well-delineated functions defined by business needs.
Microservices use domain models and bounded context to limit this further so that each service does just one thing. Focusing on specific functionality enables you to differentiate the reliability requirements of different services, and target investments more specifically.
- Design your workload based on your business domains and their respective functionality. Focusing on specific functionality enables you to differentiate the reliability requirements of different services, and target investments more specifically.
- Decompose your services into smallest possible components. With microservices architecture you can separate your workload into components with the minimal functionality to enable organizational scaling and agility.
- AWS API Gateway
Service contracts are documented agreements between teams on service integration and include a machine-readable API definition, rate limits, and performance expectations.
A versioning strategy allows your clients to continue using the existing API and migrate their applications to the newer API when they are ready.
- Provide service contracts per API Service contracts are documented agreements between teams on service integration and include a machine-readable API definition, rate limits, and performance expectations.
- AWS API Gateway
Design interactions in a distributed system to prevent failures
Distributed systems rely on communications networks to interconnect components, such as servers or services.
Your workload must operate reliably despite data loss or latency in these networks
Components of the distributed system must operate in a way that does not negatively impact other components or the workload.
These best practices prevent failures and improve mean time between failures (MTBF).
Best Practices
Hard real-time distributed systems require responses to be given synchronously and rapidly, while soft real-time systems have a more generous time window of minutes or more for response.
Offline systems handle responses through batch or asynchronous processing.
Hard real-time distributed systems have the most stringent reliability requirements.
- Identify which kind of distributed system is required. Challenges with distributed systems involved latency, scaling, understanding networking APIs, marshalling and unmarshalling data, and the complexity of algorithms such as Paxos.
- AWS EventBridge
- AWS SQS
- AWS Builders Library
Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled.
Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility.
- Implement loosely coupled dependencies. Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled.
- Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility.
- AWS EventBridge
- AWS SQS
- AWS Builders Library
Systems can fail when there are large, rapid changes in load. For example, if your workload is doing a health check that monitors the health of thousands of servers, it should send the same size payload (a full snapshot of the current state) each time.
Whether no servers are failing, or all of them, the health check system is doing constant work with no large, rapid changes.
- Do constant work so that systems do not fail when there are large, rapid changes in load.
- Implement loosely coupled dependencies. Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled.
- AWS Builders Library
An idempotent service promises that each request is completed exactly once, such that making multiple identical requests has the same effect as making a single request.
An idempotent service makes it easier for a client to implement retries without fear that a request will be erroneously processed multiple times.
- Make all responses idempotent.
- An idempotent service promises that each request is completed exactly once, such that making multiple identical requests has the same effect as making a single request.
- AWS Builders Library
Design interactions in a distributed system to mitigate or withstand failures.
Distributed systems rely on communications networks to interconnect components (such as servers or services)
Your workload must operate reliably despite data loss or latency over these networks. Components of the distributed system must operate in a way that does not negatively impact other components or the workload.
These best practices enable workloads to withstand stresses or failures, more quickly recover from them, and mitigate the impact of such impairments. The result is improved mean time to recovery (MTTR)
Best Practices
When a component's dependencies are unhealthy, the component itself can still function, although in a degraded manner.
For example, when a dependency call fails, failover to a predetermined static response.
- Implement graceful degradation to transform applicable hard dependencies into soft dependencies.
- When a component's dependencies are unhealthy, the component itself can still function, although in a degraded manner. For example, when a dependency call fails, failover to a predetermined static response.
- AWS API Gateway (throttling)
- AWS Builders Library
Throttling requests is a mitigation pattern to respond to an unexpected increase in demand.
Some requests are honored but those over a defined limit are rejected and return a message indicating they have been throttled. The expectation on clients is that they will back off and abandon the request or try again at a slower rate.
- Throttle requests. This is a mitigation pattern to respond to an unexpected increase in demand.
- Some requests are honored but those over a defined limit are rejected and return a message indicating they have been throttled.
- AWS API Gateway (throttling)
- AWS Builders Library
Use exponential backoff to retry after progressively longer intervals. Introduce jitter to randomize those retry intervals, and limit the maximum number of retries.
- Control and limit retry calls. Use exponential backoff to retry after progressively longer intervals.
- Introduce jitter to randomize those retry intervals, and limit the maximum number of retries.
- AWS API Gateway (throttling)
- AWS Builders Library
If the workload is unable to respond successfully to a request, then fail fast.
This allows the releasing of resources associated with a request, and permits the service to recover if it’s running out of resources.
- Fail fast and limit queues. If the workload is unable to respond successfully to a request, then fail fast.
- This allows the releasing of resources associated with a request, and permits the service to recover if it’s running out of resources.
- Limit queues In a queue-based system, when processing stops but messages keep arriving, the message debt can accumulate into a large backlog, driving up processing time.
- AWS Builders Library
Set timeouts appropriately, verify them systematically, and do not rely on default values as they are generally set too high.
This best practice applies to the client-side, or sender, of the request.
- Set both a connection timeout and a request timeout on any remote call, and generally on any call across processes.
- Many frameworks offer built-in timeout capabilities, but be careful as many have default values that are infinite or too high.
- AWS SDK
- AWS API Gateway
- AWS Builders Library
Services should either not require state, or should offload state such that between different client requests, there is no dependence on locally stored data on disk and in memory.
This enables servers to be replaced at will without causing an availability impact. Amazon ElastiCache or Amazon DynamoDB are good destinations for offloaded state.
- Make your applications stateless. Stateless applications enable horizontal scaling and are tolerant to the failure of an individual node.
- Remove state that could actually be stored in request parameters.
- After examining whether the state is required, move any state tracking to a resilient multi-zone cache or data store like Amazon ElastiCache, Amazon RDS, Amazon DynamoDB, or a third-party distributed data solution.
- AWS Builders Library
Emergency levers are rapid processes that can mitigate availability impact on your workload.
- Implement emergency levers. These are rapid processes that may mitigate availability impact on your workload.
- They can be operated in the absence of a root cause.
Change Management
Changes to your workload or its environment must be anticipated and accommodated to achieve reliable operation of the workload.
Changes include those imposed on your workload such as spikes in demand, as well as those from within such as feature deployments and security patches
Monitor workload resources
Logs and metrics are powerful tools to gain insight into the health of your workload
You can configure your workload to monitor logs and metrics and send notifications when thresholds are crossed or significant events occur
Monitoring enables your workload to recognize when low-performance thresholds are crossed or failures occur, so it can recover automatically in response.
Best Practices
All components of your workload should be monitored, including the front-end, business logic, and storage tiers.
Monitor the components of the workload with Amazon CloudWatch or third-party tools. Monitor AWS services with AWS Health Dashboard.
- Enable logging where available.
- Review all default metrics and explore any data collection gaps.
- Evaluate all the metrics to decide which ones to alert on for each AWS service in your workload.
- Define alerts and the recovery process for your workload after the alert is triggered.
- Explore use of synthetic transactions to collect relevant data about workloads state.
- AWS Health Dashboard
- AWS CloudWatch Metrics
- AWS X-Ray
- AWS DevOps Guru
Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps.
- Define and calculate metrics (Aggregation).
- Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps
- AWS CloudWatch
Organizations that need to know, receive notifications when significant events occur.
Alerts can be sent to Amazon Simple Notification Service (Amazon SNS) topics, and then pushed to any number of subscribers.
For example, Amazon SNS can forward alerts to an email alias so that technical staff can respond.
- Perform real-time processing and alarming.
- Organizations that need to know, receive notifications when significant events occur. >
- AWS CloudWatch
- AWS SNS
Use automation to take action when an event is detected, for example, to replace failed components.
- Perform real-time processing and alarming. Organizations that need to know, receive notifications when significant events occur.
- Use AWS Systems Manager to perform automated actions. AWS Config continually monitors and records your AWS resource configurations.
- Create and execute a plan to automate responses.
- AWS Systems Manager
- AWS EventBridge
Collect log files and metrics histories and analyze these for broader trends and workload insights.
- Use Amazon CloudWatch Logs send logs to Amazon S3 where you can use or Amazon Athena to query the data.
- AWS Builders Library
- AWS CloudWatch
Frequently review how workload monitoring is implemented and update it based on significant events and changes.
Effective monitoring is driven by key business metrics.
Ensure these metrics are accommodated in your workload as business priorities change.
- Create multiple dashboards for the workload.
- You must have a top-level dashboard that contains the key business metrics, as well as the technical metrics you have identified to be the most relevant to the projected health of the workload as usage varies.
- You should also have dashboards for various application tiers and dependencies that can be inspected.
- AWS CloudWatch Dashboards
- AWS CloudWatch Synthetics
- AWS X-Ray
- AWS Builders Library
Use AWS X-Ray or third-party tools so that developers can more easily analyze and debug distributed systems to understand how their applications and its underlying services are performing.
- Monitor end-to-end tracing of requests through your system.
- AWS X-Ray is a service that collects data about requests that your application serves, and provides tools you can use to view, filter, and gain insights into that data to identify issues and opportunities for optimization.
- AWS CloudWatch Synthetics
- AWS X-Ray
- AWS Builders Library
Design your workload to adapt to changes in demand
A scalable workload provides elasticity to add or remove resources automatically so that they closely match the current demand at any given point in time.
Best Practices
When replacing impaired resources or scaling your workload, automate the process by using managed AWS services, such as Amazon S3 and AWS Auto Scaling.
You can also use third-party tools and AWS SDKs to automate scaling.
- Configure and use AWS Auto Scaling
- Use Elastic Load Balancing. Load balancers can distribute load by path or by network connectivity.
- Use a highly available DNS provider.
- Use the AWS global network to optimize the path from your users to your applications.
- Configure and use Amazon CloudFront or a trusted content delivery network (CDN).
- AWS Partner
- AWS Autoscaling
- AWS Marketplace
- AWS Elastic Load Balancer
- AWS Network Load Balancer
- AWS Application Load Balancer
- AWS CloudFront
- AWS Route 53
Scale resources reactively when necessary if availability is impacted, to restore workload availability.
You first must configure health checks and the criteria on these checks to indicate when availability is impacted by lack of resources.
Then either notify the appropriate personnel to manually scale the resource, or trigger automation to automatically scale it.
- Obtain resources upon detection of impairment to a workload. Scale resources reactively when necessary if availability is impacted, to restore workload availability.
- Use scaling plans, which are the core component of AWS Auto Scaling, to configure a set of instructions for scaling your resources.
- AWS Partner
- AWS Auto Scaling
- AWS Marketplace
Scale resources proactively to meet demand and avoid availability impact.
- Obtain resources upon detection that more resources are needed for a workload.
- Scale resources proactively to meet demand and avoid availability impact.
- AWS Auto Scaling
- AWS Marketplace
Adopt a load testing methodology to measure if scaling activity meets workload requirements.
It’s important to perform sustained load testing.
Load tests should discover the breaking point and test the performance of your workload.
- Perform load testing to identify which aspect of your workload indicates that you must add or remove capacity.
- Load testing should have representative traffic similar to what you receive in production. Increase the load while watching the metrics you have instrumented to determine which metric indicates when you must add or remove resources.
Implement change
Controlled changes are necessary to deploy new functionality and to ensure that the workloads and the operating environment are running known, properly patched software
If these changes are uncontrolled, then it makes it difficult to predict the effect of these changes, or to address issues that arise because of them.
Best Practices
Runbooks are the predefined procedures to achieve specific outcomes.
Use runbooks to perform standard activities, whether done manually or automatically.
Examples include deploying a workload, patching a workload, or making DNS modifications.
- Enable consistent and prompt responses to well understood events by documenting procedures in runbooks.
- Use the principle of infrastructure as code (CloudFormation) to define your infrastructure.
- AWS Partner
- AWS Marketplace
- AWS CloudFormation
- AWS CodeCommit
Functional tests are run as part of automated deployment. If success criteria are not met, the pipeline is halted or rolled back.
These tests are run in a pre-production environment, which is staged prior to production in the pipeline.
- Integrate functional testing as part of your deployment.
- Functional tests are run as part of automated deployment. If success criteria are not met, the pipeline is halted or rolled back.
- AWS CodePipeline
- AWS CodeBuild
Resiliency tests (using the principles of chaos engineering) are run as part of the automated deployment pipeline in a pre-production environment.
These tests are staged and run in the pipeline in a pre-production environment.
They should also be run in production as part of game days.
- Integrate resiliency testing as part of your deployment.
- Use Chaos Engineering, the discipline of experimenting on a workload to build confidence in the workload’s capability to withstand turbulent conditions in production.
- Resiliency tests inject faults or resource degradation to assess that your workload responds with its designed resilience.
Immutable infrastructure is a model that mandates that no updates, security patches, or configuration changes happen in-place on production workloads.
When a change is needed, the architecture is built onto new infrastructure and deployed into production.
- Deploy using immutable infrastructure. Immutable infrastructure is a model in which no updates, security patches, or configuration changes happen in-place on production systems.
- If any change is needed, a new version of the architecture is built and deployed into production.
- AWS CodeDeploy
- AWS CodePipeline
Deployments and patching are automated to eliminate negative impact.
- Automate your deployment pipeline.
- Deployment pipelines allow you to invoke automated testing and detection of anomalies, and either halt the pipeline at a certain step before production deployment, or automatically roll back a change.
- AWS CodeDeploy
- AWS CodePipeline
- AWS Systems Manager Patch Manager
- AWS Partner
- AWS Marketplace
- AWS SNS
- AWS SES
Failure Management
Failures are a given and everything will eventually fail over time: from routers to hard disks, from operating systems to memory units corrupting TCP packets, from transient errors to permanent failures
Regardless of your cloud provider, there is the potential for failures to impact your workload. Therefore, you must take steps to implement resiliency if you need your workload to be reliable.
Back up data
Back up data, applications, and configuration to meet requirements for recovery time objectives (RTO) and recovery point objectives (RPO).
Best Practices
All AWS data stores offer backup capabilities.
Services such as Amazon RDS and Amazon DynamoDB additionally support automated backup that enables point-in-time recovery (PITR), which allows you to restore a backup to any time up to five minutes or less before the current time.
Many AWS services offer the ability to copy backups to another AWS Region. AWS Backup is a tool that gives you the ability to centralize and automate data protection across AWS services.
- Identify all data sources for the workload.
- Classify data sources based on criticality.
- Use AWS or third-party services to create backups of the data.
- For data that is not backed up, establish a data reproduction mechanism.
- Establish a cadence for backing up data.
- AWS Backup
- AWS DataSync
- AWS Volume Gateway
- AWS EBS Snapshots
- AWS Cross Region Replication
Control and detect access to backups using authentication and authorization, such as AWS IAM.
Prevent and detect if data integrity of backups is compromised using encryption.
- Use encryption on each of your data stores. If your source data is encrypted, then the backup will also be encrypted.
- Implement least privilege permissions to access your backups. Follow best practices to limit the access to the backups, snapshots, and replicas in accordance with security best practices.
- AWS Encryption: EBS, S3, DynamoDB, RDS, EFS, ++
- AWS Backup Encryption
- AWS Marketplace
Configure backups to be taken automatically based on a periodic schedule informed by the Recovery Point Objective (RPO), or by changes in the dataset.
Critical datasets with low data loss requirements need to be backed up automatically on a frequent basis, whereas less critical data where some loss is acceptable can be backed up less frequently.
- Identify data sources that are currently being backed up manually.
- Determine the RPO for the workload.
- Use an automated backup solution or managed service.
- AWS Partner
- AWS Marketplace
- AWS Backup
- AWS Step Functions
- AWS EventBridge
Validate that your backup process implementation meets your recovery time objectives (RTO) and recovery point objectives (RPO) by performing a recovery test.
- Testing backup and restore capability increases confidence in the ability to perform these actions during an outage.
- Periodically restore backups to a new location and run tests to verify the integrity of the data. Some common tests that should be performed are checking
- AWS Partner
- AWS Marketplace
- AWS Backup
- AWS Step Functions
- AWS EventBridge
Use fault isolation to protect your workload
Fault isolated boundaries limit the effect of a failure within a workload to a limited number of components
Components outside of the boundary are unaffected by the failure. Using multiple fault isolated boundaries, you can limit the impact on your workload.
Best Practices
Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions.
These locations can be as diverse as required.
- Use multiple Availability Zones and AWS Regions. Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions.
- If your workload must be deployed to multiple Regions, choose a multi-Region strategy. Most reliability needs can be met within a single AWS Region using a multi-Availability Zone strategy.
- Evaluate AWS Outposts for your workload. If your workload requires low latency to your on-premises data center or has local data processing requirements.
- Determine if AWS Local Zones helps you provide service to your users. If you have low-latency requirements, see if AWS Local Zones is located near your users.
- AWS Local Zones
- AWS Global Tables (DynamoDB)
- AWS Outposts
For high availability, always (when possible) deploy your workload components to multiple Availability Zones (AZs).
For workloads with extreme resilience requirements, carefully evaluate the options for a multi-Region architecture.
- Evaluate your workload and determine whether the resilience needs can be met by a multi-AZ approach (single AWS Region), or if they require a multi-Region approach.
- Implementing a multiRegion architecture to satisfy these requirements will introduce additional complexity, therefore carefully consider your use case and its requirements.
- Resilience requirements can almost always be met using a single AWS Region.
- AWS Local Zones
- AWS Global Tables (DynamoDB)
- AWS Outposts
If components of the workload can only run in a single Availability Zone or in an on-premises data center, you must implement the capability to do a complete rebuild of the workload within your defined recovery objectives.
- Implement self-healing. Deploy your instances or containers using automatic scaling when possible.
- If you cannot use automatic scaling, use automatic recovery for EC2 instances or implement self-healing automation based on Amazon EC2 or ECS container lifecycle events.
- AWS ECS Events
- AWS EC2 Auto Scaling
Like the bulkheads on a ship, this pattern ensures that a failure is contained to a small subset of requests or clients so that the number of impaired requests is limited, and most can continue without error.
Bulkheads for data are often called partitions, while bulkheads for services are known as cells.
- Use bulkhead architectures.
- Evaluate cell-based architecture for your workload.
- AWS Builders Library
Design your workload to withstand component failures
Workloads with a requirement for high availability and low mean time to recovery (MTTR) must be architected for resiliency.
Best Practices
Continuously monitor the health of your workload so that you and your automated systems are aware of degradation or failure as soon as they occur.
- Determine the collection interval for your components based on your recovery goals.
- Configure detailed monitoring for components.
- Create custom metrics to measure business key performance indicators (KPIs).
- Monitor the user experience for failures using user canaries.
- Create custom metrics that track the user's experience.
- Set alarms to detect when any part of your workload is not working properly.
- Create dashboards to visualize your metrics.
- AWS CloudWatch Synthetics
- AWS CloudWatch Dashboards
Ensure that if a resource failure occurs, that healthy resources can continue to serve requests.
For location failures (such as Availability Zone or AWS Region) ensure that you have systems in place to fail over to healthy resources in unimpaired locations.
- Fail over to healthy resources. Ensure that if a resource failure occurs, that healthy resources can continue to serve requests.
- For location failures (such as Availability Zone or AWS Region) ensure you have systems in place to fail over to healthy resources in unimpaired locations.
- APN Partner
- AWS MarketPlace
- AWS OpsWorks
- AWS R53
- AWS RDS Read Replicas
- AWS ECS task placement
- AWS Global Accelerator
Upon detection of a failure, use automated capabilities to perform actions to remediate.
- Use Auto Scaling groups to deploy tiers in an workload.
- Implement automatic recovery on EC2 instances that have applications deployed that cannot be deployed in multiple locations, and can tolerate rebooting upon failures.
- Implement automated recovery using AWS Step Functions and AWS Lambda when you cannot use automatic scaling or automatic recovery, or when automatic recovery fails.
- APN Partner
- AWS MarketPlace
- AWS OpsWorks
- AWS CloudWatch
- AWS EventBridge
- AWS Systems Manager Automation
- AWS Step Functions
The control plane is used to configure resources, and the data plane delivers services.
Data planes typically have higher availability design goals than control planes and are usually less complex.
When implementing recovery or mitigation responses to potentially resiliency-impacting events, using control plane operations can lower the overall resiliency of your architecture.
- Rely on the data plane and not the control plane when using Amazon Route 53 for disaster recovery.
- Route 53 Application Recovery Controller helps you manage and coordinate failover using readiness checks and routing controls.
- These features continually monitor your application’s ability to recoverfrom failures, and enables you to control your application recovery across multiple AWS Regions, Availability Zones, and on premises.
- APN Partner
- AWS MarketPlace
- AWS Builders Library
- AWS R53
Bimodal behavior is when your workload exhibits different behavior under normal and failure modes, for example, relying on launching new instances if an Availability Zone fails.
You should instead build workloads that are statically stable and operate in only one mode.
- Use static stability to prevent bimodal behavior.
- Bimodal behavior is when your workload exhibits different behavior under normal and failure modes, for example, relying on launching new instances if an Availability Zone fails.
- AWS Builders Library
Notifications are sent upon the detection of significant events, even if the issue caused by the event was automatically resolved.
- Alarms on business Key Performance Indicators when they exceed a low threshold.
- Having a low threshold alarm on your business KPIs help you know when your workload is unavailable or nonfunctional.
- AWS CloudWatch
- AWS EventBridge
- AWS SNS
Test reliability
After you have designed your workload to be resilient to the stresses of production, testing is the only way to ensure that it will operate as designed, and deliver the resiliency you expect.
Test to validate that your workload meets functional and non-functional requirements, because bugs or performance bottlenecks can impact the reliability of your workload. Test the resiliency of your workload to help you find latent bugs that only surface in production. Exercise these tests regularly.
Best Practices
Enable consistent and prompt responses to failure scenarios that are not well understood, by documenting the investigation process in playbooks.
Playbooks are the predefined steps performed to identify the factors contributing to a failure scenario.
The results from any process step are used to determine the next steps to take until the issue is identified or escalated.
- Use playbooks to identify issues. Playbooks are documented processes to investigate issues.
- Enable consistent and prompt responses to failure scenarios by documenting processes in playbooks.
- AWS Systems Manager Automation
- AWS Systems Manager Run Command
- AWS CloudWatch Alarms
- AWS EventBridge
Review customer-impacting events, and identify the contributing factors and preventative action items.
Use this information to develop mitigations to limit or prevent recurrence.
Develop procedures for prompt and effective responses.
- Establish a standard for your post-incident analysis.
- Good post-incident analysis provides opportunities to propose common solutions for problems with architecture patterns that are used in other places in your systems.
- Have a process to identify and document the contributing factors of an event so that you can develop mitigations to limit or prevent recurrence and you can develop procedures for prompt and effective responses.
Use techniques such as unit tests and integration tests that validate required functionality.
- Test functional requirements. These include unit tests and integration tests that validate required functionality.
- AWS CodeBuild
- AWS CodePipeline
- AWS CloudWatch Synthetics
Use techniques such as load testing to validate that the workload meets scaling and performance requirements.
- Test scaling and performance requirements. Perform load testing to validate that the workload meets scaling and performance requirements.
Run chaos experiments regularly in environments that are in or as close to production as possible to understand how your system responds to adverse conditions.
- Chaos engineering provides your teams with capabilities to continually inject real world disruptions (simulations) in a controlled way at the service provider, infrastructure, workload, and component level, with minimal to no impact to your customers.
- AWS Fault Injection Simulator
- AWS Resilience Hub
- AWS Marketplace: Gremlin Chaos Engineering Platform
Use game days to regularly exercise your procedures for responding to events and failures as close to production as possible (including in production environments) with the people who will be involved in actual failure scenarios.
Game days enforce measures to ensure that production events do not impact users.
- Schedule game days to regularly exercise your runbooks and playbooks.
- Game days should involve everyone who would be involved in a production event: business owner, development staff, operational staff, and incident response teams.
Plan for Disaster Recovery (DR)
Having backups and redundant workload components in place is the start of your DR strategy
RTO and RPO are your objectives for restoration of your workload
Set these based on business needs. Implement a strategy to meet these objectives, considering locations and function of workload resources and data
Both Availability and Disaster Recovery rely on the same best practices such as monitoring for failures, deploying to multiple locations, and automatic failover. However Availability focuses on components of the workload, while Disaster Recovery focuses on discrete copies of the entire workload. Disaster Recovery has different objectives from Availability, focusing on time to recovery after a disaster
Best Practices
The workload has a recovery time objective (RTO) and recovery point objective (RPO).
Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of service and restoration of service.
Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data recovery point.
- For the given workload, you must understand the impact of downtime and lost data on your business.
- The impact generally grows larger with greater downtime or data loss, but the shape of this growth can differ based on the workload type.
Define a disaster recovery (DR) strategy that meets your workload's recovery objectives.
Choose a strategy such as: backup and restore; standby (active/passive); or active/active.
- Determine a DR strategy that will satisfy recovery requirements for this workload.
Regularly test failover to your recovery site to ensure proper operation, and that RTO and RPO are met.
- Engineer your workloads for recovery. Regularly test your recovery paths Recovery Oriented Computing identifies the characteristics in systems that enhance recovery.
- These characteristics are: isolation and redundancy, system-wide ability to roll back changes, ability to monitor and determine health, ability to provide diagnostics, automated recovery, modular design, and ability to restart.
- AWS Fault Injection Simulator
Ensure that the infrastructure, data, and configuration are as needed at the DR site or Region.
For example, check that AMIs and service quotas are up to date.
- Ensure that your delivery pipelines deliver to both your primary and backup sites.
- Delivery pipelines for deploying applications into production must distribute to all the specified disaster recovery strategy locations, including dev and test environments.
- Use AWS Config rules to create systems that enforce your disaster recovery strategies and generate alerts when they detect drift.
- Use AWS CloudFormation to deploy your infrastructure. AWS CloudFormation can detect drift between what your CloudFormation templates specify and what is actually deployed.
- AWS Systems Manager Automation
- AWS Config Rules
- AWS CloudFormation
Use AWS or third-party tools to automate system recovery and route traffic to the DR site or Region.
- Use Elastic Disaster Recovery for automated Failover and Failback.
- Elastic Disaster Recovery continuously replicates your machines (including operating system, system state configuration, databases, applications, and files) into a low-cost staging area in your target AWS account and preferred Region.
- In the case of a disaster, after choosing to recover using Elastic Disaster Recovery, Elastic Disaster Recovery automates the conversion of your replicated servers into fully provisioned workloads in your recovery Region on AWS.
- AWS Elastic Disaster Recovery
- AWS Systems Manager Automation
- AWS Marketplace
Performance Efficiency
The performance efficiency pillar focuses on the efficient use of computing resources to meet requirements, and how to maintain efficiency as demand changes and technologies evolve.
There are 4 focus areas for Performance Efficiency in the cloud
Selection
The optimal solution for a particular workload varies, and solutions often combine multiple approaches. Well-architected workloads use multiple solutions and enable different features to improve performance.
AWS resources are available in many types and configurations, which makes it easier to find an approach that closely matches your needs. Selection of the right services/rsources is key to performance efficiency.
Performance architecture selection
Use a data-driven approach to select the patterns and implementation for your architecture and achieve a cost effective solution
AWS Solutions Architects, AWS Reference Architectures, and AWS Partners can help you select an architecture based on industry knowledge, but data obtained through benchmarking or load testing will be required to optimize your architecture.
Your architecture will likely combine a number of different architectural approaches (for example, eventdriven, ETL, or pipeline). The implementation of your architecture will use the AWS services that are specific to the optimization of your architecture's performance. In the following sections we discuss the four main resource types to consider (compute, storage, database, and network).
Best Practices
Learn about and understand the wide range of services and resources available in the cloud.
Identify the relevant services and configuration options for your workload, and understand how to achieve optimal performance.
- Inventory your workload software and architecture for related services: Gather an inventory of your workload and decide which category of products to learn more about.
- Identify workload components that can be replaced with managed services to increase performance and reduce operational complexity.
- AWS Architecture Center
- AWS Partner Network
- AWS Solutions Library
- AWS Knowledge Center
- AWS Samples
- AWS SDK Examples
Use internal experience and knowledge of the cloud, or external resources such as published use cases, relevant documentation, or whitepapers, to define a process to choose resources and services. Y
You should define a process that encourages experimentation and benchmarking with the services that could be used in your workload
- Select an architectural approach: Identify the kind of architecture that meets your performance requirements.
- Identify constraints, such as the media for delivery (desktop, web, mobile, IoT), legacy requirements, and integrations. Identify opportunities for reuse, including refactoring.
- Consult other teams, architecture diagrams, and resources such as AWS Solution Architects, AWS Reference Architectures, and AWS Partners to help you choose an architecture.
- AWS Partner Network
- AWS Solutions Library
- AWS Knowledge Center
- AWS Samples
- AWS SDK Examples
Workloads often have cost requirements for operation. Use internal cost controls to select resource types and sizes based on predicted resource need.
Determine which workload components could be replaced with fully managed services, such as managed databases, in-memory caches, and ETL services. Reducing your operational workload allows you to focus resources on business outcomes.
- Optimize workload components to reduce cost: Right size workload components and enable elasticity to reduce cost and maximize component efficiency.
- Determine which workload components can be replaced with managed services when appropriate, such as managed databases, in-memory caches, and reverse proxies.
- AWS Architecture Center
- AWS Partner Network
- AWS Solutions Library
- AWS Knowledge Center
- AWS Compute Optimizer
- AWS Samples
- AWS SDK Examples
Maximize performance and efficiency by evaluating internal policies and existing reference architectures and using your analysis to select services and configurations for your workload.
- Deploy your workload using existing policies or reference architectures: Integrate the services into your cloud deployment, then use your performance tests to ensure that you can continue to meet your performance requirements.
- AWS Architecture Center
- AWS Partner Network
- AWS Solutions Library
- AWS Knowledge Center
- AWS Samples
- AWS SDK Examples
Use cloud company resources, such as solutions architects, professional services, or an appropriate partner to guide your decisions. These resources can help review and improve your architecture for optimal performance.
Reach out to AWS for assistance when you need additional guidance or product information. AWS Solutions Architects and AWS Professional Services provide guidance for solution implementation. AWS Partners provide AWS expertise to help you unlock agility and innovation for your business.
- Reach out to AWS resources for assistance: AWS Solutions Architects and Professional Services provide guidance for solution implementation.
- APN Partners provide AWS expertise to help you unlock agility and innovation for your business.
- AWS Architecture Center
- AWS Partner Network
- AWS Solutions Library
- AWS Knowledge Center
- AWS Samples
- AWS SDK Examples
Benchmark the performance of an existing workload to understand how it performs on the cloud. Use the data collected from benchmarks to drive architectural decisions.
Use benchmarking with synthetic tests and real-user monitoring to generate data about how your workload’s components perform. Benchmarking is generally quicker to set up than load testing and is used to evaluate the technology for a particular component. Benchmarking is often used at the start of a new project, when you lack a full solution to load test.
- Monitor performance during development: : Implement processes that provide visibility into performance as your workload evolves.
- Integrate into your delivery pipeline: Automatically run load tests in your delivery pipeline.
- Test user journeys: Use synthetic or sanitized versions of production data (remove sensitive or identifying information) for load testing.
- Real-user monitoring: Use CloudWatch RUM to help you collect and view client-side data about your application performance.
- AWS Architecture Center
- AWS Partner Network
- AWS Solutions Library
- AWS Knowledge Center
- AWS CloudWatch RUM
- AWS CloudWatch Synthetics
- AWS Samples
- AWS SDK Examples
Deploy your latest workload architecture on the cloud using different resource types and sizes. Monitor the deployment to capture performance metrics that identify bottlenecks or excess capacity.
Use this performance information to design or improve your architecture and resource selection.
- Validate your approach with load testing: Load test a proof-of-concept to find out if you meet your performance requirements.
- You can use AWS services to run production-scale environments to test your architecture. Monitor metrics: Amazon CloudWatch can collect metrics across the resources in your architecture.
- Test at scale: Load testing uses your actual workload so you can see how your solution performs in a production environment.
- You can use AWS services to run production-scale environments to test your architecture.
- AWS CloudFormation
- AWS CloudWatch RUM
- AWS CloudWatch Synthetics
Compute architecture selection
The optimal compute choice for a particular workload can vary based on application design, usage patterns, and configuration settings
Architectures may use different compute choices for various components and enable different features to improve performance.
Selecting the wrong compute choice for an architecture can lead to lower performance efficiency
Best Practices
Understand how your workload can benefit from the use of different compute options, such as instances, containers and functions.
- Understand the virtualization, containerization, and management solutions that can benefit your workload and meet your performance requirements.
- A workload can contain multiple types of compute solutions.
- Each compute solution has differing characteristics. Based on your workload scale and compute requirements, a compute solution can be selected and configured to meet your needs.
- AWS EC2
- AWS ECS
- AWS EKS
- AWS Lambda
Each compute solution has options and configurations available to you to support your workload characteristics.
Learn how various options complement your workload, and which configuration options are best for your application.
- If your workload has been using the same compute option for more than four weeks and you anticipate that the characteristics will remain the same in the future, you can use AWS Compute Optimizer to provide a recommendation to you based on your compute characteristics.
- If AWS Compute Optimizer is not an option due to lack of metrics, a non-supported instance type or a foreseeable change in your characteristics then you must predict your metrics based on load testing and experimentation.
- AWS Compute Optimizer
To understand how your compute resources are performing, you must record and track the utilization of various systems.
This data can be used to make more accurate determinations about resource requirements.
- Identify, collect, aggregate, and correlate compute-related metrics.
- Using a service such as Amazon CloudWatch, can make the implementation quicker and easier to maintain.
- In addition to the default metrics recorded, identify and track additional system-level metrics within your workload.
- Record data such as CPU utilization, memory, disk I/O, and network inbound and outbound metrics to gain insight into utilization levels or bottlenecks.
- AWS CloudWatch
- AWS Systems Manager automation
- Amazon Managed Service for Prometheus
Analyze the various performance characteristics of your workload and how these characteristics relate to memory, network, and CPU usage.
Use this data to choose resources that best match your workload's profile.
- Modify your workload configuration by right sizing: To optimize both performance and overall efficiency, determine which resources your workload needs.
- Choose memory-optimized instances for systems that require more memory than CPU, or compute-optimized instances for components that do data processing that is not memory-intensive.
- AWS CloudWatch
- AWS Compute Optimizer
The cloud provides the flexibility to expand or reduce your resources dynamically through a variety of mechanisms to meet changes in demand.
Combined with compute-related metrics, a workload can automatically respond to changes and use the optimal set of resources to achieve its goal.
- Take advantage of elasticity: Elasticity matches the supply of resources you have against the demand for those resources.
- Instances, containers, and functions provide mechanisms for elasticity either in combination with automatic scaling or as a feature of the service.
- Use elasticity in your architecture to ensure that you have sufficient capacity to meet performance requirements at all scales of use.
- AWS EC2 Auto Scaling
- AWS EFS
Use system-level metrics to identify the behavior and requirements of your workload over time.
Evaluate your workload's needs by comparing the available resources with these requirements and make changes to your compute environment to best match your workload's profile
- Use a data-driven approach to optimize resources: To achieve maximum performance and efficiency, use the data gathered over time from your workload to tune and optimize your resources.
- Look at the trends in your workload's usage of current resources and determine where you can make changes to better match your workload's needs.
- AWS Compute Optimizer
Storage Architecture Selection .
The optimal storage solution for a particular system varies based on the kind of access method (block, file, or object), patterns of access (random or sequential), throughput required, frequency of access (online, offline, archival), frequency of update (WORM, dynamic), and availability and durability constraints
Well-architected systems use multiple storage solutions and enable different features to improve performance.
In AWS, storage is virtualized and is available in a number of different types. This makes it easier to match your storage methods with your needs, and offers storage options that are not easily achievable with on-premises infrastructure.
Best Practices
Identify and document the workload storage needs and define the storage characteristics of each location.
Examples of storage characteristics include: shareable access, file size, growth rate, throughput, IOPS, latency, access patterns, and persistence of data. Use these characteristics to evaluate if block, file, object, or instance storage services are the most efficient solution for your storage needs.
- Identify your workload’s most important storage performance metrics and implement improvements as part of a data-driven approach, using benchmarking or load testing.
- Use this data to identify where your storage solution is constrained, and examine configuration options to improve the solution.
- Determine the expected growth rate for your workload and choose a storage solution that will meet those rates.
- AWS EBS Volume Types
- AWS EC2 Storage
- AWS EFS
- AWS FSx for Lustre
- AWS FSx for Windows File Server
- AWS FSx for NetApp ONTAP
- AWS FSx for OpenZFS
- AWS S3 Glacier
- AWS S3
- AWS Snow Family
Evaluate the various characteristics and configuration options and how they relate to storage.
Understand where and how to use provisioned IOPS, SSDs, magnetic storage, object storage, archival storage, or ephemeral storage to optimize storage space and performance for your workload.
- Determine storage characteristics: When you evaluate a storage solution, determine which storage characteristics you require, such as ability to share, file size, cache size, latency, throughput, and persistence of data.
- Then match your requirements to the AWS service that best fits your needs.
- AWS EBS Volume Types
- AWS EC2 Storage
- AWS EFS: Amazon EFS Performance
- AWS FSx for Lustre Performance
- AWS FSx for Windows File Server Performance
- AWS FSx for NetApp ONTAP performance
- AWS FSx for OpenZFS performance
- AWS S3 Glacier: Amazon S3 Glacier Documentation
- AWS S3: Request Rate and Performance Considerations
- AWS Snow Family
Choose storage systems based on your workload's access patterns and configure them by determining how the workload accesses data.
Increase storage efficiency by choosing object storage over block storage. Configure the storage options you choose to match your data access patterns.
- Optimize your storage usage and access patterns: Choose storage systems based on your workload's access patterns and the characteristics of the available storage options.
- Determine the best place to store data that will enable you to meet your requirements while reducing overhead.
- Use performance optimizations and access patterns when configuring and interacting with data based on the characteristics of your storage (for example, striping volumes or partitioning data).
- AWS EBS Volume Types
- AWS EC2 Storage
- AWS EFS
- AWS FSx for Lustre
- AWS FSx for Windows File Server
- AWS FSx for NetApp ONTAP
- AWS FSx for OpenZFS
- AWS S3 Glacie
- AWS S3
- AWS Snow Family
Database architecture selection
The optimal database solution for a system varies based on requirements for availability, consistency, partition tolerance, latency, durability, scalability, and query capability
Many systems use different database solutions for various sub-systems and enable different features to improve performance.
Selecting the wrong database solution and features for a system can lead to lower performance efficiency.
Best Practices
Choose your data management solutions to optimally match the characteristics, access patterns, and requirements of your workload datasets.
When selecting and implementing a data management solution, you must ensure that the querying, scaling, and storage characteristics support the workload data requirements.
Learn how various database options match your data models, and which configuration options are best for your use-case
- Define the data characteristics and access patterns of your workload.
- Review all available database solutions to identify which solution supports your data requirements.
- Within a given workload, multiple databases may be selected. Evaluate each service or group of services and assess them individually.
- How is the data structured? (for example, unstructured, key-value, semi-structured, relational)
- Is ACID (atomicity, consistency, isolation, durability) compliance required?
- What consistency model is required?
- What query and result formats must be supported? (for example, SQL, CSV, Parque, Avro, JSON, etc.)
- What is the proportion of read queries in relation to write queries? Would caching be likely to improve performance?
- AWS DynamoDB
- AWS Aurora
- AWS Redshift
- AWS Athena
- AWS Redshift Spectrum
- AWS RDS
- AWS ElastiCache
- AWS Neptune GraphDB
Understand the available database options and how it can optimize your performance before you select your data management solution.
Understand the available database options and how it can optimize your performance before you select your data management solution. Use load testing to identify database metrics that matter for your workload.
- Understand your workload data characteristics so that you can configure your database options.
- Run load tests to identify your key performance metrics and bottlenecks. Use these characteristics and metrics to evaluate database options and experiment with different configurations.
- What configuration options are available for the selected databases?
- Is the workload read or write heavy?
- What solutions are available for scaling writes (partition key sharding, introducing a queue, etc.)?
- What are the current or expected peak transactions per second (TPS)? Test using this volume of traffic and this volume +X% to understand the scaling characteristics.
- AWS DynamoDB
- AWS Aurora
- AWS Redshift
- AWS Athena
- AWS Redshift Spectrum
- AWS RDS
- AWS ElastiCache
- AWS Neptune GraphDB
To understand how your data management systems are performing, it is important to track relevant metrics. These metrics will help you to optimize your data management resources, to ensure that your workload requirements are met, and that you have a clear overview on how the workload performs.
Use tools, libraries, and systems that record performance measurements related to database performance.
- Identify, collect, aggregate, and correlate database-related metrics. Metrics should include both the underlying system that is supporting the database and the database metrics.
- The underlying system metrics might include CPU utilization, memory, available disk storage, disk I/O, and network inbound and outbound metrics while the database metrics might include transactions per second, top queries, average queries rates, response times, index usage, table locks, query timeouts, and number of connections open.
- AWS CloudWatch
- AWS X-Ray
- AWS DevOps Guru
Use the access patterns of the workload to decide which services and technologies to use. In addition to non-functional requirements such as performance and scale, access patterns heavily influence the choice of the database and storage solutions.
The first dimension is the need for transactions, ACID compliance, and consistent reads. Not every database supports these and most of the NoSQL databases provide an eventual consistency model. The second important dimension would be the distribution of write and reads over time and space
- Identify and evaluate your data access pattern to select the correct storage configuration.
- Each database solution has options to configure and optimize your storage solution.
- Use the collected metrics and logs and experiment with options to find the optimal configuration. Use the following table to review storage options per database service.
- AWS DynamoDB
- AWS Aurora
- AWS Redshift
- AWS Athena
- AWS Redshift Spectrum
- AWS RDS
- AWS ElastiCache
- AWS Neptune GraphDB
Use performance characteristics and access patterns that optimize how data is stored or queried to achieve the best possible performance.
Measure how optimizations such as indexing, key distribution, data warehouse design, or caching strategies impact system performance or overall efficiency.
- Optimize data storage based on metrics and patterns: Use reported metrics to identify any underperforming areas in your workload and optimize your database components.
- Each database system has different performance related characteristics to evaluate, such as how data is indexed, cached, or distributed among multiple systems.
- Measure the impact of your optimizations.
- AWS DynamoDB
- AWS Aurora
- AWS Redshift
- AWS Athena
- AWS Redshift Spectrum
- AWS RDS
- AWS ElastiCache
- AWS Neptune GraphDB
Network architecture selection .
The optimal network solution for a workload varies based on latency, throughput requirements, jitter, and bandwidth.
Physical constraints, such as user or on-premises resources, determine location options. These constraints can be offset with edge locations or resource placement.
On AWS, networking is virtualized and is available in a number of different types and configurations. This makes it easier to match your networking methods with your needs.
AWS offers product features (for example, Enhanced Networking, Amazon EC2 networking optimized instances, Amazon S3 transfer acceleration, and dynamic Amazon CloudFront) to optimize network traffic.
AWS also offers networking features (for example, Amazon Route 53 latency routing, Amazon VPC endpoints, AWS Direct Connect, and AWS Global Accelerator) to reduce network distance or jitter.
Best Practices
Analyze and understand how network-related decisions impact workload performance. The network is responsible for the connectivity between application components, cloud services, edge networks and on-premises data and therefor it can highly impact workload performance.
In addition to workload performance, user experience is also impacted by network latency, bandwidth, protocols, location, network congestion, jitter, throughput, and routing rules.
- Identify important network performance metrics of your workload and capture its networking characteristics.
- Define and document requirements as part of a data-driven approach, using benchmarking or load testing.
- Use this data to identify where your network solution is constrained, and examine configuration options that could improve the workload.
- Application Load Balancer
- EC2 Enhanced Networking on Linux
- EC2 Enhanced Networking on Windows
- EC2 Placement Groups
- Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
- Network Load Balancer
- Networking Products with AWS
- AWS Transit Gateway
- AWS Route 53
- VPC Endpoints
- VPC Flow Logs
Evaluate networking features in the cloud that may increase performance. Measure the impact of these features through testing, metrics, and analysis.
For example, take advantage of network-level features that are available to reduce latency, packet loss, or jitter.
- Review which network-related configuration options are available to you, and how they could impact your workload.
- Understanding how these options interact with your architecture and the impact that they will have on both measured performance and the performance perceived by users is critical for performance optimization.
- AWS EBS - Optimized Instances
- AWS Application Load Balancer
- AWS EC2 instance network bandwidth
- AWS EC2 Enhanced Networking on Linux
- AWS EC2 Enhanced Networking on Windows
- AWS EC2 Placement Groups
- AWS Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
- AWS Network Load Balancer
- AWS Transit Gateway
- AWS Latency-Based Routing in Amazon Route 53
- AWS VPC Endpoints
- AWS VPC Flow Logs
When a common network is required to connect on-premises and cloud resources in AWS, ensure that you have adequate bandwidth to meet your performance requirements.
Estimate the bandwidth and latency requirements for your hybrid workload. These numbers will drive the sizing requirements for AWS Direct Connect or your VPN endpoints.
- Develop a hybrid networking architecture based on your bandwidth requirements: Estimate the bandwidth and latency requirements of your hybrid applications.
- Based on your bandwidth requirements, a single VPN or Direct Connect connection might not be enough, and you must architect a hybrid setup to enable traffic load balancing across multiple connections.
- AWS Network Load Balancer
- AWS Networking Products with AWS
- AWS Transit Gateway
- AWS Transitioning to latency-based Routing in Amazon Route 53
- AWS VPC Endpoints
- AWS VPC Flow Logs
- AWS Site-to-Site VPN
- AWS Direct Connect
- AWS Client VPN
Distribute traffic across multiple resources or services to allow your workload to take advantage of the elasticity that the cloud provides.
You can also use load balancing for offloading encryption termination to improve performance and to manage and route traffic effectively
- Use the appropriate load balancer for your workload: Select the appropriate load balancer for your workload.
- If you must load balance HTTP requests, we recommend Application Load Balancer. For network and transport protocols (layer 4 – TCP, UDP) load balancing, and for extreme performance and low latency applications, we recommend Network Load Balancer.
- Application Load Balancers support HTTPS and Network Load Balancers support TLS encryption offloading.
- AWS Network Load Balancer
- AWS Networking Products with AWS
- AWS Transit Gateway
- AWS Transitioning to latency-based Routing in Amazon Route 53
- AWS VPC Endpoints
- AWS VPC Flow Logs
- AWS Site-to-Site VPN
- AWS Direct Connect
- AWS Client VPN
Make decisions about protocols for communication between systems and networks based on the impact to the workload’s performance.
There is a relationship between latency and bandwidth to achieve throughput. If your file transfer is using TCP, higher latencies will reduce overall throughput. There are approaches to fix this with TCP tuning and optimized transfer protocols, some approaches use UDP.
- Optimize network traffic: Select the appropriate protocol to optimize the performance of your workload.
- There are approaches to fix latency with TCP tuning and optimized transfer protocols, some which use UDP.
- AWS EBS - Optimized Instances
- AWS Application Load Balancer
- AWS EC2 Enhanced Networking on Linux
- AWS EC2 Enhanced Networking on Windows
- AWS EC2 Placement Groups
- AWS Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
- AWS Network Load Balancer
- AWS Transit Gateway
- AWS Latency-Based Routing in Amazon Route 53
- AWS VPC Endpoints
- AWS VPC Flow Logs
Use the cloud location options available to reduce network latency or improve throughput.
Use AWS Regions, Availability Zones, placement groups, and edge locations such as AWS Outposts, AWS Local Zones, and AWS Wavelength, to reduce network latency or improve throughput.
- Reduce latency by selecting the correct locations: Identify where your users and data are located.
- Take advantage of AWS Regions, Availability Zones, placement groups, and edge locations to reduce latency.
- AWS EBS - Optimized Instances
- AWS Application Load Balancer
- AWS EC2 Enhanced Networking on Linux
- AWS EC2 Enhanced Networking on Windows
- AWS EC2 Placement Groups
- AWS Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
- AWS Network Load Balancer
- AWS Transit Gateway
- AWS Latency-Based Routing in Amazon Route 53
- AWS VPC Endpoints
- AWS VPC Flow Logs
Use collected and analyzed data to make informed decisions about optimizing your network configuration. Measure the impact of those changes and use the impact measurements to make future decisions.
Enable VPC Flow Logs for all VPC networks that are used by your workload. VPC Flow Logs are a feature that allows you to capture information about the IP traffic going to and from network interfaces in your VPC.
- Enable VPC Flow Logs: VPC Flow Logs enable you to capture information about the IP traffic going to and from network interfaces in your VPC.
- Enable appropriate metrics for network options: Ensure that you select the appropriate network metrics for your workload. You can enable metrics for VPC NAT gateway, transit gateways, and VPN tunnels.
- AWS EBS - Optimized Instances
- AWS Application Load Balancer
- AWS EC2 Enhanced Networking on Linux
- AWS EC2 Enhanced Networking on Windows
- AWS EC2 Placement Groups
- AWS Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
- AWS Network Load Balancer
- AWS Transit Gateway
- AWS Latency-Based Routing in Amazon Route 53
- AWS VPC Endpoints
- AWS VPC Flow Logs
Review
When architecting workloads, there are finite options that you can choose from. However, over time, new technologies and approaches become available that could improve the performance of your workload.
In the cloud, it’s much easier to experiment with new features and services because your infrastructure is code. To adopt a data-driven approach to architecture you should implement a performance review process
Evolve your workload to take advantage of new releases .
Take advantage of the continual innovation at AWS driven by customer need. We release new Regions, edge locations, services, and features regularly.
Any of these releases could positively improve the performance efficiency of your architecture.
Best Practices
Evaluate ways to improve performance as new services, design patterns, and product offerings become available.
Determine which of these could improve performance or increase the efficiency of the workload through evaluation, internal discussion, or external analysis.
- Document your workload solutions.
- Use a tagging strategy to document owners for each workload component and category.
- Identify news and update sources related to your workload components.
- Document your process for evaluating updates and new services.
- AWS Config
- AWS Tagging
- AWS Github
- AWS Skill Builder
- AWS Blog
- Whats New With AWS website
Define a process to evaluate new services, design patterns, resource types, and configurations as they become available.
For example, run existing performance tests on new instance offerings to determine their potential to improve your workload.
- Identify the key performance constraints for your workload: Document your workload’s performance constraints so that you know what kinds of innovation might improve the performance of your workload.
- AWS Blog
- What's New with AWS website
- AWS Github
- AWS Skill Builder
As an organization, use the information gathered through the evaluation process to actively drive adoption of new services or resources when they become available.
Use the information you gather when evaluating new services or technologies to drive change. As your business or workload changes, performance needs also change.
- Evolve your workload over time: Use the information you gather when evaluating new services or technologies to drive change.
- As your business or workload changes, performance needs also change.
- Use data gathered from your workload metrics to evaluate areas where you can achieve the biggest gains in efficiency or performance, and proactively adopt new services and technologies to keep up with demand.
- AWS Blog
- What's New with AWS website
- AWS Github
- AWS Skill Builder
Monitoring
After you implement your architecture you must monitor its performance so that you can remediate any issues before they impact your customers. Monitoring metrics should be used to raise alarms when thresholds are breached.
Monitor your resources to ensure that they are performing as expected .
System performance can degrade over time. Monitor system performance to identify degradation and remediate internal or external factors, such as the operating system or application load.
Best Practices
Use a monitoring and observability service to record performance-related metrics.
Examples of metrics include record database transactions, slow queries, I/O latency, HTTP request throughput, service latency, or other key data.
- Identify the relevant performance metrics for your workload and record them. This data helps identify which components are impacting overall performance or efficiency of your workload.
- Identify performance metrics: Use the customer experience to identify the most important metrics. For each metric, identify the target, measurement approach, and priority.
- Use these data points to build alarms and notifications to proactively address performance-related issues.
- AWS CloudWatch (monitoring/logging)
- AWS X-Ray
In response to (or during) an event or incident, use monitoring dashboards or reports to understand and diagnose the impact. These views provide insight into which portions of the workload are not performing as expected.
- Prioritize experience concerns for critical user stories: When you write critical user stories for your architecture, include performance requirements, such as specifying how quickly each critical story should run.
- For these critical stories, implement additional scripted user journeys to ensure that you know how the user stories perform against your requirements.
- AWS CloudWatch
- AWS CloudWatch Synthetics
- AWS X-Ray
Identify the KPIs that quantitatively and qualitatively measures workload performance. KPIs help to measure the health of a workload as it relates to a business goal.
KPIs allow business and engineering teams to align on the measurement of goals and strategies and how this combines to produce business outcomes.
KPIs should be revisited when business goals, strategies, or end-user requirements change.
- All departments and business teams impacted by the health of the workload should contribute to defining KPIs.
- A single person should drive the collaboration, timelines, documentation, and information related to an organization’s KPIs.
- AWS CloudWatch
- AWS CloudWatch Synthetics
- AWS X-Ray
- AWS QuickSight
Using the performance-related key performance indicators (KPIs) that you defined, use a monitoring system that generates alarms automatically when these measurements are outside expected boundaries.
Amazon CloudWatch can collect metrics across the resources in your architecture. You can also collect and publish custom metrics to surface business or derived metrics.
- Monitor metrics: Amazon CloudWatch can collect metrics across the resources in your architecture.
- You can collect and publish custom metrics to surface business or derived metrics.
- Use CloudWatch or a third-party monitoring service to set alarms that indicate when thresholds are exceeded.
- AWS CloudWatch
- AWS CloudWatch Synthetics
- AWS X-Ray
As routine maintenance, or in response to events or incidents, review which metrics are collected.
Use these reviews to identify which metrics were essential in addressing issues and which additional metrics, if they were being tracked, would help to identify, address, or prevent issues.
- Constantly improve metric collection and monitoring: As part of responding to incidents or events, evaluate which metrics were helpful in addressing the issue and which metrics could have helped that are not currently being tracked.
- Use this method to improve the quality of metrics you collect so that you can prevent or more quickly resolve future incidents.
- AWS CloudWatch
- AWS CloudWatch Synthetics
- AWS X-Ray
Use key performance indicators (KPIs), combined with monitoring and alerting systems, to proactively address performance-related issues.
Use alarms to trigger automated actions to remediate issues where possible. Escalate the alarm to those able to respond if automated response is not possible
- Monitor performance during operations: Implement processes that provide visibility into performance as your workload is running.
- Build monitoring dashboards and establish a baseline for performance expectations.
- AWS CloudWatch
- AWS X-Ray
Trade-offs
When you architect solutions, think about trade-offs to ensure an optimal approach. Depending on your situation, you could trade consistency, durability, and space for time or latency, to deliver higher performance.
Using trade-offs to improve performance .
When architecting solutions, actively considering trade-offs enables you to select an optimal approach
Often you can improve performance by trading consistency, durability, and space for time and latency. Trade-offs can increase the complexity of your architecture and require load testing to ensure that a measurable benefit is obtained.
Best Practices
Understand and identify areas where increasing the performance of your workload will have a positive impact on efficiency or customer experience.
For example, a website that has a large amount of customer interaction can benefit from using edge services to move content delivery closer to customers.
- Set up end-to-end tracing to identify traffic patterns, latency, and critical performance areas.
- Monitor your data access patterns for slow queries or poorly fragmented and partitioned data.
- Identify the constrained areas of the workload using load testing or monitoring.
- AWS Builders’ Library
- AWS X-Ray
- AWS CloudWatch RUM
- AWS DevOps Guru
- AWS CloudWatch Synthetics
Research and understand the various design patterns and services that help improve workload performance. As part of the analysis, identify what you could trade to achieve higher performance.
For example, using a cache service can help to reduce the load placed on database systems. However, caching can introduce eventual consistency and requires engineering effort to implement within business requirements and customer expectations.
- Evaluate and review design patterns that would improve your workload performance.
- Improve your workload to model the selected design patterns and use services and the service configuration options to improve your workload performance.
- AWS Architecture Center
- AWS Partner Network
- AWS Solutions Library
- AWS Knowledge Center
- Amazon Builders’ Library
When evaluating performance-related improvements, determine which choices will impact your customers and workload efficiency.
For example, if using a key-value data store increases system performance, it is important to evaluate how the eventually consistent nature of it will impact customers.
- Identify tradeoffs: Use metrics and monitoring to identify areas of poor performance in your system.
- Determine how to make improvements, and how tradeoffs will impact the system and the user experience.
- AWS Builders’ Library
- AWS QuickSight KPIs
- AWS CloudWatch RUM
- AWS X-Ray Documentation
As changes are made to improve performance, evaluate the collected metrics and data. Use this information to determine impact that the performance improvement had on the workload, the workload’s components, and your customers.
- A well-architected system uses a combination of performance-related strategies.
- Determine which strategy will have the largest positive impact on a given hotspot or bottleneck.
- For example, sharding data across multiple relational database systems could improve overall throughput while retaining support for transactions and, within each shard, caching can help to reduce the load.
- AWS Builders’ Library
- AWS CloudWatch RUM
- AWS CloudWatch Synthetics
Where applicable, use multiple strategies to improve performance.
For example, using strategies like caching data to prevent excessive network or database calls, using read-replicas for database engines to improve read rates, sharding or compressing data where possible to reduce data volumes, and buffering and streaming of results as they are available to avoid blocking.
- Use a data-driven approach to evolve your architecture: As you make changes to the workload, collect and evaluate metrics to determine the impact of those changes.
- Measure the impacts to the system and to the end-user to understand how your tradeoffs impact your workload.
- Use a systematic approach, such as load testing, to explore whether the tradeoff improves performance.
- AWS CloudWatch Synthetics
- AWS Builders’ Library
- AWS ElastiCache
- AWS Database Caching
- AWS CloudWatch RUM
Cost Optimization
Architect workloads with the most effective use of services and resources, to achieve business outcomes at the lowest price point.
There are five focus areas for Cost Optimization in the cloud
Practice Cloud Financial Management
Cloud Financial Management (CFM) enables organizations to realize business value and financial success as they optimize their cost and usage and scale on AWS.
Best Practices
Create a team (Cloud Business Office or Cloud Center of Excellence) that is responsible for establishing and maintaining cost awareness across your organization. The team requires people from finance, technology, and business roles across the organization.
- Establish a Cloud Business Office (CBO) or Cloud Center of Excellence (CCOE) team that is responsible for establishing and maintaining a culture of cost awareness in cloud computing.
- Define key members: You need to ensure that all relevant parts of your organization contribute and have a stake in cost management.
- Define goals and metrics: The function needs to deliver value to the organization in different ways. These goals are defined and continually evolve as the organization evolves.
- Establish regular cadence: The group(finance,technology,and business teams) should come together regularly to review their goals and metrics.
- AWS CCOE Blog
- Creating Cloud Business Office
- Create a Cloud Center of Excellence
Involve finance and technology teams in cost and usage discussions at all stages of your cloud journey. Teams regularly meet and discuss topics such as organizational goals and targets, current state of cost and usage, and financial and accounting practices.
- Establish a partnership between key finance and technology stakeholders to create a shared understanding of organizational goals and develop mechanisms to succeed financially in the variable spend model of cloud computing.
- Define key members: Verify that all relevant members of your finance and technology teams participate in the partnership. Relevant finance members will be those having interaction with the cloud bill. This will typically be CFOs, financial controllers, financial planners, business analysts, procurement, and sourcing.
- Define topics for discussion: Define the topics that are common across the teams, or will need a shared understanding.
- Establish regular cadence: To create a finance and technology partnership, establish a regular communication cadence to create and maintain alignment. The group needs to come together regularly against their goals and metrics.
- AWS News Blog website
Adjust existing organizational budgeting and forecasting processes to be compatible with the highly variable nature of cloud costs and usage. Processes must be dynamic using trend-based or business driver-based algorithms, or a combination of both.
- Update existing budget and forecasting processes: Implement trend-based, business driver-based,or a combination of both in your budgeting and forecasting processes.
- Configure alerts and notifications: Use AWS Budgets Alerts and Cost Anomaly Detection.
- Perform regular reviews with key stakeholders: Forexample, stakeholders in IT, Finance, Platform, and other areas of the business, to align with changes in business direction and usage.
- AWS Cost Explorer
- AWS Budgets
- AWS Pricing Calculator
- AWS Cost Anomaly Detection
- AWS License Manager
Implement cost awareness, create transparency, and accountability of costs into new or existing processes that impact usage, and leverage existing processes for cost awareness. Implement cost awareness into employee training.
- Identify relevant organizational processes: Each organizational unit reviews their processes and identifies processes that impact cost and usage.
- Establish self-sustaining cost-aware culture: Make sure all the relevant stakeholders align with cause-of-change and impact as a cost so that they understand cloud cost.
- Update processes with cost awareness: Each process is modified to be made cost aware. The process may require additional pre-checks, such as assessing the impact of cost, or post-checks validating that the expected changes in cost and usage occurred.
- AWS Cloud Financial Management website
Configure AWS Budgets and AWS Cost Anomaly Detection to provide notifications on cost and usage against targets. Have regular meetings to analyze your workload's cost efficiency and to promote cost- aware culture.
- Configure AWS Budgets on all accounts for your workload. Set a budget for the overall account spend, and a budget for the workload by using tags.
- Report on cost optimization: Setup a regular cycle to discuss and analyze the efficiency of the workload.
- AWS Cost Explorer
- AWS Trusted Advisor
- AWS Budgets
- AWS Budgets Best Practices
- Amazon CloudWatch
- AWS CloudTrail
- Amazon S3 Analytics
- AWS Cost and Usage Report
Implement tooling and dashboards to monitor cost proactively for the workload. Regularly review the costs with configured tools or out of the box tools, do not just look at costs and categories when you receive notifications. Monitoring and analyzing costs proactively helps to identify positive trends and allows you to promote them throughout your organization.
- Report on cost optimization: Setup a regular cycle to discuss and analyze the efficiency of the workload.
- Create and enable daily granularity AWS Budgets for the cost and usage to take timely actions to prevent any potential cost overruns
- Create AWS Cost Anomaly Detection for cost monitor
- Use AWS Cost Explorer or integrate your AWS Cost andU sage Report(CUR)data with Amazon QuickSight dashboards to visualize your organization’s costs
- AWS Budgets
- AWS Cost Explorer
- AWS Cost Anomaly Detection
Consult regularly with experts or AWS Partners to consider which services and features provide lower cost. Review AWS blogs and other information sources.
- Subscribe to blogs
- Subscribe to AWS News
- Follow AWS Price Reductions
- Meet with your account team
- AWS Cost Management website
- What’s New with AWS website
- AWS News Blog
Implement changes or programs across your organization to create a cost-aware culture. It is recommended to start small, then as your capabilities increase and your organization’s use of the cloud increases, implement large and wide ranging programs.
- Report cloud costs to technology teams
- Inform stake holders or team members about planned changes
- Meet with your account team
- Share success stories
- Training
- AWS Cost Management website
- AWS News Blog
Quantifying business value from cost optimization allows you to understand the entire set of benefits to your organization. Because cost optimization is a necessary investment, quantifying business value allows you to explain the return on investment to stakeholders. Quantifying business value can help you gain more buy-in from stakeholders on future cost optimization investments, and provides a framework to measure the outcomes for your organization’s cost optimization activities.
- Execute cost optimization best practices
- Implementing automation, for example AutoScaling
- AWS Cost Management website
- AWS Cost Explorer
- AWS News Blog
Expenditure and usage awareness
Understanding your organization’s costs and drivers is critical for managing your cost and usage effectively, and identifying cost-reduction opportunities
Organizations typically operate multiple workloads run by multiple teams. These teams can be in different organization units, each with its own revenue stream. The capability to attribute resource costs to the workloads, individual organization, or product owners drives efficient usage behavior and helps reduce waste. Accurate cost and usage monitoring allows you to understand how profitable organization units and products are, and allows you to make more informed decisions about where to allocate resources within your organization. Awareness of usage at all levels in the organization is key to driving change, as change in usage drives changes in cost.
Governance
To manage your costs in the cloud, you must manage your usage through the following governance best practicies.
Best Practices
Develop policies that define how resources are managed by your organization. Policies should cover cost aspects of resources and workloads, including creation, modification and decommission over the resource lifetime.
- Meet with team members
- Define locations for your workload
- Define and group services and resources
- Define and group the users by function
- Define the actions
- Define the review period
- Document the policies
- AWS Managed Policies for Job Functions website
- AWS Compliance latest news website
- AWS Compliance programs website
Implement both cost and usage goals for your workload. Goals provide direction to your organization on cost and usage, and targets provide measurable outcomes for your workloads.
- Define expected usage levels: Focus on usage levels to begin with.Engage with the application owners, marketing, and greater business teams to understand what the expected usage levels will be for the workload.
- Define workload resourcing and costs: With the usage levels defined, quantify the changes in workload resources required to meet these usage levels.
- Define business goals: Taking the output from the expected changes in usage and cost, combine this with expected changes in technology, or any programs that you are running, and develop goals for the workload
- Define targets: For each of the defined goals specify a measurable target.
- AWS managed policies for job functions
- AWS multi-account strategy for your AWS Control Tower landingzone
- Control access to AWS Regions using IAM policies
Implement a structure of accounts that maps to your organization. This assists in allocating and managing costs throughout your organization.
- Define separation requirements: Requirements for separation are a combination of multiple factors, including security, reliability, and financial constructs
- Define grouping requirements: Requirements for grouping do not override the separation requirements, but are used to assist management
- Define account structure: Using the seseparations and groupings,specify an account for each group and ensure that separation requirements are maintained.
- AWS managed policies for job functions
- AWS multiple account billing strategy
- Control access to AWS Regions using IAM policies
- AWS Control Tower
- AWS Organizations
- Consolidated billing
Implement groups and roles that align to your policies and control who can create, modify, or decommission instances and resources in each group. For example, implement development, test, and production groups. This applies to AWS services and third-party solutions.
- Implement groups: Using the groups of users defined in your organizational policies,implement the corresponding groups, if necessary.
- Implement roles and policies: Using the actions defined in your organizational policies, create the required roles and access policies.
- AWS managed policies for job functions
- AWS multiple account billing strategy
- Control access to AWS Regions using IAM policies
Implement controls based on organization policies and defined groups and roles. These certify that costs are only incurred as defined by organization requirements: for example, control access to regions or resource types with AWS Identity and Access Management (IAM) policies.
- Implement notifications on spend: Using your defined organization policies, createAWSbudgets to provide notifications when spending is outside of your policies.
- Implement controls on usage: Using your defined organization policies, implement IAM policies and roles to specify which actions users can perform and which actions they cannot perform.
- AWS managed policies for job functions
- AWS multiple account billing strategy
- Control access to AWS Regions using IAM policies
Track, measure, and audit the lifecycle of projects, teams, and environments to avoid using and paying for unnecessary resources.
- Perform workload reviews: As defined by your organizational policies, audit your existing projects. The amount of effort spent in the audit should be proportional to the approximate risk, value, or cost to the organization.
- AWS Config
- AWS Systems Manager
- AWS managed policies for job functions
- AWS multiple account billing strategy
- Control access to AWS Regions using IAM policies
Monitor Cost and Usage
Enable teams to take action on their cost and usage through detailed visibility into the workload
Cost optimization begins with a granular understanding of the breakdown in cost and usage, the ability to model and forecast future spend, usage, and features, and the implementation of sufficient mechanisms to align cost and usage to your organization’s objectives.
Best Practices
Configure the AWS Cost and Usage Report, and Cost Explorer hourly granularity, to provide detailed cost and usage information. Configure your workload to have log entries for every delivered business outcome. Tag resources.
- Configure the cost and usage report: Using the billing console, configure at least one cost and usage report.
- Configure hourly granularity in Cost Explorer: Using the billing console, enable Hourly and Resource Level Data.
- Configure application logging: Verify that your application logs each business outcome that it delivers so it can be tracked and measured.
- AWS Cost and Usage Report (CUR)
- AWS Glue / Athena
- AWS resource tagging
- AWS Cost Explorer
- AWS Budgets
Identify organization categories that could be used to allocate cost within your organization.
- Define your organization categories: Meet with stakeholders to define categories that reflect your organization's structure and requirements
- Define your functional categories: Meet with stakeholders to define categories that reflect the functions that you have within your business.
- Tagging AWS resources
- AWS Budgets
- AWS Cost Explorer
Establish the organization metrics that are required for this workload. Example metrics of a workload are customer reports produced, or web pages served to customers.
- Define workload outcomes: Meet with the stakeholders in the business and define the outcomes for the workload.
- Define workload component outcomes: Optionally, if you have a large and complex workload, or can easily break your workload into components (such as microservices) with well-defined inputs and outputs, define metrics for each component.
- Tagging AWS resources
- AWS Budgets
- AWS Cost Explorer
- AWS Cost and Usage Reports
Configure AWS Cost Explorer and AWS Budgets inline with your organization policies.
- Create a Cost Optimization group: Configure your account and create a group that has access to the required Cost and Usage reports.
- Configure AWS Budgets: Configure AWS Budgets on all accounts for your workload. Set a budget for the overall account spend, and a budget for the workload by using tags.
- Configure AWS Cost Explorer: Configure AWS Cost Explorer for your workload and accounts. Create a dashboard for the workload that tracks overall spend, and key usage metrics for the workload.
- Configure advanced tooling: Optionally, you can create custom tooling for your organization that provides additional detail and granularity.
- Tagging AWS resources
- AWS Budgets
- AWS Cost Explorer
- AWS Cost and Usage Reports
Define a tagging schema based on organization, and workload attributes, and cost allocation categories. Implement tagging across all resources. Use Cost Categories to group costs and usage according to organization attributes.
- Define a tagging schema: Gather all stakeholders from across your business to define a schema. This typically includes people in technical, financial, and management roles. D
- Tag resources: Using your defined cost attribution categories, place tags on all resources in your workloads according to the categories.
- Implement Cost Categories: You can create Cost Categories without implementing tagging. Cost Categories use the existing cost and usage dimensions. C
- Automate tagging: To verify that you maintain high levels of tagging across all resources, automate tagging so that resources are automatically tagged when they are created.
- Monitor and report on tagging: To verify that you maintain high levels of tagging across your organization, report and monitor the tags across your workloads.
- AWS CloudFormation Resource Tag
- Tagging AWS resources
- AWS Budgets
- AWS Cost Explorer
- AWS Cost and Usage Reports
Allocate the workload's costs by metrics or business outcomes to measure workload cost efficiency. Implement a process to analyze the AWS Cost and Usage Report with Amazon Athena, which can provide insight and charge back capability.
- Allocate costs to workload metrics: Using the defined metrics and tagging configured, create a metric that combines the workload output and workload cost.
- Use the analytics services such as Amazon Athena and Amazon QuickSight to create an efficiency dashboard for the overall workload, and any components.
- Tagging AWS resources
- AWS Budgets
- AWS Cost Explorer
- AWS Cost and Usage Reports
Decommison resources
After you manage a list of projects, employees, and technology resources over time you will be able to identify which resources are no longer being used, and which projects that no longer have an owner.
Best practices
Define and implement a method to track resources and their associations with systems over their lifetime. You can use tagging to identify the workload or function of the resource.
- Implement a tagging scheme: Implement a tagging scheme that identifies the workload the resource belongs to, verifying that all resources within the workload are tagged accordingly.
- Implement workload throughput or output monitoring: Implement workload throughput monitoring or alarming, triggering on either input requests or output completions. C
- AWS Auto Scaling
- AWS Trusted Advisor
- Tagging AWS resources
Implement a process to identify and decommission orphaned resources.
- Create and implement a decommissioning process: Working with the workload developers and owners, build a decommissioning process for the workload and its resources.
- AWS Auto Scaling
- AWS Trusted Advisor
Decommission resources triggered by events such as periodic audits, or changes in usage. Decommissioning is typically performed periodically, and is manual or automated.
- Decommission resources: Using the decommissioning process, decommission each of the resources that have been identified as orphaned.
- AWS Auto Scaling
- AWS Trusted Advisor
Design your workload to gracefully handle resource termination as you identify and decommission non- critical resources, resources that are not required, or resources with low utilization.
- Implement AWS Auto Scaling: For resources that are supported, configure them with AWS Auto Scaling.
- Configure CloudWatch to terminate instances: Instances can be configured to terminate using CloudWatch alarms. U
- Implement code within the workload: You can use the AWS SDK or AWS CLI to decommission workload resources.
- AWS Auto Scaling
- AWS Trusted Advisor
- Create Alarms to Stop, Terminate, Reboot, or Recover an Instance
Cost-effective resources
Using the appropriate services, resources, and configurations for your workloads is key to cost savings
Consider the following when creating cost-effective resources: You can use AWS Solutions Architects, AWS Solutions, AWS Reference Architectures, and APN Partners to help you choose an architecture based on what you have learned.
Evaluate cost when selecting service
Evaluate service costs through the following best practicies.
Best Practices
Work with team members to define the balance between cost optimization and other pillars, such as performance and reliability, for this workload.
- Identify organization requirements for cost: Meet with team members from your organization, including those in product management, application owners, development and operational teams, management, and financial roles.
- Prioritize the Well-Architected pillars for this workload and its components, the output is a list of the pillars in order.
- AWS Total Cost of Ownership (TCO) Calculator
Verify every workload component is analyzed, regardless of current size or current costs. The review effort should reflect the potential benefit, such as current and projected costs.
- List the workload components: Build the list of all the workload components. This is used as verification to check that each component was analyzed.
- Prioritize component list: Take the component list and prioritize it in order of effort. This is typically in order of the cost of the component from most expensive to least expensive, or the criticality as defined by your organization’s priorities.
- Perform the analysis: For each component on the list, review the options and services available and chose the option that aligns best with your organizational priorities.
- AWS Pricing Calculator
- AWS Cost Explorer
Look at overall cost to the organization of each component. Look at total cost of ownership by factoring in cost of operations and management, especially when using managed services. The review effort should reflect potential benefit, for example, time spent analyzing is proportional to component cost.
- Using the component list, work through each component from the highest priority to the lowest priority.
- For the higher priority and more costly components, perform additional analysis and assess all available options and their long term impact.
- • AWS Total Cost of Ownership (TCO) Calculator
Open-source software eliminates software licensing costs, which can contribute significant costs to workloads. Where licensed software is required, avoid licenses bound to arbitrary attributes such as CPUs, look for licenses that are bound to output or outcomes. The cost of these licenses scales more closely to the benefit they provide.
- Analyze license options: Review the licensing terms of available software. Look for open-source versions that have the required functionality, and whether the benefits of licensed software outweigh the cost.
- Analyze the software provider: Review any historical pricing or licensing changes from the vendor. Look for any changes that do not align to outcomes, such as punitive terms for running on specific vendors hardware or platforms.
- AWS Total Cost of Ownership (TCO) Calculator
Factor in cost when selecting all components. This includes using application level and managed services, such as Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, Amazon Simple Notification Service (Amazon SNS), and Amazon Simple Email Service (Amazon SES) to reduce overall organization cost. Use serverless and containers for compute, such as AWS Lambda, Amazon Simple Storage Service (Amazon S3)for static websites, and Amazon Elastic Container Service (Amazon ECS). Minimize license costs by using open source software, or software that does not have license fees: for example, Amazon Linux for compute workloads or migrate databases to Amazon Aurora.
- Select each service to optimize cost: Using your prioritized list and analysis, select each option that provides the best match with your organizational priorities.
- AWS Total Cost of Ownership (TCO) Calculator
Workloads can change over time. Some services or features are more cost effective at different usage levels. By performing the analysis on each component over time and at projected usage, the workload remains cost-effective over its lifetime.
- Define predicted usage patterns: Working with your organization, such as marketing and product owners, document what the expected and predicted usage patterns will be for the workload.
- Perform cost analysis at predicted usage: Using the usage patterns defined, perform the analysis at each of these points.
- AWS Total Cost of Ownership (TCO) Calculator
Select the correct resource type, size, and number
By selecting the best resource type, size, and number of resources, you meet the technical requirements with the lowest cost resource
Right-sizing activities takes into account all of the resources of a workload, all of the attributes of each individual resource, and the effort involved in the right-sizing operation. Right-sizing can be an iterative process, triggered by changes in usage patterns and external factors, such as AWS price drops or new AWS resource types. Right-sizing can also be one-off if the cost of the effort to right-size, outweighs the potential savings over the life of the workload.
Best Practices
Identify organization requirements and perform cost modeling of the workload and each of its components. Perform benchmark activities for the workload under different predicted loads and compare the costs. The modeling effort should reflect the potential benefit. For example, time spent is proportional to component cost.
- Perform cost modeling: Deploy the workload or a proof-of-concept, into a separate account with the specific resource types and sizes to test. Run the workload with the test data and record the output results, along with the cost data for the time the test was run. Then redeploy the workload or change the resource types and sizes and re-run the test.
- AWS Auto Scaling
- AWS CloudWatch
Select resource size or type based on data about the workload and resource characteristics. For example, compute, memory, throughput, or write intensive. This selection is typically made using a previous (on- premises) version of the workload, using documentation, or using other sources of information about the workload.
- Select resources based on data: Using your cost modeling data, select the expected workload usage level, then select the specified resource type and size.
- AWS Auto Scaling
- AWS CloudWatch
Use metrics from the currently running workload to select the right size and type to optimize for cost. Appropriately provision throughput, sizing, and storage for services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon DynamoDB, Amazon Elastic Block Store (Amazon EBS) (PIOPS), Amazon Relational Database Service (Amazon RDS), Amazon EMR, and networking. This can be done with a feedback loop such as automatic scaling or by custom code in the workload.
- Configure workload metrics: Ensure you capture the key metrics for the workload.
- View rightsizing recommendations: Use the rightsizing recommendations in AWS Compute Optimizer to make adjustments to your workload.
- Select resource type and size automatically based on metrics: Using the workload metrics, manually or automatically select your workload resources.
- AWS Auto Scaling
- AWS CloudWatch
Select the best pricing model
Perform workload cost modeling: Consider the requirements of the workload components and understand the potential pricing models
Perform regular account level analysis: Performing regular cost modeling ensures that opportunities to optimize across multiple workloads can be implemented.
• On-Demand Instances • Spot Instances • Commitment discounts - Savings Plans • Commitment discounts - Reserved Instances/Capacity • Geographic selection • Third-party agreements and pricing
Best practices
Analyze each component of the workload. Determine if the component and resources will be running for extended periods (for commitment discounts), or dynamic and short-running (for Spot or On-Demand Instances). Perform an analysis on the workload using the Recommendations feature in AWS Cost Explorer.
- Perform a commitment discount analysis: Using Cost Explorer in your account, review the Savings Plans and Reserved Instance recommendations.
- Analyze workload elasticity: Using the hourly granularity in Cost Explorer, or a custom dashboard. Analyze the workload elasticity. Look for regular changes in the number of instances that are running. Short duration instances are candidates for Spot Instances or Spot Fleet.
Resource pricing can be different in each Region. Factoring in Region cost helps ensure that you pay the lowest overall price for this workload.
- Review Region pricing: Analyze the workload costs in the current Region. Starting with the highest costs by service and usage type, calculate the costs in other Regions that are available.
Cost efficient agreements and terms ensure the cost of these services scales with the benefits they provide. Select agreements and pricing that scale when they provide additional benefits to your organization.
- Analyze third-party agreements and terms: Review the pricing in third party agreements. Perform modeling for different levels of your usage, and factor in new costs such as new service usage, or increases in current services due to workload growth.
Permanently running resources should utilize reserved capacity such as Savings Plans or Reserved Instances. Short-term capacity is configured to use Spot Instances, or Spot Fleet. On-Demand Instances are only used for short-term workloads that cannot be interrupted and do not run long enough for reserved capacity, between 25% to 75% of the period, depending on the resource type.
- Implement pricing models: Using your analysis results, purchase Savings Plans (SPs), Reserved Instances (RIs) or implement Spot Instances.
- Workload review cycle: Implement a review cycle for the workload that specifically analyzes pricing model coverage.
Use Cost Explorer Savings Plans and Reserved Instance recommendations to perform regular analysis at the management account level for commitment discounts.
- Perform a commitment discount analysis: Using Cost Explorer in your account review the Savings Plans and Reserved Instance recommendations.
Plan the data transfer
Efficient use of networking resources is required for cost optimization in the cloud.
Best practices
Gather organization requirements and perform data transfer modeling of the workload and each of its components. This identifies the lowest cost point for its current data transfer requirements.
- Calculate data transfer costs: Use the AWS pricing pages and calculate the data transfer costs for the workload.
- Link costs to outcomes: For each data transfer cost incurred, specify the outcome that it achieves for the workload.
- AWS caching solutions (doc)
- AWS Pricing (doc)
- Amazon EC2 Pricing (doc)
- Amazon VPC pricing (doc)
- Deliver content faster with Amazon CloudFront (doc)
All components are selected, and architecture is designed to reduce data transfer costs. This includes using components such as wide-area-network (WAN) optimization and Multi-Availability Zone (AZ) configurations
- Select components for data transfer: Using the data transfer modeling, focus on where the largest data transfer costs are or where they would be if the workload usage changes. L
- AWS caching solutions (doc)
- Deliver content faster with Amazon CloudFront (doc)
Implement services to reduce data transfer. For example, using a content delivery network (CDN) such as Amazon CloudFront to deliver content to end users, caching layers using Amazon ElastiCache, or using AWS Direct Connect instead of VPN for connectivity to AWS.
- Implement services: Using the data transfer modeling, look at where the largest costs and highest volume flows are.
- AWS Direct Connect
- AWS CloudFront
Manage demand and supplying resources
When you move to the cloud, you pay only for what you need. You can supply resources to match the workload demand at the time they’re needed — eliminating the need for costly and wasteful overprovisioning
You can also modify the demand using a throttle, buffer, or queue to smooth the demand and serve it with less resources.
Best Practices
Analyze the demand of the workload over time. Verify that the analysis covers seasonal trends and accurately represents operating conditions over the full workload lifetime. Analysis effort should reflect the potential benefit, for example, time spent is proportional to the workload cost.
- Analyze existing workload data: Analyze data from the existing workload, previous versions of the workload, or predicted usage patterns. Use log files and monitoring data to gain insight on how customers use the workload.
- Forecast outside influence: Meet with team members from across the organization that can influence or change the demand in the workload.
- AWS Auto Scaling
- AWS Instance Scheduler
- AWS Cost Explorer
- AWS QuickSight
Buffering and throttling modify the demand on your workload, smoothing out any peaks. Implement throttling when your clients perform retries. Implement buffering to store the request and defer processing until a later time. Verify that your throttles and buffers are designed so clients receive a response in the required time.
- Analyze the client requirements: Analyze the client requests to determine if they are capable of performing retries. For clients that cannot perform retries, buffers will need to be implemented.
- Implement a buffer or throttle: Implement a buffer or throttle in the workload. A queue such as Amazon Simple Queue Service (Amazon SQS) can provide a buffer to your workload components.
- AWS Auto Scaling
- AWS Instance Scheduler
- AWS API Gateway
- AWS SQS
- AWS Kinesis
Resources are provisioned in a planned manner. This can be demand-based, such as through automatic scaling, or time-based, where demand is predictable and resources are provided based on time. These methods result in the least amount of over or under-provisioning.
- Configure time-based scheduling: For predictable changes in demand, time-based scaling can provide the correct number of resources in a timely manner.
- Configure Auto Scaling: To configure scaling based on active workload metrics, use Amazon Auto Scaling.
- AWS Auto Scaling
- AWS Instance Scheduler
- AWS SQS
- AWS Kinesis
Optimize over time
In AWS, you optimize over time by reviewing new services and implementing them in your workload
As AWS releases new services and features, it is a best practice to review your existing architectural decisions to ensure that they remain cost effective. As your requirements change, be aggressive in decommissioning resources, components, and workloads that you no longer require. Consider the following best practices to help you optimize over time.
Best Practices
Develop a process that defines the criteria and process for workload review. The review effort should reflect potential benefit. For example, core workloads or workloads with a value of over 10% of the bill are reviewed quarterly, while workloads below 10% are reviewed annually.
- Define review frequency: Define how frequently the workload and its components should be reviewed.
- Define review thoroughness: Define how much effort is spent on the review of the workload or workload components.
- AWS News Blog
Existing workloads are regularly reviewed based on for each defined processes.
- Regularly review the workload: Using your defined process, perform reviews with the frequency specified. Verify that you spend the correct amount of effort on each component.
- Implement new services: If the outcome of the analysis is to implement changes, first perform a baseline of the workload to know the current cost for each output.
- AWS News Blog
- Whats new with AWS
Sustainability
Sustainability in the cloud is a continuous effort focused primarily on energy reduction and efficiency across all components of a workload by achieving the maximum benefit from the resources provisioned and minimizing the total resources required.
This effort can range from the initial selection of an efficient programming language, adoption of modern algorithms, use of efficient data storage techniques, deploying to correctly sized and efficient compute infrastructure, and minimizing requirements for highpowered end-user hardware.
There are six focus areas for Sustainability in the cloud
Region selection
Choose Regions where you will implement your workloads based on both your business requirements and sustainability goals.
Best Practices
User behavior patterns
The way users consume your workloads and other resources can help you identify improvements to meet sustainability goals. Scale infrastructure to continually match user load and ensure that only the minimum resources required to support users are deployed
Align service levels to customer needs.
Position resources to limit the network required for users to consume them.
Remove existing, unused assets.
Identify created assets that are unused and stop generating them. Provide your team members with devices that support their needs with minimized sustainability impact.
Best Practices
Identify periods of low or no utilization and scale down resources to eliminate excess capacity and improve efficiency.
- Use elasticity in your architecture to ensure that workload can scale down quickly and easily during the period of low user load:
- Verify that the metrics for scaling up or down are validated against the type of workload being deployed. If you are deploying a video transcoding application, 100% CPU utilization is expected and should not be your primary metric. You can use a customized metric (such as memory utilization) for your scaling policy if required.
- AWS Auto Scaling
- AWS CloudWatch
- AWS X-Ray
- AWS VPC Flow logs
Define and update Service Level Agreements (SLAs) such as availability or data retention periods to minimize the number of resources required to support your workload while continuing to meet business requirements.
- Define SLAs that support your sustainability goals while meeting your business requirements.
- Redefine SLAs to meet business requirements, not exceed them.
- Make trade-offs that significantly reduce sustainability impacts in exchange for acceptable decreases in service levels.
- Use design patterns that prioritize business-critical functions, and allow lower service levels (such as response time or recovery time objectives) for non-critical functions.
- AWS Service Level Agreements (SLAs) website
Analyze application assets (such as pre-compiled reports, datasets, and static images) and asset access patterns to identify redundancy, underutilization, and potential decommission targets.
Consolidate generated assets with redundant content (for example, monthly reports with overlapping or common datasets and outputs) to remove the resources consumed when duplicating outputs
Decommission unused assets (for example, images of products that are no longer sold) to free consumed resources and reduce the number of resources used to support the workload.
- Manage static assets and remove assets that are no longer required.
- Manage generated assets and stop generating and remove assets that are no longer required.
- Consolidate overlapping generated assets to remove redundant processing.
- Instruct third parties to stop producing and storing assets managed on your behalf that are no longer required.
Analyze network access patterns to identify where your customers are connecting from geographically. Select Regions and services that reduce the distance network traffic must travel to decrease the total network resources required to support your workload.
- Select the Regions for your workload deployment based on the following key elements:
- Your Sustainability goal
- Where your data is located
- Where your users are located
- Other constraints
- Use AWS Local Zones to run workloads like video rendering and graphics-intensive virtual desktop applications.
- Use local caching or AWS Caching Solutions for frequently used resources to improve performance, reduce data movement, and lower environmental impact.
- AWS CloudFront
- AWS ElastiCache
- AWS DynamoDB Accelerator
- AWS Lambda@Edge
Optimize resources provided to team members to minimize the sustainability impact while supporting their needs. For example, perform complex operations, such as rendering and compilation, on highly utilized shared cloud desktops instead of on underutilized high-powered single-user systems.
- Provision workstations and other devices to align with how they’re used.
- Use virtual desktops and application streaming to limit upgrade and device requirements.
- Move processor or memory-intensive tasks to the cloud.
- Evaluate the impact of processes and systems on your device lifecycle, and select solutions that minimize the requirement for device replacement while satisfying business requirements.
- Implement remote management for devices to reduce required business travel.
- AWS Workspaces
- AWS AppStream
- AWS Systems Manager Fleet Manager
Software and Architecture patterns
Implement patterns for performing load smoothing and maintaining consistent high utilization of deployed resources to minimize the resources consumed. Components might become idle from lack of use because of changes in user behavior over time
Revise patterns and architecture to consolidate under-utilized components to increase overall utilization.
Retire components that are no longer required.
Understand the performance of your workload components, and optimize the components that consume the most resources. Be aware of the devices your customers use to access your services, and implementpatterns to minimize the need for device upgrades.
Best Practices
Use efficient software designs and architectures to minimize the average resources required per unit of work.
Implement mechanisms that result in even utilization of components to reduce resources that are idle between tasks and minimize the impact of load spikes.
- Queue requests that don’t require immediate processing.
- Increase serialization to flatten utilization across your pipeline.
- Modify the capacity of individual components to prevent idling resources waiting for input.
- Create buffers and establish rate limiting to smooth the consumption of external services.
- Use the most efficient available hardware for your software optimizations.
- Use queue-driven architectures, pipeline management, and On-Demand Instance workers to maximize utilization for batch processing.
- Schedule tasks to avoid load spikes and resource contention from simultaneous execution.
- Schedule jobs during times of day where carbon intensity for power is lowest.
- AWS SQS
- AWS Step Functions
- AWS Lambda
- AWS EventBridge
Monitor workload activity to identify changes in utilization of individual components over time. Remove components that are unused and no longer required, and refactor components with little utilization to limit wasted resources.
- Analyze load (using indicators such as transaction flow and API calls) on functional components to identify unused and underutilized components.
- Retire components that are no longer needed.
- Refactor underutilized components.
- AWS X-Ray
- AWS CoudWatch
Monitor workload activity to identify application components that consume the most resources. Optimize the code that runs within these components to minimize resource usage while maximizing performance.
- Monitor performance as a function of resource usage to identify components with high resource requirements per unit of work as targets for optimization.
- Use a code profiler to identify the areas of code that use the most time or resources as targets for optimization.
- Replace algorithms with more efficient versions that produce the same result.
- Use hardware acceleration to improve the efficiency of blocks of code with long execution times.
- Use the most efficient operating system and programming language for the workload.
- Remove unnecessary sorting and formatting.
- Use data transfer patterns that minimize the resources used based on how frequently the data changes and how it is consumed.
- AWS CloudWatch
- AWS CodeGuru
Understand the devices and equipment your customers use to consume your services, their expected lifecycle, and the financial and sustainability impact of replacing those components.
Implement software patterns and architectures to minimize the need for customers to replace devices and upgrade equipment.
- Inventory the devices your customers use.
- Test using managed device farms with representative sets of hardware to understand the impact of your changes, and iterate development to maximize the devices supported.
- Account for network bandwidth and latency when building payloads, and implement capabilities that help your applications work well on low-bandwidth, high-latency links.
- Pre-process data payloads to reduce local processing requirements and limit data transfer requirements.
- Perform computationally intense activities server-side (such as image rendering), or use application streaming to improve the user experience on older devices.
- Segment and paginate output, especially for interactive sessions, to manage payloads and limit local storage requirements.
- AWS Device Farm
- AWS AppStream
Understand how data is used within your workload, consumed by your users, transferred, and stored. Select technologies to minimize data processing and storage requirements.
- Analyze your data access and storage patterns.
- Store data files in efficient file formats such as Parquet to prevent unnecessary processing (for example, when running analytics) and to reduce the total storage provisioned.
- Use technologies that work natively with compressed data
- Use the database engine that best supports your dominant query pattern.
- Manage your database indexes to ensure index designs support efficient query execution.
- Select network protocols that reduce the amount of network capacity consumed.
Data patterns
Implement data management practices to reduce the provisioned storage required to support your workload, and the resources required to use it. Understand your data, and use storage technologies and configurations that best support the business value of the data and how it’s used
Lifecycle data to more efficient, less performant storage when requirements decrease, and delete data that’s no longer required.
Best Practices
Classify data to understand its significance to business outcomes. Use this information to determine when you can move data to more energy-efficient storage or safely delete it.
- Determine requirements for the distribution, retention, and deletion of your data.
- Use tagging on volumes and objects to record the metadata that’s used to determine how it’s managed, including data classification.
- Periodically audit your environment for untagged and unclassified data, and classify and tag the data appropriately.
Use storage that best supports how your data is accessed and stored to minimize the resources provisioned while supporting your workload.
For example, Solid State Devices (SSDs) are more energy intensive than magnetic drives and should be used only for active data use cases. Use energy-efficient, archival-class storage for infrequently accessed data.
- Monitor your data access patterns.
- Migrate data to the appropriate technology based on access pattern.
- Migrate archival data to storage designed for that purpose.
- AWS CloudWatch
Manage the lifecycle of all your data and automatically enforce deletion timelines to minimize the total storage requirements of your workload.
- Define lifecycle policies for all your data classification types.
- Set automated lifecycle policies to enforce lifecycle rules.
- Delete unused volumes and snapshots.
- Aggregate data where applicable based on lifecycle rules.
- AWS Config Rules
- AWS S3 Intelligent Tiering
To minimize total provisioned storage, create block storage with size allocations that are appropriate for the workload.
Use elastic volumes to expand storage as data grows without having to resize storage attached to compute resources. Regularly review elastic volumes and shrink over-provisioned volumes to fit the current data size.
- Monitor the utilization of your data volumes.
- Use elastic volumes and managed block data services to automate allocation of additional storage as your persistent data grows.
- Set target levels of utilization for your data volumes, and resize volumes outside of expected ranges.
- Size read-only volumes to fit the data.
- Migrate data to object stores to avoid provisioning the excess capacity from fixed volume sizes on block storage.
- AWS EBS Elastic Volumes
- AWS FSx
- AWS EFS
- AWS CloudWatch
Duplicate data only when necessary to minimize total storage consumed. Use backup technologies that deduplicate data at the file and block level.
Limit the use of Redundant Array of Independent Drives (RAID) configurations except where required to meet Service Level Agreements (SLAs).
- Use mechanisms that can deduplicate data at the block and object level.
- Use backup technology that can make incremental backups and deduplicate data at the block, file, and object level.
- Use RAID only when required to meet your SLAs
- Centralize log and trace data, deduplicate identical log entries, and establish mechanisms to tune verbosity when needed.
- Pre-populate caches only where justified.
- Establish cache monitoring and automation to resize cache accordingly.
- Remove out-of-date deployments and assets from object stores and edge caches when pushing new versions of your workload.
- AWS EBS Snapshots
- AWS CloudWatch logs
- AWS FSx data dedup
- AWS CloudFront
Adopt shared storage and single sources of truth to avoid data duplication and reduce the total storage requirements of your workload.
Fetch data from shared storage only as needed. Detach unused volumes to make more resources available.
- Migrate data to shared storage when the data has multiple consumers.
- Fetch data from shared storage only as needed.
- Delete data as appropriate for your usage patterns, and implement time-to-live (TTL) functionality to manage cached data.
- Detach volumes from clients that are not actively using them
- AWS FSx
- AWS EFS
- AWS S3
Use shared storage and access data from regional data stores to minimize the total networking resources required to support data movement for your workload.
- Store data as close to the consumer as possible
- Partition regionally consumed services so that their Region-specific data is stored within the Region where it is consumed.
- Use block-level duplication instead of file or object-level duplication when copying changes across the network.
- Compress data before moving it over the network.
To minimize storage consumption, only back up data that has business value or is needed to satisfy compliance requirements.
Examine backup policies and exclude ephemeral storage that doesn’t provide value in a recovery scenario.
- Use your data classification to establish what data needs to be backed up.
- Exclude data that you can easily recreate.
- Exclude ephemeral data from your backups.
- Exclude local copies of data, unless the time required to restore that data from a common location exceeds your service level agreements (SLAs).
- AWS Backup
- AWS EBS Snapshots
- AWS RDS Backups
Harware patterns
Look for opportunities to reduce workload sustainability impacts by making changes to your hardware management practices
Minimize the amount of hardware needed to provision and deploy, and select the most efficient hardware for your individual workload.
Best Practices
Using the capabilities of the cloud, you can make frequent changes to your workload implementations.
Update deployed components as your needs change.
- Enable horizontal scaling, and use automation to scale out as loads increase and to scale in as loads decrease
- Scale using small increments for variable workloads.
- Align scaling with cyclical utilization patterns (for example, a payroll system with intense bi-weekly processing activities) as load varies over days, weeks, months, or years.
- Negotiate service level Agreements (SLAs) that allow for a temporary reduction in capacity while automation deploys replacement resources.
- AWS Compute Optimizer
- AWS Auto Scaling
Continually monitor the release of new instance types and take advantage of energy efficiency improvements, including those instance types designed to support specific workloads such as machine learning training, inference, and video transcoding.
- Learn and explore instance types which can lower your workload environmental impact.
- Plan and transition your workload to instance types with the least impact
- Operate and optimize your workload instance.
- AWS Compute Optimizer
- AWS CloudWatch
- AWS Graviton-based instances
- AWS Trainium
- AWS Inferentia
Managed services shift responsibility for maintaining high-average utilization, and sustainability optimization of the deployed hardware to AWS.
Use managed services to distribute the sustainability impact of the service across all tenants of the service, reducing your individual contribution.
- Migrate from self-hosted services to managed services.
- Use managed Amazon Relational Database Service (Amazon RDS) instances instead of maintaining your own Amazon RDS instances on Amazon Elastic Compute Cloud (Amazon EC2).
- Use managed container services, such as AWS Fargate, instead of implementing your own container infrastructure.
- AWS Fargate
- AWS DocumentDB
- AWS Elastic Kubernetes Service (EKS)
- AWS Managed Streaming for Apache Kafka (Amazon MSK)
- AWS Redhsift
- AWS RDS
Graphics Processing Units (GPUs) can be a source of high-power consumption, and many GPU workloads are highly variable, such as rendering, transcoding, and machine learning training and modeling.
Only run GPU instances for the time needed, and decommission them with automation when not required to minimize resources consumed.
- Use GPUs only for tasks where they’re more efficient than CPU-based alternatives
- Use automation to release GPU instances when not in use
- Use flexible graphics acceleration rather than dedicated GPU instances
- Take advantage of custom-purpose hardware that is specific to your workload
- AWS Inferentia
- AWS Trainium
- AWS Accelerated Computing for EC2 Instances
- AWS EC2 VT1 Instances
- AWS Elastic Graphics
Development and deployment process
Look for opportunities to reduce your sustainability impact by making changes to your development, test, and deployment practices.
Best Practices
Test and validate potential improvements before deploying them to production. Account for the cost of testing when calculating potential future benefit of an improvement.
Develop low-cost testing methods to enable delivery of small improvements.
- Add requirements for sustainability to your development process.
- Allow resources to work in parallel to develop, test, and deploy sustainability improvements
- Test and validate potential sustainability impact improvements before deploying into production.
- Test potential improvements using the minimum viable representative components.
- Deploy tested sustainability improvements to production as they become available.
Up-to-date operating systems, libraries, and applications can improve workload efficiency and enable easier adoption of more efficient technologies.
Up-to-date software might also include features to measure the sustainability impact of your workload more accurately, as vendors deliver features to meet their own sustainability goals.
- Take advantage of agility in the cloud to quickly test how new features can improve your workload to:
- Reduce sustainability impacts
- Gain performance efficiencies
- Remove barriers for a planned improvement
- Improve your ability to measure and manage sustainability impacts
- Inventory your workload software and architecture and identify components that need to be updated.
- You can use AWS Systems Manager Inventory to collect operating system (OS), application, and instance metadata from your Amazon EC2 instances and quickly understand which instances are running the software and configurations required by your software policy and which instances need to be updated.
- AWS Systems Manager Patch Manager
- What's New with AWS (website)
Use automation and infrastructure-as-code to bring pre-production environments up when needed and take them down when not used.
A common pattern is to schedule periods of availability that coincide with the working hours of your development team members.
Hibernation is a useful tool to preserve the state and rapidly bring instances online only when needed.
- Use automation to maximize utilization of your development and test environments.
- Use automation to manage the lifecycle of your development and test environments.
- Use minimum viable representative environments to develop and test potential improvements.
- Use On-Demand Instances to supplement your developer devices.
- Use automation to maximize the efficiency of your build resources.
- Use instance types with burst capacity, Spot Instances, and other technologies to align build capacity with use.
- AWS Systems Manager Session Manager
- AWS EC2 Burstable performance instances
- AWS CloudFormation
Managed device farms spread the sustainability impact of hardware manufacturing and resource usage across multiple tenants.
Managed device farms offer diverse device types so you can support older, less popular hardware, and avoid customer sustainability impact from unnecessary device upgrades.
- AWS Device Farm
- xxxxx