Operational Acceptance Testing (OAT)

The purpose of OAT is to prove the aspects of the system that do not affect the functionality but can still have a profound effect on how it is managed and supported.
OAT concentrates on areas such as resiliency, recoverability, integrity, manageability and supportability, with the specific exclusions of Performance, Security and Disaster Recovery, which are areas of speciality in their own right.
The required level of OAT is determined by using CDRM (Change Driven Risk Management) and the output from this will recommend the risk mitigation strategy for all phases of the project. This will enable the OAT phase to focus on mitigating the operational risks.
The following mitigation methods form the OAT phase:
  • Backup & Recovery
  • Change Implementation
  • Change Back-out
  • Component Failure
  • Shutdown & Resumption
  • Operational Support & Procedure
  • Alerts
All methods must be performed based on the CDRM technique and TS standards in a managed non-functional test environment that is an accurate reflection of production.

Categories of OAT

Backup & Recovery

To prove both the backup and recovery processes. The testing will prove the operation, operability and integrity of backup procedures to ensure that the operating systems and data can be restored successfully at the same site and also at another site if applicable. The recovery testing includes the build and configuration of a component. These tests will ensure build quality and guarantee subsequent builds of components are to the same standard.
The testing should prove that:
  • Service can be restored to an agreed recovery point utilising appropriate TS standard backup and restore methods.
  • Backups taken at one site can be recovered to the same site.
  • Backups taken at one site can be recovered to another other site.

Change Implementation

To prove that the implementation into the production environment will be successful and not adversely affect the existing production services.
The testing should prove that:
  • The implementation into the live production environment will not adversely affect the integrity of the current production services.
  • The implementation process can be replicated by using valid documentation that includes the time required for each step and the order of implementation.

Change Back-out

To prove the back-out of a failed change from the production environment will be successful and will not adversely affect existing production services.
 The testing should prove that: 
  • All the required steps to successfully back out a change are valid.
  • The time required for each step of the back-out is known and documented.

Component Failure

To prove that the infrastructure has been designed to cope with unplanned outages. Following failure and repair, the failed components should be able to be recovered into the infrastructure in line with TS Recovery Management processes and timescales.
The testing should prove that:
  • The service can continue after the failure of individual components (outside its core operating environment), while issuing appropriate error messages. The system should be designed to offer transparent failover where possible and upon terminal error on the active platform (usually identified by a heartbeat failure), the failover infrastructure should be automatically activated. Ultimately, this covers the ability to continue operation at an alternative facility after the failure at the primary facility. This should be proven for new and amended components.
  • The system can automatically adjust itself to availability of system resources.
  • If fail-over is invoked, fail-back can be performed successfully and recovery to the original state is achievable. When component failures are resolved the service should fully recover itself with no customer impact. Any non-automated actions should be documented.
  • If several components have been affected by a failure, there should be a proven plan showing the recommended order of restart, time to complete, etc.
  • Failure to complete a unit of work does not result in data corruption or inconsistency and all services must handle any failures while preserving data integrity.  
  • Any impact on the E2E service by the failure of individual components is understood and documented.

Shutdown & Resumption

To prove that the system can be shutdown and restarted cleanly without service disruption or within an agreed window of scheduled downtime. 
The testing should prove that:
  • Each component can be shutdown and resumed successfully within the agreed time scale.
  • The order of resumption of the components, if applicable, is valid and documented.

Operational Support & Procedure

To prove that all components of a service are capable of being supported to TS standards. 
The testing should prove that: 
  • Diagnostic information produced in failure situations is of sufficient quality to support any manual or, ideally, automatic corrective actions.
  • Any recovery documentation produced or amended, including Service Diagrams, is valid. This should be handed over to the relevant support areas.
  • Documentation for each element which covers restart / recovery, error conditions, alerts, etc. must be provided.
  • Full remote control capability to resolve error conditions should be proven for all new components and tools.
  • Maintenance of the components should be able to be performed without disruption to the service or within an agreed outage as per the SLA. The service should be able to be started, shutdown and controlled to support maintenance.

 Alerts

To prove that alerts are raised in the event of a component failure, error condition or if a threshold is breached.
The testing should prove that:
  • Event Monitoring - All critical alerts go to the TEC and reference the correct resolution document. Any system that fails at an infrastructure or application level alerts on failure or is addressed by Heartbeat functionality.
  • Threshold Monitoring - Alerts are in place and issued if agreed thresholds are exceeded. e.g. disk utilisation, CPU, memory etc.
  • Heartbeat Monitoring (End to End) - This mimics customer experience on a regular basis. An alert will be issued if response times fall below a predetermined (by the business) threshold or fail an agreed number of times consecutively. The object of the heartbeat is to prove that key business functionality is available and performing to an acceptable standard. If end-to-end heartbeat is not appropriate, then component heartbeats should be applied.

4 comments
  1. Aditya February 11, 2013 at 9:32 PM  

    A very good article, we do perform OAT in the same manner as explained in this article.

  2. Software Development Company May 30, 2017 at 6:00 AM  

    Hello,
    The Article on Operational Acceptance Testing, gives amazing information about it. Thanks for Sharing the information about the acceptance testing , For More information check the detail on the User Acceptance Testing here Software Testing Services

  3. Disaster Recovery Site Checklist August 31, 2018 at 3:03 AM  

    Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.. Also make sure to have best disaster recovery site checklist.

  4. Chad July 26, 2021 at 1:40 AM  

    Wow, What a Excellent post. I really found this to much informatics. It is what i was searching for.I would like to suggest you that please keep sharing such type of info.Visit here for Penetration testing services and Software testing services