Introduction
In Skytap cloud environments, ensuring the resilience and performance of AIX-based applications is critical. For AIX Logical Partitions (LPARs) running in Skytap, setting the correct read/write timeout (rw_timeout) value is essential to maintain application stability and reliability. This technical note discusses the importance of configuring the rw_timeout value appropriately, considering Skytap's infrastructure which is designed for performance and redundancy.
Skytap Infrastructure Overview
Skytap's storage infrastructure is designed to provide high performance, scalability, and resilience. Key features include:
- High-Performance Storage: Utilizing JBOD SSDs ensures rapid data access and high throughput, essential for demanding enterprise applications.
- Data Protection and Redundancy: The ZFS filesystem with RAIDZ2 offers robust data protection.
- Distributed Storage Nodes: Each Skytap region comprises multiple storage nodes, with LPAR disks distributed across these nodes. This architecture enhances data availability and resilience by mitigating the impact of individual node failures.
- Redundant Connectivity: Dual bonded network interfaces connected to dual switches provide multiple paths to storage, ensuring continuous access even in the event of a network failure.
- iSCSI Connectivity via VIOS: The use of iSCSI through VIOS facilitates efficient and reliable storage connectivity, supporting high availability and performance.
Best Practices
The rw_timeout parameter in AIX specifies the maximum time (in seconds) that the system will wait for a read or write operation to complete before considering it failed.
- In a cloud environment like Skytap, the rw_timeout needs to be longer than in traditional on-premises AIX deployments to accommodate the additional layers of abstraction and potential network latencies inherent in cloud architectures.
- Skytap's VIOS is configured with an rw_timeout of 60 seconds. IBM recommends that the rw_timeout value for client LPARs be twice that of the VIOS setting, which translates to 120 seconds for client LPARs in this context.
- Setting this value to 120 seconds ensures a balance between performance and reliability, accommodating the for the potential latency introduced by RAIDZ2 in the event of an infrastructure disk failure, and preventing premature timeout errors that could disrupt application performance.
Special Considerations for AIX Clusters Leveraging PowerHA (HACMP)
PowerHA (formerly known as HACMP) clusters require consistent and reliable disk I/O to ensure failover mechanisms work seamlessly. Special considerations need to be considered with using PowerHA clusters with Skytap’s Multi-Attached Storage feature:
- Number of disks in a disk set. If a disk set has a small number of disks, it can lead to a potential failure and unexpected downtime.
- PowerHA clusters use quorum calculations to determine the operational status of the cluster.
- A small number of disks can affect quorum calculations, especially if disk I/O timeouts occur frequently.
- This can lead to split-brain scenarios where cluster nodes operate independently, risking data inconsistency.
- Disk timeout. If the rw_timeout is set too low for disks in your cluster, the cluster nodes may prematurely detect disk failures, leading to unnecessary failovers.
As a result of the above, for PowerHA Clusters, a rw_timeout of 120 seconds helps maintain cluster stability by reducing the risk of false positives in disk failure detection, which could otherwise trigger unnecessary failovers and affect quorum calculations.
Note: Starting with AIX 7.2 TL05, the rw_timeout value is an attribute of the storage device (disk, …) and before AIX 7.2 TL05, it is be related to the vscsi adapter.
How to Change the rw_timeout Setting in AIX versions 7.2 TL 05 and later
To change the rw_timeout setting in AIX, follow these steps:
- Identify the Disk Devices:
- List the disk devices to identify the ones you need to configure:
lsdev -Cc disk
- Check the Current rw_timeout Value:
- Use the lsattr command to check the current rw_timeout value for a specific disk:
lsattr -El hdiskX -a rw_timeout
-
- Replace hdiskX with the appropriate disk identifier.
- Change the rw_timeout Value:
- Use the chdev command to change the rw_timeout value to 120 seconds:
chdev -l hdiskX -a rw_timeout=120 -P
-
- Replace hdiskX with the appropriate disk identifier.
- Verify the Change:
- Verify that the rw_timeout value has been updated:
lsattr -El hdiskX -a rw_timeout
How to Change the rw_timeout Setting in AIX versions 7.2 TL 05 and under
To change the rw_timeout setting in AIX, follow these steps:
- Identify the Vscsi Adapters :
- List the vscsi adapters to identify the ones you need to configure:
lsdev -Cc adapter
- Check the Current rw_timeout Value:
- Use the lsattr command to check the current rw_timeout value for a specific vscsi adapter:
lsattr -El vscsiXX
-
- Replace hdiskX with the appropriate disk identifier.
- Change the rw_timeout Value:
- Use the chdev command to change the rw_timeout value to 120 seconds:
chdev -l vscsiXX -a rw_timeout=120 -P
-
- Replace vscsiXX with the appropriate vscsi adapter.
- Verify the Change:
- Verify that the rw_timeout value has been updated:
lsattr -El vscsiXX
Conclusion
Setting the rw_timeout value to 120 seconds for AIX within Skytap LPARs is essential for maintaining the performance and reliability of applications. Adhering to IBM's best practice of setting the client LPAR rw_timeout to twice that of Skytap’s VIOS (60 seconds) will help maintain a stable and efficient environment. It is also important to remember that each application is unique and may handle changes in disk performance differently, so any changes to the rw_timeout and corresponding application configuration changes should be tested prior to going live in production environments.
Comments
0 comments
Article is closed for comments.