Wednesday 9 November 2016

hung_task_timeout_secs and blocked for more than 120 seconds problem

Linux Kernel panic issue: How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem


A panic may occur as a result of a hardware failure or a software bug in the operating system. How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem - blackMORE OpsIn many cases, the operating system is capable of continued operation after an error has occurred. However, the system is in an unstable state and rather than risking security breaches and data corruption, the operating system stops to prevent further damage and facilitate diagnosis of the error and, in usual cases, restart. After recompiling a kernel binary image from source code, a kernel panic during booting the resulting kernel is a common problem if the kernel was not correctly configured, compiled or installed.  Add-on hardware or malfunctioning RAM could also be sources of fatal kernel errors during start up, due to incompatibility with the OS or a missing device driver.  A kernel may also go into panic() if it is unable to locate a root file system. During the final stages of kernel userspace initialization, a panic is typically triggered if the spawning of init fails, as the system would then be unusable.

Background

When I face this issue i have checked the log in /var/log/messages. In that i have got below mentioned log


INFO: task jbd2/vda3-8:250 blocked for more than 120 seconds.
 Not tainted 2.6.32-431.11.2.el6.x86_64 #1
 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Step by step troubleshooting data and logs

Following command will show the server memory usage
sar -r
we can check the memory status that server consumes.
then we have to check the cpu usage with following command
sar -u
Solution for hung_task_timeout_secs

Explanation
By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. As IO subsystem responds slowly and more requests are served, System Memory gets filled up resulting in the above error, thus serving HTTP requests.
Testing
I tested this theory with the following: Change vm.dirty_ratio and vm.dirty_backgroud_ratio


sysctl -w vm.dirty_ratio=10

sysctl -w vm.dirty_background_ratio=5

commit the change

sysctl -p


Make it permanent

When the server seemed more stable and no Kernel/Swap/Memory Panic for a week, 
I edited /etc/sysctl.conf file to make these permanent after reboot.

vi /etc/sysctl.conf

add the 2 lines at the bottom

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

save and exit

then reboot the server.

No comments:

Post a Comment

Permanent hostname setup for RHEL7

Step 1 Set the host name on NMTUI tool like following nmtui set host name   then save and exit Step 2 add the following l...