Linux Kernel panic issue: How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem

A panic may occur as a result of a hardware failure or a software bug in the operating system.

How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem - blackMORE Ops

In many cases, the operating system is capable of continued operation after an error has occurred. However, the system is in an unstable state and rather than risking security breaches and data corruption, the operating system stops to prevent further damage and facilitate diagnosis of the error and, in usual cases, restart. After recompiling a kernel binary image from source code, a kernel panic during booting the resulting kernel is a common problem if the kernel was not correctly configured, compiled or installed. Add-on hardware or malfunctioning RAM could also be sources of fatal kernel errors during start up, due to incompatibility with the OS or a missing device driver. A kernel may also go into panic() if it is unable to locate a root file system. During the final stages of kernel userspace initialization, a panic is typically triggered if the spawning of init fails, as the system would then be unusable.

Background

When I face this issue i have checked the log in /var/log/messages. In that i have got below mentioned log

INFO: task jbd2/vda3-8:250 blocked for more than 120 seconds.
 Not tainted 2.6.32-431.11.2.el6.x86_64 #1
 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.


Step by step troubleshooting data and logs


Following command will show the server memory usage



sar -r



we can check the memory status that server consumes.



then we have to check the cpu usage with following command



sar -u



Solution for hung_task_timeout_secs

Explanation

By default Linux uses up to 40% of the available memory for file system 
caching. After this mark has been reached the file system flushes all 
outstanding data to disk causing all following IOs going synchronous. 
For flushing out this data to disk this there is a time limit of 120 
seconds by default. In the case here the IO subsystem is not fast enough
 to flush the data withing 120 seconds. As IO subsystem responds slowly 
and more requests are served, System Memory gets filled up resulting in 
the above error, thus serving HTTP requests.

Testing

I tested this theory with the following:

Change vm.dirty_ratio and vm.dirty_backgroud_ratio

sysctl -w vm.dirty_ratio=10

sysctl -w vm.dirty_background_ratio=5

commit the change

sysctl -p

Make it permanent


When the server seemed more stable and no Kernel/Swap/Memory Panic for a week, 
I edited /etc/sysctl.conf file to make these permanent after reboot.


vi /etc/sysctl.conf


add the 2 lines at the bottom


vm.dirty_background_ratio = 5
vm.dirty_ratio = 10


save and exit


then reboot the server.

Linux Works Documents

Wednesday, 9 November 2016

hung_task_timeout_secs and blocked for more than 120 seconds problem

Linux Kernel panic issue: How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem

Step by step troubleshooting data and logs

No comments:

Post a Comment

Helm installation on rhel 9

About Me

Report Abuse