IBM AIX bug IV82042 - SNCSSH: hung sshd process pair with defunct child process on AIX OS

Description

After discovery of AIX using SNCSSH, sometimes a pair of SSH servers (sshd) processes is left running forever on target in an unresponsive state, plus one of the sshd processes has a defunct process. For probes that follow the probe that caused the hang, the discovery log shows the error "Error; job finished with status ERROR: SSH channel xx timeout in state ESTABLISHING" as probes unsuccessfully try to open additional SSH channels on the existing TCP connection.

The behaviour can also be reproduce without using any ServiceNow software.

The root cause is the IBM AIX bug IV82042 "sshd hangs due to a race condition involving ptys". IBM claims IV82042 is fixed in Various ssh problems after upgrading to OpenSSH 7.x. If sshd hangs, it no longer responds to subsequent probes' requests to open new channels on the same SSH session (a session is a TCP connection). The MID agent logs shows this when mid.ssh.debug = true.

Discovery uses SSH pseudo-terminal (pty) mode mainly for executing privileged shell commands (ex: sudo xxx) that will prompt for a password. Any probes that specify probe parameter run_in_terminal=true also use pty mode. SNCSSH uses only SSH "exec" mode which when combined with pty mode triggers the bug in AIX. The deprecated J2SSH uses only SSH "Shell" mode, almost always in pty mode, but J2SSH doesn't reproduce the bug.

Steps to Reproduce

Since this behaviour is timing-dependent, it is hard to reproduce and troubleshoot. It may require to run the sequences multiple times.

To reproduce without using any ServiceNow software:

Run the following simple command line from a second host running any OS, or run from same AIX host. This mimics what SNCSSH does.

ssh -vvv -t <user>@<AIX host IP> echo hello
where
-t means use pseudo-terminal (pty) mode.
Specifying the shell command (echo hello) after the IP address means to use SSH protocol's "exec" mode vs. interactive "shell" mode.
-vvv (optional) enables debug logging to console.

This command will hang the client, so kill the client console window. That will force the SSH TCP connection to the target to close. The correct behavior is "hello" output to console.
Run the command ps -ef | grep sshd on the target.
Look at the start time field to find matches with same time you started probe #1 to find the hung sshd process pair.

If you run the same ps command again after a while, you will see they are still hung.
Run the ssh command again but omit the -t.

This command will always succeed and print "hello".

To reproduce using SNCSSH:

Create two SSHCommand probes:

Probe #1 has probe parameter run_in_terminal=true, with command (ECC Queue Name field) = "echo hello".
Probe #2 command = "echo goodbye", with no probe parameters.
Run probe #1.
Run probe #2.

It will timeout after a couple of minutes as it attempts to open a new channel to AIX.
Stop the MID server.

This forces all SSH TCP connections to the target to close.
Run ps -ef | grep sshd on the target.
Look at the start time field to find matches with same time you started probe #1 to find the hung sshd process pair.

Those hung processes will never exit.

Workaround

Update OpenSSH server on AIX to version 7.5.102.1500 or later per Various ssh problems after upgrading to OpenSSH 7.x which fixes IBM bug https://www.ibm.com/support/pages/apar/IV82042.

Related Problem: PRB1250056