Notifications

877 views

Description

After discovery of AIX using SNCSSH, sometimes a pair of SSH servers (sshd) processes is left running forever on target in an unresponsive state, plus one of the sshd processes has a defunct process. For probes that follow the probe that caused the hang, the discovery log shows the error "Error; job finished with status ERROR: SSH channel xx timeout in state ESTABLISHING" as probes unsuccessfully try to open additional SSH channels on the existing TCP connection.

The behaviour can also be reproduce without using any ServiceNow software.

The root cause is the IBM AIX bug IV82042 "sshd hangs due to a race condition involving ptys". If sshd hangs, it no longer responds to subsequent probe's requests to open new channels on the same SSH session (a session is a TCP connection). The MID agent logs shows this when mid.ssh.debug = true.

Discovery uses SSH pseudo-terminal (pty) mode mainly for executing privileged shell commands (ex: sudo xxx) that will prompt for a password. Any probes that specify probe parameter run_in_terminal=true also use pty mode. SNCSSH uses only SSH "exec" mode which when combined with pty mode triggers the bug in AIX. The deprecated J2SSH uses only SSH "Shell" mode, almost always in pty mode, but J2SSH doesn't reproduce the bug.

 

Steps to Reproduce

 

Since this behaviour is timing-dependent, it is hard to reproduce and troubleshoot. It may require to run the sequences multiple times.
 

To reproduce without using any ServiceNow software:
  1. Run the following simple command line from a second host running any OS, or run from same AIX host. This mimics what SNCSSH does.

           ssh -vvv -t <user>@<AIX host IP> echo hello
    where
    -t means use pseudo-terminal (pty) mode. 
    Specifying the shell command (echo hello) after the IP address means to use SSH protocol's "exec" mode vs. interactive "shell" mode.
    -vvv (optional) enables debug logging to console.

    This command will hang the client, so kill the client console window. That will force the SSH TCP connection to the target to close. The correct behavior is "hello" output to console.

  2. Run the command ps -ef | grep sshd on the target.

  3. Look at the start time field to find matches with same time you started probe #1 to find the hung sshd process pair.

    If you run the same ps command again after a while, you will see they are still hung.

  4. Run the ssh command again but omit the -t.

    This command will always succeed and print "hello".


To reproduce using SNCSSH:

  1. Create two SSHCommand probes:

    Probe #1 has probe parameter run_in_terminal=true, with command (ECC Queue Name field) = "echo hello".
    Probe #2 command = "echo goodbye", with no probe parameters.

  2. Run probe #1.

  3. Run probe #2.

    It will timeout after a couple of minutes as it attempts to open a new channel to AIX.

  4. Stop the MID server.

    This forces all SSH TCP connections to the target to close.

  5. Run ps -ef | grep sshd on the target.

  6. Look at the start time field to find matches with same time you started probe #1 to find the hung sshd process pair.

    Those hung processes will never exit.

 

Workaround

This problem has been identified as a future product enhancement, no code fix can be deployed in the current or upcoming releases.

As a workaround, specify the MID server to use deprecated J2SSH instead of SNCSSH for your MID server that discovers AIX targets. Select J2SSH for a MID server by adding Configuration Parameter mid.ssh.use_snc with a value of false.

As a workaround to a different SNCSSH-only PRB, some customers have added MID server Configuration Parameter mid.ssh_connections_per_host and set a huge value (for example, 1000). You should remove that workaround or lower the value when using J2SSH.


Background

One of the disadvantages of J2SSH vs SNCSSH is a thread count of three per SSH connection. This was the primary motivation for creating SNCSSH, which uses just a couple of threads to service all connections to all IPs.

SNCSSH uses multiple SSH channels over a single SSH session (TCP connection), vs. J2SSH opens one TCP connection per probe and uses only one SSH channel per TCP connection. Therefore, when probes run concurrently to a target, SNCSSH opens fewer concurrent TCP connections than J2SSH.

There are two limits on max channel count per IP:

  • On the target side, most SSH servers limit max channel count originating from a single client IP to a number in the 7 to 10 range.

  • On the MID server side, max channel count can be limited by setting mid.ssh_connections_per_host, which defaults to 3 for J2SSH and 7 for SNCSSH.

The lower of these two limits (N) is basically the max number of concurrent SSH probes that can run. If the MID server exceeds the target limit, the target notifies MID server. SSH probes waiting to start will block until earlier probes are finished using their connections. J2SSH keeps a max of N connections in the idle connect cache per IP, which can be cached for up to 2 minutes, so the thread count can get up well into the hundreds. If you don't mind having extra MID server SSH threads for each IP in a discovery run and having extra TCP connections to target, you can raise mid.ssh_connections_per_host above the default of 3 for J2SSH. 

If you are concurrently discovering many different IPs, the SSH max connections across all IPs becomes indirectly limited by the MID server's max worker thread count setting because each probe uses one worker thread. SSH threads don't count against that worker thread limit.


Related Problem: PRB1250056

Seen In

Helsinki Patch 4

Associated Community Threads

There is no data to report.

Article Information

Last Updated:2018-12-12 11:14:36
Published:2018-05-25