On a discovery schedule a "Max run time" can be configured. The discovery will be cancelled if it takes longer to complete then then configured max run time. A discovery scheduled which used to complete successfully in past discoveries may start being cancelled more and more often. Over time the number of devices discovered by a schedule may increase considerably, and in such cases this may be the root cause of why the discovery schedule starts to go over the configured max run time. If the number of CIs discovered by a schedule increases, then also shall be increased the max run time or the resources allocated to the discovery schedule, such as the number of MID server, MID server threads, and or MID server memory. However, in some cases the discovery may start to go over the max run time due to a single/few probes taking too long to complete or being stuck at times.This article focuses on finding troublesome probes which are stuck on the MID server and thus causing the discovery scheduled to be cancelled. A separate investigation is needed to determine why the probe is hanging as there are many different probes and investigation steps will depend based on the probe.
The following links should be helpful for increasing MID server resources:
When a discovery is cancelled, all ecc_queue records which have not been processed have their status changed to processed. However the "processed" field, which is the time when the processing of the probe/sensor was complete, is left empty. Therefore we can determine what probes were still running when a discovery was cancelled by finding the records where processed is (empty).
In the following screenshot the ECC Queue records for the discovery job have been ordered from newest to oldest for the "updated" field, the "updated" field is not seen by default on the ECC Queue related list. Adding the "updated" column is also helpful so that it is known for how long the probe ran.
It can be seen from the image above that the ECC Queue record highlighted was created an hour before the discovery was cancelled and still it had not been processed. This is a good indication this probe is what caused the discovery above to be cancelled.The above example is a simple case scenario where it is found that only one probe was not processed and therefore it is easy to determine the culprit. However in some cases there may be multiple probes that happen to hang and contribute to the timing out of the discovery. For such cases it is important to analyze the ECC Queue for the discovery cancelled and look for patterns, such as many probes with processed (empty) where the "Topic" or "Probe Name" are the same. In general, to find the ECC Queue records which caused the discovery to be cancelled look for records with a larger gap between the "created" and "updated" field.
When analyzing the ECC Queue for a cancelled discovery filter for "queue = output." Inputs can usually be filtered out because inputs have a default maximum processing time of twenty minutes and therefore will usually not cause a discovery to be cancelled.
Once it is determined what probe/IP is causing the discovery to be cancelled, then a new investigation will be needed to troubleshoot why the probe is not fully processed. The IP address can be added as an exception to the discovery schedule, so that the discovery completes successfully until a solution is found for the probe.