Troubleshoot Discovery Performance, Cancellation, and Timeout

Issue

On a discovery schedule there is a "Max run time" field. The discovery status will be cancelled once the "Max run time" is reached.

There are many reasons a discovery schedule could be cancelled or perform slower than expected. The following are some of the reasons a discovery schedule may end up cancelled:

Output probe "hangs" while being processed on the MID server
Output probe remains in ready state
Not enough MID server resources to process the ecc queue records in the desired time
Input remains in ready state

The time taken for a discovery schedule to complete should increase as the number of CIs discovered increases. If the number of CIs discovered in a discovery schedule increases, then it may be necessary to increase the max run time or the resources allocated to the discovery schedule, such as the number of MID server, MID server threads, and or MID server memory.

MID Server configuration

The documentation pages below can provide guidelines for increasing MID server resources and configuring threads and memory usage:

Release

All

Resolution

Slow Discovery

Show Discovery Timeline

The "Show Discovery Timeline" button will be helpful in reviewing the timeline for smaller discoveries. There is a default limit of 300 ecc_queue records. This limit can be controlled by glide.discovery.timeline.max_entries. It is best to keep this limit at 300 and only use this timeline for smaller discoveries troubleshooting.

Discovery performance metrics

Starting in Madrid, performance metrics are collected for individual probes, per build version, discovery status, and target IP address.

Individual
- Shows the probe and sensor processing time for each probe and sensor
- Logged when an input is processed by the SensorProcessor and controlled by property "glide.discovery.perf.metrics.enable_collection" with values (true/false)
- To view the Discover Performance Metrics either

- - Navigate to "Discovery > Discovery Performance Metrics > Probe/Sensor (Individual)"
  - Add related list "Probe and Sensor Metrics (Individual)" in the "Discovery Status" form
- Individual metrics are logged only for probes which the input was returned and processed
- The metrics can be filtered by discovery status and then ordered by "Probe processing time" or "Sensor processing time" to find the longest running ones. As an example:

Discovery Performance Metrics

Build
- Shows probe metrics per build version
- Collected daily via scheduled job "Aggregate Discovery Probe And Sensor Metrics By Build" and controlled by property "glide.discovery.perf.metrics.rollup_by_build" with values (true/false)
Discovery Status
- Shows probe metrics pre status
- Collected via a script action when a discovery completes and controlled by property "glide.discovery.perf.metrics.rollup_by_status" with values (true/false)
Target IP address
- Shows the aggregated time of a probe per target IP address
- Collected via a script action when a discovery completes and controlled by property "glide.discovery.perf.metrics.rollup_by_target" with values (true/false)
- Note: Metric by Target is not logged for devices which did not complete a discovery. Therefore, the metrics may be useful in analyzing the discovery to determine what devices and device types are taking the longest, but not as useful in determining what device caused a discovery to be cancelled, because the device which caused discovery to cross the max run time threshold did not complete and thus performance information is not collected on it.

A combination of the above metrics can be used to determine what probes or IPs have the most impact in a discovery's performance. Please note that the averages can only be statistically significant if the data set is large enough. Smaller datasets can have outliers which greatly skews the average.

See Discovery performance metrics for more information.

Cancelled Discovery

Check the ECC Queue related list for records where Processed is (empty)

When a discovery is cancelled, all ecc_queue records which have not been processed have their status changed to processed. However the "processed" field, which is the time when the processing of the probe/sensor was complete, is left empty. Therefore we can determine what probes were still running when a discovery was cancelled by finding the records where processed is (empty).

In the following screenshot the ECC Queue records for the discovery job have been ordered from newest to oldest for the "updated" field, the "updated" field is not seen by default on the ECC Queue related list. Adding the "updated" column is also helpful so that it is known how long it took the probe to complete (Updated - Processed = time in which an output probe spent on the MID server, which includes queue time waiting in MID internal queue until a thread is available for execution and execution time of probe).

Discovery status ECC Queue

Note: the ECC Queue record highlighted was created an hour before the example discovery above was cancelled and still it had not been processed.

This is a good indication this probe is what caused the discovery above to be cancelled.The above example is a simple case scenario where it is found that only one probe was not processed and therefore it is easy to determine the culprit. However in some cases there may be multiple probes that happen to hang and contribute to the timing out of the discovery. For such cases it is important to analyze the ECC Queue for the discovery cancelled and look for patterns, such as many probes with processed (empty) where the "Topic" or "Probe Name" are the same. In general, to find the ECC Queue records which caused the discovery to be cancelled look for records with a larger gap between the "created" and "updated" field.

When analyzing the ECC Queue for a cancelled discovery split the queue between output and input.

- Outputs are processed by the MID Server. Outputs hanging would point to a MID server issue
- Inputs are processed by the instance. Inputs hanging would point to an instance side issue

Once it is determined what probe/IP is causing the discovery to be cancelled, then a new investigation will be needed to troubleshoot why the probe is not fully processed. The IP address can be added as an exception to the discovery schedule, so that the discovery completes successfully until a solution is found for the probe.

Check the Devices related list for devices where the "Scan status" is not "Completed"

The Devices related list is not updated when the discovery is cancelled. Therefore any devices where probes did not complete will not have a "Scan status" of "Completed". The following Image shows devices which where still being scanned when the discovery was cancelled. From this related list we can gather the Source, which is the IP address of the device, and look for the probes which did not complete, look for them under the "ECC Queue" related list.

Active Scans

Note: both "Source" and "CMDB CI" were cleared in the above screenshot.