Discovery processing can be cancelled (using memory.watcher) or cause severe performance issues due to high memory consumption

Description

When processing Discovery Sensor jobs ASYNC: Discovery - Sensors, memory consumption can spike. If enough workers are processing sensors with large payloads the following may occur:

Memory Watcher (memory.watcher) starts canceling the longest running jobs/transactions to attempt to recover memory. This can sometimes cancel the sensors and cause the job to send the following error:
"Sensor error: Transaction cancelled: Available memory is almost depleted"
Free memory on the application node becomes so low that the node starts experiencing severe performance issues and must be restarted.

This issue is not unique to particular sensors, but occurs when multiple workers are processing large payloads.

Note that payloads are returned to the instance in the form of an attachment on an ecc_queue record. They are then processed when an asynchronous business rule on the ecc_queue record creates an ASYNC: Discovery - Sensors scheduled job. When this job is picked up by the scheduler it loads the entire payload into memory and begin processing. This can consume up to 150-170MB in memory per worker.

The platform limits the size of attachments that can be loaded into memory by 5MB using the com.glide.attachment.max_get_size property. It is NOT recommended to increase this property as it affects any location in the platform where attachments are loaded into memory and significantly increases the risk of running out of free memory (resulting in an outage).

Steps to Reproduce

Run a Discovery Schedule that will return a high volume of probes with large payloads.
This creates many ASYNC: Discovery - Sensors jobs as the ecc_queue records are created.
In stats.do, verify that the scheduler has picked up the jobs and they are processing.
Note that as they process, free memory spikes until the jobs complete.

Take a heap dump to confirm the workers processing ASYNC: Discovery - Sensors jobs are the source of high memory consumption:

Suspect 1

The thread com.glide.schedule.GlideScheduleWorker @ 0x95f6ab68 glide.scheduler.worker.2 keeps local variables with total size 166,815,896 (10.88%) bytes.

Suspect 2

The thread com.glide.schedule.GlideScheduleWorker @ 0x95cdede0 glide.scheduler.worker.6 keeps local variables with total size 164,498,688 (10.73%) bytes.

Suspect 3

The thread com.glide.schedule.GlideScheduleWorker @ 0x95f69fc8 glide.scheduler.worker.3 keeps local variables with total size 164,058,312 (10.70%) bytes.

Suspect 4

The thread com.glide.schedule.GlideScheduleWorker @ 0x95f6bda8 glide.scheduler.worker.0 keeps local variables with total size 161,407,120 (10.52%) bytes.

Workaround

For Immediate Relief:

Restart the affected node (where memory has been depleted and jobs are running)
Disable the associated Discovery schedule until one of the workarounds listed in the following section can be put in place to reduce the affected payloads

Opportunities For Tuning:

1. Stager your Discovery schedules

The Run option on a Discovery schedule record enables you to specify when a Discovery schedule should run. Specify Run = After Discovery and populate the Run after field with the Discovery schedule you want to complete before starting the second. For example, "Discovery Schedule 1" runs 5 probes. You want to make sure that those 5 probes are processed before starting "Discovery Schedule 2." This spaces out the ASYNC: Discovery - Sensors jobs that process the probe payloads and helps with performance for users on nodes that are processing these jobs. Set the following fields on the Discovery Schedule 2 schedule record:

2. Break up the payloads that are returned from Discovery probes and AWS actions:

AWS

The ability to paginate was introduced for DescribeInstances and DescribeVolumes. To implement pagination for AWS calls that return large payloads, set the maxResults field on the DescribeInstances and/or DescribeVolumes AWS Action record.
The DescribeImages API call does not support pagination. Filters can be used to try and limit the amount of data returned in these payloads. For more information, see KB0598789 and the Amazon Web Services doc site.

Shazaam

Shazaam provides batching to reduce payload sizes returned. This feature was added in Eureka and addressed in KB0565023.
If Shazaam probes are returning large payloads, reduce the batch size from 5,000 to 500.
For more information, see Create a Discovery Schedule in the ServiceNow product documentation.

Other Options

Reduce the number of IP addresses in problematic schedules (split into multiple schedules)
Do not exclude IPs from a large IP range - use multiple IP ranges instead
Do not include a large number of small IP ranges - simplify by aggregating consecutive ranges where possible
Reduce the number of MID servers being used by problematic schedules

Related Problem: PRB714288

Discovery processing can be cancelled (using memory.watcher) or cause severe performance issues due to high memory consumption

Description

Steps to Reproduce

Workaround

Attachments

Attachments