Events Management events in ready state

Issue

Event Management events are not processed. Because such events are not processed, alerts will not be created nor incidents which are created by such alerts.

Cause

01 Scheduled job "Event Management - process event" issues such as stopped, stuck or claimed by a passive node.

It is also possible the jobs are not recreated properly after the number of jobs is updated, or after an instance upgrade.

To confirm this is the correct root cause for which the events are stuck in "ready" state:

Go to "System Scheduler > Scheduled Jobs > Today's Scheduled Jobs".
Search for jobs like "Event Management - process events".
Check the "Next action" and the "Claimed by" columns.
This job out of box (OOB) should run every 5 seconds. Therefore, the "Next action" should be a few seconds from now. If "Next action" is a time in the past, then likely the job is stuck.
- Jobs which the "System ID" is "ACTIVE NODES" can have the "Next action" in the past, that is ok. These are the "parent" jobs and do not actually run.
If the job was claimed by a passive node, then the job is stuck as well.
Lastly, if "Enable multi node event processing = true", confirm that there are (<number_of_jobs_configured> * (1 + <active_worker_nodes>)) jobs.
- Example: an instance with 6 active worker nodes configured to have 4 jobs processing events per node would have (4 * (1 + 6)) = 28.

Note: The 1 above, added to the number of active worker nodes, is because a job is also created for system "Active Nodes".

When "Enable multi node event processing = true" multiple event processing jobs are created. The events are divided into "buckets" to be processed by the jobs. If there are issues with any job, then the events of the related "bucket" are not processed. Therefore, in some cases it may happen that some events are being processed and some are not.

02 Node cache not synced and one or more nodes has incorrect information regarding node count.

03 Custom Business Rules and Script Includes adds extra overhead delaying event processing

As best practice, we do not recommend having Custom Business Rules on em_event and em_alert table. If Custom Business Rules have to be added, make sure it runs fast. Under high event load, minimal overhead added can add up to major delay to event processing even if Multi Node Event Processing and multi thread is enabled.

Resolution

01 Scheduled job "Event Management - process event" issues such as stopped, stuck or claimed by a passive node.

A simple solution is to recreate the jobs as follows:

Go to "Event Management > Settings > Properties"
Find option "Number of scheduled jobs processing events" and update to a number different then the current value and save.

The above will recreate the jobs to be claimed by an active node. The value can be reverted back afterwards.

Event management by default processes events created within the last two days (current and previous shard). Therefore, even if after the jobs are recreated successfully there may be older events which are not processed. This behavior can be modified, so that event management will process events on all shards. Setting the following property will have event management process events older than two days (process events on all shards). We recommend reverting back to the default behavior after event processing is caught up.

evt_mgmt.events_processing_all_shards = true

02 Node cache not synced and one or more nodes has incorrect information regarding node count.

Flush the cache by typing cache.do in the navigation filter.

03 Custom Business Rules and Script Includes adds extra overhead delaying event processing

Add timing logic in Custom Business Rules and Script includes to see how much delay is being added for each event.
Factor in the event rate and event processing cycle of the scheduled job and optimize the Customization as much as possible.

Additional Links

Incident creation delayed or incident not created from alert

Issue

Cause

01 Scheduled job "Event Management - process event" issues such as stopped, stuck or claimed by a passive node.

02 Node cache not synced and one or more nodes has incorrect information regarding node count.

03 Custom Business Rules and Script Includes adds extra overhead delaying event processing

Resolution

01 Scheduled job "Event Management - process event" issues such as stopped, stuck or claimed by a passive node.

02 Node cache not synced and one or more nodes has incorrect information regarding node count.

03 Custom Business Rules and Script Includes adds extra overhead delaying event processing

Related Links

Additional Links

Attachments

Attachments