Event Management Events Stuck in 'Ready' state

Issue

This article provides troubleshooting guidelines for event entries in the [em_event] table stuck in 'Ready' state for an extended period of time.

Release

All releases.

Cause

There are several causes that can lead to events stuck in 'Ready' state. The processing of the events is done by the "Event Management - process events" scheduled job.

Cause 1: Event processing is backed up.

By default, Multi Node Event Processing is not enabled, thus there is only one scheduled job
If there are more events getting created than the systems ability to process these events, events will be backed up resulted in new events stays in 'Ready' state for longer period of time.

Cause 2: Custom Business Rules and Script Includes adds extra overhead delaying event processing

As best practice, we do not recommend having Custom Business Rules on em_event and em_alert table
If Custom Business Rules has to be added, make sure it's performant.
Under high event load, minimal overhead added can add up to major delay event if Multi Node Event Processing and multi thread is enabled.

Cause 3: Events are not getting picked up due to issue with Schedule Manager

This usually happens when Multi Node Event Processing is enabled along with Multi Scheduled Jobs
Events coming into service now have a specific bucket between 0 - 99 assigned.
When Multi Node Event Processing is enabled, on each node, the bucket range will be divided evenly among the scheduled jobs.
Ex: if number of scheduled jobs processing events is 4, then each job is responsible for processing each event in a specific range: [0 - 24], [25 - 49], [50 - 74], [75 - 99].
The stuck events are likely belong to a particular bucket in one of the above ranges. This indicates the scheduled job assigned to the affected range are not operational.
This was suspected to be an issue with the Schedule Manager goes out of sync when cluster nodes leaves/joins the cluster.

Cause 4. Events are not queried due to table sharding.

em_event table is sharded.
By design, Event Managment only queries events in the current shard and the one before it.
If for some reason, events were not picked up for a more than 2 days (Cause 1 - 3, Jobs suspended, etc) , i.e. those events are now in N-2 to N-7 shards, they won't be picked up again.

Resolution

Cause 1: Event processing is backed up

Enable Multi Node Event Processing and Increase the # of scheduled jobs processing events
Load test has to be carried out to see what's the event consuming rate vs. event creation rate.