Notifications

1084 views

Description

This article provides troubleshooting guidelines for event entries in the [em_event] table stuck in 'Ready' state for an extended period of time.


Release or Environment

All releases.

Cause

There are several causes that can lead to events stuck in 'Ready' state. The processing of the events is done by the "Event Management - process events" scheduled job.


Cause 1: Event processing is backed up.

  1. By default, Multi Node Event Processing is not enabled, thus there is only one scheduled job
  2. If there are more events getting created than the systems ability to process these events, events will be backed up resulted in new events stays in 'Ready' state for longer period of time.


Cause 2: Custom Business Rules and Script Includes adds extra overhead delaying event processing

  1. As best practice, we do not recommend having Custom Business Rules on em_event and em_alert table
  2. If Custom Business Rules has to be added, make sure it's performant.
  3. Under high event load, minimal overhead added can add up to major delay event if Multi Node Event Processing and multi thread is enabled.


Cause 3: Events are not getting picked up due to issue with Schedule Manager

  1. This usually happens when Multi Node Event Processing is enabled along with Multi Scheduled Jobs
  2. Events coming into service now have a specific bucket between 0 - 99 assigned.
  3. When Multi Node Event Processing is enabled, on each node, the bucket range will be divided evenly among the scheduled jobs.
  4. Ex: if number of scheduled jobs processing events is 4, then each job is responsible for processing each event in a specific range: [0 - 24], [25 - 49], [50 - 74], [75 - 99].
  5. The stuck events are likely belong to a particular bucket in one of the above ranges. This indicates the scheduled job assigned to the affected range are not operational.
  6. This was suspected to be an issue with the Schedule Manager goes out of sync when cluster nodes leaves/joins the cluster.


Cause 4. Events are not queried due to table sharding.

  1. em_event table is sharded.
  2. By design, Event Managment only queries events in the current shard and the one before it.
  3. If for some reason, events were not picked up for a more than 2 days (Cause 1 - 3, Jobs suspended, etc) , i.e. those events are now in N-2 to N-7 shards, they won't be picked up again.


Resolution

Cause 1: Event processing is backed up

  • Enable Multi Node Event Processing and Increase the # of scheduled jobs processing events
  • Load test has to be carried out to see what's the event consuming rate vs. event creation rate.


Cause 2: Custom Business Rules and Script Includes adds extra overhead delaying event processing

  • Add timing logic in Custom Business Rules and Script includes to see how much delay is being added for each event. 
  • Factor in the event rate and event processing cycle of the scheduled job and optimize the Customization as much as possible.


Cause 3: Events are not getting picked up due to issue with Schedule Manager

  • The issue occurs very infrequent and not reproducible in internal testing. Current work around is to refresh the scheduled jobs by doing the following:
    1. Goto Event Management > Settings > Properties
    2. Change 'Number of scheduled jobs processing events' to a different value (ex: from 2 to 1). Save
    3. Change 'Number of scheduled jobs processing events' to previous value. Save


Cause 4. Events are not queried due to table sharding.

  • If the stuck event are needs to be processed. Contact Customer Support through HI to have the events moved to current shards.


Additional Information

Components installed with Incident Management - Major Incident Management


Article Information

Last Updated:2020-05-31 06:24:28
Published:2020-05-31