Notifications

72 views

Symptoms


Events in em_event table are stuck in 'Ready' state for extended period of time.

Release


All releases. 

Cause


There are several causes that can lead to events stuck in 'Ready' state. The processing of the events are done by "Event Management - process events" scheduled job(s) 

Cause 1: Event processing is backed up.

  1. By default, Multi Node Event Processing is not enabled, thus there is only one scheduled job
  2. If there are more events getting created than the systems ability to process these events, events will be backed up resulted in new events stays in 'Ready' state for longer period of time.

 

Cause 2: Custom Business Rules and Script Includes adds extra overhead delaying event processing

  1. As best practice, we do not recommend having Custom Business Rules on em_event and em_alert table
  2. If Custom Business Rules has to be added, make sure it's performant.
  3. Under high event load, minimal overhead added can add up to major delay event if Multi Node Event Processing and multi thread is enabled.

 

Cause 3: Events are not getting picked up due to issue with Schedule Manager

  1. This usually happens when Multi Node Event Processing is enabled along with Multi Scheduled Jobs
  2. Events coming into service now have a specific bucket between 0 - 99 assigned.
  3. When Multi Node Event Processing is enabled, on each node, the bucket range will be divided evenly among the scheduled jobs.
  4. Ex: if number of scheduled jobs processing events is 4, then each job is responsible for processing each event in a specific range: [0 - 24], [25 - 49], [50 - 74], [75 - 99].
  5. The stuck events are likely belong to a particular bucket in one of the above ranges. This indicates the scheduled job assigned to the affected range are not operational.
  6. This was suspected to be an issue with the Schedule Manager goes out of sync when cluster nodes leaves/joins the cluster.

 

Cause 4. Events are not queried due to table sharding.

  1. em_event table is sharded.
  2. By design, Event Managment only queries events in the current shard and the one before it.
  3. If for some reason, events were not picked up for a more than 2 days (Cause 1 - 3, Jobs suspended, etc) , i.e. those events are now in N-2 to N-7 shards, they won't be picked up again.

 

Resolution


Cause 1: Event processing is backed up

  • Enable Multi Node Event Processing and Increase the # of scheduled jobs processing events
  • Load test has to be carried out to see what's the event consuming rate vs. event creation rate.

 

Cause 2: Custom Business Rules and Script Includes adds extra overhead delaying event processing

  • Add timing logic in Custom Business Rules and Script includes to see how much delay is being added for each event. 
  • Factor in the event rate and event processing cycle of the scheduled job and optimize the Customization as much as possible.

 

Cause 3: Events are not getting picked up due to issue with Schedule Manager

  • The issue occurs very infrequent and not reproducible in internal testing. Current work around is to refresh the scheduled jobs by doing the following:
    1. Goto Event Management > Settings > Properties
    2. Change 'Number of scheduled jobs processing events' to a different value (ex: from 2 to 1). Save
    3. Change 'Number of scheduled jobs processing events' to previous value. Save

 

Cause 4. Events are not queried due to table sharding.

  • If the stuck event are needs to be processed. Contact Customer Support through HI to have the events moved to current shards.

 

Additional Information


Installed with Incident Management Best Practice – Kingston

Article Information

Last Updated:2019-01-14 02:16:54
Published:2019-01-14