How the incoming events are assigned to the Event Processing Scheduled Jobs:
(1) When the events are received, each event gets assigned to a bucket number. This will be reflected in the field "bucket" on the event record.
(2) We distribute the buckets to the available event processing scheduled jobs. You can see the bucket range assigned to each job in the scheduled job record in the sys_trigger table. The jobs are called Event Management - Process event #:
Bucket range is assigned to a particular job name. For example: Event Management - process events - #1 can be assigned with range from 0-25. Each node in the cluster will have the same job name.
- When this job runs on a particular node, it calculates sub-range (in NodeloadInfo).Match.ceil(range_size/#of nodes). For a 4 nodes cluster, each Event Management - process events - #1 would have a sub-range of 7 buckets. Node1 could be assigned to 0-6, Node2 7-13, Node3 14-20 and Node4 21-24.
- This way each Event Management - process events - #1 on each node process their own exclusive bucket. If there are no changes in the cluster, no more than 1 node can process a particular bucket.
- This design is prone to an issue where if a child job is not recreated for any reason on a node, certain bucket range will not get processed.
(3) The number of these scheduled jobs is configurable through system property evt_mgmt.event_processor_job_count. It can also be seen from: Event Management > Properties > Number of scheduled jobs processing events
(4) The events are then claimed by the scheduled jobs according to their assigned buckets.
- Before NewYork release, we assign the events to the expected active job without checking if the job is actually running on the node or not. Then the status field of the event is updated to queued.the_sys_id_of_the_scheduled_job. This resulted in having stuck events because they were claimed by a scheduled job that was not running anymore on a specific node.
- Starting from NewYork release, "Event Management - Coordinator Job" is introduced that runs every 30 seconds and makes sure that all Event Management - Process event # jobs running according to configured settings. If it found any issues, it will fix the number of jobs. Once the number of jobs is corrected, all the jobs waiting for the coordinator job will be able to run.
How to Identify an issue with the event processing jobs:
(1) Open the record for the hanging event in the em_event table and take note of the bucket field value in the event's record.
(2) Go to the sys_trigger table: System Scheduler > Scheduled Jobs > Today scheduled Jobs and filter on the jobs starting with Event Management – process events
(3) For those Jobs:
- Given the information provided above, check if the number of scheduled jobs is reflected correctly.
- if it isn't then Identify the node that should have the missing job. This can be done by grouping by the "Claimed by" column and see which node has the missing job.
- Check "Next action" and "Claimed by" columns to know when a job should be triggered next and which node is claiming this job:
- The jobs should run every 5 seconds so if the "Next action" is not 5 seconds from now, then the job is stuck.
- If the job was claimed by a passive node, then the job is also stuck.
Out of the box, we have a feature that monitors the health of some of the Event Management functionalities. If this feature is enabled, you will find alerts created when it detects a potential issue. The two alerts we are interested in for this topic are "Event Processing Job " and "Delay in event processing ".
You can check the em_alert table for alerts create with the following information. If found, given that you are experiencing delays in event management processing, kindly open a case with ServiecNow support:
Description starts with: To check if this feature is enabled on your environment:
Event Management > Settings > Properties > Enable Event Management self-health monitoring