Events assignment to Event Processing Jobs and how to Identify a hang/delay in an event processing job

Summary

How the incoming events are assigned to the Event Processing Scheduled Jobs:

(1) When the events are received, each event gets assigned to a bucket number. This will be reflected in the field "bucket" on the event record.

(2) We distribute the buckets to the available event processing scheduled jobs. You can see the bucket range assigned to each job in the scheduled job record in the sys_trigger table. The jobs are called Event Management - Process event #:

Bucket range is assigned to a particular job name. For example: Event Management - process events - #1 can be assigned with range from 0-25. Each node in the cluster will have the same job name.

When Event Management > Properties > Enable multi node event processing is set to yes
When this job runs on a particular node, it calculates sub-range (in NodeloadInfo).Match.ceil(range_size/#of nodes). For a 4 nodes cluster, each Event Management - process events - #1 would have a sub-range of 7 buckets. Node1 could be assigned to 0-6, Node2 7-13, Node3 14-20 and Node4 21-24.
This way each Event Management - process events - #1 on each node process their own exclusive bucket. If there are no changes in the cluster, no more than 1 node can process a particular bucket.
This design is prone to an issue where if a child job is not recreated for any reason on a node, certain bucket range will not get processed. (this might happen when nodes are down start or added)

(3) The number of these scheduled jobs is configurable through system property evt_mgmt.event_processor_job_count. It can also be seen from: Event Management > Properties > Number of scheduled jobs processing events

When Event Management > Properties > Enable multi node event processing is set to yes, the number of “Event Management - process events” jobs should be : Event Management > Properties > Number of scheduled jobs multiply active node (active nodes are nodes that have status online and scheduler “any” in sys cluster state.) + Event Management > Properties > Number of scheduled jobs (jobs with SYSTEM ID = “ACTIVE NODES”)

(4) The events are then claimed by the scheduled jobs according to their assigned buckets.

Before NewYork release, when job is handling events it first marks all events to be handled in the current run with status queued.the_sys_id_of_the_scheduled_job. This resulted in having stuck events because they were claimed by a scheduled job that was not running anymore on a specific node..
Starting from NewYork release, there is no queued status anymore. Also "Event Management - Coordinator Job" is introduced that runs every 30 seconds and makes sure that all Event Management - Process event # jobs running according to configured settings. If it found any issues, it will fix the number of jobs. Once the number of jobs is corrected, all the jobs waiting for the coordinator job will be able to run.

(5) Note that em_event is a rolling table configured for 7 days, (table changed every 24 hours) , event processing jobs will process events that exists only in the current or previous tables (2 days before), events kept in these tables for ~5.5 days.

How to Identify an issue with the event processing jobs:

When you see :

Events that not processed for a long time and in READY or QUEUED state.

(1) Open the record for the hanging event in the em_event table and take note of the bucket field value in the event's record.

(2) Go to the sys_trigger table: System Scheduler > Scheduled Jobs > Today scheduled Jobs and filter on the jobs starting with Event Management – process events

(3) For those Jobs:

Given the information provided above, check if the number of scheduled jobs is reflected correctly.
- if it isn't then Identify the node that should have the missing job. This can be done by grouping by the "Claimed by" column and see which node has the missing job.
Check "Next action" and "Claimed by" columns to know when a job should be triggered next and which node is claiming this job:
- The jobs should run every 5 seconds so if the "Next action" is not 5 seconds from now, then the job is stuck.
- If the job was claimed by a passive node, then the job is also stuck.
- Check that the job state is not error or queued

Workaround

If the number of running event processing jobs are not correct, change Event Management > Properties > Number of scheduled jobs and then change it back to the original number (i.e change from 2 to 3 and then to 2), this should erase all running jobs and create them again according to the settings.
Run cache.do
Check that problematic events start changing their status to “Processed”.
If you see queued events that not processed, you should recreate these events (using a script), to let event processing jobs to process these events.
If you see events in READY state with create time less, then 2 days (which means that jobs will not process them), set the property evt_mgmt.events_processing_all_shards to true, and the event processing jobs will process events for all tables. Set it to false after all events processed.

Events assignment to Event Processing Jobs and how to Identify a hang/delay in an event processing job

Summary

Related Links