Notifications

106 views

Description

Symptoms


There are conditions where it is observed the events from event management are stuck on em_event table on "Ready" or "Queued"  state for longer time than expected

Technically, having events with state READY which were created before 2 days then there are chances that the current executed scheduled jobs will go "Process Event" stuck at "Running state", which mean they cannot go "Processing" or "Ready" state  untill the previous jobs either completed or terminated.

We have multiple articles which would explain the troubleshooting methods and kind of workarounds to identify and fix the issues, but, at times having stuck jobs might progress to heavy consumption of the Instance resources, which would require  to clean the em_event table to have the new jobs to be executed and have the instance to a stable state.

Refer: 

 Environment 


  • Instance with any version having Event Management plugin activated

Investigations


Verify the number of jobs stuck and the state of the Jobs from UI.

  • Login to the instance
  • Navigate >> em_event_list.do 
  • Verify the number Jobs in "Ready" or "Queued" state and the date and time of the event created.

  • Navigate >> System Scheduler > Scheduled Jobs > Today's Scheduled Jobs
  • Search for jobs "Event Management - process events"  
  • Verify the "Next action" and the "Claimed by" columns. 
  • Also verify: "sys_triggger_list.do" & "sysevent_list.do"

 

Verify the number of jobs stuck and the state of the Jobs from CLI(Analyzer).

  • Login to the instance 
  • Navigate >> em_event_list.do 
  • Verify the number Jobs in "Ready" or "Queued" state and the date and time of the event created.
  • Observe the "Description" field, notify the pattern of the events running (Example: Sysevent000)

CLI: Analyzer 

  • SQL query to identify by grouping the State of  the Events 
SQL> select "state",count(*) from sysevent000 group by "state"; 

state COUNT(*) 
-------------------------------------------------- 
processed 1365158 
ready 16232 
queued.0fb2c110871321004ebe19fa84e3ecf8 834837 
error 2063 
  • SQL query to identify the Application nodes where the events are stuck

SQL> select distinct "claimed_by" from sysevent000 where "state" not in('processed','error'); 

claimed_by 
---------------------------------------------------

app129023.sjc105.service-now.com:morganstanley081 
app128153.sjc105.service-now.com:morganstanley073 
app129013.sjc105.service-now.com:morganstanley068 

 

Verify the and capture the Stack trace of any stuck event.

  • Login to the instance
  • Navigator >> stats.do

         https://<instance>.service-now.com/nav_to.do?uri=%2Fstats.do 

  • Go to Background Scheduler -> Scheduler Workers 
  • Identify the worker on which the "process event" job is currently running 
  • Double click on the job and it will open in a new window with the stack trace details. 

Cause


There are multiple scenarios where the Jobs might get hung or stuck, need to verify the logs and stack traces to identify, below some for reference.

  • If an event becomes "blocked", which can be the case when it encounters an infinite loop, the event can be unblocked by finding the event, setting it to a state of "error". Sometimes setting to error might fail or the events are completing, but they run very, very slowly. The events processor eventually gets them processed, but the speed is obviously not acceptable.
  • Events with state READY which were created before 2 days then there are chances that the current executed scheduled jobs will go "Process Event" stuck at "Running state", which mean they cannot go "Processing" or "Ready" state until the previous jobs either completed or terminated.
  • Event Management has 2 options to configure how many processing jobs to run

a. In “Event Management” properties if “Enable multi node event processing” if false the number of jobs should be the numbers defined in “Number of scheduled jobs processing events” property.
b. In “Event Management” properties if “Enable multi node event processing” if true, the number of jobs should be the numbers defined in “Number of scheduled jobs processing events” property multiply by the active nodes (nodes in state “online” and scheduler “any”), i.e 2 jobs defined to run and you have 2 active nodes the number of jobs should be 2*2=4 + 2 (another job for every node with “System ID” = “All Nodes” ) = 6.

  •  In case you have events in the READY state, you should check if the number of jobs correct (according to EM properties as described above). If it’s not correct try to change the number defined in “Number of scheduled jobs processing events” (for example from 2 to 3) and then back to the old number. This should fix the problem. (delete old jobs and create new ones) 
  • If you have such Situation (you create new processing jobs by changing the property) and you have events in state QUEUED within the last 2 days, you can change these events to state READY (by a script) and then they should be processed.
  •  In case the number of jobs are not correct, (or job is not running at all) most of the time it’s a platform issue (jobs should be deleted/inserted by the platform when nodes goes up down, etc)

 

Quick Workaround


Procedure 1:  Restart "Event management process event"

  • Login to the instance
  • In the application navigator, Type "Event Management"
  • Choose "Properties"
  • Enable "Enable multi-node event processing" and save it


To restart the event management process event schedule job

  • In the application navigator, type "transaction"
  • Choose "Active Transactions (All Nodes)" under System Diagnostics
  • Filter it for "Event" under URL field
  • Choose the transaction
  • In below there is a drop-down menu and expand it. You will see "kill" option and choose it.
  • Make sure the transaction is killed

 

Procedure 2: Clean all the Ready and Queued state events from the em_event table 

}
var queryString = “https://<instance>.service-now.com/em_event_list.do?sysparm_query=sys_created_onONToday@javascript:gs.beginningOfToday()@javascript:gs.endOfToday()&sysparm_first_row=1&sysparm_view=”;
var gr = new GlideRecord(‘em_event’);
gr.addEncodedQuery(queryString);
gr.query();
while (gr.next()) {

gr.deleteRecord();
}

Note: The script  provided above to clean the  em_event table  might take time depending on the number of Events to be cleaned, Example: 1 million records to clean might take 2 to 4 days to clean 

Additional Information


It is needed for TSE to reproduce the issue in OOB and observe the script to clean the em_event table, below steps to Reproduce in OOB.

  • Login to instance 
  • Navigator > System Definition > Background scripts >> Execute below script (Below script will push some events to the em_event table) 
for (var i=0 ; i<50 ; i++){
var gr = new GlideRecord('em_event');
gr.initialize();
gr.name = 'vol-6a6c506aINCIDENT_NUMBER';
gr.type = 'VolumeReadOps';
gr.event_class = 'AWS CloudWatch';
gr.severity = '5'
gr.resolution_state = 'New'
gr.time_of_event = new GlideDateTime();
gr.state = 'Ready';
gr.description = 'INCIDENT.NUMBER_ EventProcessTestingThreshold Crossed: 1 datapoint (1431.0) was greater than or equal to the threshold (1000.0)';
gr.insert();    
}
  • Let above execute for 1 min and then cancel the transaction with "cancel_mytransactions.do" 
  • Execute below script to clean the em_event table 
}var queryString = “https://<instance>.service-now.com/em_event_list.do?sysparm_query=sys_created_onONToday@javascript:gs.beginningOfToday()@javascript:gs.endOfToday()&sysparm_first_row=1&sysparm_view=”;
var gr = new GlideRecord(‘em_event’);
gr.addEncodedQuery(queryString);
gr.query();
while (gr.next()) {

gr.deleteRecord();

}
  • Let the above script execute until you see "em_event" table clean, monitor the time of execution and update customer accordingly

Article Information

Last Updated:2019-08-02 20:56:02
Published:2019-06-29
0111.png