After a Discovery Status for a Discovery Schedule is cancelled, either Manually or due to Max Runtime being exceeded, it can be a very long while before the probes queued and running in the MID Server are actually stopped from running.
When a Discovery is cancelled, it will almost certainly be in the middle of running thousands of probes, and hundreds of those will be queued either in the ecc_queue or internally in the MID Server. The cancel job first has to work its way to the front of that queue, which can take a long while. Then the MID Server has to start cancelling the jobs it has already taken from the ecc queue, and end the threads for the ones already started.
When a Discovery status is cancelled, all output records in the ecc queue on Ready State are set as processed so that they will not get picked up by MID Servers. However nothing is done about the jobs already picked up by MID Servers yet. A MID Server picks up more jobs at a time than it has free threads, and internally queues them. It will continue to run those jobs it has already taken, and start others at has already taken as threads become free. It will only start cancelling those once it has picked up the cancel job from the ecc queue, and that may be queued behind many other ready ecc queue outputs for other discovery schedules or non-discovery jobs.
In one case, a delay of 45 minutes was seen before all the MID Server cluster members picked up the cancel job, and a further 8 minutes before all running jobs were cancelled.
This will then impact any 'run after' discovery schedule, and any other jobs using the MID Server at the time, as either none or only some threads will be available for running new probes. This may in turn cause subsequent discovery schedules to cancel due to exceeding max runtime, or take a lot longer than usual.
Steps to Reproduce
- Set up a Discovery Schedule and run it. Ideally use an instance set up for Patterns rather than Probes, and include some devices, such as large linux servers or network switches/routers, that are bound to result in some very long running probes. Ideally include a few hundred of the probes and patterns that are known to run for very long times in the MID Server, such as:
- Pattern Launcher: IIS
- Pattern Launcher: Linux Server
- Pattern Launcher: MSSql DB On Windows Pattern
- Pattern Launcher: Tomcat
- UNIX - Classify
- Cancel the Discovery Status for that run.
- Check the ECC Queue table for the original SystemCommand outputs for cancel_discovery and the queue.processing inputs where the payload contains the sys_ids for "Discovery cancelled"n. You may see that the MID Server actually cancels the 'processing' output jobs much later, and a few sys_ids at a time.
- A Thread dump, or queue.stats input will confirm probes from the cancelled Discovery are continuing to run for a long time.
This problem is currently under review, however a workaround for the main cause of the delay is available:
The SystemCommand output for discovery_cancelled is being created at the same priority as the Discovery status as a whole - usually priority=2 (standard). It is possible to create a business rule that increases the priority of the ECC Queue output for the Cancel job. The MID Server should then pick up the job immediately, and be able to run it in the separate Interactive Thread Group which should have free threads even if all Standard threads are still running probes.
- Navigate to System Definition - Business Rules, and click New
- Fill in the field values like so, and Submit:
- Name: PRB1332197 cancel_discovery priority
- Table: Queue [ecc_queue]
- Active: Ticked
- When to run:
- When: Before (this is the default when not ticking Advanced, which we aren't)
- Insert: Ticked
- Filter conditions:
- Queue IS Output
- AND Topic IS SystemCommand
- AND Source IS cancel_discovery
- AND Priority IS NOT Interactive
- Set field values:
- Priority TO Interactive
- Set field values:
This Business Rule in included in the attached Update Set XML.
Related Problem: PRB1332197