The archiver is triggered hourly (by default) by a scheduled job and only has protections to prevent a parallel archiver thread on the same node. If the archiver job is picked up by a node that is not currently running the archiver, it will run, despite a running job on another node.
Evidence that this is likely occurring is failed inserts to the archive table (ar_) that look like this:
glide.db.archiver ... FAILED TRYING TO EXECUTE ON CONNECTION 6: INSERT INTO ar_..
Also, the Archive Runs module will show multiple jobs with State="Running", but there will still only be one record in archive status (being updating by all parallel threads). You can also check the threads.do page for the different primary nodes to see if the glide.db.archiver thread is running on more than one node.
Steps to Reproduce
- Log in to an instance with more than one node.
- Navigate to System Scheduler > Today's Scheduled Jobs and search for the Archive job.
- Click Execute Now a couple of times
Note that you must have a few hundred things that need to be archived so that the job runs long enough to see it running twice.
- Either check the various threads.do pages for the thread being run in parallel or check the Archive Status module.
Note that the second run executes in parellel. It should not start while the first run is still going.
The Archive scheduled job is set up as a one-hour repeat interval job. It calls a class that hands the thread off to a background thread called glide.db.archiver. If one archiver thread runs longer than one hour, then the scheduled job fires again and starts the thread on another node. This results in the same queries being executed from two nodes and therefore can cause unique key violations or database lock contention. The jobs then begin to conflict with each other. This causes telescoping latency and eventually archiving falls behind, affecting overall performance. In one case that was discovered, the archive job queries were accounting for 70% of total database runtime.
There are four approaches to working around this problem.
One main point of contention is a query on the database similar to the following example:
[0:00:03.944] Script completed: SQL
SELECT * FROM sys_update_log WHERE set_id = 'fdc464cb691fe940c39a7ac8cfd8de9f'
This query often causes replication to fall behind. To fix this query, add an index to the column set_id to improve the WHERE clause.
If immediate relief must be provided:
Pin the Archive job to a specific node to ensure that two archivers are never running concurrently.
Restart the nodes so that current duplicate executing job threads exit out safely.
Warning: This solution will break archiving if you are AHA transferred/failed over or if the node is removed from the pool.
This solution is a little trickier, but does not cause archiving to break during AHA transfer/fail over (as workaround #2 does). Consider implementing workaround #2 for immediate relief and then using workaround #3 after you have time to get it right.
There are essentially four parameters you have to tune:
- Number of batches (glide.db.archive.max_iterations)
- Number of records in a batch (glide.db.archive.batch_size)
- Sleep time between batches (glide.db.archive.sleep_time)
- Interval on Archive scheduled job (default = 1hr)
The strategy of this workaround is to avoid ever having two archive jobs running at the same time by tuning these parameters so that one full archive iteration (for example, all active archive rules) during the busiest period of the day for archiving finishes just under the amount of time on the archive scheduled job interval.
For example, if you have a one-hour job interval, you want the archive iteration to complete in approximately 50-55 minutes.
Suppose that during archive runs you see that the combined active archive rules finish between two and three hours. You either need to space out the scheduled job to running every three hours or reduce the batch size and max iterations such that the rules complete within an hour.
This final workaround is an alternative to Workaround #2 and should be used in conjunction with Workarounds #1 and #3
This workaround is achieved by altering the out-of-box "Archive" scheduled job so that it checks for a running archive process before it kicks off a new one. The following script should replace the Script field of the existing "Archive" scheduled job (sys_trigger table). The script is designed to be resilient to AHA transfers by clearing out the status if it does not change for 3 hours - an indication that archiving is not really happening but the status value is left over from an archive thread that was running prior to a migration:
Related Problem: PRB608048