Wednesday, February 14, 2024

Sitecore ADM: Resolving Stalled Tasks and Restoring Task Processing

My team is currently in the process of purging millions of historical anonymous xDB contact records and associated data using the ADM module for a client whose xDB shard database sizes have been approaching max storage capacity for the Azure tier.  Because xDB is a crucial portion of the client site's operations, our options for reducing the DB size have been somewhat limiting due to complex custom external integrations with xDB. 

In our approach, we opted to use ADM to purge historical anonymous contacts in batches. We prepare ~300k contact records per shard for each batch, which are manually retrieved via SQL query. Once we've created the temporary table in the shard DB, we prepare the data by generating a comma-delimited list of contacts and then kick off the purge process via ADM. 

When ADM populates its Tasks table, each queued record is subsequently processed by ADM and removed from the Tasks table as it completes processing that record.  The ADM task execution is a generally slow process (1 contact processed every 2-3 seconds); we closely monitor the progress with a SQL query:

With this approach (in addition to SHRINK and REINDEX operations between batches), we have seen the necessary disk size reduction of both xDB shard DBs after running a cadence of several batches.  

However, we ran into a snag in a recent batch, which resulted in the entire ADM task processing halting entirely.  The issue appeared to directly correlate with general Azure Maintenance operations, which had occurred over the weekend while the batch was mid-process.  Azure Maintenance updates typically happen without any advanced notice or warning.  Usually, Azure Maintenance operations have minimal adverse effects, but this round seemed to have caused much of the infrastructure to spiral.  We observed that the ADM tasks were no longer processing when all was said and done.  

Attempts to re-start the job via ADM kept resulting in the same error:

"[ADM] Response from xConnect did not indicate success. Status code: BadRequest, Message: {\"Message\":\"The remove task can't be started while another one is running.\"}"

Upon initial analysis, we noted that the ADM tasks table was still populated with IDs that had yet to be processed when the operation was cut off. I began dissecting the ADM binary files for clues - specifically in search of the message "The remove task can't be started while another one is running".  

I learned that the StartContactsDataRemoving method queries an IsRunning method to determine if any other tasks are in progress. If there are, it throws a BadRequest
response and returns the "The remove task can't be started while another one is running.
message" message. 

Digging deeper led me to this ClearRemoveDataSettings method - called in the StopRunningTasksAndClearStorage method.  Deeper in, there are references to a PropertiesRepository class and an object name of "RemoveDataSettings" used to store task information:

This, in turn, finally led me to a PropertyValueQuery method in a PropertiesRepositoryQueries class, which contained a SQL command used as part of the process:

We reviewed the current state of the ADM Properties table within the ADM DB and found three entries, including RemoveDataSettings:

The RemoveDataSettings record's value appeared to be a JSON representation of ADM's last ADM removal task run.  However, the JSON representation was cut off after a few hundred characters.  With this state of the present value, ADM was convinced that the task was not completed.  

Following the approach used in the code (mimicking what should occur when an ADM removal task is completed), we ran the following command:

We also entirely cleared the remaining IDs and Tasks table and re-initialized the process.  With these steps, our ADM tasks were back to processing as expected.

I hope this one helps anyone in a similar situation!