Batch Retryable for Batch Jobs in Finance and Operations apps3 min read

It’s a pretty common issue when there is a connection loss happening to the database in a cloud application, failing batch jobs and forcing teams to redo the operation. But, with the 10.0.18 release, the Microsoft team has introduced a retryable flag feature to be set on every batch job to safely retry should any connection issues be experienced.

Background

Dynamics 365 for Finance and Operations workload is typically memory intensive in the SQL bound workloads making most of the transactions to fetch the data from the database and apply business process transformation before persisting it back in the database. Each unit of work is wrapped in transactions, which are called ttsbegin and ttscommit transactions.

Sometimes, it takes longer for a unit to finish an operation, which should not be a problem, unless you have got a connection loss. In case the connection is down in the middle of a transaction – the transaction will fail, rolling back and losing the work. To spot the issue, there are certain symptoms, i.e. errors, that will be seen in batch jobs execution:

  • [Microsoft][ODBC Driver 17 for SQL Server] – Unspecified error occurred on SQL Server, meaning connection may have been terminated by the server;
  • You are not able to select a record in [Name of a table], causing a temporary problem with the connection to the database.

It should be noted that Azure SQL Database offers a 99.99% availability SLA for zone redundant databases in its business-critical tier. Meaning, all customer production environments are zone redundant. Nevertheless, there is a 00.01% chance where SQL might not be available and that could be because of unplanned outages from the SQL side, which may be a failover, DNS issues, or a planned outage on the SQL side. Quite frequent and unnormal, but still can happen.

Other possible issues for SQL connection loss would be long runs on SQL which will be blocked and, eventually, maybe killed by WD/DAMS.

Batch Retryable Overview

Any batch job you run in Finance and Operations that has got a connection loss will throw an error state, messing up the process and putting you back to square one. That means Finance and Operations must be resilient to transient connection losses, automatically restarting batch jobs on connection loss, primarily to handle SQL Transient Exceptions. And that’s where the BatchRetryable feature comes into play.

BatchRetryable is introduced in 10.0.18 with a purpose to be used by the batch framework to determine if the individual batch job should be automatically retried on encountering SQL Transient Exception. Hence, BatchRetryable will continue to retry after every 5 seconds of database connection loss for a maximum of 5 minutes with Maximum retries of 5 times. 

Technically, should a batch task be marked as “Retryable”, the batch service will retry the task once it encounters an error, omitting manual process. Also, make sure to set such flag to “false” on batch tasks that should not be retried on failure.

Behavior

Again, the beauty of Retryable feature is that the batch task will not end as an “Error” and will continue to retry should any connection issues occur, auto restarting the process of the batch task from the beginning as soon as SQL connectivity is re-established.

Keep also in mind that if a custom batch process is designed to run in multi-threading, you must implement BatchRetryable interface to both the main controller and task controller, otherwise your tasks will fail during transient connectivity issues. OR, it will continue to retry unless the batch is finished. However, if a transient issue happens before tasks are created, the batch job will fail.

As the key takeaways, make sure to educate your team to uptake the feature and implement it for custom batch jobs properly: at both main controller and task controller. Otherwise, if you don’t implement BatchRetryable, by default, your custom batch jobs will not retry in case of SQL transient error.