Its true that you could probably pretty much deal with most of the downsides. Limited rollback (maybe 10 minutes or something) and maybe a delay on restarting, and you would both reduce the odds of a repeat failure and also reduce the fatigue on the servers from repeated restarts.
Notably doing a selective rollback on just the database entries that the failed server was working on (rather than bringing everything down to do a global rollback and then coming back up) it might be challenging to implement a highly selective rollback, depending on how the backend works.
Notably doing a selective rollback on just the database entries that the failed server was working on (rather than bringing everything down to do a global rollback and then coming back up) it might be challenging to implement a highly selective rollback, depending on how the backend works.