-
Notifications
You must be signed in to change notification settings - Fork 80
Description
A few issues have been found in the past where BRO is prevented by the Webhook from restoring a valid Backup because a given resource has changed in a way the Webhook doesn't allow (but should, in a Restore scenario).
All of these issues happened in the Rollback scenario, which works like this:
- Have Rancher vX installed
- Take a Backup
- Upgrade Rancher to VX+1
- Create a create a BRO Restore object to bring the previous versions of all Rancher-related objects back
- Rollback the Rancher deployment itself via Helm
In this scenario, we often have objects that had their definitions changed from vX to VX+1, and when applying the Restore, the Webhook blocks them because they should be immutable (for instance, this issue) or are by any reason incompatible.
This is one example of an issue faced because of the way the Webhook+BRO interact. Similar issues are never found in relation to BRO+Rancher as the Rancher deployment is scaled down by BRO during the restores, meaning BRO needs to be a super powerful Operator which is able to restore a cluster state to a previous point in time without interference. With this in mind, the idea is that similarly to how we handle Rancher itself, the Webhook should not be active when BRO is running restores.
There is an established way of bypassing the Webhook with the rancher-webhook-sudo ServiceAccount, and the first idea was to have BRO impersonate it to not be blocked by the Webhook. However, that makes things messy in cases of Cluster Migrations, since BRO is deployed without Rancher and as a consequence the SA will not exist at Restore time.
From an ORBS POV, the ideal implementation would be to:
- Have BRO deploy 2 ServiceAccounts, the regular one and a "BRO-sudo" one.
- Extend the Webhook bypass function to also offer this bypass to the new BRO SA.