Try building quality in by shifting left on the collaboration. Instead of mobbing on the outage, mob during the development. Diversity of thoughts, needed skills and system understanding builds the quality in.
And guess what. You also get to reduce MTTR out of the box because more people will know the details of the system and be able to react sooner because of the situational awareness that collaboration has created.
The highest leverage for system resilience comes from the shared mental model, not circuit breakers/bulkheads/timeouts. The (lack of) resilience of the system reflects the (lack of) resilience of team/org’s shared mental model.