Adopting complexity with contemplation
Of course, every blogger editorializing the extended October 2025 outage of AWS’s us-east-1 region will extract the lessons they want from it. Opportunities are ripe to suggest specific things that might have been done differently or that would have better embodied some distributed system design best practice.
But for me, reading that post-event summary, my first thought wasn’t to name some principle of fault tolerant distributed systems design that could have been better executed; Instead, it was to wonder whether quite that level of complexity needed to be present at all.
Two of the things we learned from that post-mortem were
- The DNS records for dynamodb, including major public-facing entrypoints to it such as
dynamodb.us-east-1.amazonaws.com, are constantly being updated using automation, and - This automation was peripheral enough to the proper operation of the system that it could be disabled worldwide for days, pending development of a primary bugfix and “additional protections to prevent the application of incorrect DNS plans.”
Complexity and reliability are not necessarily trade-offs in a zero sum game, but building a system that is complex will always be much easier than building a system that is complex and reliable. That the DNS updates system at the heart of the AWS outage could simply be disabled without major impacts to end user functionality makes me wonder if AWS could have started by embracing a simpler – less capable, but fully understood, tested, and reliable – DNS records automation system to relieve immediate pressures. Then, there probably would have been freedom to progressively add complexity to that system more slowly and deliberately than they did.
In physical world projects, the costs, safety regulations, and practical limits of construction all help to resist rushed deployments. Today’s Japanese bullet trains are the product of literally decades of careful engineering, contemplation, and iteration. But in the digital space, and even more so as we enter an era of limitless AI-generated code, none of these obstacles exist. A cloud service can just download more software into in their infrastructure, or perhaps ask an AI to pump out bespoke code, and very quickly add capability while assuming risk and sacrificing reliability in the form of ill-managed complexity.

It was a coincidence that I had just arrived in Kyoto, Japan when the AWS incident occurred. As it played out and I took the city in, I couldn’t help but feel that wisdom of Kyoto city’s planners could benefit the architects of our modern Cloud systems.
Kyoto, home to so many dozens of the country’s most historic Shinto shrines, Bhuddist temples, Zen and Japanese Gardens incorporated cohesively alongside 200mph train service and gargantuan shopping centers with merchandise as far as the eye can see high above and below ground.
This contrast exposes Kyoto’s understanding that modernity and progress have much to give, and that its comforts and conveniences should not be sacrificed at the altar of the traditional life. But at the same time, it shows that Kyotoans have not lost sight of the fact that there are still worthwhile comforts and conveniences to be had in the ancient and simple. Techniques like tending to one’s moss and training bonzai to compliment a garden’s sight lines. Walk through the whole of Kyoto and one is left reminded that there is value in the complex and wonderous just as much as there is in the tranquil and simple.
A city with many of its plots left as gardens is still a city, and until the modern complex wonders that might replace a garden are ready, there is no shame in keeping the garden.
A database with many of its maintenance tasks left unoptimized is still a database, and until the complexity that might improve that optimization has been contemplated carefully with an eye to reliability, there is no shame in keeping the more boring but more reliable.
Cloud engineers should be willing to accept added complexity into the systems they design, but should always contemplate how to add that complexity in a progressive way that encourages and enables high levels of reliability to be added along with it. Sometimes, that contemplation might lead to a feeling that the added comfort promised by the added complexity is not worth the effort. In those cases, keep the garden.