A chain halt (block production being stopped) is not a fun time for anyone involved - similar to a site outage on any Web2 product. There has been a long history of chain halts on a variety of blockchains (filecoin, polkadot, celo) just as there are with various products we use every day- news sites, social platforms, and even industrial platforms (electricity, networking providers, commuter trains, etc). It is a common, known risk of doing business.
At Unit 410, we have analyzed many chain halts in realtime and have put together the following step-by-step recommendations that we think would make a great solution for any network searching for a clear or clearer process in how they handle / would handle a chain halt if / when it occurs.
We have unfortunately seen multiple chain halts that didn’t implement many basic incident management practices, resulting in extended MTTRs (mean-time-to-recover) that could have been reduced with a couple basic steps. Halts that are handled effectively and efficiently can serve as an opportunity for sharing in-depth knowledge and can even result in a more cohesive and engaged community.
This is written from the perspective of an independent node operator on a proof-of-stake network.
We offer these ideas as a resource to any chain that would like to improve their incident management plan.
Incident Management: A Solved Problem?
A fair question to ask: Isn’t Incident Management a solved problem with known technical standards and processes that could be implemented? In short: yes and no. The challenges of decentralization make existing incident management techniques difficult or impossible to fully implement. However, adopting a subset of those processes should go a long way toward minimizing the impact of a chain halt.
Unlike the most prolific books on this topic generally, there are several key differences for Web3: these are decentralized organizations, using decentralized tools, with sometimes unclear incentives / structures in-place, and with no clear emergency communication / paging systems or processes. Most networks will require 2/3 of their validators to cooperate which further increases the complexity.
Why not fix each of these? You could. It may not be practical in many cases and some on-chain participants will simply rebel. Unlike a single organization with a singular goal, there are many competing goals with multiple participants that are in some cases hard to identify. Being hard to identify raises another complexity where, even if folks would like to assist, they work to remain as an alias in large forums, and so coordination with them is challenging. This should be taken into account throughout any process.
Step 1: Assign Roles
One of the most basic initial items to accomplish that then makes everything else fall into place is to assign roles. Without leadership, everything is likely to fall apart instead.
- An Incident Commander.
- A Scribe.
That’s it. Start there.
An Incident Commander should be the primary point-of-contact and organizer for the incident. They are explicitly not the person looking into anything technical. They are making sure the correct people are online, trying to get people online, and coordinating various aspects of the incident response.
A Scribe should immediately create a public document shared with the necessary participants (validators, block producers, etc). This document should be their exclusive focus in capturing the various details of the incident: current status, timeline, who is doing what. This may not seem valuable at first, and time not spent directly addressing the problem at hand may feel like time wasted, but the amount of time saved by not answering the same questions or giving the same information twice will pay dividends.
If the majority of chain halts assign these two roles, their success in reducing their MTTR is drastically improved in our experience.
Step 2: Video Communication
The Scribe should create a video call on their preferred platform (Google, Zoom, etc) and communicate the link to the necessary participants. Complex issues require high-bandwidth communication between specific participants assigning, debugging, and testing (not necessarily all folks on a call). A textbox on Discord is generally not sufficient when their chain is halted.
The Incident Commander should generally be leading the call and tapping on people when they need to step away to handle any questions / topics that come up.
Step 3: Provide Regular Updates
If helpful, assign a Communications role at this step. Communicate the status at regular intervals to all necessary participants. Keep in mind that these participants may be spread across the globe and in different time zones.
The top of the public document should state the current status, summary, and have a clear exit criteria (chain is producing blocks), and what is currently being done (TODO list and bugs filed).
Step 3a: Memes (optional)
Even the most well handled and effective chain halt incident responses will likely have periods of time where the majority of those tuned-in are simply waiting. Waiting for a new release, waiting for enough validators to come online, or simply waiting for their node to start without the –x-crisis-skip-assert-invariants flag.
Keeping those who are needed to recover from the incident engaged is crucial to actually restarting the chain. But this again highlights the importance of a shared doc for dissemination of critical information. No one who gets a bunch of alerts when they turn on their phone after a 12 hour overseas flight is going to have any idea what is going on if they have to scroll through endless messages in a discord channel. Memes or otherwise.
Step 4: Live Handoffs
Incidents rarely are resolved quickly with blockchains. Start planning ahead as to who you will hand-off to when you need to step away for food, sleep, etc. When the time comes, perform the handoff over the Video call. Receive confirmation that they are now assuming the role that you had previously held.
Bonus Points: Address the Unspoken Concerns
In many cases, the number one concern of the validators who are taking part in restarting a chain during a chain halt is the risk of double-signing. Validators who are present and engaged with fixing the chain halt are likely not going to miss signing blocks once there are new blocks to sign. The last thing you want during a chain restart are for validators that are present and engaged to decide that they will wait for the next block before coming back online. More often than not, the actual risk of double-signing is quite low but hearing that directly from the Incident Commander can be just the right amount of reassurance needed.
Once the incident is resolved, schedule a post mortem. Another great template to follow (that will contain nearly everything you captured in the incident document) will assist you in that process.
Share the results of this post mortem publicly (beyond any smaller groups) once completed.
You can start small with just a few pre-planned steps (4) as mentioned above and massively improve your MTTR. From there, you can implement a more robust plan if desired. At the very least, you should have a way of getting in contact with enough validators, enough in this case meaning the minimum quantity or of combined voting power to restart the chain. Here is a great example from our friends at cLabs (Celo) of how you might accomplish this: Celo Validators Contact Information Refresher.
You can never really be 100% prepared for the unknowns of a chain halt but being prepared to respond with a plan to such an incident is fairly easy. Once you feel you are reasonably prepared, you can perform a test drill to assess your level of preparedness.
Does your chain feel prepared? Do you have a clear process? Let us know if this is helpful or if you have your own resources that are regularly exercised should a halt occur on your network.
We are currently hiring for a Cryptocurrency Security Engineer in / around our Austin, TX location.