By Drew Rothstein, President
Running a validator on a cryptocurrency network is a complex operation involving skill sets ranging from deep software development and secure infrastructure expertise to honed site reliability processes. As with operating any software platform, there will be a variety of patches that need to be applied over time whether these be bug / feature patches, configuration patches or security patches.
As one of the longest running teams who have now run large scale validators on a variety of different types of networks, we have developed a set of processes internally for patching and applying changes easily and safely. We would like to share from our experience what we have found to be most helpful from a variety of crypto projects to make this easy and safe for the operators securing the crypto ecosystem.
Tags / Releases
Most crypto projects have a reasonably well established semantic versioning scheme. We rarely run into a project that does not. This is logical and helpful for all involved. Bonus points for those projects that use signed tags/commits for releases.
A regular release cadence solves several problems we have witnessed over time.
- It communicates and sets expectations with operators.
- It (usually) forces automation to the release process reducing risk / reliance on SPOFs.
- If / when emergencies come up (security patches), it makes that entire process far less painful for everyone involved.
As an example, if at a minimum a new
dev release is cut weekly, and a
prod release is cut monthly, it is very clear on expectations for operators. It also usually forces any automation that is necessary for CI / CD system(s) reducing reliance on any single developer for this process.
- Near: Regular rc releases for
testnets; ~Monthly releases for
- Polkadot: Regular rc releases for
testnets; ~Monthly releases for
Potential Improvement Examples
<Some Projects>: No regular
rcreleases. Regular releases for
<Some Projects>: Regular
rcreleases. Large gaps in releases for
We read changelogs. Changelogs have different audiences and it is clear that different crypto projects communicate their changes differently. Some networks do this really well and some could use improvement!
At the highest level, we ask ourselves, “Are there any expectation changes with this release?” This comes down to a set of questions we iterate through after gaining confidence on a
- Are there any configuration changes now or upcoming?
- Are there any performance changes we should expect?
- Are there any API or contract-related changes that impact operations?
- Are there any new side effects?
A great changelog will make it clear who the audience is for each section and clearly articulate only what is necessary without the fluff included. A giant list of 100 fixes that may be relevant to only a few developers and not to operators is not useful for most people.
- Celo: A link to their release process and recent audits is great. The key updates section is written in prose, in full vs. having to read through various MRs and thousands of lines of changes.
- Nym: The one-line summary for each item with a clear MR link is fantastic. It gives you an overview without having to immediately read the code unless it is particularly relevant.
Potential Improvement Examples
<Some Projects>: Looking at this changelog for a recent
<X>SDK release we come away with more questions than answers - Are these fixes relevant? Do they impact everyone? Are these upgrades critical? Are they functionality or security? Are the security patches critical?
<Some Projects>: A lot of changes! A nice amount of summary detail for each item but this is surely going to take a long time to review and determine what is relevant as an operator of this network.
Two projects with very different changelogs with different positives and improvement points. We typically spend a few hours reading through changelogs and the actual changes for every single release. Better changelogs can increase our confidence in understanding a particular set of changes and speed up our review process to get an update deployed.
This is hard. There is no one-size-fits-all approach but this is meant to highlight how today, all projects can be improved in this area!
Testing and Deployment
If your project does not have a functional, active, and realistic
testnet, it makes patching very difficult. If there were more investment in chaos engineering as a whole (e.g. channeling inner Charity Majors) then we could forgo some of this. But at this time, across nearly every network we have seen, there is very little such investment, particularly if this is truly a desire / strategy by crypto networks today.
We normally explore the topics:
- How did the network test these changes?
- How does the network expect we test these changes?
If the answers to these are not clear, which they typically are not (see Changelogs above!), then this adds complexity due to ambiguity for operators.
Sometimes, this is communicated via a social channel for a network but it is usually framed in the context of, “Please roll out vX.Y.Z by tomorrow at 1400.”
This is probably close to the most critical item for networks to communicate and yet it is the least formalized item we see across networks.
Crypto networks generally have a beautiful property of massive decentralization. This is motivating and drives a lot of the current development. Decentralization, though, usually means a large footprint with a variety of timezones.
There is absolutely no perfect time for a network to plan a deployment for a decentralized network. Instead, you can set the expectation (e.g. have a process) where you clearly communicate when you expect a deployment to be completed and you can make this consistent.
We will generally benefit from consistency vs. a preferred time. If we can expect that, for the
<X> network, upgrades will take place at
0400 when we would like to be sleeping, we prefer to know and plan around that vs. not knowing and having to plan randomly, especially on short notice.
Urgent / Private Patches
Urgent patches commonly need deployment on crypto networks. A lot of this software is newer and rapidly evolving, which is exciting! How urgent patches are deployed to protect a network is a unique and interesting topic where we commonly see room for improvement building on the previous items mentioned in this post.
If a network lacks clear tags/releases, a regular release cadence, great changelogs, and/or testing expectations, then any urgent patch application is likely going to be an all-hands-on-deck, chaotic sprint for hundreds of operators that puts the network and users at avoidable risk. All of these foundational items build upon each other and if done well make an urgent patch a complete non-issue.
You have an urgent patch upcoming or ready. You likely need to inform validators, privately, and ask them to roll it out before a public release. How in the world do you accomplish this?
This is a solvable problem and it might actually be easier than improving any of the aforementioned items!
Here are two steps for a crypto network to implement:
- Create / have a list of the
Top Xvalidators on your network with their paging and non-paging contact information.
- Update / check the list on a cadence (e.g. quarterly).
That’s it! Then, reach out to the non-paging alias for everyone on a cadence to make sure it is up to date.
|<Validator Name>||<Paging Alias>||<Non-Paging Alias>|
|Validator Foo||security-paging [at] foo [dot] com||security [at] foo [dot] com|
Have this available for your developers internally in a shared location. While you could add a lot more metadata (time zones, expertise, website, etc) - none of this is relevant for the immediate purpose of communicating.
Relying on a patchwork of random social platforms and private groups can be embarrassing for a lot of the larger crypto networks yet it is commonplace today. Let’s make communication the expectation / standard before agreeing to operate on a network and help folks implement it!
A lot of what has been discussed in this post is commonplace in the more standard software development and production operation context. As large scale distributed networks continue to evolve we’re eager to collaborate with protocols and other validators to streamline and harden the update process.
All of these are solvable with the appropriate priority and focus. We are not far away from resolving with a little time and energy. This will make the entire ecosystem more sustainable and stronger before more Production Engineers and Site Reliability Engineers are able to join protocol teams and implement more robust standards.
We are currently hiring for multiple positions in various locations and remote.