Bittensor Protocol Monitoring
Bittensor Protocol Monitoring
By Ryan Hendricks, Cryptocurrency Engineer
Background
Bittensor is a novel protocol that decentralizes the training and inference of machine learning models. To support early participation, we’ve been actively developing, scaling and securing a subset of infrastructure for the network. Networks evolve most quickly in their early stages and being a good participant requires flexible observability focused on network health. This post discusses how we’ve architected, deployed and evolved our monitoring and is intended to help operators think about improving the signal of their own monitoring.
The bittensor network is divided into different subnetworks with each subnet supporting a specific machine learning model. Within a subnet, there are 2 main participants, miners and validators, each having a unique UID. Together they form the metagraph for the respective subnet.
Under the hood, the network uses substrate for the consensus layer. This allows the state of the network to be decentralized. Additionally, it allows the use of generic substrate tooling such as polkadot-js for managing accounts and Substrate API Sidecar for querying data from the network. We’ve written previously about substrate extrinsics1 but let’s dive a bit deeper into how one can better the network from an observability perspective.
Initial Setup
A prerequisite to monitoring infrastructure is to cleanly define it. We prefer reproducible and containerized infrastructure, so this tutorial assumes you have docker and a recent version of golang installed.
Running Substrate API Sidecar
Parity provides a well maintained sidecar that can be used to easily extract structured data from a local node. We make extensive use of this. You may run your own sidecar instance by creating a docker-compose.yml
similar to the example below and starting the container with docker-compose up
.
version: "3.8"
services:
sidecar-subtensor:
container_name: sidecar-subtensor
image: parity/substrate-api-sidecar:latest
ports:
- "8080:8080"
environment:
SAS_SUBSTRATE_URL: wss://entrypoint-finney.opentensor.ai:443
Once up and running, you may start making queries to your node on port :8080. In this post we assume you’re running locally. If querying a remote node, replace localhost
with your node’s IP address. A first query then looks like:
curl localhost:8080/node/version
{
"clientVersion": "4.0.0-dev-c88a37247b9",
"clientImplName": "node-subtensor",
"chain": "Bittensor"
}
You may leave off the endpoint and just query localhost:8080
which will return the full list of available endpoints. The same list is available as a swaggerfile as well.
Key Network Telemetry
With your sidecar container running, we can now begin to codify an example monitor to query key telemetry. The following golang will extract parameters that you should be aware of:
package main
import (
"fmt"
"io/ioutil"
"net/http"
)
func main() {
PrintNodeNetwork()
PrintNodeVersion()
PrintRuntimeSpec()
}
func check(err error){
if err != nil {
panic(err)
}
}
func get(url string) string {
method := "GET"
client := &http.Client{}
req, err := http.NewRequest(method, url, nil)
check(err)
res, err := client.Do(req)
check(err)
body, err := ioutil.ReadAll(res.Body)
check(err)
return string(body)
}
func PrintNodeVersion() string {
fmt.Println(get("http://localhost:8080/node/version"))
}
func PrintRuntimeSpec() string {
fmt.Println(get("http://localhost:8080/runtime/spec"))
}
func PrintNodeNetwork() string {
fmt.Println(get("http://localhost:8080/node/network"))
}
The output should give you something like the following
Node Network
{
"nodeRoles": [
{
"full": null
}
],
"numPeers": "387",
"isSyncing": false,
"shouldHavePeers": true,
"localPeerId": "12D3KooWS1ZQAJ6zLxNHi6hXVNKbDLjFfCLGoyF9bvAS1fk7TVQQ",
"localListenAddresses": [
"/ip4/127.0.0.1/tcp/30333/ws/p2p/12D3KooWS1ZQAJ6zLxNHi6hXVNKbDLjFfCLGoyF9bvAS1fk7TVQQ",
"/ip4/10.10.0.85/tcp/30333/ws/p2p/12D3KooWS1ZQAJ6zLxNHi6hXVNKbDLjFfCLGoyF9bvAS1fk7TVQQ",
"/ip4/10.116.0.95/tcp/30333/ws/p2p/12D3KooWS1ZQAJ6zLxNHi6hXVNKbDLjFfCLGoyF9bvAS1fk7TVQQ",
"/ip6/::1/tcp/30333/ws/p2p/12D3KooWS1ZQAJ6zLxNHi6hXVNKbDLjFfCLGoyF9bvAS1fk7TVQQ",
],
"peersInfo": "Cannot query system_peers from node."
}
Node Version
{
"clientVersion": "4.0.0-dev-c88a37247b9",
"clientImplName": "node-subtensor",
"chain": "Bittensor"
}
Runtime Spec
{
"at": {
"height": "116115",
"hash": "0x312a434d3074d1693e4d60ff0d9325b2f17b55bf83105d110b53b150bc608647"
},
"authoringVersion": "1",
"transactionVersion": "1",
"implVersion": "1",
"specName": "node-subtensor",
"specVersion": "116",
"chainType": {
"live": null
},
"properties": {
"ss58Format": "42",
"tokenDecimals": [
"9"
],
"tokenSymbol": [
"TAO"
]
}
}
Each of these endpoints include key parameters to monitor, notably:
- Node Network
numPeers
→ the number of peers that the node is currently connected to. This should remain above 150.isSyncing
→ whether or not the node is up to date with other peers. This should remain false; otherwise you are making queries against outdated state.
- Node Version
clientVersion
→ useful for ensuring the node is on the correct version or if using multiple nodes that they are all on the same version.
- Runtime Spec
specVersion
→ indicates the current runtime version. A change to the runtime can include changes to the encoding and decoding of extrinsics (transactions) which would in turn require changes to your signers. Monitoring for specVersion changes is an important way to prevent your tooling from falling behind.
Add Type Definitions
Next, we’ll use mholt.github.io/json-to-go to simplify turning these responses into type definitions which gives us:
type NodeNetworkResponse struct {
NodeRoles []struct {
Full interface{} `json:"full"`
} `json:"nodeRoles"`
NumPeers string `json:"numPeers"`
IsSyncing bool `json:"isSyncing"`
ShouldHavePeers bool `json:"shouldHavePeers"`
LocalPeerID string `json:"localPeerId"`
LocalListenAddresses []string `json:"localListenAddresses"`
PeersInfo string `json:"peersInfo"`
}
type NodeVersionResponse struct {
ClientVersion string `json:"clientVersion"`
ClientImplName string `json:"clientImplName"`
Chain string `json:"chain"`
}
type RuntimeSpecResponse struct {
At struct {
Height string `json:"height"`
Hash string `json:"hash"`
} `json:"at"`
AuthoringVersion string `json:"authoringVersion"`
TransactionVersion string `json:"transactionVersion"`
ImplVersion string `json:"implVersion"`
SpecName string `json:"specName"`
SpecVersion string `json:"specVersion"`
ChainType struct {
Live interface{} `json:"live"`
} `json:"chainType"`
Properties struct {
Ss58Format string `json:"ss58Format"`
TokenDecimals []string `json:"tokenDecimals"`
TokenSymbol []string `json:"tokenSymbol"`
} `json:"properties"`
}
Hotkey Balance & Owner Info
Now that we’ve explored key network parameters, we’ll also want to monitor the applicable addresses, their balances and any operations they’re signing. This will help us confirm signed operations match our expectations, loudly communicate to our team when keys are being accessed and provide the basis for downstream services like balance tracking to reconcile events.
We can start by querying the balance for an address with:
func GetBalanceInfoForAddress(address string) {
url := fmt.Sprintf("http://localhost:8080/accounts/%s/balance-info", address)
resp := new(types.BalanceInfoResponse)
err := GetRequest(url, &resp)
check(err)
logger.PrettyPrint(resp)
}
Which can be structured with this type definition:
type BalanceInfoResponse struct {
At struct {
Hash string `json:"hash"`
Height string `json:"height"`
} `json:"at"`
Nonce string `json:"nonce"`
TokenSymbol string `json:"tokenSymbol"`
Free string `json:"free"`
Reserved string `json:"reserved"`
MiscFrozen string `json:"miscFrozen"`
FeeFrozen string `json:"feeFrozen"`
Locks []interface{} `json:"locks"`
}
Putting this into practice, we can pick a random hotkey from TaoStats and display its balance with: GetBalanceInfoForAddress("5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3")
:
{
"at": {
"hash": "0xe65bb31d67465db17067ffb19f32be06e8b993a969ef4d529deeb5d5d19cd522",
"height": "116263"
},
"nonce": "444",
"tokenSymbol": "TAO",
"free": "860000",
"reserved": "0",
"miscFrozen": "0",
"feeFrozen": "0",
"locks": []
}
Since we know from our previous query RuntimeSpec.properties.tokenDecimals
that TAO has 9 decimals meaning that the actual balance for this address is:
Why not just hardcode the 9 and save ourselves the extra step? Well, there’s precedent for chains to redenominate in the past, so it is generally wise to selectively take caution against hardcoding here.
Determine the Coldkey
Bittensor is designed with distinct “hot” and “cold” keys with a one to many mapping from cold to hotkeys. A Coldkey is meant to protect funds and cannot be used to sign “immediate” operations needed to participate in validation. For each hotkey, coldkeys are visible on TaoStats and we can also query the corresponding coldkey from the subtensorModule pallet via our api sidecar instance. To determine the coldkey for a given hotkey we query the ‘Owner’ pallet storage item. This is done by adding an additional client function, an associated type definition, and updating our main.go
.
// client.go
package client
func GetHotKeyOwner(address string) {
url := fmt.Sprintf("http://localhost:8080/pallets/subtensorModule/storage/Owner?keys[]=%s", address)
resp := new(types.StorageResponse)
err := GetRequest(url, &resp)
check(err)
logger.PrettyPrint(resp)
}
// responses.go
type StorageResponse struct {
At struct {
Hash string `json:"hash"`
Height string `json:"height"`
} `json:"at"`
Pallet string `json:"pallet"`
PalletIndex string `json:"palletIndex"`
StorageItem string `json:"storageItem"`
Keys []string `json:"keys"`
Value string `json:"value"`
}
// main.go
func main() {
client.GetHotKeyOwner("5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3")
}
Putting this all together, our query returns:
$ go run main.go
{
"at": {
"hash": "0x5d4d987ca3ad7df15ee24046eae146dbfd073e53efcf0e605de865d9b48ac020",
"height": "116329"
},
"pallet": "subtensorModule",
"palletIndex": "8",
"storageItem": "owner",
"keys": [
"5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3"
],
"value": "5Ccmf1dJKzGtXX7h17eN72MVMRsFwvYjPVmkXPUaapczECf6"
}
Cleaning Up
We were successfully able to query the cold key for a given hotkey and now we’d like to extract the balance for that address from the storage response. We can now rework our client functions to return the responses to the caller, leaving us with the following client.go
:
package client
import (
"encoding/json"
"fmt"
"io"
"net/http"
"bittensor-monitor/pkg/types"
)
func getRequest(url string, resp interface{}) error {
res, resErr := http.Get(url)
if resErr != nil {
return resErr
}
if res.StatusCode != http.StatusOK {
return fmt.Errorf("%v", res.StatusCode)
}
defer func() {
closeErr := res.Body.Close()
if closeErr != nil {
panic(closeErr)
}
}()
body, readErr := io.ReadAll(res.Body)
if readErr != nil {
return readErr
}
jsonErr := json.Unmarshal(body, resp)
if jsonErr != nil {
return jsonErr
}
return nil
}
func GetNodeVersion() (*types.NodeVersionResponse, error) {
url := "http://localhost:8080/node/version"
resp := new(types.NodeVersionResponse)
err := getRequest(url, &resp)
return resp, err
}
func GetRuntimeSpec() (*types.RuntimeSpecResponse, error) {
url := "http://localhost:8080/runtime/spec"
resp := new(types.RuntimeSpecResponse)
err := getRequest(url, &resp)
return resp, err
}
func GetNodeNetwork() (*types.NodeNetworkResponse, error) {
url := "http://localhost:8080/node/network"
resp := new(types.NodeNetworkResponse)
err := getRequest(url, &resp)
return resp, err
}
func GetBalanceInfoForAddress(address string) (*types.BalanceInfoResponse, error) {
url := fmt.Sprintf("http://localhost:8080/accounts/%s/balance-info", address)
resp := new(types.BalanceInfoResponse)
err := getRequest(url, &resp)
return resp, err
}
func GetHotKeyOwner(address string) (*types.StorageResponse, error) {
url := fmt.Sprintf("http://localhost:8080/pallets/subtensorModule/storage/Owner?keys[]=%s", address)
resp := new(types.StorageResponse)
err := getRequest(url, &resp)
return resp, err
}
Now we can use the response data from our queries in subsequent queries. For example, we can take the address returned from our GetHotKeyOwner()
call and use it in GetBalanceInfoForAddress()
.
package main
import (
"bittensor-monitor/pkg/client"
"bittensor-monitor/pkg/logger"
)
func main() {
resp, err := client.GetHotKeyOwner("5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3")
if err != nil {
panic(err) // TODO: don't panic here
}
coldkey := resp.Value
balanceResp, balErr := client.GetBalanceInfoForAddress(coldkey)
if balErr != nil {
panic(balErr)
}
logger.PrettyPrint(balanceResp)
}
Taking this even further, since we know how to query the number of decimals for a given denom, we can add a helper function to convert the value returned to the human-readable value for the balance. We add the following helper function to our responses.go
which will return the decimals for a given token symbol and in the process also make it easier to convert the string in the balance info response to a number.
func (r RuntimeSpecResponse) DecimalsForSymbol(token string) int64 {
decimalString := ""
for n := range r.Properties.TokenSymbol {
sym := r.Properties.TokenSymbol[n]
if sym == token {
decimalString = r.Properties.TokenDecimals[n]
}
}
if decimalString == "" {
panic("decimals not found for given token symbol; consider returning an error instead of a panic here.")
}
num, err := strconv.ParseInt(decimalString, 10, 0)
if err != nil {
panic(err) // TODO: don't panic here.
}
return num
}
func (b BalanceInfoResponse) BalanceFree() float64 {
num, err := strconv.ParseFloat(b.Free, 64)
if err != nil {
panic(err) // TODO: don't panic here.
}
return num
}
Using these helper functions we can tweak our main.go
like so
package main
import (
"fmt"
"math"
"bittensor-monitor/pkg/client"
)
func main() {
resp, err := client.GetHotKeyOwner("5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3")
if err != nil {
panic(err) // TODO: don't panic here
}
coldkey := resp.Value
balanceResp, balErr := client.GetBalanceInfoForAddress(coldkey)
if balErr != nil {
panic(balErr)
}
spec, specErr := client.GetRuntimeSpec()
if specErr != nil {
panic(specErr)
}
decimals := spec.DecimalsForSymbol(balanceResp.TokenSymbol)
humanReadableBalance := balanceResp.BalanceFree() / (math.Pow10(int(decimals)))
fmt.Printf("%v %v \n", humanReadableBalance, balanceResp.TokenSymbol)
}
This returns our balance;
$ go run main.go
0.500001 TAO
Querying Storage
Let’s now take a moment to better understand what is happening here. You might be wondering whether we could query information from storage directly from the node. The long answer is absolutely and I encourage anyone interested in learning more about how this might be accomplished to check out the write-up from Shaun Tabrizi’s blog: Querying Substrate Storage via RPC. The short answer is that querying substrate storage is a rather involved and time-consuming task.
Working through an earlier network launch requires speed and flexibility. Using the Substrate API Sidecar (built by Parity!) is one way to move faster and safer while keeping up with a novel network.
But what else might we want to monitor from pallet storage? We can see the list of items within the pallet storage via:
http://localhost:8080/pallets/subtensorModule/storage?onlyIds=true
For the full list of available storage items you can query for a given substrate pallet, you can query:
http://localhost:8080/pallets/$PALLET_NAME/storage/
This will also tell you what parameters are required for querying each storage item.
UID Metagraph Data
Combining everything we’ve explored thus far to recreate the Metagraph entry for a given hotkey. We can use the following type definition for the aforementioned Metagraph entry but feel free to add additional items as you see fit. This represents all the datapoints we we’d like to query from the chain and will provide the foundation for our monitor.
type NeuronInfo struct {
UID int
HotKey string
ColdKey string
Stake float64
Rank float64
VTrust float64
Trust float64
Consensus float64
Incentive float64
Dividends float64
Emission float64
Updated float64
VPermit bool
Active bool
Height float64
}
Preparation
Additional metagraph items can be queried like so:
func GetStorageItemSingleKeySingleValue(item string, key int) (*types.StorageResponseSingleValue, error) {
url := fmt.Sprintf("http://localhost:8080/pallets/subtensorModule/storage/%v?keys[]=%v", item, key)
resp := new(types.StorageResponseSingleValue)
err := getRequest(url, &resp)
return resp, err
}
func GetStorageItemSingleKeyMultiValue(item string, key int) (*types.StorageResponseMultiValue, error) {
url := fmt.Sprintf("http://localhost:8080/pallets/subtensorModule/storage/%v?keys[]=%v", item, key)
resp := new(types.StorageResponseMultiValue)
err := getRequest(url, &resp)
return resp, err
}
func GetStorageItemSingleKeyMultiBoolValue(item string, key int) (*types.StorageResponseMultiBoolValue, error) {
url := fmt.Sprintf("http://localhost:8080/pallets/subtensorModule/storage/%v?keys[]=%v", item, key)
resp := new(types.StorageResponseMultiBoolValue)
err := getRequest(url, &resp)
return resp, err
}
And parsed with the following type definitions:
type StorageResponseSingleValue struct {
At struct {
Hash string `json:"hash"`
Height string `json:"height"`
} `json:"at"`
Pallet string `json:"pallet"`
PalletIndex string `json:"palletIndex"`
StorageItem string `json:"storageItem"`
Value string `json:"value"`
}
type StorageResponseMultiValue struct {
At struct {
Hash string `json:"hash"`
Height string `json:"height"`
} `json:"at"`
Pallet string `json:"pallet"`
PalletIndex string `json:"palletIndex"`
StorageItem string `json:"storageItem"`
Value []string `json:"value"`
}
type StorageResponseMultiBoolValue struct {
At struct {
Hash string `json:"hash"`
Height string `json:"height"`
} `json:"at"`
Pallet string `json:"pallet"`
PalletIndex string `json:"palletIndex"`
StorageItem string `json:"storageItem"`
Value []bool `json:"value"`
}
Finally, we add the following helper functions for parsing the data returned in the responses:
package util
import (
"math"
"strconv"
)
func MustParseFloat64(str string) float64 {
num, err := strconv.ParseFloat(str, 64)
if err != nil {
panic(err)
}
return num
}
func MustParseNormalizedFloat(str string) float64 {
f := MustParseFloat64(str)
return f / math.MaxUint16
}
func MustParseInt64(str string) int64 {
num, err := strconv.ParseInt(str, 10, 64)
if err != nil {
panic(err)
}
return num
}
The Code
Finally, this code can be found here. This represents everything discussed above and has been combined into a single file. This is provided as a proof-of-concept, but you’ll need to test & tailor before using in your own environment.
Results
After updating our main.go
to call this might-be-a-bit-too-long function with the hotkey address from earlier and running go run main.go
we get the following:
{
"UID": 1169,
"HotKey": "5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3",
"ColdKey": "5Ccmf1dJKzGtXX7h17eN72MVMRsFwvYjPVmkXPUaapczECf6",
"Stake": 785266.577834045,
"Rank": 0,
"VTrust": 0.8839551384756237,
"Trust": 0,
"Consensus": 0,
"Incentive": 0,
"Dividends": 0.22926680399786373,
"Emission": 11.463921703,
"Updated": 246, // consider having monitor alert if this rises above a certain threshold
"VPermit": true, // consider having monitor alert if this is ever false
"Active": true, // consider having monitor alert if this is ever false
"Height": 158794
}
Next Steps
Now that we have this data what do we actually do with it? There are more than a few ways we could use this data to build a monitor and a few of the options that come to mind are:
- Expose the data as prometheus metrics, creating dashboards in grafana, and writing rules using alertmanager for any unexpected or unsafe state.
- Exporting data through your preferred logging solution (Splunk, Datadog, or other) and alerting through your existing pipelines.
- Write custom logic to trigger PagerDuty/Slack/Discord/Telegram/AIM alerts.
Put the data in IPFS and sell them as NFTs
Everyone’s infrastructure is different and so too are the items you’ll want to monitor, graph and alert. By exploring the available data, extracting what’s available, and iterating tends to grow confidence over time in being alerted to important or anomalous conditions while also being able to ignore the noise. If you liked this post and want to spend more of your time building high-signal monitoring for high-quality, secure and reproducible infrastructure, please get in touch.
We are currently hiring for a variety of roles: unit410.com/#jobs.
Note: All code samples are provided for reference and illustrative purposes only, and are not intended to be used as-is in production.