Bittensor Protocol Monitoring

Background

Bittensor is a novel protocol that decentralizes the training and inference of machine learning models. To support early participation, we’ve been actively developing, scaling and securing a subset of infrastructure for the network. Networks evolve most quickly in their early stages and being a good participant requires flexible observability focused on network health. This post discusses how we’ve architected, deployed and evolved our monitoring and is intended to help operators think about improving the signal of their own monitoring.

Metagraph The Bittensor Metagraph Simulation

The bittensor network is divided into different subnetworks with each subnet supporting a specific machine learning model. Within a subnet, there are 2 main participants, miners and validators, each having a unique UID. Together they form the metagraph for the respective subnet.

Under the hood, the network uses substrate for the consensus layer. This allows the state of the network to be decentralized. Additionally, it allows the use of generic substrate tooling such as polkadot-js for managing accounts and Substrate API Sidecar for querying data from the network. We’ve written previously about substrate extrinsics1 but let’s dive a bit deeper into how one can better the network from an observability perspective.

Initial Setup

A prerequisite to monitoring infrastructure is to cleanly define it. We prefer reproducible and containerized infrastructure, so this tutorial assumes you have docker and a recent version of golang installed.

Running Substrate API Sidecar

Parity provides a well maintained sidecar that can be used to easily extract structured data from a local node. We make extensive use of this. You may run your own sidecar instance by creating a docker-compose.yml similar to the example below and starting the container with docker-compose up.

version: "3.8"
services:
  sidecar-subtensor:
    container_name: sidecar-subtensor
    image: parity/substrate-api-sidecar:latest
    ports:
      - "8080:8080"
    environment:
      SAS_SUBSTRATE_URL: wss://entrypoint-finney.opentensor.ai:443

Once up and running, you may start making queries to your node on port :8080. In this post we assume you’re running locally. If querying a remote node, replace localhost with your node’s IP address. A first query then looks like:

curl localhost:8080/node/version


{
  "clientVersion": "4.0.0-dev-c88a37247b9",
  "clientImplName": "node-subtensor",
  "chain": "Bittensor"
}

You may leave off the endpoint and just query localhost:8080 which will return the full list of available endpoints. The same list is available as a swaggerfile as well.

Key Network Telemetry

With your sidecar container running, we can now begin to codify an example monitor to query key telemetry. The following golang will extract parameters that you should be aware of:

package main


import (
 "fmt"
 "io/ioutil"
 "net/http"
)


func main() {
  PrintNodeNetwork()
  PrintNodeVersion()
  PrintRuntimeSpec()
}


func check(err error){
        if err != nil {
                panic(err)
        }
}


func get(url string) string {
        method := "GET"
        
        client := &http.Client{}
        req, err := http.NewRequest(method, url, nil)
        check(err)


        res, err := client.Do(req)
        check(err)
        
        body, err := ioutil.ReadAll(res.Body)
        check(err)
        return string(body)
}


func PrintNodeVersion() string {
        fmt.Println(get("http://localhost:8080/node/version"))
}


func PrintRuntimeSpec() string {
        fmt.Println(get("http://localhost:8080/runtime/spec"))
}


func PrintNodeNetwork() string {
        fmt.Println(get("http://localhost:8080/node/network"))
}

The output should give you something like the following

Node Network

{
  "nodeRoles": [
    {
      "full": null
    }
  ],
  "numPeers": "387",
  "isSyncing": false,
  "shouldHavePeers": true,
  "localPeerId": "12D3KooWS1ZQAJ6zLxNHi6hXVNKbDLjFfCLGoyF9bvAS1fk7TVQQ",
  "localListenAddresses": [
    "/ip4/127.0.0.1/tcp/30333/ws/p2p/12D3KooWS1ZQAJ6zLxNHi6hXVNKbDLjFfCLGoyF9bvAS1fk7TVQQ",
    "/ip4/10.10.0.85/tcp/30333/ws/p2p/12D3KooWS1ZQAJ6zLxNHi6hXVNKbDLjFfCLGoyF9bvAS1fk7TVQQ",
    "/ip4/10.116.0.95/tcp/30333/ws/p2p/12D3KooWS1ZQAJ6zLxNHi6hXVNKbDLjFfCLGoyF9bvAS1fk7TVQQ",
    "/ip6/::1/tcp/30333/ws/p2p/12D3KooWS1ZQAJ6zLxNHi6hXVNKbDLjFfCLGoyF9bvAS1fk7TVQQ",
  ],
  "peersInfo": "Cannot query system_peers from node."
}

Node Version

{
  "clientVersion": "4.0.0-dev-c88a37247b9",
  "clientImplName": "node-subtensor",
  "chain": "Bittensor"
}

Runtime Spec

{
  "at": {
    "height": "116115",
    "hash": "0x312a434d3074d1693e4d60ff0d9325b2f17b55bf83105d110b53b150bc608647"
  },
  "authoringVersion": "1",
  "transactionVersion": "1",
  "implVersion": "1",
  "specName": "node-subtensor",
  "specVersion": "116",
  "chainType": {
    "live": null
  },
  "properties": {
    "ss58Format": "42",
    "tokenDecimals": [
      "9"
    ],
    "tokenSymbol": [
      "TAO"
    ]
  }
}

Each of these endpoints include key parameters to monitor, notably:

  • Node Network
    • numPeers → the number of peers that the node is currently connected to. This should remain above 150.
    • isSyncing → whether or not the node is up to date with other peers. This should remain false; otherwise you are making queries against outdated state.
  • Node Version
    • clientVersion → useful for ensuring the node is on the correct version or if using multiple nodes that they are all on the same version.
  • Runtime Spec
    • specVersion → indicates the current runtime version. A change to the runtime can include changes to the encoding and decoding of extrinsics (transactions) which would in turn require changes to your signers. Monitoring for specVersion changes is an important way to prevent your tooling from falling behind.

Add Type Definitions

Next, we’ll use mholt.github.io/json-to-go to simplify turning these responses into type definitions which gives us:

type NodeNetworkResponse struct {
        NodeRoles []struct {
                Full interface{} `json:"full"`
        } `json:"nodeRoles"`
        NumPeers             string   `json:"numPeers"`
        IsSyncing            bool     `json:"isSyncing"`
        ShouldHavePeers      bool     `json:"shouldHavePeers"`
        LocalPeerID          string   `json:"localPeerId"`
        LocalListenAddresses []string `json:"localListenAddresses"`
        PeersInfo            string   `json:"peersInfo"`
}


type NodeVersionResponse struct {
        ClientVersion  string `json:"clientVersion"`
        ClientImplName string `json:"clientImplName"`
        Chain          string `json:"chain"`
}


type RuntimeSpecResponse struct {
        At struct {
                Height string `json:"height"`
                Hash   string `json:"hash"`
        } `json:"at"`
        AuthoringVersion   string `json:"authoringVersion"`
        TransactionVersion string `json:"transactionVersion"`
        ImplVersion        string `json:"implVersion"`
        SpecName           string `json:"specName"`
        SpecVersion        string `json:"specVersion"`
        ChainType          struct {
                Live interface{} `json:"live"`
        } `json:"chainType"`
        Properties struct {
                Ss58Format    string   `json:"ss58Format"`
                TokenDecimals []string `json:"tokenDecimals"`
                TokenSymbol   []string `json:"tokenSymbol"`
        } `json:"properties"`
}

Hotkey Balance & Owner Info

Now that we’ve explored key network parameters, we’ll also want to monitor the applicable addresses, their balances and any operations they’re signing. This will help us confirm signed operations match our expectations, loudly communicate to our team when keys are being accessed and provide the basis for downstream services like balance tracking to reconcile events.

We can start by querying the balance for an address with:

func GetBalanceInfoForAddress(address string) {
        url := fmt.Sprintf("http://localhost:8080/accounts/%s/balance-info", address)
        resp := new(types.BalanceInfoResponse)
        
        err := GetRequest(url, &resp)
        check(err)
       
        logger.PrettyPrint(resp)
}

Which can be structured with this type definition:

type BalanceInfoResponse struct {
        At struct {
                Hash   string `json:"hash"`
                Height string `json:"height"`
        } `json:"at"`
        Nonce       string        `json:"nonce"`
        TokenSymbol string        `json:"tokenSymbol"`
        Free        string        `json:"free"`
        Reserved    string        `json:"reserved"`
        MiscFrozen  string        `json:"miscFrozen"`
        FeeFrozen   string        `json:"feeFrozen"`
        Locks       []interface{} `json:"locks"`
}

Putting this into practice, we can pick a random hotkey from TaoStats and display its balance with: GetBalanceInfoForAddress("5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3") :

{
        "at": {
                "hash": "0xe65bb31d67465db17067ffb19f32be06e8b993a969ef4d529deeb5d5d19cd522",
                "height": "116263"
        },
        "nonce": "444",
        "tokenSymbol": "TAO",
        "free": "860000",
        "reserved": "0",
        "miscFrozen": "0",
        "feeFrozen": "0",
        "locks": []
}

Since we know from our previous query RuntimeSpec.properties.tokenDecimals that TAO has 9 decimals meaning that the actual balance for this address is:

\[\frac{860000}{10^9} = 0.00086 \textbf{ TAO}\]

Why not just hardcode the 9 and save ourselves the extra step? Well, there’s precedent for chains to redenominate in the past, so it is generally wise to selectively take caution against hardcoding here.

Determine the Coldkey

Bittensor is designed with distinct “hot” and “cold” keys with a one to many mapping from cold to hotkeys. A Coldkey is meant to protect funds and cannot be used to sign “immediate” operations needed to participate in validation. For each hotkey, coldkeys are visible on TaoStats and we can also query the corresponding coldkey from the subtensorModule pallet via our api sidecar instance. To determine the coldkey for a given hotkey we query the ‘Owner’ pallet storage item. This is done by adding an additional client function, an associated type definition, and updating our main.go.

// client.go
package client

func GetHotKeyOwner(address string) {
        url := fmt.Sprintf("http://localhost:8080/pallets/subtensorModule/storage/Owner?keys[]=%s", address)
        resp := new(types.StorageResponse)
        
        err := GetRequest(url, &resp)
        check(err)
        
        logger.PrettyPrint(resp)
}


// responses.go
type StorageResponse struct {
 At struct {
  Hash   string `json:"hash"`
  Height string `json:"height"`
 } `json:"at"`
 Pallet      string   `json:"pallet"`
 PalletIndex string   `json:"palletIndex"`
 StorageItem string   `json:"storageItem"`
 Keys        []string `json:"keys"`
 Value       string   `json:"value"`
}

// main.go
func main() {
    client.GetHotKeyOwner("5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3")
}

Putting this all together, our query returns:

$ go run main.go
{
        "at": {
                "hash": "0x5d4d987ca3ad7df15ee24046eae146dbfd073e53efcf0e605de865d9b48ac020",
                "height": "116329"
        },
        "pallet": "subtensorModule",
        "palletIndex": "8",
        "storageItem": "owner",
        "keys": [
                "5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3"
        ],
        "value": "5Ccmf1dJKzGtXX7h17eN72MVMRsFwvYjPVmkXPUaapczECf6"
}


Cleaning Up

We were successfully able to query the cold key for a given hotkey and now we’d like to extract the balance for that address from the storage response. We can now rework our client functions to return the responses to the caller, leaving us with the following client.go:

package client

import (
 "encoding/json"
 "fmt"
 "io"
 "net/http"
 "bittensor-monitor/pkg/types"
)

func getRequest(url string, resp interface{}) error {
        res, resErr := http.Get(url)
        if resErr != nil {
                return resErr
        }
        
        if res.StatusCode != http.StatusOK {
                return fmt.Errorf("%v", res.StatusCode)
        }
        
        defer func() {
                closeErr := res.Body.Close()
                if closeErr != nil {
                        panic(closeErr)
                }
        }()
        
        body, readErr := io.ReadAll(res.Body)
        if readErr != nil {
                return readErr
        }
        
        jsonErr := json.Unmarshal(body, resp)
        if jsonErr != nil {
                return jsonErr
        }
        
        return nil
}


func GetNodeVersion() (*types.NodeVersionResponse, error) {
        url := "http://localhost:8080/node/version"
        resp := new(types.NodeVersionResponse)
        err := getRequest(url, &resp)
        
        return resp, err
}


func GetRuntimeSpec() (*types.RuntimeSpecResponse, error) {
        url := "http://localhost:8080/runtime/spec"
        resp := new(types.RuntimeSpecResponse)
        err := getRequest(url, &resp)
        
        return resp, err
}


func GetNodeNetwork() (*types.NodeNetworkResponse, error) {
                url := "http://localhost:8080/node/network"
                resp := new(types.NodeNetworkResponse)
                err := getRequest(url, &resp)
                
                return resp, err
}


func GetBalanceInfoForAddress(address string) (*types.BalanceInfoResponse, error) {
        url := fmt.Sprintf("http://localhost:8080/accounts/%s/balance-info", address)
        resp := new(types.BalanceInfoResponse)
        err := getRequest(url, &resp)
        
        return resp, err
}


func GetHotKeyOwner(address string) (*types.StorageResponse, error) {
        url := fmt.Sprintf("http://localhost:8080/pallets/subtensorModule/storage/Owner?keys[]=%s", address)
        resp := new(types.StorageResponse)
        err := getRequest(url, &resp)
        
        return resp, err
}

Now we can use the response data from our queries in subsequent queries. For example, we can take the address returned from our GetHotKeyOwner() call and use it in GetBalanceInfoForAddress().

package main


import (
        "bittensor-monitor/pkg/client"
        "bittensor-monitor/pkg/logger"
)


func main() {
        resp, err := client.GetHotKeyOwner("5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3")
        if err != nil {
                panic(err) // TODO: don't panic here
        }
        
        coldkey := resp.Value
        
        balanceResp, balErr := client.GetBalanceInfoForAddress(coldkey)
        if balErr != nil {
                panic(balErr)
        }
        
        logger.PrettyPrint(balanceResp)
}

Taking this even further, since we know how to query the number of decimals for a given denom, we can add a helper function to convert the value returned to the human-readable value for the balance. We add the following helper function to our responses.go which will return the decimals for a given token symbol and in the process also make it easier to convert the string in the balance info response to a number.

func (r RuntimeSpecResponse) DecimalsForSymbol(token string) int64 {


        decimalString := ""
        
        for n := range r.Properties.TokenSymbol {
                sym := r.Properties.TokenSymbol[n]
                if sym == token {
                        decimalString = r.Properties.TokenDecimals[n]
                }
        }


        if decimalString == "" {
                panic("decimals not found for given token symbol; consider returning an error instead of a panic here.")
        }


        num, err := strconv.ParseInt(decimalString, 10, 0)
        if err != nil {
                panic(err) // TODO: don't panic here.
        }


        return num
}


func (b BalanceInfoResponse) BalanceFree() float64 {
        num, err := strconv.ParseFloat(b.Free, 64)
        if err != nil {
                panic(err) // TODO:  don't panic here.
        }


        return num
}

Using these helper functions we can tweak our main.go like so

package main


import (
 "fmt"
 "math"
 "bittensor-monitor/pkg/client"
)


func main() {
        resp, err := client.GetHotKeyOwner("5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3")
        if err != nil {
                panic(err) // TODO: don't panic here
        }
        
        coldkey := resp.Value
        
        balanceResp, balErr := client.GetBalanceInfoForAddress(coldkey)
        if balErr != nil {
                panic(balErr)
        }
        
        spec, specErr := client.GetRuntimeSpec()
        if specErr != nil {
                panic(specErr)
        }
        
        decimals := spec.DecimalsForSymbol(balanceResp.TokenSymbol)
        humanReadableBalance := balanceResp.BalanceFree() / (math.Pow10(int(decimals)))
        
        fmt.Printf("%v %v \n", humanReadableBalance, balanceResp.TokenSymbol)
}

This returns our balance;

$ go run main.go
0.500001 TAO

Querying Storage

Let’s now take a moment to better understand what is happening here. You might be wondering whether we could query information from storage directly from the node. The long answer is absolutely and I encourage anyone interested in learning more about how this might be accomplished to check out the write-up from Shaun Tabrizi’s blog: Querying Substrate Storage via RPC. The short answer is that querying substrate storage is a rather involved and time-consuming task.

Working through an earlier network launch requires speed and flexibility. Using the Substrate API Sidecar (built by Parity!) is one way to move faster and safer while keeping up with a novel network.

But what else might we want to monitor from pallet storage? We can see the list of items within the pallet storage via:

  • http://localhost:8080/pallets/subtensorModule/storage?onlyIds=true

For the full list of available storage items you can query for a given substrate pallet, you can query:

  • http://localhost:8080/pallets/$PALLET_NAME/storage/

This will also tell you what parameters are required for querying each storage item.

UID Metagraph Data

Combining everything we’ve explored thus far to recreate the Metagraph entry for a given hotkey. We can use the following type definition for the aforementioned Metagraph entry but feel free to add additional items as you see fit. This represents all the datapoints we we’d like to query from the chain and will provide the foundation for our monitor.

type NeuronInfo struct {
        UID       int
        HotKey    string
        ColdKey   string
        Stake     float64
        Rank      float64
        VTrust    float64
        Trust     float64
        Consensus float64
        Incentive float64
        Dividends float64
        Emission  float64
        Updated   float64
        VPermit   bool
        Active    bool
        Height    float64
}

Preparation

Additional metagraph items can be queried like so:

func GetStorageItemSingleKeySingleValue(item string, key int) (*types.StorageResponseSingleValue, error) {
        url := fmt.Sprintf("http://localhost:8080/pallets/subtensorModule/storage/%v?keys[]=%v", item, key)
        resp := new(types.StorageResponseSingleValue)
        err := getRequest(url, &resp)
        
        return resp, err
}


func GetStorageItemSingleKeyMultiValue(item string, key int) (*types.StorageResponseMultiValue, error) {
        url := fmt.Sprintf("http://localhost:8080/pallets/subtensorModule/storage/%v?keys[]=%v", item, key)
        resp := new(types.StorageResponseMultiValue)
        err := getRequest(url, &resp)
        
        return resp, err
}


func GetStorageItemSingleKeyMultiBoolValue(item string, key int) (*types.StorageResponseMultiBoolValue, error) {
        url := fmt.Sprintf("http://localhost:8080/pallets/subtensorModule/storage/%v?keys[]=%v", item, key)
        resp := new(types.StorageResponseMultiBoolValue)
        err := getRequest(url, &resp)
        
        return resp, err
}

And parsed with the following type definitions:

type StorageResponseSingleValue struct {
        At struct {
                Hash   string `json:"hash"`
                Height string `json:"height"`
         } `json:"at"`
        Pallet      string `json:"pallet"`
        PalletIndex string `json:"palletIndex"`
        StorageItem string `json:"storageItem"`
        Value       string `json:"value"`
}


type StorageResponseMultiValue struct {
        At struct {
                Hash   string `json:"hash"`
                Height string `json:"height"`
        } `json:"at"`
        Pallet      string   `json:"pallet"`
        PalletIndex string   `json:"palletIndex"`
        StorageItem string   `json:"storageItem"`
        Value       []string `json:"value"`
}


type StorageResponseMultiBoolValue struct {
        At struct {
                Hash   string `json:"hash"`
                Height string `json:"height"`
        } `json:"at"`
        Pallet      string `json:"pallet"`
        PalletIndex string `json:"palletIndex"`
        StorageItem string `json:"storageItem"`
        Value       []bool `json:"value"`
}

Finally, we add the following helper functions for parsing the data returned in the responses:

package util


import (
 "math"
 "strconv"
)


func MustParseFloat64(str string) float64 {
        num, err := strconv.ParseFloat(str, 64)
        if err != nil {
                panic(err)
        }
        
        return num
}


func MustParseNormalizedFloat(str string) float64 {
        f := MustParseFloat64(str)
        return f / math.MaxUint16
}


func MustParseInt64(str string) int64 {
        num, err := strconv.ParseInt(str, 10, 64)
        if err != nil {
                panic(err)
        }
        
        return num
}


The Code

Finally, this code can be found here. This represents everything discussed above and has been combined into a single file. This is provided as a proof-of-concept, but you’ll need to test & tailor before using in your own environment.

Results

After updating our main.go to call this might-be-a-bit-too-long function with the hotkey address from earlier and running go run main.go we get the following:

{
        "UID": 1169,
        "HotKey": "5F4tQyWrhfGVcNhoqeiNsR6KjD4wMZ2kfhLj4oHYuyHbZAc3",
        "ColdKey": "5Ccmf1dJKzGtXX7h17eN72MVMRsFwvYjPVmkXPUaapczECf6",
        "Stake": 785266.577834045,
        "Rank": 0,
        "VTrust": 0.8839551384756237,
        "Trust": 0,
        "Consensus": 0,
        "Incentive": 0,
        "Dividends": 0.22926680399786373,
        "Emission": 11.463921703,
        "Updated": 246,  // consider having monitor alert if this rises above a certain threshold
        "VPermit": true,  // consider having monitor alert if this is ever false
        "Active": true,  // consider having monitor alert if this is ever false
        "Height": 158794
}

Next Steps

Now that we have this data what do we actually do with it? There are more than a few ways we could use this data to build a monitor and a few of the options that come to mind are:

  • Expose the data as prometheus metrics, creating dashboards in grafana, and writing rules using alertmanager for any unexpected or unsafe state.
  • Exporting data through your preferred logging solution (Splunk, Datadog, or other) and alerting through your existing pipelines.
  • Write custom logic to trigger PagerDuty/Slack/Discord/Telegram/AIM alerts.
  • Put the data in IPFS and sell them as NFTs

Everyone’s infrastructure is different and so too are the items you’ll want to monitor, graph and alert. By exploring the available data, extracting what’s available, and iterating tends to grow confidence over time in being alerted to important or anomalous conditions while also being able to ignore the noise. If you liked this post and want to spend more of your time building high-signal monitoring for high-quality, secure and reproducible infrastructure, please get in touch.

We are currently hiring for a variety of roles: unit410.com/jobs.

Note: All code samples are provided for reference and illustrative purposes only, and are not intended to be used as-is in production.

  1. See here for a longer discussion on substrate extrinsics.