Efficient Data Reconciliation: Accuracy with Bisect Technique
This is what AI thinks accounting accuracy looks like
By Joel Nordell, Lead Accounting Engineer
Introduction
At Unit 410, when it comes to accounting data, we take accuracy seriously, even when dealing with blockchains that have a high number of decimal places, such as 18 on Ethereum or 24 on Near. We preserve all precision to ensure that even the smallest discrepancies in our reported data are corrected. This blog post will explore one technique we use to quickly and efficiently pinpoint data errors and maintain the utmost accuracy in our accounting processes.
Background
Our typical approach toward indexing a blockchain network for accounting is as follows. First, we deploy a dedicated RPC node as part of our network participation infrastructure. This node connects to the blockchain network, and provides protocol-specific endpoints for retrieving data from the blockchain. Usually this will include, at a minimum, the following three important types of calls:
- a way to query the current block number.
- a way to retrieve all data associated with a block including any events that occurred.
- a way to retrieve the balance of an address at a given block.
Using these endpoints, we check for new blocks (either by subscribing, or by polling) and process them one-by-one as they occur. Within each block, we parse the transaction data looking for anything that affects an address balance. We interpret this data into a series of entries which represent balance changes. The specifics of these can vary quite a bit depending on the type of network, but they generally have the same structure: block identifier (number or hash), the address, the amount of the change (either positive or negative), the token, etc.
The result is a database of every on-chain event that has ever changed the balance of each address. From this database, we generate all of our accounting reporting, as well as provide balance information as of any point in time.
Ensuring Accuracy
To ensure the accuracy and completeness of our indexer data, we employ a reconciliation technique. Reconciliation is a process that compares the balance of every address in our database to the balance reported by the blockchain node. If any discrepancies are found, the process returns an error. If no discrepancies are found, the process returns success. The below diagram shows the sequence that occurs every time we invoke a reconciliation.
We constantly run this process in the background, on all of our indexers, to ensure that our accounting data always matches the blockchain source of truth. Anytime a discrepancy occurs, our team is alerted and we take steps to remedy the situation. It is important to note, however, that this only tells us that a discrepancy has occurred; it does not tell us where the discrepancy is. To find it, we must perform a more detailed analysis.
Finding Errors
When a reconciliation failure occurs, it means the balance computed within our database does not match the balance reported by the blockchain node. This is usually due to some block data having been missed by the indexing process, often caused by temporary network errors or other infrastructure issues. Because we live in the real world, this happens from time to time.
Whenever this occurs, we must pinpoint exactly which data is missing so that we can re-index the specific blocks affected. How can we do that efficiently?
To solve this problem, we have developed a technique we call bisect
, inspired by (and named after) the git-bisect command.
The initial step involves finding a block where the reconciliation process was successful. This is typically straightforward as we can choose any block prior to the failure. The selected block can be significantly earlier in the blockchain, for instance, subtracting 10,000 from the latest block number. By reconciling this block and ensuring its success, we establish a starting point. This, in conjunction with the block where the reconciliation process failed, provides us with a specific range to search within. The error must be located somewhere within this range.
We can efficiently find the first block in the range where reconciliation fails, in O(log n) time, using a binary search. This process, known as bisect
, repeatedly invokes reconcile
, iteratively narrowing the search range in half until the first block N is found where N-1 succeeds and N fails. The following illustration demonstrates step-by-step how this search is performed. In this example, the failing block is identified by checking only 4 blocks (22% of the range). Within a larger range, the efficiency is much greater.
Now, with the first failed block identified, we can re-index that block and ensure that our reconciliation now succeeds. If there are still more failures, we simply repeat the bisect process, with each corrected block as the new starting point. This technique allows us to quickly identify & correct every data error, with a minimum amount of effort, thus ensuring that we always deliver 100% accurate and reliable accounting data to our clients.
Conclusion
In conclusion, maintaining accuracy in accounting data is essential for Unit 410. With our thorough reconciliation processes and the use of the bisect
technique, we can quickly identify and correct any data errors. This ensures that our reporting is always in line with the blockchain source of truth. We prioritize accuracy to provide reliable accounting data to our clients and maintain our commitment to excellence in financial management.