Aug 25, 2022 | joachimneu
A core responsibility of any layer-1 blockchain is to guarantee data availability. This guarantee is essential for clients to be able to interpret the layer-1 blockchain itself, and also as a foundation for higher-layer applications like rollups. For this purpose, an oft-discussed technique is random sampling for data availability verification, as popularized by a paper by Mustafa Al-Bassam, Alberto Sonnino, and Vitalik Buterin in 2018. This technology is at the core of the Celestia blockchain, and proposed for inclusion in proof-of-stake (PoS) Ethereum with “Danksharding.”
The purpose of this blog post is to explain the basics of data availability sampling (DAS), the model upon which it rests, and challenges and open problems when it comes to implementing the technique in practice. We hope that this post can “pill” researchers, attract them to the problem, and spur new ideas towards solving some of the outstanding challenges (cf. this recent Request for Proposals of the Ethereum Foundation).
Somebody (such as a layer-1 block proposer or layer-2 sequencer) has produced a block of data. They claim to have made it “available” to the “public.” Your goal is to check the availability claim, i.e., would you actually be able to obtain the data if you needed to?
Availability of data is crucial. Optimistic fraud-proof-based systems, like Optimism, require data availability for verification, and even validity-proof-based systems, like StarkNet or Aztec, require data availability to ensure liveness (e.g., to prove asset ownership for a rollup’s escape hatch or forced transaction inclusion mechanism).
For the problem formulation so far, there is an easy “naive” testing procedure, which earlier systems like Bitcoin implicitly adopt: just try to download the entire block of data. If you succeed, you know it is available; if you do not succeed, you consider it unavailable. Now, however, we want to test for data availability without downloading too much data ourselves, e.g., because the data is larger than we can handle, or because it seems wasteful to spend a lot of bandwidth on data we aren’t actually interested in, “only” to verify its availability. At this point we need a model to clarify what it “means” to download or withhold only “part of the data.”
A common approach in computer science is to first describe a new technique in a model that has rather generous facilities; and to subsequently explain how that model can be realized. We take a similar approach with DAS, except, as we will see, interesting open R&D problems pop up when we attempt to instantiate the model.
In our model, there is a bulletin board in a dark room (see comic below). First, the block producer enters the room and gets the opportunity to write some information on the bulletin board. As the block producer exits, it can give you, the validator, a tiny piece of information (of size that does not scale linearly with the original data). You enter the room with a flashlight that has a very narrow light beam and is low on battery, so you can only read the writing on very few distinct locations of the bulletin board. Your goal is to convince yourself that indeed the block producer has left enough information on the bulletin board so that if you were to turn on the light and read the complete bulletin board, you would be able to recover the file.
At first, this seems tricky: We can ask the block producer to write down the complete file on the bulletin board. Now consider two possibilities: Either the producer behaves honestly and writes down the complete file, or the producer misbehaves and leaves out some tiny chunk of the information, making the file as a whole unavailable. By inspecting the bulletin board at only a few locations, you cannot distinguish these two scenarios reliably—thus, you cannot check for data availability reliably. We need a new approach!
This is where erasure correcting Reed-Solomon codes come into play. Let’s take a brief detour to recap those. On a high level, an erasure correcting code works like this: A vector of
If the code is maximum distance separable (MDS), then the original
This is because a line can be described as a polynomial of degree 1 with two coefficients:
Reed-Solomon codes are built from this insight. For encoding, we start with the
Going back to our data availability problem: Instead of asking the block producer to write down the raw file on the bulletin board, we now ask it to cut the file in
But now these two scenarios, a bulletin board that is fully written, and a bulletin board that is half empty, are easily distinguishable: You inspect the bulletin board at a small number
This is beautifully simple—within the given “bulletin board in a dark room” model. Let’s think about the model: What do the components represent? Can we realize them in a real computer system, and how?
In fact, to help spot the gaps between theory and practice, we have explained the problem and solution using that “strange” “bulletin board in a dark room” model, with metaphors that bear little resemblance to the real computational system. This was to encourage you to reflect about how aspects of the real and model world correspond, and how they might (not) be realized. If you are left with pieces of the model that you haven’t been able to translate into a computer/network/protocol equivalent, you know there is something left to be done—which might be either gaps in your understanding, or open research problems! ;)
Here is a non-exhaustive collection of challenges, for some of which the community has found reasonable answers over the years, while others are still open research problems.
Challenge A: How to ensure that the chunks on the bulletin board were actually written by the proposer? Think about alterations of the sampled chunks as they transit in whatever form on the network to the sampling node. This is where the small piece of information comes in that the block producer can pass to the sampling nodes as the producer leaves and the sampling node enters the dark room. In practice, this is realized as a binding vector commitment (think Merkle tree) to the original content written to the bulletin board, and it is shared for instance as part of the block header. Given the commitment, the block producer can then leave a proof with each coded chunk on the bulletin board to show that indeed the chunk was written by the block producer. The chunks cannot be altered by a third party on transit, as the commitment scheme does not allow to forge a valid proof for a modified chunk. Note that this by itself does not preclude that the block producer writes invalid/inconsistent chunks on the bulletin board, which we address next.
Challenge B: Enforce that the block producer erasure-encodes properly. We have assumed in the above scheme that the block producer encodes the information chunks correctly, so that the guarantees of the erasure-code hold, and that consequently from enough coded chunks it is actually possible to recover the information chunks. In other words, all the block producer can do is withhold chunks, but not confuse us with invalid chunks. In practice, there are three commonly discussed approaches to rule out invalid encoding:
Challenge C: “What” and “where” is the bulletin board? How does the proposer “write” to it? Before we get to “what” and “where” the bulletin board “is”, how the proposer “writes” to it, and how the verifier “reads”/”samples” from it, let’s review well-known shortcomings of two basic peer-to-peer networking primitives:
With this in mind, we can return to the central questions on how to implement the bulletin-board and the read/write operations on it. Where are the coded chunks stored? How do they get there? Three prominent approaches under consideration are:
Challenges with these approaches are:
Challenge D: “How” do we implement the random sampling? This question is two-fold: how are the desired chunks located and transmitted in the network (i.e., how to “read” from the bulletin board), and how is it ensured that the sampling “remains random” with respect to the adversary, i.e., that an adversarial block producer does not have (too much) opportunity to change its strategy adaptively depending on who queries which chunks.
Note that the above solve only sampling (“reading” from the bulletin board now), but not “reading” from the bulletin board at any time in the future. Specifically, GOSSIP essentially implements an ephemeral bulletin board (which can be read/sampled only at the time when its content is being written/dispersed), while DHT implements a permanent bulletin board (which can be read/sampled also much later). Typically, a permanent bulletin board is desired (with a permanence requirement ranging from “days” to “forever”, depending on the exact design), for which purpose GOSSIP has to be complemented with a DHT to route chunks, which comes with the aforementioned challenges. REPLICATE implements a permanent bulletin board right away.
The following table illustrates which peer-to-peer protocols are commonly proposed to realize which functionality of the model. Specifically, the gossip-oriented approach comes in two variants, one that uses gossip to sample chunks and one that uses a DHT to sample chunks, while in both “chunks are written on the bulletin board” using gossip and “read from it” at later points in time using a DHT. In contrast, the DHT-oriented approach relies entirely on a DHT for all the operations in question. In the replication-oriented approach, each node reads/samples chunks from a full replica nearby, using a request/response protocol. It effectively uses gossip for the initial dissemination of the chunks, albeit the gossipping between two peers might technically be realized through the request/response protocol.
Approach | “Write” chunks | Sample chunks (while being written) | “Read” chunks (considerably later) |
Gossip-oriented | Gossip | Gossip | DHT |
DHT | |||
DHT-oriented | DHT | ||
Replication-oriented | Gossip | Request/response | Request/response |
Furthermore, in all above techniques, “who samples what” is leaked (at least partially) to the adversary, so that the adversary can, through its own behavior, adaptively impair/promote the propagation of chunks sampled by certain nodes, and thus trick certain nodes into believing that the block is (un-)available. While the earlier work shows that only few nodes can be tricked, this is undesirable. Alternatively, earlier works assume anonymous network communication, which in practice at least comes with a considerable performance penalty, if not outright impractical.
Challenge E: How to “repair” the board’s content? That is, if a coded chunk goes missing (for example because nodes storing that chunk have gone offline; how is that even detected?), how is it recovered? Naive repair involves decoding and re-encoding, and thus poses a considerable communication and computational burden, in particular for the common Reed-Solomon erasure-correcting codes. Who takes on this burden? How are they compensated? How is it avoided that a malicious block producer can grief sampling nodes by withholding a few coded chunks and forcing nodes to spend resources on expensive repair? What about distributed repair schemes? How are the chunks that are necessary for repair even retrieved, going back to the previous point about “reading” from the board in the future.
Challenge F: Incentives. If sampling is free, how to prevent a denial-of-service vector? If sampling requires payment (how to implement?), how can it simultaneously be fully anonymous? How are those compensated that store (parts of) the bulletin board, route information in the peer-to-peer network, or perform maintenance tasks such as chunk repair?
For completeness, we briefly mention a slightly different model for which DAS achieves a slightly different guarantee. Even without anonymous and reliable networking, an adversary can trick at most a certain number of honest nodes into believing that an unavailable file is available. Otherwise, it would have to release so many chunks that from the union of all chunks obtained by honest nodes the file can be recovered. The advantage of this model is that the properties required from the network are easier to achieve (in particular, when the peer-to-peer network has been subverted by the adversary). Disadvantages are that there is no concrete guarantee for an individual user (you could be among the few tricked!), and that it remains unclear how to gather all the samples obtained by honest nodes and recover the file (in particular, when the peer-to-peer network has been subverted by the adversary).
Based on the observations and arguments presented in this blog post, we think the following will be some interesting directions for future research and development:
If you are working on (or interested to work on) any of these directions, I would love to hear from you! You can reach me on Twitter at @jneu_net.
Acknowledgments: Special thanks to Mustafa Al-Bassam, Danny Ryan, Dankrad Feist, Sreeram Kannan, Srivatsan Sridhar, Lei Yang, Dan Robinson, Georgios Konstantopoulos, and Dan Lee for fruitful discussions and feedback on earlier drafts of this post, and to Achal Srinivasan for the beautiful illustrations.
Disclaimer: This post is for general information purposes only. It does not constitute investment advice or a recommendation or solicitation to buy or sell any investment and should not be used in the evaluation of the merits of making any investment decision. It should not be relied upon for accounting, legal or tax advice or investment recommendations. This post reflects the current opinions of the authors and is not made on behalf of Paradigm or its affiliates and does not necessarily reflect the opinions of Paradigm, its affiliates or individuals associated with Paradigm. The opinions reflected herein are subject to change without being updated. Paradigm invests in certain of the protocols referenced herein. These references are not meant to endorse or recommend these protocols for investment or otherwise.