# A SNARKy Exponential Function

## Simulating real numbers using finite field arithmetic

### 2019-03-09

In Coda, we use the proof-of-stake protocol Ouroboros for consensus. Naturally since this is Coda, that means we have to check proofs-of-stake inside a SNARK.

A proof-of-stake for a person with some amount of stake $$a$$ in Ouroboros is a random number $$s$$ between 0 and 1 (which one provably has generated fairly) such that $$s$$ is less than some threshold depending on $$a$$. Concretely, that threshold is $$1 - (1/2)^{\frac{a}{T}}$$ where $$T$$ is the total amount of stake in the system.

It’s important to use a threshold of this form because it means that the density of blocks over time does not depend on the distribution of stake.

If you know anything about SNARKs, you know that inside of a SNARK all we can do is arithmetic (that is, addition and multiplication) in a finite field $$F_p$$. It’s not at all clear how we can compute a fractional number raised to a fractional power!

We’ll go through a very cool technique for computing such a thing. All the code for doing what we’ll talk about is implemented using snarky and can be found here.

The technique will go as follows:

1. We’ll use Taylor series to approximately reduce the problem to doing arithmetic with real numbers.
2. We’ll approximate the arithmetic of real numbers using the arithmetic of rational numbers.
3. We’ll then approximate the arithmetic of rational numbers using integer arithmetic.
4. Finally, we’ll simulate integer arithmetic using finite field arithmetic.

## Taylor series

First we need a way to reduce the problem of computing an exponentiation to multiplications and additions in a finite field. As a first step, calculus lets us reduce exponentiation to multiplication and addition over the real numbers (a field, but not a finite one) using a Taylor series.

Specifically, we know that

\begin{aligned}
1 - (1/2)^x
&= -\log(1/2) x - \frac{(\log(1/2) x)^2}{2!} - \frac{(\log(1/2) x)^3}{3!} - \dots \\
&= \log(2) x - \frac{(\log(2) x)^2}{2!} + \frac{(\log(2) x)^3}{3!} - \dots
\end{aligned}
We can truncate this Taylor series to get polynomials $$T_n$$
\begin{aligned}
T_n
&= \log(2) x - \frac{(\log(2) x)^2}{2!} + \dots + \frac{(\log(2) x)^n}{n!} \\
&= \log(2) x - \frac{\log(2)^2}{2!} x^2 + \dots + \frac{\log(2)^n}{n!} x^n
\end{aligned}

by taking the first $$n$$ terms. The Taylor polynomials nearly compute $$1 - (1/2)^x$$, but with some error that gets smaller as you take more and more terms. You can see this in the image of the actual function (in blue) along with the first few Taylor polynomials (in black).

It turns out there’s a handy formula which lets us figure out how many terms we need to take to make sure we get the first $$k$$ bits of the output correct, so we can just use that and truncate at the appropriate point for the amount of precision that we want.

## From reals to rationals

Multiplication and addition are continuous, which means if you change the inputs by only a little bit, the outputs change by only a little bit.

Explicitly, if instead of computing $$x_1 + x_2$$, we compute $$a_1 + a_2$$ where $$a_i$$ is $$x_i$$ plus some small error $$e_i$$, we have $$a_1 + a_2 = (x_1 + e_1) + (x_2 + e_2) = (x_1 + x_2) + (e_1 + e_2)$$ so the result is close to $$x_1 + x_2$$ (since $$e_1 + e_2$$ is small).

Similarly for multiplication we have $$a1 a2 = (x1 + e1)(x2 + e2) = x1 x2 + e1 x1 + e2 x2 + e1 e2$$. If $$e1, e2$$ are small enough compared to $$x_1$$ and $$x_2$$, then $$e1 x1 + e2 x2 + e1 e2$$ will be small as well, and so $$a1 a2$$ will be close to $$x1 x2$$.

What this means is that instead of computing the Taylor polynomial $\log(2) x - \frac{\log(2)^2}{2!} x^2 + \dots + \frac{\log(2)^n}{n!} x^n$

using real numbers like $$\log(2)$$ (which is irrational), we can approximate each coefficient $$\log(2)^k / k!$$ with a nearby rational number and compute using those instead! By continuity, we’re guaranteed that the result will be close to the actual value (and we can quantify exactly how close if we want to).

## From rationals to integers

There are a few ways to approximate rational arithmetic using integer arithmetic, some of which are more efficient than others.

The first is to use arbitrary rational numbers and simulate rational arithmetic exactly. We know that $\frac{a}{b} + \frac{c}{d} = \frac{a d + b c}{b d}$ so if we represent rationals using pairs of integers, we can simulate rational arithmetic perfectly. However, there is a bit of an issue with this approach, which is that the integers involved get really huge really quickly when you add numbers together. For example, $$1/2 + 1/3 + \dots + 1/n$$ has $$n!$$ as its denominator, which is a large number.

That’s a problem for us inside the SNARK, because we’re working with a finite field and want to make sure there is no overflow.

A better approach is to use rational numbers whose denominator is a power of two. These numbers are called dyadic rationals and are basically the same thing as floating point numbers.

Here, addition can be simulated using integers as follows. A rational number $$\frac{a}{2^k}$$ is represented as the pair $$(a, k)$$. Say $$k \leq m$$. For addition, we have $\frac{a}{2^k} + \frac{b}{2^m} = \frac{2^{m - k} a + b}{2^m}$ so the denominator of a sum is the max of the denominators of the inputs. That means that the denominators don’t get huge when you add a bunch of numbers (they will stay as big as the largest input to the sum).

Moreover, any rational number can be approximated by a number of this form (it’s just the binary expansion of that number).

## From integers to a finite field

To recap, we’ve done the following.

1. We approximated our exponential $$1 - (1/2)^x$$ by arithmetic over the reals using Taylor polynomials.
2. We approximated real arithmetic by rational arithmetic using continuity.
3. We approximated rational arithmetic by the arithmetic of dyadic-rationals/floats, again using continuity. Moreover, we saw that floating point arithmetic can easily be simulated exactly using integer arithmetic.

Now our final step is to simulate integer arithmetic using the arithmetic of our prime order field $$\mathbb{F}_p$$. But this step is actually the easiest! As long as we are careful about numbers not overflowing, it’s the same thing. That is, if we know ahead of time that $$a + b < p$$, then we can safely compute $$a + b \mod p$$, knowing that the result will be the same as over the integers. The same is true for multiplication. So as long as we don’t do too many multiplications, the integers representing the dyadic rationals we’re computing with won’t get too big, and so we will be okay. And if we don’t take too many terms of the Taylor series, this will be the case.

## Conclusion

Let’s survey the net result of all this approximation: We have a very efficient way of approximately computing an exponential function on fractional inputs inside of a SNARK, in such a way that we can have concrete bounds on the error of the approximation. Pretty cool! This enables us to use a threshold function for Ouroboros that guarantees a constant density of blocks regardless of the distribution of stake.

# Fast Accumulation on Streams

## Succinctly Verifying Coda’s Ledger

### 2018-12-20

1 If you’d rather consume this content in video form, watch this talk.

While developing Coda, we came across an interesting problem that uncovered a much more general and potentially widely applicable problem: Taking advantage of parallelism when combining a large amount of data streaming in over time. We were able to come up with a solution that scales up to any throughput optimally while simultaneously minimizing latency and space usage. We’re sharing our results with the hope that others dealing with manipulation of online data streams will find them interesting and applicable.1

## Background

2 Equivalent to security as a full node.

The Coda cryptocurrency protocol is unique in that it uses a succinct blockchain. In Coda the blockchain is replaced by a tiny constant-sized cryptographic proof. This means that in the Coda protocol a user can sync with full-security2 instantly—users don’t have to wait to download thousands and thousands of blocks to verify the state of the network.

What is this tiny cryptographic proof? It’s called a zk-SNARK, or zero knowledge Succinct Non-interactive ARgument of Knowledge. zk-SNARKs let a program create a proof of a computation, then share that proof with anyone. Anyone with the proof can verify the computation very quickly, in just milliseconds, independent of how long the computation itself takes. While validating proofs is fast, creating them is quite slow, so creating this SNARK proof is much more computationally expensive. We use a few different SNARK proofs throughout Coda’s protocol, but the important one for this post is what we call the “Ledger Proof”.

3 Note that we represent account states concretely as their hashes for performance reasons.

A ledger proof tells us that given some starting account state $\sigma_0$ there was a series of $k$ transactions that eventually put us into account state $\sigma_k$. Let’s refer to such a proof as $\sigma_0 \Longrightarrow \sigma_k$.3 So what does it mean for a single transaction to be valid? A transaction, $T_i^{i+1}$, is valid if it’s been signed by the sender, and the sender had sufficient balance in their account. As a result our account state $\sigma_i$ transitions to some new state $\sigma_{i+1}$. This state transition can be represented as $\sigma_i T_{i}^{i+1} \sigma_{i+1}$. We could recompute $\sigma_0 \Longrightarrow \sigma_k$ every time there is a new transaction, but that would be slow, with the cost of generating the proof growing with the number of transactions—instead we can reuse the previous proof recursively. These ledger proofs enable users of Coda to be sure that the ledger has been computed correctly and play a part in consensus state verification.

More precisely, the recursive bit of our ledger proof, $\sigma_0 \Longrightarrow \sigma_{i}$, or the account state, has transitioned from the starting state $\sigma_0$ to the current state $\sigma_i$ after $i$ correct transactions are applied, could naively be defined in the following way:

There exists a proof, $\sigma_0 \Longrightarrow \sigma_{i-1}$, and $\sigma_{i-1}T_{i-1}^{i}\sigma_i$ such that $\sigma_0 \Longrightarrow \sigma_{i-1}$ verifies and $\sigma_{i-1}T_{i-1}^{i}\sigma_i$ is valid.

Let’s examine what running this process over four steps would look like:

The functional programming enthusiast will notice that this operation is like a scan:

The ~init in OCaml refers to a named argument, and 'a and 'b are a type unification variables that work similarly to generics in Java.

(* scan [1;2;3] ~init:0
~f:(fun b a -> b + a)
=> [1,3,6] *)
val scan : 'a list
-> ~init:'b
-> ~f:('b -> 'a -> 'b)
-> 'b list
(* scan [1;2;3] ~init:0 ~f:(fun b a -> b + a)
=> [1,3,6] *)
val scan : 'a list -> ~init:'b
-> ~f:('b -> 'a -> 'b) -> 'b list

4

A scan combines elements of a collection together incrementally and returns all intermediate values. For example if our elements are numbers and our operation is plus, scan [1;2;3] ~init:0 ~f:(fun b a → b + a) has following evaluation trace:

:: means “cons” or prepend to the front of a linked list.

scan [1;2;3] ~init:0 ~f:add
1::3::6::[]
[1;3;6]

5

However, what we really have is a scan operation over some sort of stream of incoming information, not a list. A signature in OCaml may look like this:

val scan : 'a Stream.t
-> ~init:'b
-> ~f:('b -> 'a -> 'b)
-> 'b Stream.t

6 We write streams as lists in the evaluation.

As new information flows into the stream we combine it with the last piece of computed information and emit that result onto a new stream. Here’s a trace with transactions and proofs6:

\begin{aligned}
scan &[\sigma_0T_0^{1}\sigma^{1}; \; \sigma_1T_1^{2}\sigma^{2}; \; \sigma_2T_2^{3}\sigma^{3}] \\
\sim &init:\sigma_0 \Longrightarrow \sigma_0 \\
\sim &f:combine \\
\\

combine&(\sigma_0 \Longrightarrow \sigma_0,\sigma_0T_0^{1}\sigma_1):: \\
(scan &[\sigma_1T_1^{2}\sigma^{2}; \; \sigma_2T_2^{3}\sigma^{3}] \\
\sim &init:\\
&combine(\sigma_0 \Longrightarrow \sigma_0,\sigma_0T_0^{1}\sigma_1)  \\
\sim &f:combine) \\
\\

\sigma_0 \Longrightarrow \sigma_1&:: \\
(scan &[\sigma_1T_1^{2}\sigma^{2}; \; \sigma_2T_2^{3}\sigma^{3}] \\
\sim &init:\sigma_0 \Longrightarrow \sigma_1 \\
\sim &f:combine) \\
\\

\sigma_0 \Longrightarrow \sigma_2&::\sigma_0 \Longrightarrow \sigma_1:: \\
(scan &[\sigma_2T_2^{3}\sigma_3] \\
\sim &init:\sigma_0 \Longrightarrow \sigma_2 \\
\sim &f:combine) \\
\\

\sigma_0 \Longrightarrow \sigma_3&::\sigma_0 \Longrightarrow \sigma_2::\sigma_0 \Longrightarrow \sigma_1:: \\
(scan &[] \\
\sim &init:\sigma_0 \Longrightarrow \sigma_3 \\
\sim &f:combine) \\
\\

\sigma_0 \Longrightarrow \sigma_3&::\sigma_0 \Longrightarrow \sigma_2::\sigma_0 \Longrightarrow \sigma_1::[] \\
\\

[\sigma_0 \Longrightarrow \sigma_3&; \; \sigma_0 \Longrightarrow \sigma_2; \; \sigma_0 \Longrightarrow \sigma_1]
\end{aligned}
\begin{aligned}
scan &[\sigma_0T_0^{1}\sigma^{1}; \; \sigma_1T_1^{2}\sigma^{2}; \; \sigma_2T_2^{3}\sigma^{3}] \\
\sim &init:\sigma_0 \Longrightarrow \sigma_0 \quad\sim f:combine \\
\\

combine&(\sigma_0 \Longrightarrow \sigma_0,\sigma_0T_0^{1}\sigma_1):: \\
(scan &[\sigma_1T_1^{2}\sigma^{2}; \; \sigma_2T_2^{3}\sigma^{3}] \\
\sim &init:combine(\sigma_0 \Longrightarrow \sigma_0,\sigma_0T_0^{1}\sigma_1)  \\
\sim &f:combine) \\
\\

\sigma_0 \Longrightarrow \sigma_1&:: \\
(scan &[\sigma_1T_1^{2}\sigma^{2}; \; \sigma_2T_2^{3}\sigma^{3}] \\
\sim &init:\sigma_0 \Longrightarrow \sigma_1 \quad\sim f:combine) \\
\\

\sigma_0 \Longrightarrow \sigma_2&::\sigma_0 \Longrightarrow \sigma_1:: \\
(scan &[\sigma_2T_2^{3}\sigma_3] \\
\sim &init:\sigma_0 \Longrightarrow \sigma_2 \quad\sim f:combine) \\
\\

\sigma_0 \Longrightarrow \sigma_3&::\sigma_0 \Longrightarrow \sigma_2::\sigma_0 \Longrightarrow \sigma_1:: \\
\\

\sigma_0 \Longrightarrow \sigma_3&::\sigma_0 \Longrightarrow \sigma_2::\sigma_0 \Longrightarrow \sigma_1::[] \\
\\

[\sigma_0 \Longrightarrow \sigma_3&; \; \sigma_0 \Longrightarrow \sigma_2; \; \sigma_0 \Longrightarrow \sigma_1]
\end{aligned}
\begin{aligned}
&scan [\sigma_0T_0^{1}\sigma^{1}; \; \sigma_1T_1^{2}\sigma^{2}; \; \sigma_2T_2^{3}\sigma^{3}] \quad\sim init:\sigma_0 \Longrightarrow \sigma_0 \quad\sim f:combine \\
\\

&combine(\sigma_0 \Longrightarrow \sigma_0,\sigma_0T_0^{1}\sigma_1):: \\
\\

&\sigma_0 \Longrightarrow \sigma_1:: (scan [\sigma_1T_1^{2}\sigma^{2}; \; \sigma_2T_2^{3}\sigma^{3}] \quad\sim init:\sigma_0 \Longrightarrow \sigma_1 \quad\sim f:combine) \\
\\

&\sigma_0 \Longrightarrow \sigma_2::\sigma_0 \Longrightarrow \sigma_1:: (scan [\sigma_2T_2^{3}\sigma_3] \quad\sim init:\sigma_0 \Longrightarrow \sigma_2 \quad\sim f:combine) \\
\\

&\sigma_0 \Longrightarrow \sigma_3::\sigma_0 \Longrightarrow \sigma_2::\sigma_0 \Longrightarrow \sigma_1:: (scan [] \quad\sim init:\sigma_0 \Longrightarrow \sigma_3 \quad\sim f:combine) \\
\\

&\sigma_0 \Longrightarrow \sigma_3::\sigma_0 \Longrightarrow \sigma_2::\sigma_0 \Longrightarrow \sigma_1::[] \\
\\

&[\sigma_0 \Longrightarrow \sigma_3; \; \sigma_0 \Longrightarrow \sigma_2; \; \sigma_0 \Longrightarrow \sigma_1]
\end{aligned}

Unfortunately, we have a serial dependency of proof construction here: you must have $\sigma_0 \Longrightarrow \sigma_i$ before getting $\sigma_0 \Longrightarrow \sigma_{i+1}$. This is very slow. When using Libsnark it takes ~20 seconds to do one of these steps on an 8 core cloud instance, and that’s just for a single transaction. This translates to merely 3 transactions per minute globally on the network!

What we’ll do in this blog post is find a better scan. A scan that maximizes throughput, doesn’t incur too much latency, and doesn’t require too much intermediate state. A scan that takes advantage of properties of the zk-SNARK primitives we have. We’ll do this by iterating on our design until we get something that best meets our requirements. Finally, we’ll talk about a few other potential use cases for such a scan outside of cryptocurrency.

## Requirements

Now that we understand the root problem, let’s talk about requirements to help guide us toward the best solution for this problem. We want to optimize our scan for the following features:

1. Maximize transaction throughput

Transaction throughput here refers to the rate at which transactions can be processed and validated in the Coda protocol network. Coda strives to be able to support low transaction fees and more simultaneous users on the network, so this is our highest priority.

1. Minimize transaction latency

7 The more we sacrifice latency the longer proposer nodes have to keep around full copies of the state before just relying on the small SNARK itself.

It’s important to minimize transaction latency to enter our SNARK to keep the low RAM requirements on proposer nodes, nodes that propose new transitions during Proof of Stake.7 SNARKing a transaction is not the same as knowing a transaction has been processed, so this is certainly less important for us than throughput.

1. Minimize size of state

Again, to keep low RAM requirements on proposer nodes we want to minimize the amount of data we need to represent one state.

And moreover, this is the order of importance of these goals from most to least important: Maximize throughput, minimize latency, minimize size of state.

## Properties

8 This is possible because of a cryptographic notion known as “Signature of Knowledge” which lets us embed information about the creator and a fee into the proof in a way that is unforgeable. We will talk more about how we use this information in another blog post.

• All SNARK proofs take one unit of time to complete
• Transactions arrive into the system at a constant rate $R$ per unit time
• We effectively have any number of cores we need to process transactions because we can economically incentivize actors to perform SNARK proofs and use transaction fees to pay those actors.8
• Two proofs can be recursively merged:

This merge operation is associative:

So we can actually write transaction SNARKs that effectively prove the following statements:

Base ($\sigma_i \Longrightarrow \sigma_{i+1}$)

There exists $\sigma_iT_i^{i+1}\sigma_{i+1}$ such that the transaction is valid

Merge ($\sigma_i \Longrightarrow \sigma_k$)

There exists $\sigma_i \Longrightarrow \sigma_j$ and $\sigma_j \Longrightarrow \sigma_k$ such that both proofs verify

Before we go any further, though, let’s abstract away some details here.

## Abstractions

Data:

$D_i := \; \; \sigma_iT_i^{i+1}\sigma_{i+1}$

Base work:

$B_i := \; \; \sigma_i \Longrightarrow \sigma_{i+1}$

Merge work:

$M_{ij} := \; \; \sigma_i \Longrightarrow \sigma_j$

Accumulated value:

$A_k := \; \; \sigma_0 \Longrightarrow \sigma_k$

Let’s say that data effectively enqueues a “Base job” that can be completed to become “Base work”. Similarly, two “Base work”s (or two “Merge works”s) can be combined in a “Merge job” to create “Merge work”.

## Initial Analysis

### Upper Bound

Let’s set an upper bound efficiency target for any sort of scan. No matter what we do we can’t do better than the following:

• Throughput: $R$ per unit time

We said new data was entering the system at a rate of $R$ per unit time, so the best we can do is complete the work as soon as it’s added.

• Latency: $O(1)$

In the best case, we don’t have to wait to get the piece of data included as part of the scan result. Whatever time it takes to do one step is the time it takes before our data is included in the scan result.

• Space: $O(1)$

We don’t need to store any extra information besides the most recent result.

As a reminder, we decided that the naive approach is just a standard linear scan. This “dumb scan” can be a nice lower bound on throughput, we can also analyze the other attributes we care about here:

### Linear Scan

• Throughput: $1$ per unit time

Our linear scan operation emits a result at every step and so we need the prior result before we can perform the next step.

• Latency: $O(1)$

Every step emits a single result based on the data

• Space: $O(1)$

We only have to hold on to the most recently accumulated result to combine with the next value.

Since our primary goal is to maximize throughput, it’s clear a linear scan isn’t appropriate.

## Parallel Periodic Scan

Recall that the merge operation is associative. This means that we can choose to evaluate more than one merge at the same time, thus giving us parallelism! Even though data are coming in only $R$ at a time, we can choose to hold more back to unlock parallel merge work later. Because we effectively have infinite cores we can get a massive speedup by doing work in parallel.

This gives rise to the notion of a “periodic scan”:

(* periodicScan 1->2->3->4->5->6->7->8
~init:0
~lift:(fun a -> a)
~merge:(fun a b -> a + b) =>
10->36
*)
val periodicScan :
'a Stream.t ->
~init:'b ->
~lift:('a -> 'b) ->
~merge:('b -> 'b -> 'b) ->
'b Stream.t
(* periodicScan 1->2->3->4->5->6->7->8
~init:0 ~lift:(fun a -> a)
~merge:(fun a b -> a + b) => 10->36
*)
val periodicScan : 'a Stream.t -> ~init:'b ->
~lift:('a -> 'b) -> ~merge:('b -> 'b -> 'b) -> 'b Stream.t

A scan that periodically emits complete values, not every time an 'a datum appears on a stream, but maybe every few times. This therefore has slightly different semantics than a traditional scan operation.

Rather than returning a stream emitting 1→3→6→10→15→21→28→36, we buffer data elements 1 through 4 and compute with those in parallel, and only emit the resulting sum, 10, when we’re done. Likewise we buffer 5 through 8, and combine that with 10 and emit that 36 when we’re done. We periodically emit intermediate results instead of doing so every time.

## Naive Implementation of Periodic Scan

Let’s go over this tree construction step-by-step, considering what happens to our data over time as it’s coming through into the system. Let’s consider $R = 8$.

First we gather $R$ pieces of data and enqueue $R$ Base jobs for our network to complete. We use $R$ of our cores and can complete all jobs in one time step. We hold back the data on the pipe, and we are forced to buffer it because we haven’t finished handling the first $R$.

As we add Base work, we give way for a series of Merge jobs that can be completed in the next step:

Now we have $\frac{R}{2}$ pieces of merge work to complete and we use $\frac{R}{2}$ cores and complete them in one time step.

We repeat until we reach the top of the tree. The completed Merge work at the top can be consumed by the rest of the system.

## Analysis

• Throughput: $\frac{R}{log(R)}$

Every $log(R)$ steps, we have the opportunity to consume $R$ more pieces of data.

• Latency: $O(log(R))$

It takes $log(R)$ time steps before we emit our top-level merge work as we half the nodes in each layer of our tree at each step.

• Space: $O(R)$

We now have to keep parts of a tree around at each step. Since our trees have $R$ leaves, typical binary trees have $2R-1$ nodes when completed, and we have an extra layer, we actually use $3R-1$ nodes.

### Naive Periodic Scan

For the purposes of visualization, unit time is being replaced with 60 seconds. We assume the space of a single node in the tree is 2KB.

Throughput (in data per second) Latency (in seconds) Space (in bytes)
$R=4$ 0.0333 180 ~22KB
$R=16$ 0.0667 300 ~94KB
$R=1024$ 1.71 660 ~6MB
$R=16384$ 19.5 900 ~98MB
Throughput (in data per second) Latency (in seconds) Space (in bytes)
$R=4$ 0.0333 180 ~22KB
$R=16$ 0.0667 300 ~94KB
$R=1024$ 1.71 660 ~6MB
$R=16384$ 19.5 900 ~98MB

### Serial Scan

Throughput (in data per second) Latency (in seconds) Space (in bytes)
$R=4$ 0.05 20 ~2KB
$R=16$ 0.05 20 ~2KB
$R=1024$ 0.05 20 ~2KB
$R=16384$ 0.05 20 ~2KB
Throughput (in data per second) Latency (in seconds) Space (in bytes)
$R=4$ 0.05 20 ~2KB
$R=16$ 0.05 20 ~2KB
$R=1024$ 0.05 20 ~2KB
$R=16384$ 0.05 20 ~2KB

We have increased throughput at the cost of some latency and space when compared with the serial approach, so this is a little bit better!

However, this solution leaves something to be desired—why must we halve our parallelism as we walk up each layer of the tree? We have a stream feeding us $R$ data values every unit of time, so we should have enough work to do. Shouldn’t we use this somehow?

## Better Solution

Let’s take advantage of the fact that we get $R$ new data values each time we complete work—still preferring earlier queued data values to minimize latency once we’ve exhausted available parallelism.

With this in mind, let’s trace a run-through, this time always making sure we have $R$ pieces of work to do at every step—for illustration, let’s pick $R=2$:

We do as we did before, but this time we have $R$ jobs to complete and can dispatch to our $R$ cores every step. We have exactly $log(R)$ trees pending at a time. At every step, we complete the first tree (tree zero) and at tree $i$, we complete layer $i$.

## Analysis

• Throughput: $R$

Throughput of work completion matches our stream of data! It’s perfect, we’ve hit our upper-bound.

• Latency: $O(log(R))$

9 Here’s a short informal proof: Note that any sort of reduction operation on $N$ pieces of data can’t be done faster than $O(log(N))$ span. If we assume we could handle our $R$ units that we enqueue at a time in fewer than $O(log(N))$ steps then since we’re doing a reduction operation we would be doing it faster than $O(log(N))$ which is a contradiction.

Latency is still logarithmic, though now it’s $log(R)+1$ steps as our trees have $R$ leaves and we an extra layer on the bottom for base jobs. In fact, this is actually the lower bound.9

• Space: $O(R*log(R))$

We have multiple trees now. Interestingly, we have exactly $log(R)$ trees pending at a time. Again our longer trees take up an extra layer than traditional binary trees, so in this case $3R-1$ nodes since we have $R$ leaves, and we have $log(R)$ of these trees.10

Now that we have thoroughly optimized our throughput and latency, let’s optimize for space.

## Optimize size

10 In order to prevent latency and space from growing over time, we need to make sure we complete work as fast as we add it.

Do we really need to hold all $log(R)$ trees? We only ever care about the frontier of work. All the information we need to perform the next layer of jobs. We clearly don’t need to store anything above that or below it in the trees.

Notice that we only use some of each layer of trees even across the $log(R)$ trees. And so we can represent the frontier of the $log(R)$ trees with only a single tree representing the work pipeline moving from leaves to the root in the following manner:

## Analysis

• Throughput: $R$

Throughput is the same as before.

• Latency: $O(log(R))$

Latency is the same as above.

• Space: $O(R)$

We’ve reduced our space back down to a single tree with leaves $3R-1$.

## Space Optimization

Do we really need that extra layer? If we change how we think about the problem, we can use a perfect binary tree which we can manipulate to save even more space:

Now we’re down to $2R-1$ nodes—a standard binary tree with $R$ leaves.

How do we store the tree? Since we know the size a priori (a complete binary tree with $R$ leaves), we can use a succinct representation.

11 This is a very interesting area of computer science research, and I very much recommend the curious to read more: See Zhou, et. al 2013 and wavelet trees.

12 In our case, just the cursor.

A succinct data structure requires only $o(Z)$ extra space to manage the relationship between the elements if $Z$ is the optimal number of bits that we need to express the information in an unstructured manner. Note that this is little-$o$ not big-$O$—a much tighter bound.11

In fact our structure as described is actually an implicit one because of our scalar cursor. An implicit data structure is one that uses only $O(1)$ extra bits.12 In later refinements (in part 2), we’ll go back to a succinct representation because we need to relax one of the assumptions we made here. This is similar to the popular implicit heap that you may have learned about in a computer science class.

## Final Analysis

• Throughput: $R$

Throughput keeps up with production rate $R$, so we couldn’t do better.

• Latency: $O(log(R))$

Latency is proportional to $log(R)$ steps, as we described earlier, so we don’t get hurt too badly there.

• Space: $2R-1 + O(1)$

We have an implicit data structure representation for our complete binary tree with $2R$ leaves as described above.

### Fully Optimized Scan

Throughput (in data per second) Latency (in seconds) Space (in bytes)
$R=4$ 0.0667 180 ~22KB
$R=16$ 0.267 300 ~94KB
$R=1024$ 17.1 660 ~6MB
$R=16384$ 273 900 ~98MB
$R=65536$ 1092 1020 ~393MB
Throughput (in data per second) Latency (in seconds) Space (in bytes)
$R=4$ 0.0667 180 ~22KB
$R=16$ 0.267 300 ~94KB
$R=1024$ 17.1 660 ~6MB
$R=16384$ 273 900 ~98MB
$R=65536$ 1092 1020 ~393MB

We went from a sequential solution that at $R=16384$ only handled a throughput of 0.05 data per second to an initial parallel solution that handled 19.5 data per second to a fully optimized solution that handles 273 data per second. Our final solution even has optimal latency and space characteristics.

We did it! Coda can now be limited in its throughput by the speed at which information can flow across the network, and no longer by the time it takes to construct a SNARK. Moreover, we solved a more general problem: Efficiently computing an online periodic parallel scan over an infinite stream for some associative operation.

## Other Use Cases

Deep space telescopes produce an astronomical amount of data per second. For example, the Square Kilometre Array will process petabits of data per second. If data frames are coming in faster than we can process them which is certainly true for some types of workloads like non-parametric machine learning, we can use this data structure to handle these streams.

More generally, certain map-reduce type workloads that act in an online fashion (on an infinite stream of inputs instead of a finite collection) with expensive operators, could benefit from using our same data structure.

You can also go through literature and try to find prior art. We didn’t find much searching through map-reduce papers. The only thing that was a bit related is a paper from the GPU programming world, but doesn’t address the infinite streaming bit. Please leave a comment if you want to share any related work.

## Conclusion

We were able to take advantage of parallelism and other properties of our system to materialize this general “periodic scan” problem of combining data streaming in online fashion which as we described doesn’t limit throughput at all, has optimal latency characteristics, and is succinct. With this data structure, Coda is free to take advantage of succinctness to offer a high-throughput with no risk of centralization!

In a future blog post, we’ll talk about instantiating this parametric structure with concrete parameters and how we instantiate our infinite core machine model by farming work out to the network. We’ll also talk about the optimization problem we have for choosing how to fill out these trees with completed work.

If you like this sort of stuff, we’re looking for open source contributors and hiring.

## Future work

We’ll explore modifying this structure to optimize latency in the presence of variable throughput. You can imagine that if we detect input data throughput becomes sufficiently slow we can remove a layer from the next virtual tree, and if it’s too fast we can add one. We haven’t yet explored how this will affect the further refinements we made on top of the virtual trees.

Additionally, we will want to explore a more efficient mechanism to share account states that are part of the scan tree to nodes that don’t care about the in-progress proofs, so that bandwidth-constrained nodes can still look up their most recent account states without waiting for a ledger proof to pop out of the tree.

## Appendix

We can reify this model with the following signature in the Coda codebase:

'a is the type of the top value and there’s some notion of an associative merging operation on the 'a values. 'd is the type of the data at the leaves that comes in at rate $R$.

val start : parallelism_log_2:int
-> ('a, 'd) State.t
(** The initial state of the parallel
scan at some parallelism *)

val next_jobs : state:('a, 'd) State.t
-> ('a, 'd) Available_job.t list
(** Get all the available jobs *)

val enqueue_data : state:('a, 'd) State.t
-> data:'d list
-> unit Or_error.t
(** Add data to parallel scan state *)

val free_space : state:('a, 'd) State.t
-> int
(** Compute how much data ['d] elements we
are allowed to add to the state *)

val fill_in_completed_jobs :
state:('a, 'd) State.t
-> completed_jobs:
'a State.Completed_job.t list
-> 'a option Or_error.t
(** Complete jobs needed at this state --
optionally emits the ['a] at the top of
the tree *)
val start : parallelism_log_2:int -> ('a, 'd) State.t
(** The initial state of the parallel scan at some parallelism *)

val next_jobs : state:('a, 'd) State.t -> ('a, 'd) Available_job.t list
(** Get all the available jobs *)

val enqueue_data : state:('a, 'd) State.t -> data:'d list -> unit Or_error.t
(** Add data to parallel scan state *)

val free_space : state:('a, 'd) State.t -> int
(** Compute how much data ['d] elements we are allowed to add to the state *)

val fill_in_completed_jobs :
state:('a, 'd) State.t
-> completed_jobs:'a State.Completed_job.t list
-> 'a option Or_error.t
(** Complete jobs needed at this state -- optionally emits the ['a] at the top
of the tree *)

13

## Acknowledgements

Thanks to Evan Shapiro for working through these data structures with me when we were first figuring this stuff out. Thanks to Deepthi Kumar for collaborating with me on several optimizations. Thanks to Laxman Dhulipala for taking a video call early on and helping with an early core insight about the multiple trees. Finally, thanks to Omer Zach and Corey Richardson (and Evan and Deepthi) for their very thorough feedback on drafts of this post!

1. If you’d rather consume this content in video form, watch this talk.

2. Equivalent to security as a full node.

3. Note that we represent account states concretely as their hashes for performance reasons.

4. The ~init in OCaml refers to a named argument, and 'a and 'b are a type unification variables that work similarly to generics in Java.

5. :: means “cons” or prepend to the front of a linked list.

6. We write streams as lists in the evaluation.

7. The more we sacrifice latency the longer proposer nodes have to keep around full copies of the state before just relying on the small SNARK itself.

8. This is possible because of a cryptographic notion known as “Signature of Knowledge” which lets us embed information about the creator and a fee into the proof in a way that is unforgeable. We will talk more about how we use this information in another blog post.

9. Here’s a short informal proof: Note that any sort of reduction operation on $N$ pieces of data can’t be done faster than $O(log(N))$ span. If we assume we could handle our $R$ units that we enqueue at a time in fewer than $O(log(N))$ steps then since we’re doing a reduction operation we would be doing it faster than $O(log(N))$ which is a contradiction.

10. In order to prevent latency and space from growing over time, we need to make sure we complete work as fast as we add it.

11. This is a very interesting area of computer science research, and I very much recommend the curious to read more: See Zhou, et. al 2013 and wavelet trees.

12. In our case, just the cursor.

13. 'a is the type of the top value and there’s some notion of an associative merging operation on the 'a values. 'd is the type of the data at the leaves that comes in at rate $R$.

# Functional programming + crypto reading list

### 2018-04-01

Recently a candidate sent us a very thoughtful email asking what kind of knowledge we’re looking for in a candidate. It set off a discussion at the office, as we wondered what direction we could provide candidates who might not yet have all of the perfect qualifications.

Below is our response to the candidate, giving guidance on what kinds of knowledge we are looking for in candidates.

None of us came into this field knowing everything. In fact, we came from quite varied backgrounds, and we all have areas we’re more or less comfortable in. So even though I’ll be giving a more comprehensive list below, it’s much more important to be excited about learning new things than having any particular expertise, and we’d be interested in talking to any candidate that knows some things in the list below and is capable and energized to tackle more.

So, after some deliberation, here are some basic resources to get you started. For our company, knowing the basics in cryptography (and cryptocurrency), functional programming, and math is probably the most important.

## Functional Programming

We’re building our protocol in OCaml, but you can learn the functional style in any functional programming language: Haskell, Scala, SML, and Elm are all great. You should be familiar with common techniques such as - modeling with types - how to read type signatures and recognize monads - algebraic data types - programming in an immutable style - higher order functions - pattern matching - recursion and so on.

If you write a medium-sized project you’ll learn a lot about these concepts, and probably become a more well-rounded programmer.

Some books we like are

## Math

For our purposes, some basic knowledge of abstract algebra is useful. Good things to understand are the definitions of groups and fields. Basic (discrete) probability theory is useful as well.

Programming for SNARKs can also be quite mathematical, and we think the skills provided by a background in mathematics are somewhat transferable to SNARK programming.

## Cryptography

Oded Goldreich’s Foundations of Cryptography is a classic, and it’ll give you a good grounding in fundamental concepts, including - hash functions - signature schemes - zero knowledge proofs.

For cryptocurrencies in particular, another great book is Bitcoin and Cryptocurrency Technologies, which is one of the best resources we’ve found on cryptocurrencies.

## Special topics

We also are working on some projects that could really use special expertise in a few areas. We of course don’t expect anyone to be an expert in all of these areas, but experience in one or two would certainly set a candidate apart.

• Programming Languages: Type systems, compilers, etc.
• Advanced Cryptography: Elliptic curve cryptography, zk-SNARKs/STARKs, formal treatment of cryptocurrency
• Security
• Systems Programming
• Distributed Systems

Again, we stress that a candidate need not have background in any of the above specialized areas, but these are the fields that will be useful to us as we build out our protocol, so if this stuff seems interesting to you, and you’d be willing to start learning a chunk of it, we’d be interested in talking. Click here to view job listings and apply.

# Snarky: A high-level language for verifiable computation

### 2018-03-11

Living in the world today requires giving up a lot of control. We give up control of our financial lives to banks and unaccountable credit bureaus. We give up control of our most intimate data to use online services like Facebook, Amazon, and Google. We give up control of our elections to voting-system companies who run opaque and unauditable elections. We even give up some control over our understanding of the world through exposure to false or misleading news stories.

But it doesn’t have to be this way. Cryptography as a discipline provides us with some of the tools necessary to regain some of this control over our resources and data while reducing the need to trust unaccountable actors.

One such tool is verifiable computation. A verifiable computation is a computation that produces an output along with a proof certifying something about that output. Until very recently, verifiable computation has been mostly theoretical, but recent developments in zk-SNARK constructions have helped maked it practical.

Verifiable computation makes it possible for you to be confident about exactly what other people are doing with your data. For example, it enables

• Online services that are forced to be transparent about what data of yours they have and how they are using it.
• Auditable elections that protect the privacy of your vote.
• Publishing stories that provably come from a reliable source, without leaking what that source is.
• Sending money to others without yielding control of your account or your privacy.

Verifiable computation is powered by zk-SNARKs. Right now, however, progamming directly with SNARKs is comparable to writing machine code by hand, and trusting “SNARK machine code” is a lot like trusting a compiled binary without the source code.

To fulfill the promise of verifiable computing as a tool for returning control and agency to individuals, their operation has to be made as transparent as possible. We can help accomplish that goal by making the properties that verifiable computations prove specifiable in languages that are as close as possible to the informal, inuitive properties we have in our minds. That way, individuals can trust the easy-to-understand high-level specifications, rather than opaque “SNARK machine code”.

In this post, I’ll describe how we at O(1) Labs are helping to bridge this gap and solve the transparency problem with our language Snarky for specifying verifiable computation.

# Verifiable computation

As mentioned above, a verifiable computation is a computation that computes an output along with a proof certifying something about that output.

For example,

• A government could compute the winner of an election and prove that they counted all the votes correctly.
• An advertising service could compute an ad to serve to you and prove that the ad was generated without using personal data (i.e., your race, income, political views, etc.)
• A website could compute a news-feed to send to you and prove that it was generated without access to your personal data (and thus free of targeted ads, content, etc.)
• A journalist could compute a story containing a quote from a source and prove that the quote came from a reliable source (without revealing which one).

# Verifiable elections

For this post, let’s focus on the example of a verifiable election. One place where people are clamoring for accountability is of course in how pizza toppings are chosen in group settings. (Heads up: I kept this example simple for exposition, which means there are a few flaws. I take no accountability for your pizza election.)

So let’s imagine you and your friends are trying to decide on what pizza topping to get (either pepperoni or mushroom) and you’d like to vote on a topping while keeping your votes as secret as possible.

Here is a picture of pizza to keep you interested.

Let’s say everyone trusts Alice to keep votes secret. She’ll act as the “government” by collecting everyone’s votes. But everyone also knows Alice loves mushroom pizza, which means we don’t necessarily trust her to run a fair election. So we’ll develop a scheme that gives everyone assurance that the election was run correctly (i.e., that each person’s vote was included and that the majority vote was computed correctly).

Using zk-SNARKs, we can write a verifiable computation which Alice can run to compute the majority vote and prove that it was computed correctly. Moreover, using the “zk” or zero-knowledge part of zk-SNARKs, she can do so in such a way that everyone can trust the result without learning any information about individuals’ votes.

# zk-SNARKs, technically

Simplifying a bit, zk-SNARK constructions give us the following ability. Say we have a set of polynomials $$p_1, \dots, p_k$$ in variables $$x_1, \dots, x_n, y_1, \dots, y_m$$. For given $$\alpha_1, \dots, \alpha_n$$, if we know $$\beta_1, \dots, \beta_m$$ such that $p_i(\alpha_1, \dots, \alpha_n, \beta_1, \dots, \beta_m) = 0$ we can produce a piece of data $$\pi$$ which somehow certifies our knowledge of such $$\beta_i$$s which has the following two properties: 1. Zero-knowledge: $$\pi$$ does not expose any information about $$\beta_1, \dots, \beta_m$$ 2. Succinctness: $$\pi$$ is very small (concretely, a few hundred bytes) and can be checked quickly (concretely, in milliseconds). Such a set of $$\beta$$s is called a satisfying assignment.

It turns out that such constraint systems are universal in the following sense. Given any (non-deterministic) verifiable computation, we can construct a constraint system so that the existence of a satisfying assignment is equivalent to the correctness of the computation.

So, it seems that zk-SNARKs gives us exactly what we want. Namely, a means to prove correctness of computations while hiding private information and saving parties from having to redo the computation themselves.

# Back to elections

With these SNARKs in hand, let’s return to our election example. The voting scheme will be as follows:

1. Each voter $$i$$ chooses a vote $$v_i$$ and a nonce $$b_i$$. They publish a commitment $$h_i = H(b_i, v_i)$$, where $$H$$ is some collision resistant hash function. They send $$(b_i, v_i)$$ to the government.
2. The government computes the majority vote $$w$$ and publishes it along with a SNARK proving “For each $$i$$, I know $$(b_i, v_i)$$ such that $$H(b_i, v_i) = h_i$$ and $$w$$ is the majority vote of the $$v_i$$”.
3. Voters verify the SNARK on their own against the public set of commitments $$(h_1, \dots, h_n)$$.

The zero knowledge property of the SNARK ensures that no votes are revealed to anyone except the government. So, to realize this scheme in practice, all we need to do is to encode the above statement as a constraint system. Here it is:

Click for full set

Great, we’re done! Er – well, maybe not. The trouble is that it’s basically impossible for anyone to verify that this constraint system does actually enforce the above property. I could have just chosen it at random, or maliciously. In fact it doesn’t actually force the property: I deleted a bunch of constraints to make this page load faster.

The situation is similar to programming in general: one doesn’t want to have to trust a compiled binary because it is difficult to verify that it is doing what one expects one to do. Instead, we write programs in high-level languages that are easier for people to verify, and then compile them to assembly.

Here, in order for it to be convincing that a constraint system actually does what one expects it to do, one would like it to be the result of running a trusted compiler on a high-level program that is more easily seen to be equivalent to the claim one wants to prove.

# Toward a programming language for verifiable computation

We’ll now describe Snarky, our OCaml DSL for verifiable computation. It’s a high-level language for describing verifiable computations so that their correctness is more transparent. First we describe the programming model of Snarky and then explain in more depth how this model is realized.

## Requests

The basic programming model is as follows. A “verifiable computation” will be an otherwise pure computation augmented with the ability to do the following two things:

1. Pause execution to ask its environment to provide it with a value and then resume execution using that value.
2. Assert that a constraint holds among some values, terminating with an exception if the constraint does not hold.
A verifiable computation requesting a value from its environment

To get a feel for the model, let’s see our election computation rendered in a pseudocode version of Snarky.

winner (commitments):
List.map commitments (fun commitment ->
(nonce, vote) = request (Open_ballot commitment)
assert (H(nonce, vote) = commitment)
return vote)

pepperoni_count =
count votes (fun v -> v = Pepperoni)

pepperoni_wins = pepperoni_count > commitments.length / 2
return (if pepperoni_wins then Pepperoni else Mushroom)

This is intended to define a function winner that takes as input a list of commitments and returns the majority vote of a set of votes corresponding to those commitments (assuming it doesn’t terminate with an exception). It obtains the corresponding votes by mapping over the commitments and for each one

• requesting that the environment provide it with such a vote (and nonce)
• asserting that the provided vote and nonce do in fact correspond to the commitment.

If winner(commitments) is run in an environment in which it terminates without an assertion failure and outputs w, we know that there were votes corresponding to commitments such that the majority vote was w. Snarky gives us a way to prove statements like this about computations.

$$\newcommand{\tild}{\widetilde}$$ Namely, given a verifiable computation $$P$$ (i.e., a computation that makes some requests for values and assertions of constraints) Snarky lets us compile $$P$$ into a constraint system $$\tild{P}$$ such that the following two are equivalent:

1. Some environment can provide $$P$$ with values to answer each request such that $$P$$ executes without an assertion failure.
2. Some environment can produce a satisfying assignment to $$\tild{P}$$.

In our case, the requests are for openings to each of the vote commitments, and the assertions check the correctness of the openings. So, reiterating, if Alice can prove winner(cs) = w for some commitments cs and winner w, she will have proved “I know a set of votes votes corresponding to the commitments cs such that the majority vote of votes is w”.

## Snarky concretely

Let’s take a look at what the above example actually looks like in Snarky

let winner commitments =
Checked.List.mapi commitments ~f:(fun i commitment ->
let%bind nonce, vote =
request Ballot.Opened.typ (Open_ballot i)
in
let%map () =
>>= Ballot.Closed.assert_equal commitment
in
vote)
in
let%bind pepperoni_count =
count votes ~f:(fun v -> Vote.(v = var Pepperoni))
in
let half = constant (Field.of_int (List.length commitments / 2)) in
let%bind pepperoni_wins = pepperoni_count > half in
Vote.(if_ pepperoni_wins ~then_:(var Pepperoni) ~else_:(var Mushroom))

There’s a bit of noise caused by the harsh realities of OCaml’s monad syntax, but overall it is quite close to our pseudocode. We

1. Map over the commitments, requesting for our environment to open them.
2. Compute the number of votes for pepperoni.
3. If the number of pepperoni votes is greater than half the votes, return pepperoni as the winner, and otherwise return mushroom.

## Handling requests

We must provide a mechanism for handling requests made by verifiable computations to pass in requested values (similar to the way we write exception handlers). In Snarky, this looks like

handle
(winner commitments)
(fun (With {request; respond}) ->
match request with
| Open_ballot i -> respond ballots.(i)
| _ -> unhandled)

where ballots : Ballot.Opened.t array is the array of opened ballots that the government has access to.

The request/handler model has a few nice features. In particular,

1. It allows one to program in a direct style by pretending one has magical access to requested values.
2. It makes a clear distinction between values that are directly computed and non-deterministically chosen values that need to be constrained (with assertions) to ensure correctness.
3. It makes testing verifiable computations simple, as one can set up “malicious” handlers that provide values other than the intended ones. The purpose here is to check that the assertions made by the computation do in fact rule out all values besides the intended ones.

# Wrapping up

Snarky helps us bridge the gap between high-level properties we want to prove using verifiable computations, and the low-level constraint systems we need to provide to SNARK constructions. It brings the promise of accountability and control over personal data through verifiable computing one step closer to practicality. The code is available on github, and we at O(1) Labs are using it in the development of our new cryptocurrency protocol that aims to power the examples described above and more.

If you find what we’re doing interesting, we’re hiring. You can find more info here. You can also sign up for our mailing list here.