-
Notifications
You must be signed in to change notification settings - Fork 4
Workflows
This document will detail how to construct congregation workflows. It will begin by illustrating
the basic anatomy of a congregation workflow, and will follow with concrete example workflows over
both plaintext data and data that is inputted as secret shares.
The following objects are required to construct congregation workflows:
1. create_column() - A wrapper function that returns a `Column` object. \
`Column` objects store various metadata and trust relations for columns \
in a dataset.
2. lang - The query language module for congregation. All workflow queries \
are included in the lang module.
3. Assemble - A wrapper class used to orchestrate the various modules within \
congregation. At a high level, `Assemble` provides an interface with a single \
entrypoint (`generate_and_dispatch()`) for launching congregation workflows.
These objects can be imported in the usual way:
from congregation import create_column, lang, Assemble
Workflows themselves are composed of the following objects:
1. Some protocol, written as a python function. This function must return the root \
nodes of the protocol (i.e. - the objects returned by the `create()` functions \
in the protocol function).
2. A configuration object. The configuration object itself is simply a JSON file that \
contains information that congregation needs to configure it's own compute backends \
correctly.
3. An `Assemble()` object, whose function `generate_and_dispatch` takes the protocol and \
configuration objects as parameters.
The examples below will focus on the protocol
function, and omit the configuration and Assemble
steps, since they're mostly boilerplate. For reference, the reader can assume that each protocol
documented below uses the following configuration and Assemble
code to run it:
cfg = json.loads(open(sys.argv[1], "r").read())
# the Assemble class (at congregation/assemble/assemble.py) exposes
# all the main steps related to workflow generation and dispatching
a = Assemble()
a.generate_and_dispatch(protocol, cfg)
def protocol():
# 3 columns for party 1
a = create_column("a", "INTEGER")
b = create_column("b", "INTEGER")
c = create_column("c", "INTEGER")
# 3 columns for party 2
d = create_column("d", "INTEGER")
e = create_column("e", "INTEGER")
f = create_column("f", "INTEGER")
# 3 columns for party 3
g = create_column("g", "INTEGER")
h = create_column("h", "INTEGER")
i = create_column("i", "INTEGER")
# create all input relations
rel_one = lang.create("in1", [a, b, c], {1})
rel_two = lang.create("in2", [d, e, f], {2})
rel_three = lang.create("in3", [g, h, i], {3})
# concatenate input relations
cc = lang.concat([rel_one, rel_two, rel_three], "cc")
# simple sum aggregation over column "a", grouped by the values
# from column "b".
# note that since we didn't explicitly pass column names to concat(),
# the columns in its output relation were named according to the first
# relation in its input (here, relation rel_one).
agg = lang.aggregate(cc, "agg", ["b"], "a", "sum")
# reveal output to parties 1, 2, and 3
lang.collect(agg, {1, 2, 3})
# return the workflow's root nodes
return {rel_one, rel_two, rel_three}
The above workflow takes place between 3 parties on 3 different machines, each with their own
input files in1.csv
, in2.csv
, and in3.csv
, respectively. The input files themselves contain
3 columns each. Thus, the input file corresponding to rel_one
could look something like the following: \
a,b,c
1,2,3
4,5,6
7,8,9
Since no input_path
argument was supplied to each create()
function call, congregation will resolve the
path to each input dataset (one per machine) as < cfg["general"]["data_path"] > / < name >.csv
. For example,
rel_one
if the data_path
configuration for party 1 is /tmp/data/
, congregation will assume that the data
for the rel_one
dataset exists at /tmp/data/in1.csv
. Alternatively, one can override this behavior by
specifying an input_path
to create()
as follows:
rel_one = create("in1", [a, b, c], {1}, input_path="/home/me/my_input.csv")
Note that congregation only needs to resolve an input path for each party individually, so there is no
need for party 1 to specify input paths for rel_two
and rel_three
.
def protocol():
# 3 columns for input dataset (shared between parties {1, 2, 3}
a = create_column("a", "INTEGER")
b = create_column("b", "INTEGER")
c = create_column("c", "INTEGER")
# create shared input relation
rel_one = lang.create("inpt", [a, b, c], {1, 2, 3})
# simple sum aggregation over column "a", with no key columns
agg = lang.aggregate(rel_one, "agg", ["b"], "a", "sum")
# reveal output to parties 1, 2, and 3
lang.collect(agg, {1, 2, 3})
# return the workflow's root node
return {rel_one}
In this example there is a single input dataset, but it has been secret shared and distributed
across 3 distinct input parties. Since each party is effectively inputting which corresponds to
the same dataset, we only need one create()
call. To indicate that the input data for this call
is already secret shared, we pass {1, 2, 3}
to the create()
call, which tells congregation that
parties 1, 2, and 3 all hold shares of the data for this relation.