Skip to content

Workflows

bengetch edited this page May 6, 2021 · 3 revisions

Workflows

This document will detail how to construct congregation workflows. It will begin by illustrating
the basic anatomy of a congregation workflow, and will follow with concrete example workflows over
both plaintext data and data that is inputted as secret shares.

Workflow anatomy

The following objects are required to construct congregation workflows:

1. create_column() - A wrapper function that returns a `Column` object. \
   `Column` objects store various metadata and trust relations for columns \
   in a dataset. 
2. lang - The query language module for congregation. All workflow queries \
   are included in the lang module.
3. Assemble - A wrapper class used to orchestrate the various modules within \
   congregation. At a high level, `Assemble` provides an interface with a single \
   entrypoint (`generate_and_dispatch()`) for launching congregation workflows.

These objects can be imported in the usual way:

from congregation import create_column, lang, Assemble

Workflows themselves are composed of the following objects:

1. Some protocol, written as a python function. This function must return the root \
   nodes of the protocol (i.e. - the objects returned by the `create()` functions \
   in the protocol function).
2. A configuration object. The configuration object itself is simply a JSON file that \
   contains information that congregation needs to configure it's own compute backends \
   correctly.
3. An `Assemble()` object, whose function `generate_and_dispatch` takes the protocol and \
   configuration objects as parameters.

The examples below will focus on the protocol function, and omit the configuration and Assemble
steps, since they're mostly boilerplate. For reference, the reader can assume that each protocol
documented below uses the following configuration and Assemble code to run it:

cfg = json.loads(open(sys.argv[1], "r").read())

# the Assemble class (at congregation/assemble/assemble.py) exposes
# all the main steps related to workflow generation and dispatching
a = Assemble()
a.generate_and_dispatch(protocol, cfg)

Example workflow 1: Using plaintext data

def protocol():

    # 3 columns for party 1
    a = create_column("a", "INTEGER")
    b = create_column("b", "INTEGER")
    c = create_column("c", "INTEGER")

    # 3 columns for party 2
    d = create_column("d", "INTEGER")
    e = create_column("e", "INTEGER")
    f = create_column("f", "INTEGER")

    # 3 columns for party 3
    g = create_column("g", "INTEGER")
    h = create_column("h", "INTEGER")
    i = create_column("i", "INTEGER")

    # create all input relations
    rel_one = lang.create("in1", [a, b, c], {1})
    rel_two = lang.create("in2", [d, e, f], {2})
    rel_three = lang.create("in3", [g, h, i], {3})

    # concatenate input relations
    cc = lang.concat([rel_one, rel_two, rel_three], "cc")

    # simple sum aggregation over column "a", grouped by the values
    # from column "b".
    # note that since we didn't explicitly pass column names to concat(),
    # the columns in its output relation were named according to the first
    # relation in its input (here, relation rel_one).
    agg = lang.aggregate(cc, "agg", ["b"], "a", "sum")

    # reveal output to parties 1, 2, and 3
    lang.collect(agg, {1, 2, 3})

    # return the workflow's root nodes
    return {rel_one, rel_two, rel_three}

The above workflow takes place between 3 parties on 3 different machines, each with their own
input files in1.csv, in2.csv, and in3.csv, respectively. The input files themselves contain
3 columns each. Thus, the input file corresponding to rel_one could look something like the following: \

a,b,c
1,2,3
4,5,6
7,8,9

Since no input_path argument was supplied to each create() function call, congregation will resolve the
path to each input dataset (one per machine) as < cfg["general"]["data_path"] > / < name >.csv. For example, rel_one if the data_path configuration for party 1 is /tmp/data/, congregation will assume that the data
for the rel_one dataset exists at /tmp/data/in1.csv. Alternatively, one can override this behavior by
specifying an input_path to create() as follows:

rel_one = create("in1", [a, b, c], {1}, input_path="/home/me/my_input.csv")

Note that congregation only needs to resolve an input path for each party individually, so there is no
need for party 1 to specify input paths for rel_two and rel_three.

Example workflow 2: Using secret shared data

def protocol():

    # 3 columns for input dataset (shared between parties {1, 2, 3}
    a = create_column("a", "INTEGER")
    b = create_column("b", "INTEGER")
    c = create_column("c", "INTEGER")

    # create shared input relation
    rel_one = lang.create("inpt", [a, b, c], {1, 2, 3})

    # simple sum aggregation over column "a", with no key columns
    agg = lang.aggregate(rel_one, "agg", ["b"], "a", "sum")

    # reveal output to parties 1, 2, and 3
    lang.collect(agg, {1, 2, 3})

    # return the workflow's root node
    return {rel_one}

In this example there is a single input dataset, but it has been secret shared and distributed
across 3 distinct input parties. Since each party is effectively inputting which corresponds to
the same dataset, we only need one create() call. To indicate that the input data for this call
is already secret shared, we pass {1, 2, 3} to the create() call, which tells congregation that
parties 1, 2, and 3 all hold shares of the data for this relation.

Clone this wiki locally