Examples

This section will provide examples of input and output data for each query across various input
parameters. For method signature documentation, refer to the Queries section. For convenience,
we'll refer to the following input relations as input_a, input_b, and input_c, respectively:

input_a

a,b,c
1,2,3
4,5,6
1,2,3
7,8,9
2,4,5

input_b

d,e,f
4,5,6
2,3,4
1,3,4
2,5,6
7,3,4

input_c

g,h,i
1,2,3
1,2,4
7,9,2
7,9,2
4,5,6

We will further assume that these relations are the output of the following create() relations:

a = create_column("a", "INTEGER")
b = create_column("b", "INTEGER")
c = create_column("c", "INTEGER")

d = create_column("d", "INTEGER")
e = create_column("e", "INTEGER")
f = create_column("f", "INTEGER")

input_a = create("input_a", [a, b, c], {1})
input_a = create("input_b", [d, e, f], {2})

Note that the create() relations above do not require that the variable be named identically
to the relation name (input_a does not need to be named "input_a"), and were only named that way
here for convenience.

aggregate

The aggregate() function can be used to compute either the sum, mean, standard deviation,
or variance over a relation. Further, this computation can be grouped according to the values
in a given column before applying the aggregation to a target column, or just applied to a
target column with no grouping done beforehand. The following examples will illustrate only the
"sum" computation, but the other aggregate types are computed analogously.

# grouped
agg_sum_grouped = aggregate(input_a, "agg_sum", ["a"], "b", "sum", "sum_col")

"""
outputs:

    a,sum_col
    1 4
    2 4
    4 5
    7 8
"""

# ungrouped
agg_sum_ungrouped = aggregate(input_a, "agg_sum", [], "b", "sum", "sum_col")

"""
outputs:

    sum_col
    21
"""

min_max_median

The min_max_median() function simply constructs 3 columns whose values correspond to the minimum,
maximum, and median values of the input relation, with respect to a target column. Note that while
the method signature for this function includes a group_col_names argument, the JIFF code generation
does not currently support grouped operations of this type. Though this will be added in the future, at
present the min_man_median() function only supports ungrouped computations at present.

mmm = min_max_median(input_a, "mmm", [], "c")

"""
outputs: 

    __MIN__,__MAX__,__MEDIAN__
    3,9,5
"""

deciles

The deciles() function produces 9 columns whose values correspond to the 1-Decile, 2-Decile, ..., 9-Decile
values of the input relation, with respect to a target column. Like the min_max_median() function above,
column grouping is forthcoming, despite being included in the method signature.

dec = deciles(input_a, "dec", [], "a")

"""
outputs:
1, 1, 2, 4, 7
    1-DECILE,2-DECILE,3-DECILE,4-DECILE,5-DECILE,6-DECILE,7-DECILE,8-DECILE,9-DECILE
    1,1,1,2,2,4,4,7,7
"""

aggregate_count

Like the above aggregate() function, aggregate_count() can compute a row count over a relation that
is either grouped by a key column or completely ungrouped. Note that an ungrouped aggregate_count()
computation is equivalent to the num_rows() function.

# grouped
agg_count_grouped = aggregate_count(input_a, "agg_count", ["a"], "a_count")

"""
outputs:
    
    a,a_count
    1,2
    2,1
    4,1
    7,1
"""

# ungrouped
agg_count_ungrouped = aggregate_count(input_a, "agg_count", [], "row_count")

"""
outputs:

    row_count
    5
"""

project

The project() function simply selects out a list of columns from a relation, and returns a
new relation with the unselected columns dropped.

proj = project(input_a, "proj", ["a", "b"])

"""
outputs:

    a,b
    1,2
    4,5
    1,2
    7,8
    2,4
"""

Columns can also be reordered:

proj = project(input_a, "proj", ["b", "a"])

"""
outputs:

    b,a
    2,1
    5,4
    2,1
    8,7
    4,2
"""

add

The add() function can add an arbitrarily long list of integers and columns to an existing column.

added = add(input_a, "added", "c", ["b", 7])

"""
outputs:

    a,b,c
    1,2,12
    4,5,18
    1,2,12
    7,8,24
    2,4,16
"""

# can also create a new column whose values are the sum of the operands
added_new_column = add(input_a, "added", "new_column", ["b", 7])

"""
outputs:

    a,b,c,new_column
    1,2,3,9
    4,5,6,12
    1,2,3,9
    7,8,9,15
    2,4,5,11
"""

Note, though, that any columns passed to the operands list must already exist in the input relation.
For example, the following would return an error:

# can't supply column from input_b relation
added_wrong = add(input_a, "added_wrong", "c", ["d", 7])

subtract, multiply, divide

For brevity, examples for subtract(), multiply(), and divide() are omitted, since their syntax
and behavior are identical to the behavior of add().

limit

The limit() function simply preserves a specified number of rows from an input relation, and drops
the rest.

lim = limit(input_a, "lim", 2)

"""
outputs:
    a,b,c
    1,2,3
    4,5,6
"""

distinct

The distinct() function removes duplicate rows from an input relation, with respect to some list of
columns.

dis = distinct(input_c, "dis", ["g", "h", "i"])

"""
outputs:

    g,h,i
    1,2,3
    1,2,4
    7,9,2
    4,5,6
"""

dis = distinct(input_c, "dis", ["g", "h"])

"""
outputs:

    g,h,i
    1,2,3
    7,9,2
    4,5,6
"""

filter_by

The filter_by() function is used to filter rows from a dataset, based on some column's relationship to
either another input column or a scalar. We can thus construct expressions like:

* "preserve all rows from relation input_a whose values in column a are less than 2"
* "preserve all rows from relation input_c whose values in column g are greater than the values in column i"

These expressions would be written as follows:    input_op_node: < Node >,
   name: < str >,
   filter_col_name: < str >,
   operator: < str >,
   filter_against: < str, int > \

fb = filter_by(input_a, "fb", "a", "<", 2)

"""
outputs:

    a,b,c
    1,2,3
    1,2,3
"""

fb = filter_by(input_c, "fb", "g", ">", "i")

"""
outputs:

    g,h,i
    7,9,2
    7,9,2
"""

sort_by

The sort_by() function sorts the rows of an input relation with respect to a certain key column.

s = sort_by(input_a, "sorted", "a")

"""
outputs: 

    a,b,c
    1,2,3
    1,2,3
    2,4,5
    4,5,6
    7,8,9
"""

# can also sort in decreasing order
s = sort_by(input_a, "sorted", "a", increasing=False)

"""
outputs: 

    a,b,c
    7,8,9
    4,5,6
    2,4,5
    1,2,3
    1,2,3
"""

num_rows

The num_rows() function just counts the number of rows in an input relation, and returns a single column
and row with that value.

nr = num_rows(input_a, "nr")

"""
outputs:

    num_rows
    5
"""

# can also name the count column:
nr = num_rows(input_a, "nr", count_col_name="number_of_rows")

"""
outputs:

    number_of_rows
    5
"""

join

The join() function performs a standard join over specified columns from two input relations:

j = join(input_a, input_c, "joined", ["b", "c"], ["h", "i"])

"""
outputs:

    b,c,a,g
    2,3,1,1
    5,6,4,4
    2,3,1,1
"""

concat

The concat() function produces a vertically concatenated relation from a list of input relations:

c = concat([input_a, input_b, input_c], "concatenated")

"""
outputs:
    a,b,c
    1,2,3
    4,5,6
    1,2,3
    7,8,9
    2,4,5
    4,5,6
    2,3,4
    1,3,4
    2,5,6
    7,3,4
    1,2,3
    1,2,4
    7,9,2
    7,9,2
    4,5,6
"""

# can also optionally rename columns in output relation
c = concat([input_a, input_b], "concatenated", ["j", "k", "l"])

"""
outputs:

    j,k,l
    1,2,3
    4,5,6
    1,2,3
    7,8,9
    2,4,5
    4,5,6
    2,3,4
    1,3,4
    2,5,6
    7,3,4
"""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples

Examples

input_a

input_b

input_c

aggregate

min_max_median

deciles

aggregate_count

project

add

subtract, multiply, divide

limit

distinct

filter_by

sort_by

num_rows

join

concat

Clone this wiki locally