-
Notifications
You must be signed in to change notification settings - Fork 4
Examples
This section will provide examples of input and output data for each query across various input
parameters. For method signature documentation, refer to the Queries section. For convenience,
we'll refer to the following input relations as input_a
, input_b
, and input_c
, respectively:
a,b,c
1,2,3
4,5,6
1,2,3
7,8,9
2,4,5
d,e,f
4,5,6
2,3,4
1,3,4
2,5,6
7,3,4
g,h,i
1,2,3
1,2,4
7,9,2
7,9,2
4,5,6
We will further assume that these relations are the output of the following create()
relations:
a = create_column("a", "INTEGER")
b = create_column("b", "INTEGER")
c = create_column("c", "INTEGER")
d = create_column("d", "INTEGER")
e = create_column("e", "INTEGER")
f = create_column("f", "INTEGER")
input_a = create("input_a", [a, b, c], {1})
input_a = create("input_b", [d, e, f], {2})
Note that the create()
relations above do not require that the variable be named identically
to the relation name (input_a does not need to be named "input_a"), and were only named that way
here for convenience.
The aggregate()
function can be used to compute either the sum, mean, standard deviation,
or variance over a relation. Further, this computation can be grouped according to the values
in a given column before applying the aggregation to a target column, or just applied to a
target column with no grouping done beforehand. The following examples will illustrate only the
"sum" computation, but the other aggregate types are computed analogously.
# grouped
agg_sum_grouped = aggregate(input_a, "agg_sum", ["a"], "b", "sum", "sum_col")
"""
outputs:
a,sum_col
1 4
2 4
4 5
7 8
"""
# ungrouped
agg_sum_ungrouped = aggregate(input_a, "agg_sum", [], "b", "sum", "sum_col")
"""
outputs:
sum_col
21
"""
The min_max_median()
function simply constructs 3 columns whose values correspond to the minimum,
maximum, and median values of the input relation, with respect to a target column. Note that while
the method signature for this function includes a group_col_names
argument, the JIFF code generation
does not currently support grouped operations of this type. Though this will be added in the future, at
present the min_man_median()
function only supports ungrouped computations at present.
mmm = min_max_median(input_a, "mmm", [], "c")
"""
outputs:
__MIN__,__MAX__,__MEDIAN__
3,9,5
"""
The deciles()
function produces 9 columns whose values correspond to the 1-Decile, 2-Decile, ..., 9-Decile
values of the input relation, with respect to a target column. Like the min_max_median()
function above,
column grouping is forthcoming, despite being included in the method signature.
dec = deciles(input_a, "dec", [], "a")
"""
outputs:
1, 1, 2, 4, 7
1-DECILE,2-DECILE,3-DECILE,4-DECILE,5-DECILE,6-DECILE,7-DECILE,8-DECILE,9-DECILE
1,1,1,2,2,4,4,7,7
"""
Like the above aggregate()
function, aggregate_count()
can compute a row count over a relation that
is either grouped by a key column or completely ungrouped. Note that an ungrouped aggregate_count()
computation is equivalent to the num_rows()
function.
# grouped
agg_count_grouped = aggregate_count(input_a, "agg_count", ["a"], "a_count")
"""
outputs:
a,a_count
1,2
2,1
4,1
7,1
"""
# ungrouped
agg_count_ungrouped = aggregate_count(input_a, "agg_count", [], "row_count")
"""
outputs:
row_count
5
"""
The all_stats()
function produces 17 output columns, each representing one statistical operation from
the following: sum, mean, variance, standard deviation, min/max/median, deciles (1-9), and row count.
# ungrouped
c = concat([input_a, input_b, input_c], "concatenated")
a = all_stats(c, "all_stats", [], "b")
"""
outputs:
__SUM__,__MEAN__,__VARIANCE__,__STD_DEV__,__MIN__,__MAX__,__MEDIAN__,__1_DECILE__,__2_DECILE__,__3_DECILE__,__4_DECILE__,__5_DECILE__,__6_DECILE__,__7_DECILE__,__8_DECILE__,__9_DECILE__,__COUNT__
67,4,9,3,2,9,4,2,2,3,3,4,5,5,8,9,15
"""
The project()
function simply selects out a list of columns from a relation, and returns a
new relation with the unselected columns dropped.
proj = project(input_a, "proj", ["a", "b"])
"""
outputs:
a,b
1,2
4,5
1,2
7,8
2,4
"""
Columns can also be reordered:
proj = project(input_a, "proj", ["b", "a"])
"""
outputs:
b,a
2,1
5,4
2,1
8,7
4,2
"""
The add()
function can add an arbitrarily long list of integers and columns to an existing column.
added = add(input_a, "added", "c", ["b", 7])
"""
outputs:
a,b,c
1,2,12
4,5,18
1,2,12
7,8,24
2,4,16
"""
# can also create a new column whose values are the sum of the operands
added_new_column = add(input_a, "added", "new_column", ["b", 7])
"""
outputs:
a,b,c,new_column
1,2,3,9
4,5,6,12
1,2,3,9
7,8,9,15
2,4,5,11
"""
Note, though, that any columns passed to the operands
list must already exist in the input relation.
For example, the following would return an error:
# can't supply column from input_b relation
added_wrong = add(input_a, "added_wrong", "c", ["d", 7])
For brevity, examples for subtract()
, multiply()
, and divide()
are omitted, since their syntax
and behavior are identical to the behavior of add()
.
The limit()
function simply preserves a specified number of rows from an input relation, and drops
the rest.
lim = limit(input_a, "lim", 2)
"""
outputs:
a,b,c
1,2,3
4,5,6
"""
The distinct()
function removes duplicate rows from an input relation, with respect to some list of
columns.
dis = distinct(input_c, "dis", ["g", "h", "i"])
"""
outputs:
g,h,i
1,2,3
1,2,4
7,9,2
4,5,6
"""
dis = distinct(input_c, "dis", ["g", "h"])
"""
outputs:
g,h,i
1,2,3
7,9,2
4,5,6
"""
The filter_by()
function is used to filter rows from a dataset, based on some column's relationship to
either another input column or a scalar. We can thus construct expressions like:
* "preserve all rows from relation input_a whose values in column a are less than 2"
* "preserve all rows from relation input_c whose values in column g are greater than the values in column i"
These expressions would be written as follows:
input_op_node: < Node >,
name: < str >,
filter_col_name: < str >,
operator: < str >,
filter_against: < str, int > \
fb = filter_by(input_a, "fb", "a", "<", 2)
"""
outputs:
a,b,c
1,2,3
1,2,3
"""
fb = filter_by(input_c, "fb", "g", ">", "i")
"""
outputs:
g,h,i
7,9,2
7,9,2
"""
The sort_by()
function sorts the rows of an input relation with respect to a certain key column.
s = sort_by(input_a, "sorted", "a")
"""
outputs:
a,b,c
1,2,3
1,2,3
2,4,5
4,5,6
7,8,9
"""
# can also sort in decreasing order
s = sort_by(input_a, "sorted", "a", increasing=False)
"""
outputs:
a,b,c
7,8,9
4,5,6
2,4,5
1,2,3
1,2,3
"""
The num_rows()
function just counts the number of rows in an input relation, and returns a single column
and row with that value.
nr = num_rows(input_a, "nr")
"""
outputs:
num_rows
5
"""
# can also name the count column:
nr = num_rows(input_a, "nr", count_col_name="number_of_rows")
"""
outputs:
number_of_rows
5
"""
The join()
function performs a standard join over specified columns from two input relations:
j = join(input_a, input_c, "joined", ["b", "c"], ["h", "i"])
"""
outputs:
b,c,a,g
2,3,1,1
5,6,4,4
2,3,1,1
"""
The concat()
function produces a vertically concatenated relation from a list of input relations:
c = concat([input_a, input_b, input_c], "concatenated")
"""
outputs:
a,b,c
1,2,3
4,5,6
1,2,3
7,8,9
2,4,5
4,5,6
2,3,4
1,3,4
2,5,6
7,3,4
1,2,3
1,2,4
7,9,2
7,9,2
4,5,6
"""
# can also optionally rename columns in output relation
c = concat([input_a, input_b], "concatenated", ["j", "k", "l"])
"""
outputs:
j,k,l
1,2,3
4,5,6
1,2,3
7,8,9
2,4,5
4,5,6
2,3,4
1,3,4
2,5,6
7,3,4
"""