Add batch support to generator framework #14

nickmoorman · 2018-11-28T16:23:40Z

This pull request adds batching support to the generator framework (issue #13). To enable batching in a generator, users can simply add a block like this to their generator's configuration:

batching:
  batch_size: 5

If this configuration exists, batching is assumed to be desired. If for some reason a user doesn't want to remove the batching configuration, but doesn't want to actually use it (for example, during development etc.), disabled: true can be added to fall back to the original non-batching generator flow.

nickmoorman · 2018-11-28T16:34:09Z

boundary_layer/builders/templates/generator_operator.j2

    item_name = {{ item_name_builder }}(index, item)
    blocklist_match = any(re.match(i, item_name) for i in {{ blocklist }})
    if blocklist_match:
        continue

+    {% set item_input = 'item' %}
+    {% set name_input = 'item_name' %}
+{% endif %}
    {{ node.target | sanitize_operator_name }}_builder(


I broke this out of the conditional blocks and templated item_input and name_input to avoid repetition, but I'm actually not sure if that's the best idea here... Thoughts?

nickmoorman · 2018-11-28T19:50:31Z

boundary_layer/builders/templates/generator_operator.j2

+def {{ batch_name_builder }}(index, items):
+    return 'batch_%d_%d' % (index, len(items))
+
+{# TODO: Import this from some util module #}


I'm not quite sure of the most appropriate way to do this...

aw yeah that's not really an option for us right now.

Noted. I'll update the comment to reference such a time in the future where this functionality exists.

nickmoorman · 2018-11-28T19:52:04Z

boundary_layer/registry/types/operator.py

@@ -422,8 +422,9 @@ def _build_task_id(self, execution_context):
            return base_name

        suffix_mode = execution_context.referrer.item.get('auto_task_id_mode')
-        if not suffix_mode or suffix_mode == 'item_name':
-            return base_name + '-<<item_name>>'
+        name_var = 'batch_name' if execution_context.referrer.item.get('batching', {'enabled': False})['enabled'] else 'item_name'


I feel like there's probably a better way to do this... 🤔

nickmoorman · 2018-11-28T19:55:02Z

boundary_layer/registry/types/operator.py

-        if not suffix_mode or suffix_mode == 'item_name':
-            return base_name + '-<<item_name>>'
+        name_var = 'batch_name' if execution_context.referrer.item.get('batching', {'enabled': False})['enabled'] else 'item_name'
+        if not suffix_mode or suffix_mode == name_var:


Hmm, should probably support either batch_name or item_name when batching is enabled, but only item_name if batching is not enabled. Thoughts?

Oh yup, if I had read this comment first I would have type a lot less up there ^^ haha

On second thought, I don't think it makes sense to use item_name in a batching scenario. There's not really a good way to construct the item name in that case, so I think we'd have to use batch_name. Let me know if I'm missing something here... Otherwise, I'll add some validation around this.

nickmoorman · 2018-11-28T19:59:30Z

boundary_layer/registry/types/operator.py

@@ -422,8 +422,10 @@ def _build_task_id(self, execution_context):
            return base_name

        suffix_mode = execution_context.referrer.item.get('auto_task_id_mode')
-        if not suffix_mode or suffix_mode == 'item_name':
-            return base_name + '-<<item_name>>'
+        batching_config = execution_context.referrer.item.get('batching', {'enabled': False})


I feel like there's probably a better way to do this... 🤔

mchalek · 2018-11-28T20:25:41Z

boundary_layer/builders/templates/generator_operator.j2

+def {{ batch_name_builder }}(index, items):
+    return 'batch_%d_%d' % (index, len(items))
+
+{# TODO: Import this from some util module #}


aw yeah that's not really an option for us right now.

mchalek · 2018-11-28T20:26:16Z

boundary_layer/builders/templates/generator_operator.j2

-        )):
+        )
+
+{% if node.batching.enabled %}


What if, instead of using a conditional block for this, you could instead:

use a default batch size of 1

create the {{item_name}} variable inside the builder function only if the batch size is 1

have the batch-name builder default to just choosing the item_name if the batch size is 1?

I think something along these lines might simplify a lot of the code, because many lines and branches have to go into dealing with differences between item_name and batch_name. I guess batch_name would become standard and item_name would only be a special case.

I get what you're saying, but if things technically run in "batch" mode all the time, then all related functions should return a list (even if that list only contains one item). For backwards compatibility, a single-item list could be simplified to its singular element, but IMO, that inconsistency complicates the API. 🤷‍♂️

Alright yeah you convinced me 😄

I think I still do prefer the interface described below in the comment about BatchingSchema, where if batch_size is missing or None then we implicitly interpret that to be equivalent to disabling batching, so that we don't need to fill in both the enabled field and the batch_size field. thoughts on that?

I like it. Makes usage a little bit easier. 👍

mchalek · 2018-11-28T20:29:26Z

boundary_layer/builders/templates/generator_operator.j2


+{# TODO: Import this from some util module #}
+{# https://stackoverflow.com/a/312464 #}
+def grouped(l, n):


Love the use of the generator for this. Could you rename this function to something more obscure? Right now, unfortunately, we don't have a way to ensure that variables/functions inserted here do not have name collisions with those inserted in other places. For example, if there were a task named grouped in some DAG, boundary-layer would create a local variable named grouped that would collide with this.

Makes sense, will do. Thanks for calling this out!

mchalek · 2018-11-28T20:42:35Z

boundary_layer/schemas/dag.py

@@ -35,9 +35,15 @@ class ReferenceSchema(OperatorSchema):
    target = fields.String(required=True)


+class BatchingSchema(StrictSchema):


minor maybe, but you could do this with just a single optional batch_size argument in the generator schema, i think? and if that argument is missing then you implicitly assume that enabled == False. This would also simplify the part above where you say you think there would be a better way...

See my comment in generator_operator.j2 for my logic here.

mchalek · 2018-11-28T20:53:58Z

boundary_layer/registry/types/operator.py

-        if not suffix_mode or suffix_mode == 'item_name':
-            return base_name + '-<<item_name>>'
+        name_var = 'batch_name' if execution_context.referrer.item.get('batching', {'enabled': False})['enabled'] else 'item_name'
+        if not suffix_mode or suffix_mode == name_var:


Oh yup, if I had read this comment first I would have type a lot less up there ^^ haha

nickmoorman · 2018-12-06T19:16:13Z

test/test_schemas.py

@@ -0,0 +1,57 @@
+from boundary_layer.schemas.dag import BatchingSchema


Let me know if there's a better place to put this.

oh ya I wouldn't bet on us having perfect fidelity with this, but we have tried to align the locations of files in test/ with the directory structure that is used in the main package. So could you please put this file into a directory test/schemas ?

nickmoorman

Alright, finally finished tweaking this to my liking... I added lots of unit tests to test the new functionality, including tests to show that the templates generate the expected Python code. I also did a little manual testing to verify that way as well. Results can be seen here: https://gist.github.com/nickmoorman/4a55a69f07f5c68a09939014eaf4179a

I think this should be good to go now. Let me know your thoughts! =]

nickmoorman · 2019-01-01T00:20:36Z

boundary_layer/builders/templates/generator_operator.j2

-        )):
+        )
+
+{% if node.batching_enabled %}


Ok, so I did what you suggested, but I ended up adding a batching_enabled helper property to make it easier to resolve the batching situation. Like you proposed, we can implicitly know that batching is enabled if batching settings exist (for now, that means there's a batch_size), and we can use the optional disabled property to turn batching off even when we have the batch size configured. However, the simple check if not node.batching.disabled returns false positives here when the batching config does not exist at all, so the check needed to be more like if node.batching|length > 0 and not node.batching.disabled. Since this is a bit cumbersome, I created the batching_enabled property to avoid having to duplicate this logic in multiple places (3 distinct places in this PR need to perform this check).

ooh, I like this solution 👍

nickmoorman · 2019-01-01T00:29:17Z

test/test_generators.py

@@ -25,3 +32,186 @@ def test_default_param_filler():
        'timeout_sec': 5,
        'headers': {}
    }
+
+
+# Tests for batching functionality


Probably went a little overboard here, but I like to be thorough... 😬

omg this is awesome! <3 overboard tests

mchalek

Hey @nickmoorman this looks great. I really like the solution you settled upon for determining whether batching is enabled (and sorry for leading you down a partial rabbit hole with it, I had not anticipated the false positive that you found).

I had a few minor comments: one regarding python3 compliance and one that I think will prevent config files from being parsed with the batch_name setting for auto_task_id_mode, which is pretty esoteric so I'm not at all surprised that no tests caught it.

Oh also I responded to your question about the location of the schemas test file ;)

If you wouldn't mind making these changes, we can get this merged asap.

Also, I am 😍 at all those tests!

Thanks again!

mchalek · 2019-01-15T20:54:12Z

boundary_layer/builders/templates/generator_operator.j2

-        )):
+        )
+
+{% if node.batching_enabled %}


ooh, I like this solution 👍

mchalek · 2019-01-15T21:02:14Z

boundary_layer/builders/templates/generator_operator.j2

+        item_name = item_name_builder(index, item)
+        return not any(re.match(i, item_name) for i in blocklist)
+
+    filtered = filter(lambda (index, item): not_in_blocklist(index, item), enumerate(items))


unfortunately python 3 does not support the syntax in which function/lambda args are automatically expanded into tuples (though I will admit I have not carefully considered whether other parts of the code-generating templates generate python3 compliant code... we may have some work to do in that area).

Maybe it would be better to just map and filter using a comprehension like

return [ index for (index, item) in enumerate(items) if not_in_blocklist(index, item) ]

?

mchalek · 2019-01-15T21:07:46Z

boundary_layer/schemas/dag.py

 class GeneratorSchema(ReferenceSchema):
    auto_task_id_mode = fields.String()
    regex_blocklist = fields.List(fields.String())
+    batching = fields.Nested(BatchingSchema)

    @validates_schema
    def check_task_id_mode(self, data):


Oh I did not notice this in the original PR, but we'll have to add logic in this check_task_id_mode() method to allow the auto_task_id_mode value to be set to batch_name, otherwise I think the config parser will reject this setting. Maybe the logic that you already have for checking these values in operator.py could be moved here?

mchalek · 2019-01-15T21:08:29Z

test/test_generators.py

@@ -25,3 +32,186 @@ def test_default_param_filler():
        'timeout_sec': 5,
        'headers': {}
    }
+
+
+# Tests for batching functionality


omg this is awesome! <3 overboard tests

mchalek · 2019-01-15T21:10:26Z

test/test_schemas.py

@@ -0,0 +1,57 @@
+from boundary_layer.schemas.dag import BatchingSchema


oh ya I wouldn't bet on us having perfect fidelity with this, but we have tried to align the locations of files in test/ with the directory structure that is used in the main package. So could you please put this file into a directory test/schemas ?

mchalek · 2020-06-25T23:43:57Z

Closing due to inactivity

nickmoorman added 3 commits November 28, 2018 08:15

First pass at adding batch support to generator framework

563a4bd

DRY!

d751977

More DRY...

b2aa9ad

nickmoorman commented Nov 28, 2018

View reviewed changes

nickmoorman added 4 commits November 28, 2018 10:38

Fix some whitespace issues

416a929

Fix variable names in generator preamble

dd156a0

Implement blocklist filtering for batches; uniqify variables

c3f45a5

Move blocklist declaration back to its original location...

109fcd0

nickmoorman commented Nov 28, 2018

View reviewed changes

Fix lint error...

9ce9e10

nickmoorman commented Nov 28, 2018

View reviewed changes

mchalek requested changes Nov 28, 2018

View reviewed changes

nickmoorman added 2 commits December 6, 2018 09:36

Address code review feedback

e3ebf6f

Implicitly set 'enabled' field if not set

74165a4

nickmoorman commented Dec 6, 2018

View reviewed changes

nickmoorman added 6 commits December 7, 2018 14:49

Clean up batching enablement usage

2e98922

Fix lint violation

a7107ad

Update tests

490c52c

Clean up tests

709d5bf

Add 'batching_enabled' helper property

c7fdb29

More test cleanup...

eeb596c

nickmoorman commented Jan 1, 2019

View reviewed changes

nickmoorman changed the title ~~First pass at adding batch support to generator framework~~ Add batch support to generator framework Jan 1, 2019

mchalek requested changes Jan 15, 2019

View reviewed changes

mchalek closed this Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch support to generator framework #14

Add batch support to generator framework #14

nickmoorman commented Nov 28, 2018 •

edited

Loading

nickmoorman Nov 28, 2018

nickmoorman Nov 28, 2018

mchalek Nov 28, 2018

nickmoorman Dec 5, 2018

nickmoorman Nov 28, 2018

nickmoorman Nov 28, 2018

mchalek Nov 28, 2018

nickmoorman Dec 6, 2018

mchalek Dec 6, 2018

nickmoorman Nov 28, 2018

mchalek Nov 28, 2018

mchalek Nov 28, 2018

nickmoorman Dec 6, 2018

mchalek Dec 6, 2018

nickmoorman Dec 6, 2018

mchalek Nov 28, 2018

nickmoorman Dec 5, 2018

mchalek Nov 28, 2018

nickmoorman Dec 6, 2018

mchalek Nov 28, 2018

nickmoorman Dec 6, 2018

mchalek Jan 15, 2019

nickmoorman left a comment

nickmoorman Jan 1, 2019

mchalek Jan 15, 2019

nickmoorman Jan 1, 2019

mchalek Jan 15, 2019

mchalek left a comment

mchalek Jan 15, 2019

mchalek Jan 15, 2019

mchalek Jan 15, 2019

mchalek Jan 15, 2019

mchalek Jan 15, 2019

mchalek commented Jun 25, 2020

		@@ -35,9 +35,15 @@ class ReferenceSchema(OperatorSchema):
		target = fields.String(required=True)


		class BatchingSchema(StrictSchema):

		@@ -0,0 +1,57 @@
		from boundary_layer.schemas.dag import BatchingSchema

Add batch support to generator framework #14

Add batch support to generator framework #14

Conversation

nickmoorman commented Nov 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickmoorman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mchalek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mchalek commented Jun 25, 2020

nickmoorman commented Nov 28, 2018 •

edited

Loading