PATH WALK I: The path-walk API #1818

gitgitgadget · 2024-11-01T13:13:49Z

On the Git mailing list, karthik nayak wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <stolee@gmail.com> > > In anticipation of a few planned applications, introduce the most basic form > of a path-walk API. It currently assumes that there are no UNINTERESTING > objects, and does not include any complicated filters. It calls a function > pointer on groups of tree and blob objects as grouped by path. This only > includes objects the first time they are discovered, so an object that > appears at multiple paths will not be included in two batches. > > These batches are collected in 'struct type_and_oid_list' objects, which > store an object type and an oid_array of objects. > > The data structures are documented in 'struct path_walk_context', but in > summary the most important are: > > * 'paths_to_lists' is a strmap that connects a path to a > type_and_oid_list for that path. To avoid conflicts in path names, > we make sure that tree paths end in "/" (except the root path with > is an empty string) and blob paths do not end in "/". > > * 'path_stack' is a string list that is added to in an append-only > way. This stores the stack of our depth-first search on the heap > instead of using recursion. > > * 'path_stack_pushed' is a strmap that stores path names that were > already added to 'path_stack', to avoid repeating paths in the > stack. Mostly, this saves us from quadratic lookups from doing > unsorted checks into the string_list. > > The coupling of 'path_stack' and 'path_stack_pushed' is protected by the > push_to_stack() method. Call this instead of inserting into these > structures directly. > > The walk_objects_by_path() method initializes these structures and > starts walking commits from the given rev_info struct. The commits are > used to find the list of root trees which populate the start of our > depth-first search. Isn't this more of breadth-first search? Reading through the code, the algorithm seems something like: - For each commit in list of commits (from rev_info) - Tackle each root tree, add root path to the stack. - For each path in stack left - Call the callback provided by client. - Find all its first level children, add each to the stack. So wouldn't this go through the tree in level by level basis? Making it a BFS? Apart from this, the patch itself looks solid. I ended up writing a small client to play with this API, and was very pleased how quickly I could get it running. [snip]

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 11/1/24 9:12 AM, karthik nayak wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> From: Derrick Stolee <stolee@gmail.com> >> >> The walk_objects_by_path() method initializes these structures and >> starts walking commits from the given rev_info struct. The commits are >> used to find the list of root trees which populate the start of our >> depth-first search. > > Isn't this more of breadth-first search? Reading through the code, the > algorithm seems something like: > > - For each commit in list of commits (from rev_info) > - Tackle each root tree, add root path to the stack. > - For each path in stack left > - Call the callback provided by client. > - Find all its first level children, add each to the stack. > > So wouldn't this go through the tree in level by level basis? Making it > a BFS? While we are adding all children to the stack, we only pop off the top of the stack, making it a DFS. (We do visit the paths in reverse- lexicographic order, though.) To make it a BFS, we would need to visit the paths in the order they are added to the list. Instead, we visit them in Last-In First-Out order. I initially had built it as a BFS, but ran into memory issues when running it on very large repos. Thanks, -Stolee

gitgitgadget · 2024-11-01T13:43:28Z

On the Git mailing list, karthik nayak wrote (reply to this):

karthik nayak <karthik.188@gmail.com> writes: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> From: Derrick Stolee <stolee@gmail.com> >> >> In anticipation of a few planned applications, introduce the most basic form >> of a path-walk API. It currently assumes that there are no UNINTERESTING >> objects, and does not include any complicated filters. It calls a function >> pointer on groups of tree and blob objects as grouped by path. This only >> includes objects the first time they are discovered, so an object that >> appears at multiple paths will not be included in two batches. >> >> These batches are collected in 'struct type_and_oid_list' objects, which >> store an object type and an oid_array of objects. >> >> The data structures are documented in 'struct path_walk_context', but in >> summary the most important are: >> >> * 'paths_to_lists' is a strmap that connects a path to a >> type_and_oid_list for that path. To avoid conflicts in path names, >> we make sure that tree paths end in "/" (except the root path with >> is an empty string) and blob paths do not end in "/". >> >> * 'path_stack' is a string list that is added to in an append-only >> way. This stores the stack of our depth-first search on the heap >> instead of using recursion. >> >> * 'path_stack_pushed' is a strmap that stores path names that were >> already added to 'path_stack', to avoid repeating paths in the >> stack. Mostly, this saves us from quadratic lookups from doing >> unsorted checks into the string_list. >> >> The coupling of 'path_stack' and 'path_stack_pushed' is protected by the >> push_to_stack() method. Call this instead of inserting into these >> structures directly. >> >> The walk_objects_by_path() method initializes these structures and >> starts walking commits from the given rev_info struct. The commits are >> used to find the list of root trees which populate the start of our >> depth-first search. > > Isn't this more of breadth-first search? Reading through the code, the > algorithm seems something like: > > - For each commit in list of commits (from rev_info) > - Tackle each root tree, add root path to the stack. > - For each path in stack left > - Call the callback provided by client. > - Find all its first level children, add each to the stack. > > So wouldn't this go through the tree in level by level basis? Making it > a BFS? My bad here, thinking more about it, it is DFS indeed. Although we add all the children of a level to the stack, we pop each of them from the stack and end up traversing down that level. > > Apart from this, the patch itself looks solid. I ended up writing a > small client to play with this API, and was very pleased how quickly I > could get it running. > > [snip]

gitgitgadget · 2024-12-13T12:04:14Z

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 06, 2024 at 07:45:52PM +0000, Derrick Stolee via GitGitGadget wrote: > --- /dev/null > +++ b/path-walk.c > @@ -0,0 +1,263 @@ > +/* > + * path-walk.c: implementation for path-based walks of the object graph. > + */ > +#include "git-compat-util.h" > +#include "path-walk.h" > +#include "blob.h" > +#include "commit.h" > +#include "dir.h" > +#include "hashmap.h" > +#include "hex.h" > +#include "object.h" > +#include "oid-array.h" > +#include "revision.h" > +#include "string-list.h" > +#include "strmap.h" > +#include "trace2.h" > +#include "tree.h" > +#include "tree-walk.h" > + > +struct type_and_oid_list > +{ > + enum object_type type; > + struct oid_array oids; > +}; Nit: formatting of this struct is off. > +static void push_to_stack(struct path_walk_context *ctx, > + const char *path) > +{ > + if (strset_contains(&ctx->path_stack_pushed, path)) > + return; > + > + strset_add(&ctx->path_stack_pushed, path); > + string_list_append(&ctx->path_stack, path); > +} > + > +static int add_children(struct path_walk_context *ctx, > + const char *base_path, > + struct object_id *oid) > +{ So this function assumes that `oid` always refers to a tree? I think it would make sense to clarify this by calling the function accordingly, like e.g. `add_tree_entries()`. > + struct tree_desc desc; > + struct name_entry entry; > + struct strbuf path = STRBUF_INIT; > + size_t base_len; > + struct tree *tree = lookup_tree(ctx->repo, oid); > + > + if (!tree) { > + error(_("failed to walk children of tree %s: not found"), > + oid_to_hex(oid)); > + return -1; > + } else if (parse_tree_gently(tree, 1)) { > + die("bad tree object %s", oid_to_hex(oid)); I wonder whether we maybe shouldn't die but instead return an error in the spirit of libification. > + } > + strbuf_addstr(&path, base_path); > + base_len = path.len; > + > + parse_tree(tree); > + init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size); > + while (tree_entry(&desc, &entry)) { > + struct type_and_oid_list *list; > + struct object *o; > + /* Not actually true, but we will ignore submodules later. */ > + enum object_type type = S_ISDIR(entry.mode) ? OBJ_TREE : OBJ_BLOB; > + > + /* Skip submodules. */ > + if (S_ISGITLINK(entry.mode)) > + continue; > + > + if (type == OBJ_TREE) { > + struct tree *child = lookup_tree(ctx->repo, &entry.oid); > + o = child ? &child->object : NULL; > + } else if (type == OBJ_BLOB) { > + struct blob *child = lookup_blob(ctx->repo, &entry.oid); > + o = child ? &child->object : NULL; > + } else { > + /* Wrong type? */ > + continue; This code is unreachable, so we could make this a `BUG()`. Might also use a switch instead, but that's more of a stylistic question. > + } > + > + if (!o) /* report error?*/ > + continue; So this can only happen in case `lookup_tree()` or `lookup_blob()` run into an error. I think this error should definitely be bubbled up so that we don't silently skip tree entries in case of repo corruption. > + strbuf_setlen(&path, base_len); > + strbuf_add(&path, entry.path, entry.pathlen); > + > + /* > + * Trees will end with "/" for concatenation and distinction > + * from blobs at the same path. > + */ > + if (type == OBJ_TREE) > + strbuf_addch(&path, '/'); > + > + if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) { > + CALLOC_ARRAY(list, 1); > + list->type = type; > + strmap_put(&ctx->paths_to_lists, path.buf, list); > + } > + push_to_stack(ctx, path.buf); > + > + /* Skip this object if already seen. */ > + if (o->flags & SEEN) > + continue; > + o->flags |= SEEN; This made me wonder: why do we only skip the object this late? Couldn't we already have done so immediately after we have looked up the object to avoid some work? If not, it might be useful to add a comment why it has to come this late. > + oid_array_append(&list->oids, &entry.oid); > + } > + > + free_tree_buffer(tree); > + strbuf_release(&path); > + return 0; > +} > + > +/* > + * For each path in paths_to_explore, walk the trees another level > + * and add any found blobs to the batch (but only if they exist and > + * haven't been added yet). > + */ > +static int walk_path(struct path_walk_context *ctx, > + const char *path) > +{ > + struct type_and_oid_list *list; > + int ret = 0; Nit: needless initialization. > + > + list = strmap_get(&ctx->paths_to_lists, path); > + > + if (!list->oids.nr) > + return 0; > + > + /* Evaluate function pointer on this data. */ > + ret = ctx->info->path_fn(path, &list->oids, list->type, > + ctx->info->path_fn_data); > + > + /* Expand data for children. */ > + if (list->type == OBJ_TREE) { > + for (size_t i = 0; i < list->oids.nr; i++) { > + ret |= add_children(ctx, > + path, > + &list->oids.oid[i]); > + } > + } > + > + oid_array_clear(&list->oids); > + strmap_remove(&ctx->paths_to_lists, path, 1); > + return ret; > +} > + > +static void clear_strmap(struct strmap *map) Nit: this isn't clearing a generic strmap, but rather `paths_to_lists`. Should we maybe rename it to `clear_paths_to_lists()`? > +{ > + struct hashmap_iter iter; > + struct strmap_entry *e; > + > + hashmap_for_each_entry(&map->map, &iter, e, ent) { > + struct type_and_oid_list *list = e->value; > + oid_array_clear(&list->oids); > + } > + strmap_clear(map, 1); > + strmap_init(map); > +} > + > +/** > + * Given the configuration of 'info', walk the commits based on 'info->revs' and > + * call 'info->path_fn' on each discovered path. > + * > + * Returns nonzero on an error. > + */ > +int walk_objects_by_path(struct path_walk_info *info) > +{ > + const char *root_path = ""; > + int ret = 0; > + size_t commits_nr = 0, paths_nr = 0; > + struct commit *c; > + struct type_and_oid_list *root_tree_list; > + struct path_walk_context ctx = { > + .repo = info->revs->repo, > + .revs = info->revs, > + .info = info, > + .path_stack = STRING_LIST_INIT_DUP, > + .path_stack_pushed = STRSET_INIT, > + .paths_to_lists = STRMAP_INIT > + }; > + > + trace2_region_enter("path-walk", "commit-walk", info->revs->repo); > + > + /* Insert a single list for the root tree into the paths. */ > + CALLOC_ARRAY(root_tree_list, 1); > + root_tree_list->type = OBJ_TREE; > + strmap_put(&ctx.paths_to_lists, root_path, root_tree_list); > + push_to_stack(&ctx, root_path); > + > + if (prepare_revision_walk(info->revs)) > + die(_("failed to setup revision walk")); > + > + while ((c = get_revision(info->revs))) { > + struct object_id *oid = get_commit_tree_oid(c); > + struct tree *t; > + commits_nr++; > + > + oid = get_commit_tree_oid(c); > + t = lookup_tree(info->revs->repo, oid); > + > + if (!t) { > + warning("could not find tree %s", oid_to_hex(oid)); > + continue; > + } Is this error something we should bubble up to the caller? As mentioned above, I'm being cautious about silently accepting potentially-corrupt data. Silent in the spirit of the caller not noticing it, not in the sense of the user not noticing it. Patrick

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 12/13/24 6:58 AM, Patrick Steinhardt wrote: > On Fri, Dec 06, 2024 at 07:45:52PM +0000, Derrick Stolee via GitGitGadget wrote: >> + } else if (parse_tree_gently(tree, 1)) { >> + die("bad tree object %s", oid_to_hex(oid)); > > I wonder whether we maybe shouldn't die but instead return an error in > the spirit of libification. This is in fact something that is being tested when 'git pack-objects' has the --path-walk feature. See "get an error for missing tree object" in t5317 as an example. It's not enough to fail, but we need to fail with this error message. Has there been enough progress in the libification effort to establish a pattern for returning an error message like "bad tree object %s" from an API like this to the caller? I will try using a "error(); return -1;" and consider that as the best option for right now. >> + } else { >> + /* Wrong type? */ >> + continue; > > This code is unreachable, so we could make this a `BUG()`. Might also > use a switch instead, but that's more of a stylistic question. I think a BUG() would be good here. >> + } >> + >> + if (!o) /* report error?*/ >> + continue; > > So this can only happen in case `lookup_tree()` or `lookup_blob()` > run into an error. I think this error should definitely be bubbled up so > that we don't silently skip tree entries in case of repo corruption. Looks like I agreed with you but didn't follow through. >> + /* Skip this object if already seen. */ >> + if (o->flags & SEEN) >> + continue; >> + o->flags |= SEEN; > > This made me wonder: why do we only skip the object this late? Couldn't > we already have done so immediately after we have looked up the object > to avoid some work? If not, it might be useful to add a comment why it > has to come this late. I went to look to see if there was a reason, and at this point there is not a good reason. This should be moved up to avoid some checks and path manipulation. I think that in a later patch, the use of the UNINTERESTING flag is important to pass the flag even when the object is already SEEN. This is probably cruft from an earlier version that passed the UNINTERESTING bits in this part of the code. >> + if (!t) { >> + warning("could not find tree %s", oid_to_hex(oid)); >> + continue; >> + } > > Is this error something we should bubble up to the caller? As mentioned > above, I'm being cautious about silently accepting potentially-corrupt > data. Silent in the spirit of the caller not noticing it, not in the > sense of the user not noticing it. Can do. Thanks for the careful review! -Stolee

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, Dec 18, 2024 at 09:21:25AM -0500, Derrick Stolee wrote: > On 12/13/24 6:58 AM, Patrick Steinhardt wrote: > > On Fri, Dec 06, 2024 at 07:45:52PM +0000, Derrick Stolee via GitGitGadget wrote: > > > > + } else if (parse_tree_gently(tree, 1)) { > > > + die("bad tree object %s", oid_to_hex(oid)); > > > > I wonder whether we maybe shouldn't die but instead return an error in > > the spirit of libification. > > This is in fact something that is being tested when 'git pack-objects' has > the --path-walk feature. See "get an error for missing tree object" in > t5317 as an example. > > It's not enough to fail, but we need to fail with this error message. > > Has there been enough progress in the libification effort to establish a > pattern for returning an error message like "bad tree object %s" from an > API like this to the caller? > > I will try using a "error(); return -1;" and consider that as the best > option for right now. Yeah, I think this is best practice for now where we don't have a superior mechanism like structured errors. Patrick

gitgitgadget · 2024-12-27T14:21:05Z

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 20, 2024 at 04:21:09PM +0000, Derrick Stolee via GitGitGadget wrote: [snip] > +static int add_tree_entries(struct path_walk_context *ctx, > + const char *base_path, > + struct object_id *oid) > +{ > + struct tree_desc desc; > + struct name_entry entry; > + struct strbuf path = STRBUF_INIT; > + size_t base_len; > + struct tree *tree = lookup_tree(ctx->repo, oid); > + > + if (!tree) { > + error(_("failed to walk children of tree %s: not found"), > + oid_to_hex(oid)); > + return -1; > + } else if (parse_tree_gently(tree, 1)) { > + error("bad tree object %s", oid_to_hex(oid)); > + return -1; > + } You can `return error(_("..."));` directly as it already returns `-1`. Not sure whether this by itself warrants a reroll -- probably not. I'll leave it up to you. The rest of the patch series looks as expected, mostly based on the range diff. Patrick

gitgitgadget · 2024-11-01T14:28:40Z

On the Git mailing list, karthik nayak wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <stolee@gmail.com> > > The rev_info that is specified for a path-walk traversal may specify > visiting tag refs (both lightweight and annotated) and also may specify > indexed objects (blobs and trees). Update the path-walk API to walk > these objects as well. > > When walking tags, we need to peel the annotated objects until reaching > a non-tag object. If we reach a commit, then we can add it to the > pending objects to make sure we visit in the commit walk portion. If we Nit: s/in/it in/ [snip] > + case OBJ_BLOB: > + if (!info->blobs) > + continue; > + if (pending->path) { > + struct type_and_oid_list *list; > + char *path = pending->path; > + if (!(list = strmap_get(&ctx->paths_to_lists, path))) { > + CALLOC_ARRAY(list, 1); > + list->type = OBJ_BLOB; > + strmap_put(&ctx->paths_to_lists, path, list); > + } > + oid_array_append(&list->oids, &obj->oid); > + } else { > + /* assume a root tree, such as a lightweight tag. */ Shouldn't this comment be for tagged blobs? > + oid_array_append(&tagged_blobs->oids, &obj->oid); > + } > + break; [snip]

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 11/1/24 10:25 AM, karthik nayak wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> From: Derrick Stolee <stolee@gmail.com> >> >> The rev_info that is specified for a path-walk traversal may specify >> visiting tag refs (both lightweight and annotated) and also may specify >> indexed objects (blobs and trees). Update the path-walk API to walk >> these objects as well. >> >> When walking tags, we need to peel the annotated objects until reaching >> a non-tag object. If we reach a commit, then we can add it to the >> pending objects to make sure we visit in the commit walk portion. If we > > Nit: s/in/it in/ thanks! >> + case OBJ_BLOB: >> + if (!info->blobs) >> + continue; >> + if (pending->path) { >> + struct type_and_oid_list *list; >> + char *path = pending->path; >> + if (!(list = strmap_get(&ctx->paths_to_lists, path))) { >> + CALLOC_ARRAY(list, 1); >> + list->type = OBJ_BLOB; >> + strmap_put(&ctx->paths_to_lists, path, list); >> + } >> + oid_array_append(&list->oids, &obj->oid); >> + } else { >> + /* assume a root tree, such as a lightweight tag. */ > > Shouldn't this comment be for tagged blobs? Yes. This is a copy-paste error. Thanks for the careful reading. -Stolee

gitgitgadget · 2024-11-21T22:47:49Z

On the Git mailing list, Taylor Blau wrote (reply to this):

On Sat, Nov 09, 2024 at 07:41:10PM +0000, Derrick Stolee via GitGitGadget wrote: > diff --git a/path-walk.c b/path-walk.c > index 24cf04c1e7d..2ca08402367 100644 > --- a/path-walk.c > +++ b/path-walk.c > @@ -98,6 +98,10 @@ static int add_children(struct path_walk_context *ctx, > if (S_ISGITLINK(entry.mode)) > continue; > > + /* If the caller doesn't want blobs, then don't bother. */ > + if (!ctx->info->blobs && type == OBJ_BLOB) > + continue; > + I was going to ask why we're not reusing the existing property of the rev_info struct to specify what object type(s) we do/don't want, but the answer is obvious: we don't want that to change the behavior of the commit-level walk which is used to seed the actual path walk based on its results. However, it would be kind of nice to have a single place to specify how you want to traverse objects, and using the rev_info struct seems like a good choice there since you can naively ask Git to parse command-line arguments like --blobs, --trees, --objects, etc. and set the appropriate bits. I wonder if it might make sense to do something like: unsigned int tmp_blobs = revs->blob_objects; unsigned int tmp_trees = revs->tree_objects; unsigned int tmp_tags = revs->tag_objects; if (prepare_revision_walk(revs)) die(_("failed to setup revision walk")); /* commit-level walk */ revs->blob_objects = tmp_blobs; revs->tree_objects = tmp_trees; revs->tag_objects = tmp_tags; /* path-level walk */ I don't have strong feelings about it, but it feels like this approach could be cleaner from a caller's perspective. But I can see an argument to the contrary that it does introduce some awkwardness with the dual-use of those fields. Thanks, Taylor

gitgitgadget · 2024-12-13T12:04:16Z

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 06, 2024 at 07:45:56PM +0000, Derrick Stolee via GitGitGadget wrote: > @@ -194,6 +201,134 @@ static void clear_strmap(struct strmap *map) > strmap_init(map); > } > > +static void setup_pending_objects(struct path_walk_info *info, > + struct path_walk_context *ctx) > +{ > + struct type_and_oid_list *tags = NULL; > + struct type_and_oid_list *tagged_blobs = NULL; > + struct type_and_oid_list *root_tree_list = NULL; > + > + if (info->tags) > + CALLOC_ARRAY(tags, 1); > + if (info->blobs) > + CALLOC_ARRAY(tagged_blobs, 1); > + if (info->trees) > + root_tree_list = strmap_get(&ctx->paths_to_lists, root_path); > + > + /* > + * Pending objects include: > + * * Commits at branch tips. > + * * Annotated tags at tag tips. > + * * Any kind of object at lightweight tag tips. > + * * Trees and blobs in the index (with an associated path). > + */ > + for (size_t i = 0; i < info->revs->pending.nr; i++) { > + struct object_array_entry *pending = info->revs->pending.objects + i; > + struct object *obj = pending->item; > + > + /* Commits will be picked up by revision walk. */ > + if (obj->type == OBJ_COMMIT) > + continue; > + > + /* Navigate annotated tag object chains. */ > + while (obj->type == OBJ_TAG) { > + struct tag *tag = lookup_tag(info->revs->repo, &obj->oid); > + if (!tag) > + break; Same here as previous comments, is this an error that we should rather report? [snip] > + if (tagged_blobs) { > + if (tagged_blobs->oids.nr) { > + const char *tagged_blob_path = "/tagged-blobs"; > + tagged_blobs->type = OBJ_BLOB; > + push_to_stack(ctx, tagged_blob_path); > + strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs); > + } else { > + oid_array_clear(&tagged_blobs->oids); > + free(tagged_blobs); > + } > + } > + if (tags) { > + if (tags->oids.nr) { > + const char *tag_path = "/tags"; > + tags->type = OBJ_TAG; > + push_to_stack(ctx, tag_path); > + strmap_put(&ctx->paths_to_lists, tag_path, tags); > + } else { > + oid_array_clear(&tags->oids); > + free(tags); > + } > + } > +} So this is kind of curious. Does that mean that a file named "tagged-blobs" would be thrown into the same bag as a tagged blob? Or are these special due to the leading "/"? Patrick

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 12/13/24 6:58 AM, Patrick Steinhardt wrote: > On Fri, Dec 06, 2024 at 07:45:56PM +0000, Derrick Stolee via GitGitGadget wrote: >> @@ -194,6 +201,134 @@ static void clear_strmap(struct strmap *map) >> + /* Navigate annotated tag object chains. */ >> + while (obj->type == OBJ_TAG) { >> + struct tag *tag = lookup_tag(info->revs->repo, &obj->oid); >> + if (!tag) >> + break; > > Same here as previous comments, is this an error that we should rather > report? Can do. > [snip] >> + if (tagged_blobs) { >> + if (tagged_blobs->oids.nr) { >> + const char *tagged_blob_path = "/tagged-blobs"; >> + tagged_blobs->type = OBJ_BLOB; >> + push_to_stack(ctx, tagged_blob_path); >> + strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs); >> + } else { >> + oid_array_clear(&tagged_blobs->oids); >> + free(tagged_blobs); >> + } >> + } >> + if (tags) { >> + if (tags->oids.nr) { >> + const char *tag_path = "/tags"; >> + tags->type = OBJ_TAG; >> + push_to_stack(ctx, tag_path); >> + strmap_put(&ctx->paths_to_lists, tag_path, tags); >> + } else { >> + oid_array_clear(&tags->oids); >> + free(tags); >> + } >> + } >> +} > > So this is kind of curious. Does that mean that a file named > "tagged-blobs" would be thrown into the same bag as a tagged blob? Or > are these special due to the leading "/"? Indeed, the leading "/" differentiates these categories from other paths that are stored in the process. Thanks, -Stolee

gitgitgadget · 2024-11-01T13:49:32Z

On the Git mailing list, karthik nayak wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: [snip] > diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh > new file mode 100755 > index 00000000000..1f277b88291 > --- /dev/null > +++ b/t/t6601-path-walk.sh > @@ -0,0 +1,118 @@ > +#!/bin/sh > + > +test_description='direct path-walk API tests' > + > +. ./test-lib.sh > + > +test_expect_success 'setup test repository' ' > + git checkout -b base && > + > + mkdir left && > + mkdir right && > + echo a >a && > + echo b >left/b && > + echo c >right/c && > + git add . && > + git commit -m "first" && > + > + echo d >right/d && > + git add right && > + git commit -m "second" && > + > + echo bb >left/b && > + git commit -a -m "third" && > + > + git checkout -b topic HEAD~1 && > + echo cc >right/c && > + git commit -a -m "topic" > +' > + Nit: Since the root level tree is already special cased out, we only check one level of path here, would be nice to add another level of tree to this. > +test_expect_success 'all' ' > + test-tool path-walk -- --all >out && > + > + cat >expect <<-EOF && > + TREE::$(git rev-parse topic^{tree}) > + TREE::$(git rev-parse base^{tree}) > + TREE::$(git rev-parse base~1^{tree}) > + TREE::$(git rev-parse base~2^{tree}) > + TREE:left/:$(git rev-parse base:left) > + TREE:left/:$(git rev-parse base~2:left) > + TREE:right/:$(git rev-parse topic:right) > + TREE:right/:$(git rev-parse base~1:right) > + TREE:right/:$(git rev-parse base~2:right) > + trees:9 > + BLOB:a:$(git rev-parse base~2:a) > + BLOB:left/b:$(git rev-parse base~2:left/b) > + BLOB:left/b:$(git rev-parse base:left/b) > + BLOB:right/c:$(git rev-parse base~2:right/c) > + BLOB:right/c:$(git rev-parse topic:right/c) > + BLOB:right/d:$(git rev-parse base~1:right/d) > + blobs:6 > + EOF > + > + test_cmp_sorted expect out > +' Isn't the order deterministic? Why do we need to sort it?

gitgitgadget · 2024-11-01T22:26:38Z

On the Git mailing list, Jonathan Tan wrote (reply to this):

I haven't looked thoroughly at the rest of the patches yet, but had a comment about this test. Rearranging: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > +test_expect_success 'all' ' > + test-tool path-walk -- --all >out && > + > + cat >expect <<-EOF && > + TREE::$(git rev-parse topic^{tree}) > + TREE::$(git rev-parse base^{tree}) > + TREE::$(git rev-parse base~1^{tree}) > + TREE::$(git rev-parse base~2^{tree}) > + TREE:left/:$(git rev-parse base:left) > + TREE:left/:$(git rev-parse base~2:left) > + TREE:right/:$(git rev-parse topic:right) > + TREE:right/:$(git rev-parse base~1:right) > + TREE:right/:$(git rev-parse base~2:right) > + trees:9 [snip rest of "expect"] The way you're testing this, wouldn't the tests pass even if the OIDs aren't emitted in path order? (E.g. if topic:right and base~1:right were somehow grouped into two different groups, even though they have the same path.) I would have expected the test output to be something like: TREE:right/ $(rp :right topic base~1 base~2) where rp is a function that takes in a suffix and one or more prefixes - I haven't figured out its contents yet, but echo $(git rev-parse HEAD^^ HEAD^ HEAD | sort) gives us a space-separated list, so it doesn't seem too difficult to define such a function. > +static int emit_block(const char *path, struct oid_array *oids, > + enum object_type type, void *data) > +{ > + struct path_walk_test_data *tdata = data; > + const char *typestr; > + > + switch (type) { > + case OBJ_TREE: > + typestr = "TREE"; > + tdata->tree_nr += oids->nr; > + break; > + > + case OBJ_BLOB: > + typestr = "BLOB"; > + tdata->blob_nr += oids->nr; > + break; > + > + default: > + BUG("we do not understand this type"); > + } > + > + for (size_t i = 0; i < oids->nr; i++) > + printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i])); Then here, you would print typestr and path before the "for" loop. In the "for" loop you would add oid_to_hex() results to a sorted string list, have another "for" loop that prints each element preceded by a space, then print a "\n" after both "for" loops.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 11/1/24 6:23 PM, Jonathan Tan wrote: > I haven't looked thoroughly at the rest of the patches yet, but had a > comment about this test. Rearranging: > > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: >> +test_expect_success 'all' ' >> + test-tool path-walk -- --all >out && >> + >> + cat >expect <<-EOF && >> + TREE::$(git rev-parse topic^{tree}) >> + TREE::$(git rev-parse base^{tree}) >> + TREE::$(git rev-parse base~1^{tree}) >> + TREE::$(git rev-parse base~2^{tree}) >> + TREE:left/:$(git rev-parse base:left) >> + TREE:left/:$(git rev-parse base~2:left) >> + TREE:right/:$(git rev-parse topic:right) >> + TREE:right/:$(git rev-parse base~1:right) >> + TREE:right/:$(git rev-parse base~2:right) >> + trees:9 > > [snip rest of "expect"] > > The way you're testing this, wouldn't the tests pass even if the OIDs > aren't emitted in path order? (E.g. if topic:right and base~1:right > were somehow grouped into two different groups, even though they have > the same path.) You are correct that if the path-walk API emitted multiple batches with the same path name, then we would not detect that via the current testing strategy. The main reason to use the sort is to avoid adding a restriction on the order in which objects appear within the batch. Your recommendation to group a batch into a single line does not strike me as a suitable approach, because long lines become hard to read and difficult to parse diffs. (Also, the order within the batch becomes baked in as a requirement.) The biggest question I'd like to ask is this: do you see a risk of a path being repeated? There are cases where it will happen, such as indexed objects that are not reachable anywhere else. The way I would consider modifying these tests to reflect the batching would be to associate each batch with a number, causing the order of the paths to become hard-coded in the test. Something like 0:COMMIT::$(git rev-parse ...) 0:COMMIT::$(git rev-parse ...) 1:TREE::$(git rev-parse ...) 1:TREE::$(git rev-parse ...) 2:TREE:right/:$(git rev-parse ...) 3:BLOB:right/a:$(...) 4:TREE:left/:$(git rev-parse ...) 5:BLOB:left/b:$(...) This would imply some amount of order that maybe should become a requirement of the API. Thanks, -Stolee

On the Git mailing list, Jonathan Tan wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes: > You are correct that if the path-walk API emitted multiple batches > with the same path name, then we would not detect that via the current > testing strategy. > > The main reason to use the sort is to avoid adding a restriction on > the order in which objects appear within the batch. > > Your recommendation to group a batch into a single line does not > strike me as a suitable approach, because long lines become hard to > read and difficult to parse diffs. (Also, the order within the batch > becomes baked in as a requirement.) The hashes in a line can be abbreviated if line length is a concern. Also, note that I am suggesting sorting the OIDs within a line (that is, a batch), and also sorting the lines (batches) as a whole. > The biggest question I'd like to ask is this: do you see a risk of > a path being repeated? There are cases where it will happen, such as > indexed objects that are not reachable anywhere else. I was thinking that the whole point of this feature is that we group objects by path, so it seems desirable to test that paths are not repeated. (Or repeated as little as possible, if it is not possible to avoid repetition e.g. in the case you describe.) > The way I would consider modifying these tests to reflect the batching > would be to associate each batch with a number, causing the order of > the paths to become hard-coded in the test. Something like > > 0:COMMIT::$(git rev-parse ...) > 0:COMMIT::$(git rev-parse ...) > 1:TREE::$(git rev-parse ...) > 1:TREE::$(git rev-parse ...) > 2:TREE:right/:$(git rev-parse ...) > 3:BLOB:right/a:$(...) > 4:TREE:left/:$(git rev-parse ...) > 5:BLOB:left/b:$(...) > > This would imply some amount of order that maybe should become a > requirement of the API. > > Thanks, > -Stolee If we're willing to declare an order in which we will return paths to the user, that would work too. (I'm not sure that we need to declare an order, though.)

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 11/4/24 6:39 PM, Jonathan Tan wrote: > Derrick Stolee <stolee@gmail.com> writes: > >> The biggest question I'd like to ask is this: do you see a risk of >> a path being repeated? There are cases where it will happen, such as >> indexed objects that are not reachable anywhere else. > > I was thinking that the whole point of this feature is that we group > objects by path, so it seems desirable to test that paths are not > repeated. (Or repeated as little as possible, if it is not possible > to avoid repetition e.g. in the case you describe.) In addition to determining the order of the batches, it can be helpful to demonstrate that we don't call the path_fn with an empty batch! I discovered this while making the appropriate changes today and putting the fixes in the right places. Thanks, -Stolee

gitgitgadget · 2024-11-06T14:09:20Z

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Thu, Oct 31, 2024 at 06:27:00AM +0000, Derrick Stolee via GitGitGadget wrote: [snip] > +int cmd__path_walk(int argc, const char **argv) > +{ > + int res; > + struct rev_info revs = REV_INFO_INIT; > + struct path_walk_info info = PATH_WALK_INFO_INIT; > + struct path_walk_test_data data = { 0 }; > + struct option options[] = { > + OPT_END(), > + }; > + > + initialize_repository(the_repository); > + setup_git_directory(); > + revs.repo = the_repository; > + > + argc = parse_options(argc, argv, NULL, > + options, path_walk_usage, > + PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0); > + > + if (argc > 1) > + setup_revisions(argc, argv, &revs, NULL); > + else > + usage(path_walk_usage[0]); > + > + info.revs = &revs; > + info.path_fn = emit_block; > + info.path_fn_data = &data; > + > + res = walk_objects_by_path(&info); > + > + printf("trees:%" PRIuMAX "\n" > + "blobs:%" PRIuMAX "\n", > + data.tree_nr, data.blob_nr); > + > + return res; > +} This function is leaking memory. I'd propose to add below patch on top to plug them, which makes t6601 pass with the leak sanitizer enabled. Patrick diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c index 06b103d876..fa3bfe46b5 100644 --- a/t/helper/test-path-walk.c +++ b/t/helper/test-path-walk.c @@ -85,7 +85,6 @@ int cmd__path_walk(int argc, const char **argv) OPT_END(), }; - initialize_repository(the_repository); setup_git_directory(); revs.repo = the_repository; @@ -110,5 +109,6 @@ int cmd__path_walk(int argc, const char **argv) "tags:%" PRIuMAX "\n", data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr); + release_revisions(&revs); return res; }

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 11/6/24 9:04 AM, Patrick Steinhardt wrote: > On Thu, Oct 31, 2024 at 06:27:00AM +0000, Derrick Stolee via GitGitGadget wrote: > [snip] > This function is leaking memory. I'd propose to add below patch on top > to plug them, which makes t6601 pass with the leak sanitizer enabled. Thanks! Applied for the next version. -Stolee

gitgitgadget · 2024-11-21T22:41:45Z

On the Git mailing list, Taylor Blau wrote (reply to this):

On Sat, Nov 09, 2024 at 07:41:09PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <stolee@gmail.com> > > Add some tests based on the current behavior, doing interesting checks > for different sets of branches, ranges, and the --boundary option. This > sets a baseline for the behavior and we can extend it as new options are > introduced. > > Store and output a 'batch_nr' value so we can demonstrate that the paths are > grouped together in a batch and not following some other ordering. This > allows us to test the depth-first behavior of the path-walk API. However, we > purposefully do not test the order of the objects in the batch, so the > output is compared to the expected output through a sort. > > It is important to mention that the behavior of the API will change soon as > we start to handle UNINTERESTING objects differently, but these tests will > demonstrate the change in behavior. > > Signed-off-by: Derrick Stolee <stolee@gmail.com> > --- Nice. I like the approach of implementing the API in a single commit, and then demonstrating a trivial "caller" by way of a custom test helper. I think that the artifact of having a test helper here is useful on its own, but it also serves as a good example of how to use the API, and provides something to actually test the implementation with. I'm going to steal this pattern the next time I need to work on something that necessitates a complex new API ;-). > +static int emit_block(const char *path, struct oid_array *oids, > + enum object_type type, void *data) > +{ > + struct path_walk_test_data *tdata = data; > + const char *typestr; > + > + switch (type) { > + case OBJ_TREE: > + typestr = "TREE"; > + tdata->tree_nr += oids->nr; > + break; > + > + case OBJ_BLOB: > + typestr = "BLOB"; > + tdata->blob_nr += oids->nr; > + break; > + default: > + BUG("we do not understand this type"); > + } I think you could write this as: if (type == OBJ_TREE) tdata->tree_nr += oids->nr; else if (type == OBJ_BLOB) tdata->blob_nr += oids->nr; else BUG("we do not understand this type"); typestr = type_name(type); Which DRYs things up a bit and uses the type_name() helper. That will give you strings like "tree" and "blob" instead of "TREE" and "BLOB", but I'm not sure if the casing is important. Thanks, Taylor

-Original file line number
+Diff line change
@@ -0,0 +1,63 @@
+    Path-Walk API
+    =============
+    The path-walk API is used to walk reachable objects, but to visit objects
+    in batches based on a common path they appear in, or by type.
+    For example, all reachable commits are visited in a group. All tags are
+    visited in a group. Then, all root trees are visited. At some point, all
+    blobs reachable via a path `my/dir/to/A` are visited. When there are
+    multiple paths possible to reach the same object, then only one of those
+    paths is used to visit the object.
+    Basics
+    ------
+    To use the path-walk API, include `path-walk.h` and call
+    `walk_objects_by_path()` with a customized `path_walk_info` struct. The
+    struct is used to set all of the options for how the walk should proceed.
+    Let's dig into the different options and their use.
+    `path_fn` and `path_fn_data`::
+    	The most important option is the `path_fn` option, which is a
+    	function pointer to the callback that can execute logic on the
+    	object IDs for objects grouped by type and path. This function
+    	also receives a `data` value that corresponds to the
+    	`path_fn_data` member, for providing custom data structures to
+    	this callback function.
+    `revs`::
+    	To configure the exact details of the reachable set of objects,
+    	use the `revs` member and initialize it using the revision
+    	machinery in `revision.h`. Initialize `revs` using calls such as
+    	`setup_revisions()` or `parse_revision_opt()`. Do not call
+    	`prepare_revision_walk()`, as that will be called within
+    	`walk_objects_by_path()`.
+    +
+    It is also important that you do not specify the `--objects` flag for the
+    `revs` struct. The revision walk should only be used to walk commits, and
+    the objects will be walked in a separate way based on those starting
+    commits.
+    `commits`, `blobs`, `trees`, `tags`::
+    	By default, these members are enabled and signal that the path-walk
+    	API should call the `path_fn` on objects of these types. Specialized
+    	applications could disable some options to make it simpler to walk
+    	the objects or to have fewer calls to `path_fn`.
+    +
+    While it is possible to walk only commits in this way, consumers would be
+    better off using the revision walk API instead.
+    `prune_all_uninteresting`::
+    	By default, all reachable paths are emitted by the path-walk API.
+    	This option allows consumers to declare that they are not
+    	interested in paths where all included objects are marked with the
+    	`UNINTERESTING` flag. This requires using the `boundary` option in
+    	the revision walk so that the walk emits commits marked with the
+    	`UNINTERESTING` flag.
+    Examples
+    --------
+    See example usages in:
+    	`t/helper/test-path-walk.c`

-Original file line number
+Diff line change
@@ Expand Up / @@ -818,6 +818,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o @@
     TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
     TEST_BUILTINS_OBJS += test-partial-clone.o
     TEST_BUILTINS_OBJS += test-path-utils.o
+    TEST_BUILTINS_OBJS += test-path-walk.o
     TEST_BUILTINS_OBJS += test-pcre2-config.o
     TEST_BUILTINS_OBJS += test-pkt-line.o
     TEST_BUILTINS_OBJS += test-proc-receive.o
@@ Expand Down Expand Up / @@ -1094,6 +1095,7 @@ LIB_OBJS += parse-options.o @@
     LIB_OBJS += patch-delta.o
     LIB_OBJS += patch-ids.o
     LIB_OBJS += path.o
+    LIB_OBJS += path-walk.o
     LIB_OBJS += pathspec.o
     LIB_OBJS += pkt-line.o
     LIB_OBJS += preload-index.o
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PATH WALK I: The path-walk API #1818

Diff view

Diff view

There are no files selected for viewing

gitgitgadget bot Nov 1, 2024

gitgitgadget bot Nov 1, 2024

gitgitgadget bot Nov 1, 2024

gitgitgadget bot Dec 13, 2024

gitgitgadget bot Dec 18, 2024

gitgitgadget bot Dec 27, 2024

gitgitgadget bot Dec 27, 2024

gitgitgadget bot Nov 1, 2024

gitgitgadget bot Nov 4, 2024

gitgitgadget bot Nov 21, 2024

gitgitgadget bot Dec 13, 2024

gitgitgadget bot Dec 18, 2024

gitgitgadget bot Nov 1, 2024

gitgitgadget bot Nov 1, 2024

gitgitgadget bot Nov 4, 2024

gitgitgadget bot Nov 4, 2024

gitgitgadget bot Nov 8, 2024

gitgitgadget bot Nov 6, 2024

gitgitgadget bot Nov 8, 2024

gitgitgadget bot Nov 21, 2024

PATH WALK I: The path-walk API #1818

Are you sure you want to change the base?

PATH WALK I: The path-walk API #1818

Diff view

Diff view

There are no files selected for viewing

gitgitgadget bot Nov 1, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 1, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 1, 2024

Choose a reason for hiding this comment

gitgitgadget bot Dec 13, 2024

Choose a reason for hiding this comment

gitgitgadget bot Dec 18, 2024

Choose a reason for hiding this comment

gitgitgadget bot Dec 27, 2024

Choose a reason for hiding this comment

gitgitgadget bot Dec 27, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 1, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 4, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 21, 2024

Choose a reason for hiding this comment

gitgitgadget bot Dec 13, 2024

Choose a reason for hiding this comment

gitgitgadget bot Dec 18, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 1, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 1, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 4, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 4, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 8, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 6, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 8, 2024

Choose a reason for hiding this comment

gitgitgadget bot Nov 21, 2024

Choose a reason for hiding this comment