Format for all the surveys within the pipeline #3

PhDyellow · 2021-07-21T03:49:16Z

PhDyellow
Jul 21, 2021
Maintainer

I get the most out of targets if all surveys are massaged into the same format.

The important data (in long format) are:

 c("survey", "trophic", "depth", "depth_cat", spatial_vars, "taxon", "abund")

depth could probably be dropped though.

The way to use the surveys is to iterate over the surveys, fitting a GF model to each survey.

Targets has a few ways to iterate over objects, so the collection of surveys can be one of these forms:

One large data.table, iterate with group_by = c("survey", "trophic", "depth_cat")
List of surveys
list column data.table, one row per survey, with survey metadata as character vector columns and survey data as list columns.

One large data.table makes global operations easy, such as rounding lats and lons and merging with env data. However, it makes operations that apply to separate groups a bit more awkward, because I have to group every time. Conceptually, it is less obvious, because later steps will create objects that will not fit into the large data.table, so iterating over the raw species data will be done very differently to iterating over GF models. It makes normalising the data a bit trickier too, mostly when I want to keep track of sites separately to species counts at sites: A site exists in many surveys, how do I denote an empty site from one survey that was not sampled in a second survey? I would need a separate list of data.tables that stored site locations for each survey.

A list of surveys is easy to loop over, but harder to group. In particular, trophic and survey are hierarchical, and I will want to combine GF models within trophic levels, across surveys. In a list of lists, I need to crawl the list and return entries matching a trophic level.

A list column data.table combines the benefits of lists of surveys, where each survey is largely self-contained, with the grouping and iteration benefits of data.frames. I can iterate over each row, to operate on each survey, and I can group surveys by the trophic column. I can include arbitrary data in the list cols, and even continue to add interesting data to the overall data.table by adding more list cols. Then the GF model is meaningfully connected to the data that created it. The main risk is going overboard with adding cols, and having redundant data in the pipeline targets (survey data, survey data with GF model, survey data and GF model with CASTeR objects). The biggest advantage of the list column approach is that all data is handled in a similar way throughout the pipeline: metadata is a vector col that can be filtered and grouped like any data.frame, looping over surveys (or other units of analysis) is done over rows, related data is kept together within a row without needing to pass multiple lists and constantly look up elements in those lists by metadata names (access looks like current_loop_row$sites rather than sites[[loop_trophic]][[loop_survey]], and current_loop is not a global variable passed in to the function).

PhDyellow · 2021-07-21T03:49:30Z

PhDyellow
Jul 21, 2021
Maintainer Author

I am using list cols in data.tables.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format for all the surveys within the pipeline #3

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Format for all the surveys within the pipeline #3

PhDyellow Jul 21, 2021 Maintainer

Replies: 1 comment

PhDyellow Jul 21, 2021 Maintainer Author

PhDyellow
Jul 21, 2021
Maintainer

PhDyellow
Jul 21, 2021
Maintainer Author