Replies: 1 comment
-
I am using list cols in data.tables. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I get the most out of
targets
if all surveys are massaged into the same format.The important data (in long format) are:
depth
could probably be dropped though.The way to use the surveys is to iterate over the surveys, fitting a GF model to each survey.
Targets has a few ways to iterate over objects, so the collection of surveys can be one of these forms:
One large data.table makes global operations easy, such as rounding lats and lons and merging with env data. However, it makes operations that apply to separate groups a bit more awkward, because I have to group every time. Conceptually, it is less obvious, because later steps will create objects that will not fit into the large data.table, so iterating over the raw species data will be done very differently to iterating over GF models. It makes normalising the data a bit trickier too, mostly when I want to keep track of sites separately to species counts at sites: A site exists in many surveys, how do I denote an empty site from one survey that was not sampled in a second survey? I would need a separate list of data.tables that stored site locations for each survey.
A list of surveys is easy to loop over, but harder to group. In particular, trophic and survey are hierarchical, and I will want to combine GF models within trophic levels, across surveys. In a list of lists, I need to crawl the list and return entries matching a trophic level.
A list column data.table combines the benefits of lists of surveys, where each survey is largely self-contained, with the grouping and iteration benefits of data.frames. I can iterate over each row, to operate on each survey, and I can group surveys by the trophic column. I can include arbitrary data in the list cols, and even continue to add interesting data to the overall data.table by adding more list cols. Then the GF model is meaningfully connected to the data that created it. The main risk is going overboard with adding cols, and having redundant data in the pipeline targets (survey data, survey data with GF model, survey data and GF model with CASTeR objects). The biggest advantage of the list column approach is that all data is handled in a similar way throughout the pipeline: metadata is a vector col that can be filtered and grouped like any data.frame, looping over surveys (or other units of analysis) is done over rows, related data is kept together within a row without needing to pass multiple lists and constantly look up elements in those lists by metadata names (access looks like
current_loop_row$sites
rather thansites[[loop_trophic]][[loop_survey]]
, and current_loop is not a global variable passed in to the function).Beta Was this translation helpful? Give feedback.
All reactions