General question about pseudo-pooling #2047

cjfields · 2024-11-01T00:56:44Z

I'm using an approach like that described in the 'Big Data' workflow, but with the dada step in the loop farmed out to independent worker jobs on a cluster so these can be run in parallel. These are then merged afterwards, combined into a sequence table, and then chimeras are removed.

So far this works quite well, but we'd like to increase sensitivity. What I am wondering is whether we could essentially emulate what pseudo-pooling does by running a first-pass like the above, generate a set of priors from the output, then run a second pass (again parallel on the cluster) but including the priors (generated similar to

dada2/R/dada.R

Line 400 in 278f5f3

    
           pseudo_priors <- colnames(st)[colSums(st>0) >= opts$PSEUDO_PREVALENCE | colSums(st) >= opts$PSEUDO_ABUNDANCE]

). I'm not seeing anything in the function that immediately gives me pause, but would you know if there is anything we need to consider when implementing this (set.seed or any parameters that should be included in the following round)?

Thanks!

The text was updated successfully, but these errors were encountered:

benjjneb · 2024-11-11T03:38:10Z

What you are suggesting looks exactly right to me. This is what pool="pseudo" does, but the built-in implementation can't farm out samples to different nodes as you are doing.

if there is anything we need to consider when implementing this (set.seed or any parameters that should be included in the following round)?

I don't immediately see any issue.
I think what you are doing is consistent with our approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General question about pseudo-pooling #2047

General question about pseudo-pooling #2047

cjfields commented Nov 1, 2024

benjjneb commented Nov 11, 2024

General question about pseudo-pooling #2047

General question about pseudo-pooling #2047

Comments

cjfields commented Nov 1, 2024

benjjneb commented Nov 11, 2024