Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreating training model #24

Open
bnpapas opened this issue Mar 10, 2023 · 7 comments
Open

Recreating training model #24

bnpapas opened this issue Mar 10, 2023 · 7 comments

Comments

@bnpapas
Copy link

bnpapas commented Mar 10, 2023

I have been attempting to use the fast5 data provided with the manuscript to train a model to call the same 4 barcodes as "resnet20-final.h5". I've used mapping information to assign barcodes, and if I use the given model with deeplexicon the agreement with my truth table is excellent.
I've tried taking 40k reads from each barcode as a training set, with 10k from each as test and validation sets. The training runs, seemingly without issue, however it shows some behavior I don't understand.

  1. Even when the reported accuracy on the training set crosses 0.9, the validation accuracy hovers around 0.5. I've even completed a test where I've had my validation set be a subset of the training reads, and this still occurs.
  2. Using the final model output by the training yields terrible results, even when used on the training set.

Note: I have been using the docker image provided by pulling lpryszcz/deeplexicon:1.2.0-gpu, with "deeplexicon_multi.py train" having default options. Do you have any suggestions how I can improve the model training results?

@enovoa
Copy link
Collaborator

enovoa commented Mar 10, 2023

Hi @bnpapas - Are you segmenting the fast5?

@bnpapas
Copy link
Author

bnpapas commented Mar 13, 2023

I am following the instructions posted here: https://psy-fer.github.io/deeplexicon/train/
I'm not sure which step would be segmentation?

@noncodo
Copy link
Collaborator

noncodo commented Mar 13, 2023 via email

@bnpapas
Copy link
Author

bnpapas commented Mar 14, 2023

The goal here is to be able to train a new model with an eye towards possibly adding new barcodes - I won't be able to use dmux first in a real use case. The truth table files I've assembled are based on mapping information, as was done in the publication. The match between these truth tables and the dmux results from "resnet20-final.h5" is very good.

Edit: To make sure it is clear, I am using the python version of the training code, which uses the "dRNA_segmenter" function to segment reads prior to image generation and subsequent training.

@bnpapas
Copy link
Author

bnpapas commented May 26, 2023

When dmux is assigning barcodes, it uses the "classify" function. This function does a transform of the data:

  x = image.astype('float32') + 1
  x = x / 2

The training subcommand, however, does not take this step and trains directly on the images. I've removed the transform from "classify" and now my freshly-trained models produce sensible results with dmux. I assume I can get similar behavior by adding the transform into the train subroutine.
Is there a reason to think having this transformation is better than not?

@Psy-Fer
Copy link
Owner

Psy-Fer commented May 26, 2023

I think that was added (meant to be on both), to avoid a zero divide error to make it 1 indexed. Sorry been a while since I wrote that.

@fulaibaowang
Copy link

fulaibaowang commented Aug 9, 2023

You may need to segment the data a priori, e.g. by running python3 deeplexicon.py dmux This will split the signal to separate the barcodes from the RNA. Then train on the segmented barcode output.

On Mar 13, 2023, at 10:15 AM, bnpapas @.***> wrote: I am following the instructions posted here: https://psy-fer.github.io/deeplexicon/train/ https://psy-fer.github.io/deeplexicon/train/ I'm not sure which step would be segmentation? — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCR37TRXRHAKBGBBFT273W34TZTANCNFSM6AAAAAAVWZHVLA. You are receiving this because you are subscribed to this thread.

would you mind sharing the code? I see deeplexicon_multi.py squig for getting the segmetation but how to would you "split the signal to separate the barcodes from the RNA"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants