Exercise(s) to Classify Genomic Sequences into Coding and Non-Coding

Data

To train, validate and test binary sequence (=time-series) classification methods we use DNA sequences that are either

completely coding or
completely non-coding These sequences are data from
Mertsch and Stanke, End-to-end Learning of Evolutionary Models to Find Coding Regions in Genome Alignments, bioRxiv 2021

Models to Compare

Let $k\in {0,1,2,\dots}$ be a model order. Let $x \in {a,c,g,t}^\ell$ be an input DNA sequence of length $\ell$. Let $y\in{0,1}$ be a class, here $y=1$ means coding (=positive) and $y=0$ (=negative) means non-coding.

Two $k$-th order Markov chains, one for coding, one for non-coding, trained individually to maximize the likelihood of the respective data.
Like 1., but the positive model is 3-periodic.
Two $k$-th order Markov chains, one for coding, one for non-coding. Then logistic regression to predict a probability of coding. Trained (discriminately) to miminize cross-entropy error (CEE).
Like 3, but the positive model is 3-periodic.
Like 4, but $M>2$ models are allowed and $M$ is optimized.
$M$ HMMs with a fixed number of states ($n=3$) are trained jointly with logistic regression.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
.gitignore		.gitignore
HMMCell.py		HMMCell.py
README.md		README.md
RNNpredictDNA.ipynb		RNNpredictDNA.ipynb
RNNpredictDeu.ipynb		RNNpredictDeu.ipynb
SimpleRNNCell.py		SimpleRNNCell.py
dishonest_casino.py		dishonest_casino.py
failing-RNN.ipynb		failing-RNN.ipynb
failing-RNN.py		failing-RNN.py
forwardManually.pdf		forwardManually.pdf
forwardManually.png		forwardManually.png
forwardManually.xoj		forwardManually.xoj
hmm-forward.ipynb		hmm-forward.ipynb
hmm-train-casino.ipynb		hmm-train-casino.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exercise(s) to Classify Genomic Sequences into Coding and Non-Coding

Data

Models to Compare

Compare the Accuracy of above Models on the Test Data

About

Releases

Packages

Languages

mslehre/classify-seqs

Folders and files

Latest commit

History

Repository files navigation

Exercise(s) to Classify Genomic Sequences into Coding and Non-Coding

Data

Models to Compare

Compare the Accuracy of above Models on the Test Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages