To train, validate and test binary sequence (=time-series) classification methods we use DNA sequences that are either
- completely coding or
- completely non-coding
These sequences are data from
Mertsch and Stanke, End-to-end Learning of Evolutionary Models to Find Coding Regions in Genome Alignments, bioRxiv 2021
Let
- Two
$k$ -th order Markov chains, one for coding, one for non-coding, trained individually to maximize the likelihood of the respective data. - Like 1., but the positive model is 3-periodic.
- Two
$k$ -th order Markov chains, one for coding, one for non-coding. Then logistic regression to predict a probability of coding. Trained (discriminately) to miminize cross-entropy error (CEE). - Like 3, but the positive model is 3-periodic.
- Like 4, but
$M>2$ models are allowed and$M$ is optimized. -
$M$ HMMs with a fixed number of states ($n=3$ ) are trained jointly with logistic regression.