Question about minimap2 parameters in long_read_typing.py #28

karoliinas · 2024-03-26T13:20:53Z

Hi! I'm running long_read_typing.py for PacBio -reads for which I have extracted the HLA-regions (originally aligned data to chm13, then extracted hla with HLA*LA and ran bam2fastq). I was wondering about the alignment, as the same script is used also for typing Nanopore reads and there is only one set of minimap2 parameters:
minimap2 -t {parameter.threads} -p 0.1 -N 100000 -a $ref $fq

And wondered why those recommended for pacbio aren't used? E.g:
minimap2 -ax map-pb ref.fa pacbio-reads.fq
or
-k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200

Should instead use the script SpecHLA.sh or am I correct in only running the long_read_typing.py at this point?

Many thanks!

The text was updated successfully, but these errors were encountered:

wshuai294 · 2024-03-26T16:12:24Z

Hi,

Thank you for the patient effort. For long-read data only, long_read_typing.py should be used instead of SpecHLA.sh. For long-read-only typing, we put effort into binning reads instead of the alignment. I believe your suggested parameter setting of Minimap2 is better. We will update it as suggested soon.

Thank you so much.

karoliinas · 2024-03-27T12:49:57Z

Thanks for your insights and the great the tool! I will modify the script for my hifi data. It's low coverage (~10X) and I appreciate it's not optimal for hla typing. I'm now running both HLA*LA and SpecHLA to have more confidence in the genotypes that match.

karoliinas · 2024-04-11T13:11:05Z

Hi,
I'm reopening the issue although I'm not sure whether my question falls under the same category, but likely it has something to do with alignment. I managed to genotype my data, and the results mostly agree with those from HLA*LA, which is nice. I'm plotting the depths to further evaluate the calls, and notice that HLA DPB1 has a ~500bp region in all the samples with spuriously high depth (1000-2000), whereas the data is generally ~10x. Do you have any idea why this might be?

Thanks again!

wshuai294 · 2024-04-12T03:56:31Z

Hi,
I guess two possible reasons:

Is the input data short reads or long reads? If they are short reads, it might be caused by PCR duplicates which can be removed by tools such as fastp.
This region might be homologous for many HLA genes. Maybe you can extract this ~500bp region, and map it to chr6 to see the hits. Or you can use online ncbi-blast to map it to the nr database. Also, you can observe the reads mapped to this region, and check if they only map to this region or they can map to other regions.

Hope it can help.

Best,
Shuai

karoliinas · 2024-04-16T11:14:16Z

Hi, thanks for the insights! I have been observing the alignment files outputted by specHLA (DPB1.bam), and the read quality and mapping seem fine to me.

I have long read data (pacbio hifi), and therefore only ran the long_read_typing.py. Does it also perform read binning, or only alignment? I was thinking could the problem arise in the binning step?

Best,
Karoliina

Thanks for the help!

wshuai294 · 2024-04-16T11:58:52Z

Hi,

SpecHLA indeed bins the long reads as well. Please look at the file ${sample}.db.sam in the output folder, which records what alleles each read can map to.

Best,
shuai

karoliinas · 2024-04-17T08:19:27Z

Right, first of all, it's great to have all the outputs for troubleshooting! Seems that there's a lot going on under the hood. So for most of the samples, the file ${sample}.db.sam has roughly 3M reads altogether, and ~900K reads are assigned for DPB1. Does that sound like a lot?

For your information, I did change the mapping parameters to suit the hifi -reads because the original got stuck in the mapping step for a long time:

under class Pacbio_Binning:
minimap2 -x map-hifi -t {parameter.threads} -p 0.1 -N 100000 -a $ref $fq > $outdir/$sample.db.sam and
under class Fasta:
minimap2 -x map-hifi -t %s -a $hla_ref $outdir/$hla.%s.fq.gz | samtools view -bS -F 0x800 -| samtools sort - >$outdir/$hla.bam
I didn't, however, touch the parameters about secondary mapping.

Thank you for your helpful comments and your patience!

wshuai294 · 2024-04-17T11:57:09Z

Hi, I found that the reads from the bacterial artificial chromosome (BAC) are wrongly assigned to DPB1. It would be a good idea to map the long reads to the BAC library to remove such reads before using SpecHLA.

Here is how I find this:
First, I check the reads assigned to DPB1 in the NA19238 PacBio hifi sample, where the read m64039_190830_040857/79364868/ccs is weird, cause only a small middle fraction can map to DPB1 allele (cigar shows). If a read is derived from DPB1, there will be three cases: the left end of the read map to the DPB1, the middle part can map to the whole DPB1 gene, or the right end of the read map to the DPB1. It is impossible that only a small middle fraction maps to DPB1.

To see what the read is, I map the reads to the NR database using ncbi-blast, and found this read can perfectly map to Homo sapiens BAC clone RP11-576F1 from 2, complete sequence. The alignment result is

Maybe you can extract some reads assigned to DPB1 and map them to the NR database using ncbi-blast too. If it's true, we can remove them by aligning the raw reads to the BAC library before using SpecHLA.

karoliinas · 2024-04-17T14:32:28Z

Wow, great, thank you for your trouble! Interesting, that there is an identical sequence in BAC and DPB1. I will test blast, and am currently downloading the databases as my data is sensitive and I can't use the web tool. Would this BAC presence be due to contamination? And where does it come from, I understand it is used as cloning vector, but I am dealing with whole genomes measured from blood. Sorry, seems that I have questions again, but they are not urgent. I'll let you know when I have blasted a few DPB1 -reads.

wshuai294 · 2024-04-18T03:44:53Z

I also guess the BAC sequences are contamination introduced in sequencing; they might be used for sequencing accuracy assessment or human DNA amplification. However, further surveys are needed to confirm this. Waiting for your alignment results.

karoliinas · 2024-04-24T10:55:50Z

Hi, after blasting a number of the dubious reads, they do hit some kind of BACs, though with slightly different id than in your screenshot. I tried running HiFiAdapterfilt but seems that the sequences aren't included in the list of PacBio adapters, so that was not helpful. Not sure which BAC library to align the reads to, and how to go about it. How big of an effect would you think these reads have in the hla-haplotyping? And if it's big, would it affect all the genes or only DPB1?

wshuai294 · 2024-04-24T12:07:51Z

Thanks for the reply. We have not assessed the influence of BAC yet. I think these BAC-derived reads only harm high-resolution typing results (e.g., 8-digit). I will construct a pipeline to remove such reads, and see whether it improves the performance of SpecHLA. If it does, I will add it to long_read_typing.py.

wshuai294 · 2024-04-24T13:00:42Z

By the way, what's the location of this HLA DPB1 region, i.e., ~500bp region in all the samples with spuriously high depth (1000-2000)? In my data, this region is DPB1:2248-2832bp based on the SpecHLA reference.

karoliinas · 2024-04-24T13:22:21Z

Ok, this is a relief, that the results should hold :) And like I said, the results are pretty identical to those from hla*la.

The region in my data is 2265-2832, slightly different start. Many thanks for your help and insights!

wshuai294 · 2024-04-25T04:35:32Z

It is great to hear the consistency between the results of HLA*LA and SpecHLA, indicating they are reliable. I am collecting the BAC sequences from NCBI.

karoliinas closed this as completed Mar 27, 2024

karoliinas reopened this Mar 27, 2024

karoliinas closed this as completed Mar 27, 2024

karoliinas reopened this Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about minimap2 parameters in long_read_typing.py #28

Question about minimap2 parameters in long_read_typing.py #28

karoliinas commented Mar 26, 2024 •

edited

Loading

wshuai294 commented Mar 26, 2024

karoliinas commented Mar 27, 2024

karoliinas commented Apr 11, 2024

wshuai294 commented Apr 12, 2024

karoliinas commented Apr 16, 2024

wshuai294 commented Apr 16, 2024

karoliinas commented Apr 17, 2024

wshuai294 commented Apr 17, 2024

karoliinas commented Apr 17, 2024

wshuai294 commented Apr 18, 2024

karoliinas commented Apr 24, 2024

wshuai294 commented Apr 24, 2024

wshuai294 commented Apr 24, 2024

karoliinas commented Apr 24, 2024

wshuai294 commented Apr 25, 2024

Question about minimap2 parameters in long_read_typing.py #28

Question about minimap2 parameters in long_read_typing.py #28

Comments

karoliinas commented Mar 26, 2024 • edited Loading

wshuai294 commented Mar 26, 2024

karoliinas commented Mar 27, 2024

karoliinas commented Apr 11, 2024

wshuai294 commented Apr 12, 2024

karoliinas commented Apr 16, 2024

wshuai294 commented Apr 16, 2024

karoliinas commented Apr 17, 2024

wshuai294 commented Apr 17, 2024

karoliinas commented Apr 17, 2024

wshuai294 commented Apr 18, 2024

karoliinas commented Apr 24, 2024

wshuai294 commented Apr 24, 2024

wshuai294 commented Apr 24, 2024

karoliinas commented Apr 24, 2024

wshuai294 commented Apr 25, 2024

karoliinas commented Mar 26, 2024 •

edited

Loading