-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte #4
Comments
What type of input data are you using? Can you share the first few rows of your input file? |
<!--
/* Font Definitions */
@font-face
{font-family:PMingLiU;
panose-1:2 2 5 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:PMingLiU;
panose-1:2 1 6 1 0 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
.MsoChpDefault
{mso-style-type:export-only;}
/* Page Definitions */
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 90.0pt 72.0pt 90.0pt;}
div.WordSection1
{page:WordSection1;}
-->Hello, The imput is vcf file from whole genome sequencing; I followed your instructions. The file size is around 1.0G. The process stopped at mutpred_merge.py. In the intermediate/splits; there was only few indels, LOF, missense variants file, much lesser than its original vcf input file. Below is the first few rows of input file ##fileformat=VCFv4.2##FILTER=<ID=PASS,Description="All filters passed">##FILTER=<ID=LowQual,Description="Low quality">##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="The phred-scaled genotype likelihoods rounded to the closest integer">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership">##INFO=<ID=DP,Number=1,Type=Integer,Description="Combined depth across samples">##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts, for each ALT allele, in the same order as listed">##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency, for each ALT allele, in the same order as listed">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS mapping quality">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">##SentieonCommandLine.Haplotyper=<ID=Haplotyper,Version="sentieon-genomics-201808",Date="2021-08-15T02:27:27Z",CommandLine="/opt/sentieon-genomics-201808/libexec/driver -r /var/data0/ucsc.hg19.fasta -t 32 -i /var/data1/TAAD0149-LCM8323_L4.realigned.sorted.dedup.bam -q /var/data1/TAAD0149-LCM8323_L4.recaldata.table --algo Haplotyper -d /var/data0/dbsnp_138.hg19.vcf --emit_conf=10 --call_conf=30 /var/data1/TAAD0149-LCM8323_L4.vcf.gz">##contig=<ID=chrM,length=16571,assembly=hg19>##contig=<ID=chr1,length=249250621,assembly=hg19>##contig=<ID=chr2,length=243199373,assembly=hg19>##contig=<ID=chr3,length=198022430,assembly=hg19>##contig=<ID=chr4,length=191154276,assembly=hg19>##contig=<ID=chr5,length=180915260,assembly=hg19>##contig=<ID=chr6,length=171115067,assembly=hg19>##contig=<ID=chr7,length=159138663,assembly=hg19>##contig=<ID=chr8,length=146364022,assembly=hg19>##contig=<ID=chr8_gl000197_random,length=37175,assembly=hg19>##contig=<ID=chr9_gl000198_random,length=90085,assembly=hg19>##contig=<ID=chr9_gl000199_random,length=169874,assembly=hg19>##contig=<ID=chr9_gl000200_random,length=187035,assembly=hg19>##contig=<ID=chr9_gl000201_random,length=36148,assembly=hg19>##contig=<ID=chr11_gl000202_random,length=40103,assembly=hg19>##contig=<ID=chr17_ctg5_hap1,length=1680828,assembly=hg19>##contig=<ID=chr17_gl000203_random,length=37498,assembly=hg19>##contig=<ID=chr17_gl000204_random,length=81310,assembly=hg19>##contig=<ID=chr17_gl000205_random,length=174588,assembly=hg19>##contig=<ID=chr17_gl000206_random,length=41001,assembly=hg19>##contig=<ID=chr18_gl000207_random,length=4262,assembly=hg19>##contig=<ID=chr19_gl000208_random,length=92689,assembly=hg19>##contig=<ID=chr19_gl000209_random,length=159169,assembly=hg19>##contig=<ID=chr21_gl000210_random,length=27682,assembly=hg19>##contig=<ID=chrUn_gl000211,length=166566,assembly=hg19>##contig=<ID=chrUn_gl000212,length=186858,assembly=hg19>##contig=<ID=chrUn_gl000213,length=164239,assembly=hg19>##contig=<ID=chrUn_gl000214,length=137718,assembly=hg19>##contig=<ID=chrUn_gl000215,length=172545,assembly=hg19>##contig=<ID=chrUn_gl000216,length=172294,assembly=hg19>##contig=<ID=chrUn_gl000217,length=172149,assembly=hg19>##contig=<ID=chrUn_gl000218,length=161147,assembly=hg19>##contig=<ID=chrUn_gl000219,length=179198,assembly=hg19>##contig=<ID=chrUn_gl000220,length=161802,assembly=hg19>##contig=<ID=chrUn_gl000221,length=155397,assembly=hg19>##contig=<ID=chrUn_gl000222,length=186861,assembly=hg19>##contig=<ID=chrUn_gl000223,length=180455,assembly=hg19>##contig=<ID=chrUn_gl000224,length=179693,assembly=hg19>##contig=<ID=chrUn_gl000225,length=211173,assembly=hg19>##contig=<ID=chrUn_gl000226,length=15008,assembly=hg19>##contig=<ID=chrUn_gl000227,length=128374,assembly=hg19>##contig=<ID=chrUn_gl000228,length=129120,assembly=hg19>##contig=<ID=chrUn_gl000229,length=19913,assembly=hg19>##contig=<ID=chrUn_gl000230,length=43691,assembly=hg19>##contig=<ID=chrUn_gl000231,length=27386,assembly=hg19>##contig=<ID=chrUn_gl000232,length=40652,assembly=hg19>##contig=<ID=chrUn_gl000233,length=45941,assembly=hg19>##contig=<ID=chrUn_gl000234,length=40531,assembly=hg19>##contig=<ID=chrUn_gl000235,length=34474,assembly=hg19>##contig=<ID=chrUn_gl000236,length=41934,assembly=hg19>##contig=<ID=chrUn_gl000237,length=45867,assembly=hg19>##contig=<ID=chrUn_gl000238,length=39939,assembly=hg19>##contig=<ID=chrUn_gl000239,length=33824,assembly=hg19>##contig=<ID=chrUn_gl000240,length=41933,assembly=hg19>##contig=<ID=chrUn_gl000241,length=42152,assembly=hg19>##contig=<ID=chrUn_gl000242,length=43523,assembly=hg19>##contig=<ID=chrUn_gl000243,length=43341,assembly=hg19>##contig=<ID=chrUn_gl000244,length=39929,assembly=hg19>##contig=<ID=chrUn_gl000245,length=36651,assembly=hg19>##contig=<ID=chrUn_gl000246,length=38154,assembly=hg19>##contig=<ID=chrUn_gl000247,length=36422,assembly=hg19>##contig=<ID=chrUn_gl000248,length=39786,assembly=hg19>##contig=<ID=chrUn_gl000249,length=38502,assembly=hg19>##reference=file:///var/data0/ucsc.hg19.fasta##bcftools_normVersion=1.13+htslib-1.13##bcftools_normCommand=norm -m-both -o /staging/biology/genegogo2019/annovar_TAAD/TAAD0149-LCM8323_L4.step1.vcf /staging/biology/genegogo2019/annovar_TAAD/TAAD0149-LCM8323_L4.vcf.gz; Date=Mon May 30 13:41:37 2022##bcftools_normCommand=norm -f /staging/reserve/paylong_ntu/AI_SHARE/reference/GATK_bundle/2.8/hg19/ucsc.hg19.fasta -o /staging/biology/genegogo2019/annovar_TAAD/TAAD0149-LCM8323_L4.vcf /staging/biology/genegogo2019/annovar_TAAD/TAAD0149-LCM8323_L4.step1.vcf; Date=Mon May 30 13:41:56 2022#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TAAD0149-LCM8323_L4chrM 150 . T C 51425.8 . AC=2;AF=1;AN=2;BaseQRankSum=-3.558;ClippingRankSum=0;DP=1765;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQRankSum=0;QD=30.74;ReadPosRankSum=0.024;SOR=0.78 GT:AD:DP:GQ:PL 1/1:56,1617:1673:99:51454,2990,0chrM 195 . C T 7031.77 . AC=1;AF=0.5;AN=2;BaseQRankSum=4.873;ClippingRankSum=0;DP=1639;ExcessHet=3.0103;FS=3.049;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=4.54;ReadPosRankSum=-2.549;SOR=0.756 GT:AD:DP:GQ:PL 0/1:1203,346:1549:99:7060,0,36222chrM 302 . AC A 17150.7 . AC=2;AF=1;AN=2;BaseQRankSum=0.83;ClippingRankSum=0;DP=868;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=59.99;MQRankSum=0;QD=23.69;ReadPosRankSum=0.902;SOR=2.958 GT:AD:DP:GQ:PL 1/1:47,677:724:99:17188,1183,0chrM 410 . A T 30745.8 . AC=2;AF=1;AN=2;DP=908;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=59.99;QD=35.54;SOR=2.555 GT:AD:DP:GQ:PL 1/1:0,865:865:99:30774,2605,0chrM 491 . T C 30203.8 . AC=2;AF=1;AN=2;BaseQRankSum=-3.069;ClippingRankSum=0;DP=1089;ExcessHet=3.0103;FS=1.264;MLEAC=2;MLEAF=1;MQ=59.97;MQRankSum=0;QD=30.39;ReadPosRankSum=-0.858;SOR=0.514 GT:AD:DP:GQ:PL 1/1:50,944:994:99:30232,1153,0chrM 2354 . C T 61694.8 . AC=2;AF=1;AN=2;DP=1806;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=34.45;SOR=0.752 GT:AD:DP:GQ:PL 1/1:0,1791:1791:99:61723,5386,0chrM 2485 . C T 60209.8 . AC=2;AF=1;AN=2;DP=1798;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=59.72;QD=34.09;SOR=0.859 GT:AD:DP:GQ:PL 1/1:0,1766:1766:99:60238,5305,0chrM 3029 . T C 48639.8 . AC=2;AF=1;AN=2;BaseQRankSum=1.034;ClippingRankSum=0;DP=1685;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQRankSum=0;QD=28.95;ReadPosRankSum=-0.985;SOR=0.8 GT:AD:DP:GQ:PL 1/1:127,1553:1680:99:48668,844,0chrM 3706 . G A 49509.8 . AC=2;AF=1;AN=2;BaseQRankSum=-4.266;ClippingRankSum=0;DP=1762;ExcessHet=3.0103;FS=2.493;MLEAC=2;MLEAF=1;MQ=60;MQRankSum=0;QD=28.47;ReadPosRankSum=0.484;SOR=0.322 GT:AD:DP:GQ:PL 1/1:117,1622:1739:99:49538,1282,0chrM 4492 . G A 45206.8 . AC=2;AF=1;AN=2;BaseQRankSum=11.468;ClippingRankSum=0;DP=1643;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=40.5;MQRankSum=0.537;QD=27.57;ReadPosRankSum=0.897;SOR=0.607 GT:AD:DP:GQ:PL 1/1:135,1505:1640:99:45235,822,0chrM 5581 . C T 47719.8 . AC=2;AF=1;AN=2;DP=1419;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=45.76;QD=33.68;SOR=0.811 GT:AD:DP:GQ:PL 1/1:0,1417:1417:99:47748,4261,0 With best wishes,段德敏 從 Windows 的郵件傳送 寄件者: Timothy Bergquist傳送時間: 2022年7月6日 上午 12:18收件者: NCBI-Hackathons/MutPredMerge副本: minn333; Author主旨: Re: [NCBI-Hackathons/MutPredMerge] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte (Issue #4) What type of input data are you using? Can you share the first few rows of your input file?—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Dear author, I encountered UnicodeDecodeError while runnning mutpred_merge.py. I tried to correct writting as
data = pd.read_csv("intermediates/scores/" + filename, names=cols, header=None, sep="|", encoding = 'unicode_escape')
but failed to correct.
The new error came out as: UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 4598-4599: truncated \xXX escape。
Do your have any suggestion?
Thanks a lots!
Traceback (most recent call last):
File "/$User/MutPredMerge-master/mutpred_merge.py", line 202, in
merged_variants = merge()
File "/$User/MutPredMerge-master/mutpred_merge.py", line 110, in merge
data = pd.read_csv("intermediates/scores/" + filename, names=cols, header=None, sep="|")
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 933, in init
self._engine = self._make_engine(f, self.engine)
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1235, in _make_engine
return mapping[engine](f, **self.options)
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 734, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte
The text was updated successfully, but these errors were encountered: