UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte #4

minn333 · 2022-07-03T02:13:42Z

Dear author, I encountered UnicodeDecodeError while runnning mutpred_merge.py. I tried to correct writting as
data = pd.read_csv("intermediates/scores/" + filename, names=cols, header=None, sep="|", encoding = 'unicode_escape')

but failed to correct.

The new error came out as: UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 4598-4599: truncated \xXX escape。

Do your have any suggestion?

Thanks a lots!

Traceback (most recent call last):
File "/$User/MutPredMerge-master/mutpred_merge.py", line 202, in
merged_variants = merge()
File "/$User/MutPredMerge-master/mutpred_merge.py", line 110, in merge
data = pd.read_csv("intermediates/scores/" + filename, names=cols, header=None, sep="|")
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 933, in init
self._engine = self._make_engine(f, self.engine)
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1235, in _make_engine
return mapping[engine](f, **self.options)
File "/$PATH/snakemake/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 734, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte

trberg · 2022-07-05T16:18:07Z

What type of input data are you using? Can you share the first few rows of your input file?

minn333 · 2022-07-07T02:23:57Z

Hello, The imput is vcf file from whole genome sequencing; I followed your instructions. The file size is around 1.0G. The process stopped at mutpred_merge.py. In the intermediate/splits; there was only few indels, LOF, missense variants file, much lesser than its original vcf input file. Below is the first few rows of input file ##fileformat=VCFv4.2##FILTER=<ID=PASS,Description="All filters passed">##FILTER=<ID=LowQual,Description="Low quality">##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="The phred-scaled genotype likelihoods rounded to the closest integer">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership">##INFO=<ID=DP,Number=1,Type=Integer,Description="Combined depth across samples">##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts, for each ALT allele, in the same order as listed">##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency, for each ALT allele, in the same order as listed">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS mapping quality">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">##SentieonCommandLine.Haplotyper=<ID=Haplotyper,Version="sentieon-genomics-201808",Date="2021-08-15T02:27:27Z",CommandLine="/opt/sentieon-genomics-201808/libexec/driver -r /var/data0/ucsc.hg19.fasta -t 32 -i /var/data1/TAAD0149-LCM8323_L4.realigned.sorted.dedup.bam -q /var/data1/TAAD0149-LCM8323_L4.recaldata.table --algo Haplotyper -d /var/data0/dbsnp_138.hg19.vcf --emit_conf=10 --call_conf=30 /var/data1/TAAD0149-LCM8323_L4.vcf.gz">##contig=<ID=chrM,length=16571,assembly=hg19>##contig=<ID=chr1,length=249250621,assembly=hg19>##contig=<ID=chr2,length=243199373,assembly=hg19>##contig=<ID=chr3,length=198022430,assembly=hg19>##contig=<ID=chr4,length=191154276,assembly=hg19>##contig=<ID=chr5,length=180915260,assembly=hg19>##contig=<ID=chr6,length=171115067,assembly=hg19>##contig=<ID=chr7,length=159138663,assembly=hg19>##contig=<ID=chr8,length=146364022,assembly=hg19>##contig=<ID=chr8_gl000197_random,length=37175,assembly=hg19>##contig=<ID=chr9_gl000198_random,length=90085,assembly=hg19>##contig=<ID=chr9_gl000199_random,length=169874,assembly=hg19>##contig=<ID=chr9_gl000200_random,length=187035,assembly=hg19>##contig=<ID=chr9_gl000201_random,length=36148,assembly=hg19>##contig=<ID=chr11_gl000202_random,length=40103,assembly=hg19>##contig=<ID=chr17_ctg5_hap1,length=1680828,assembly=hg19>##contig=<ID=chr17_gl000203_random,length=37498,assembly=hg19>##contig=<ID=chr17_gl000204_random,length=81310,assembly=hg19>##contig=<ID=chr17_gl000205_random,length=174588,assembly=hg19>##contig=<ID=chr17_gl000206_random,length=41001,assembly=hg19>##contig=<ID=chr18_gl000207_random,length=4262,assembly=hg19>##contig=<ID=chr19_gl000208_random,length=92689,assembly=hg19>##contig=<ID=chr19_gl000209_random,length=159169,assembly=hg19>##contig=<ID=chr21_gl000210_random,length=27682,assembly=hg19>##contig=<ID=chrUn_gl000211,length=166566,assembly=hg19>##contig=<ID=chrUn_gl000212,length=186858,assembly=hg19>##contig=<ID=chrUn_gl000213,length=164239,assembly=hg19>##contig=<ID=chrUn_gl000214,length=137718,assembly=hg19>##contig=<ID=chrUn_gl000215,length=172545,assembly=hg19>##contig=<ID=chrUn_gl000216,length=172294,assembly=hg19>##contig=<ID=chrUn_gl000217,length=172149,assembly=hg19>##contig=<ID=chrUn_gl000218,length=161147,assembly=hg19>##contig=<ID=chrUn_gl000219,length=179198,assembly=hg19>##contig=<ID=chrUn_gl000220,length=161802,assembly=hg19>##contig=<ID=chrUn_gl000221,length=155397,assembly=hg19>##contig=<ID=chrUn_gl000222,length=186861,assembly=hg19>##contig=<ID=chrUn_gl000223,length=180455,assembly=hg19>##contig=<ID=chrUn_gl000224,length=179693,assembly=hg19>##contig=<ID=chrUn_gl000225,length=211173,assembly=hg19>##contig=<ID=chrUn_gl000226,length=15008,assembly=hg19>##contig=<ID=chrUn_gl000227,length=128374,assembly=hg19>##contig=<ID=chrUn_gl000228,length=129120,assembly=hg19>##contig=<ID=chrUn_gl000229,length=19913,assembly=hg19>##contig=<ID=chrUn_gl000230,length=43691,assembly=hg19>##contig=<ID=chrUn_gl000231,length=27386,assembly=hg19>##contig=<ID=chrUn_gl000232,length=40652,assembly=hg19>##contig=<ID=chrUn_gl000233,length=45941,assembly=hg19>##contig=<ID=chrUn_gl000234,length=40531,assembly=hg19>##contig=<ID=chrUn_gl000235,length=34474,assembly=hg19>##contig=<ID=chrUn_gl000236,length=41934,assembly=hg19>##contig=<ID=chrUn_gl000237,length=45867,assembly=hg19>##contig=<ID=chrUn_gl000238,length=39939,assembly=hg19>##contig=<ID=chrUn_gl000239,length=33824,assembly=hg19>##contig=<ID=chrUn_gl000240,length=41933,assembly=hg19>##contig=<ID=chrUn_gl000241,length=42152,assembly=hg19>##contig=<ID=chrUn_gl000242,length=43523,assembly=hg19>##contig=<ID=chrUn_gl000243,length=43341,assembly=hg19>##contig=<ID=chrUn_gl000244,length=39929,assembly=hg19>##contig=<ID=chrUn_gl000245,length=36651,assembly=hg19>##contig=<ID=chrUn_gl000246,length=38154,assembly=hg19>##contig=<ID=chrUn_gl000247,length=36422,assembly=hg19>##contig=<ID=chrUn_gl000248,length=39786,assembly=hg19>##contig=<ID=chrUn_gl000249,length=38502,assembly=hg19>##reference=file:///var/data0/ucsc.hg19.fasta##bcftools_normVersion=1.13+htslib-1.13##bcftools_normCommand=norm -m-both -o /staging/biology/genegogo2019/annovar_TAAD/TAAD0149-LCM8323_L4.step1.vcf /staging/biology/genegogo2019/annovar_TAAD/TAAD0149-LCM8323_L4.vcf.gz; Date=Mon May 30 13:41:37 2022##bcftools_normCommand=norm -f /staging/reserve/paylong_ntu/AI_SHARE/reference/GATK_bundle/2.8/hg19/ucsc.hg19.fasta -o /staging/biology/genegogo2019/annovar_TAAD/TAAD0149-LCM8323_L4.vcf /staging/biology/genegogo2019/annovar_TAAD/TAAD0149-LCM8323_L4.step1.vcf; Date=Mon May 30 13:41:56 2022#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TAAD0149-LCM8323_L4chrM 150 . T C 51425.8 . AC=2;AF=1;AN=2;BaseQRankSum=-3.558;ClippingRankSum=0;DP=1765;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQRankSum=0;QD=30.74;ReadPosRankSum=0.024;SOR=0.78 GT:AD:DP:GQ:PL 1/1:56,1617:1673:99:51454,2990,0chrM 195 . C T 7031.77 . AC=1;AF=0.5;AN=2;BaseQRankSum=4.873;ClippingRankSum=0;DP=1639;ExcessHet=3.0103;FS=3.049;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=4.54;ReadPosRankSum=-2.549;SOR=0.756 GT:AD:DP:GQ:PL 0/1:1203,346:1549:99:7060,0,36222chrM 302 . AC A 17150.7 . AC=2;AF=1;AN=2;BaseQRankSum=0.83;ClippingRankSum=0;DP=868;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=59.99;MQRankSum=0;QD=23.69;ReadPosRankSum=0.902;SOR=2.958 GT:AD:DP:GQ:PL 1/1:47,677:724:99:17188,1183,0chrM 410 . A T 30745.8 . AC=2;AF=1;AN=2;DP=908;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=59.99;QD=35.54;SOR=2.555 GT:AD:DP:GQ:PL 1/1:0,865:865:99:30774,2605,0chrM 491 . T C 30203.8 . AC=2;AF=1;AN=2;BaseQRankSum=-3.069;ClippingRankSum=0;DP=1089;ExcessHet=3.0103;FS=1.264;MLEAC=2;MLEAF=1;MQ=59.97;MQRankSum=0;QD=30.39;ReadPosRankSum=-0.858;SOR=0.514 GT:AD:DP:GQ:PL 1/1:50,944:994:99:30232,1153,0chrM 2354 . C T 61694.8 . AC=2;AF=1;AN=2;DP=1806;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=34.45;SOR=0.752 GT:AD:DP:GQ:PL 1/1:0,1791:1791:99:61723,5386,0chrM 2485 . C T 60209.8 . AC=2;AF=1;AN=2;DP=1798;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=59.72;QD=34.09;SOR=0.859 GT:AD:DP:GQ:PL 1/1:0,1766:1766:99:60238,5305,0chrM 3029 . T C 48639.8 . AC=2;AF=1;AN=2;BaseQRankSum=1.034;ClippingRankSum=0;DP=1685;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQRankSum=0;QD=28.95;ReadPosRankSum=-0.985;SOR=0.8 GT:AD:DP:GQ:PL 1/1:127,1553:1680:99:48668,844,0chrM 3706 . G A 49509.8 . AC=2;AF=1;AN=2;BaseQRankSum=-4.266;ClippingRankSum=0;DP=1762;ExcessHet=3.0103;FS=2.493;MLEAC=2;MLEAF=1;MQ=60;MQRankSum=0;QD=28.47;ReadPosRankSum=0.484;SOR=0.322 GT:AD:DP:GQ:PL 1/1:117,1622:1739:99:49538,1282,0chrM 4492 . G A 45206.8 . AC=2;AF=1;AN=2;BaseQRankSum=11.468;ClippingRankSum=0;DP=1643;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=40.5;MQRankSum=0.537;QD=27.57;ReadPosRankSum=0.897;SOR=0.607 GT:AD:DP:GQ:PL 1/1:135,1505:1640:99:45235,822,0chrM 5581 . C T 47719.8 . AC=2;AF=1;AN=2;DP=1419;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=45.76;QD=33.68;SOR=0.811 GT:AD:DP:GQ:PL 1/1:0,1417:1417:99:47748,4261,0 With best wishes,段德敏從 Windows 的郵件傳送寄件者: Timothy Bergquist傳送時間: 2022年7月6日上午 12:18收件者: NCBI-Hackathons/MutPredMerge副本: minn333; Author主旨: Re: [NCBI-Hackathons/MutPredMerge] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte (Issue #4) What type of input data are you using? Can you share the first few rows of your input file?—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte #4

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte #4

minn333 commented Jul 3, 2022

trberg commented Jul 5, 2022

minn333 commented Jul 7, 2022 via email

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte #4

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte #4

Comments

minn333 commented Jul 3, 2022

trberg commented Jul 5, 2022

minn333 commented Jul 7, 2022 via email