Skip to content

File Format

eddy-elisee edited this page Sep 12, 2024 · 10 revisions

Name convention

For structure file names, replace all characters such as |, : and . by _.

For sequence ID in fasta file, you could put any ID but, all |, : and . will be automatically replaced by _. However, for Uniprot headers like sp|UniqueIdentifier|EntryName or tr|UniqueIdentifier|EntryName, the UniqueIdentifier is extracted and used as sequence ID.

Files

Reference structure file

Input of:

  • ASMC/run_asmc.py with the option -r/--ref
  • ASMC/asmc/compute_perc_id.py with the option -r/--ref-str

Mandatory to run ASMC based on structures.

This file contains one reference structure path per line, e.g:

/home/User/data/RefA.pdb
/home/User/data/RefZ.pdb

Pocket csv file

Input of:

  • ASMC/run_asmc.py with the option -p/--pocket

Output of:

  • ASMC/run_asmc.py if the option -p/--pocket isn't provided

File used to indicate the active sites positions. The format is ID,Chain,pos..., e.g:

RefA,A,55,57,59,77,101,102,129,130,131,145,148
RefZ,A,89,91,93,118,142,143,170,171,172,197,198

If not provided, this file will be built by ASMC.

Models file

Input of:

  • ASMC/run_asmc.py with the option -m/--models
  • ASMC/asmc/compute_perc_id.py with the option -r/--ref-str*

*only the first column is necessary.

Output of:

  • ASMC/run_asmc.py if the option -s/--seqs is provided

File built by ASMC if the modelling steps is performed. Otherwise, the file should be built by yourself and provided to ASMC/run_asmc.py, e.g:

/home/User/data/models/target_1.pdb /home/User/data/RefA.pdb
/home/User/data/models/target_2.pdb /home/User/data/RefZ.pdb

Active sites alignment

Input of:

  • ASMC/run_asmc.py with the option -a/--active-sites

Output of:

  • ASMC/run_asmc.py if the option -a/--active-sites isn't provided
  • ASMC/run_asmc.py also returns a fasta file for each group which can be used with the option -a/--active-sites

This is simply a fasta file containing all active site sequences to be clustered, e.g:

>A0A015SZL4
MGAPECWKFSRHHEYERD
>A0A015TUY7
MGAPECWKFSRHHEYERD
>A0A017H2J5
MGAPECWEKANLREYKGA

Input for --msa options

Input of:

  • ASMC/run_asmc.py with the option -M/--msa

The file should contain 2 information if only one reference is used:

  • The active site positions in the reference sequence
  • The path to the multiple sequence alignment
refA,55,57,59,77,101,102,129,130,131,145,148
/home/User/data/multiple_sequence_alignment.fasta

If they are multiple references, it's necessary to have 1) the pocket positions of each reference and 2) the path to a file similar to identity_targets_refs.tsv (see below)

refA,55,57,59,77,101,102,129,130,131,145,148
RefZ,89,91,93,118,142,143,170,171,172,197,198
/home/User/data/identity_targets_refs.tsv
/home/User/data/multiple_sequence_alignment.fasta

Reference sequence file

Input of:

  • ASMC/asmc/compute_perc_id.py with the option -R/--ref-seq

This file is an input of ASMC/asmc/compute_perc_id.py with the option --ref-seq and contains one reference ID per line, e.g:

RefA
RefZ

groups_x_min_y.tsv

Input of:

  • subcommand compare with the option -f1 and -f2
  • subcommand to_xlsx with the option -f/--file

Output of:

  • ASMC/run_asmc.py

The x corresponds to the -e/--eps value of the ASMC/run_asmc.py. By default, the value is auto so the value is automatically chosen before the clustering, based on the normalised distances distribution.

The y corresponds to the --min-samples value of the ASMC/run_asmc.py. By default, the value is auto so the value is 5 if the number of samples ≤ 1500 and 25 for more.

The format is ID Active_site_sequence Group_id, e.g:

ID1	ACQGINFIRVDYEIHIGMGGT	-1
ID2	SAEGINLMRNSFVQHVGHQGT	0
ID3	SAEGINFVRNSFVQHVGHQGT	0
ID4	SCEGVNFVRVDRLVHVGLIGT	1
ID5	SCEGVNFIRVDRLVHVGLIGT	1

Note: The group numbering starts at 0 and -1 is the ID for the outliers

identity_targets_refs.tsv

Input of:

  • subcommand run as a path in a file to provide to the -M/--msa, see above
  • subcommand compare with the option -id

Output:

  • subcommand run if the -s/--seqs is provided (homology modelling performed by ASMC with MODELLER)
  • subcommand identity

Output example for the subcommand identity:

id1	refA	62.50
id2	refA	68.75
id3	refZ	68.75
id4	refZ	50.00
id5	refZ	62.50

Note: the identity_targets_refs.tsv returned by the subcommand run have 4 columns, the last column contains the value of --id option.

active_site_checking.tsv

Output of:

  • subcommand compare

The format of this file is: ID G1 SEQ1 G2 SEQ2 DIFF REF_ID PERC_ID REF_SEQ, e.g:

ID	G1	SEQ1	G2	SEQ2	DIFF	REF_ID	PERC_ID	REF_SEQ
ID22	0	FGSNLGCYEVFMYP	0	FGSNLGCYEVFMYP	0	REFC	16.81	LPSQLDWYEVMEYP
ID45	0	ILSKVAWFEVFVPG	-1	ILS-VAWFEAVIYP	5	REFB	18.14	VLSAAAWYEIIVYP
ID48	0	VGSEVTWYESAMYP	0	VGSSVTWYESAMYP	1	REFD	26.85	LGSQVTWYEIIIYP
ID61	0	IASQMGWYEAIIYP	0	IASQMGWYEAIIYP	0	REFB	39.82	VLSAAAWYEIIVYP
ID67	0	ILSAAAWYEIIVYP	0	ILSAAAWYEIIVYP	0	REFB	51.77	VLSAAAWYEIIVYP

Note: The values in G1 and G2 columns are just the id of the groups for their respective runs. Two 0 don't signify that the members composition is identical for the two groups. However, multiple runs with same parameters on the same active sites alignment always return the same clusters.

Clone this wiki locally