Fetching hits from Fuzzle

class database.data.Hit(id: int, query: str, q_scop_id: str, no: int, sbjct: str, s_scop_id: str, s_desc: str, prob: float, eval: float, pval: float, score: float, ss: float, cols: int, q_start: int, q_end: int, s_start: int, s_end: int, hmm: int, ident: float, q_sufam_id: str, s_sufam_id: str, q_fold_id: str, s_fold_id: str, rmsd_pair: float, ca_pair: int, rmsd_tm_pair: float, score_tm_pair: float, ca_tm_pair: int, rmsd_tm: float, score_tm: float, ca_tm: int, q_tm_start: int, q_tm_end: int, s_tm_start: int, s_tm_end: int, q_cluster: str, s_cluster: str)[source]

Some of the documentation of this function was taken from the hhsuite python documentation: https://github.com/soedinglab/hh-suite/wiki as the sequence information from the Fuzzle hits come from HHsearch. The structural superimpositions were performed with TMalign: https://zhanglab.ccmb.med.umich.edu/TM-align/

property ca_pair

The number of alpha carbon pairs that were used for the rmsd_pair calculation.

property ca_tm

The number of alpha carbon pairs that were used for the rmsd_tm calculation

property ca_tm_pair

The number of alpha carbon pairs that were used for the rmsd_tm_pair calculation

property cols

The number of aligned Match columns in the HMM-HMM alignment.

property eval

E-value

property hmm

int

property id

The database id for this hit

property ident

Identity % for the sequence alignment

property no

The HHsearch hit number for this query

property prob

HHsearch probability

property pval

p-value

property q_cluster

Query cluster where this fragment belongs to.

property q_end

Residue where the alignment ends for the query domain (sequence position)

property q_fold_id

The fold the query belongs to

property q_scop_id

The SCOP family the query belongs to

property q_start

Residue where the alignment starts for the query domain (sequence position)

property q_sufam_id

The superfamily the query belongs to

property q_tm_end

Residue in the query structure where the rmsd_tm_pair alignment ends

property q_tm_start

Residue in the query structure where the rmsd_tm_pair alignment starts

property query

The 7-letter SCOP95 code for the query domain

property rmsd_pair

RMSD for the alignment between the two domains, strictly taking the alpha carbons from the structures that exactly appear in the HHsearch sequence alignment

property rmsd_tm

RMSD for the TMalign alignment between the two domains without seed

property rmsd_tm_pair

RMSD for the TMalign alignment between the two domains, passing the sequence alignment as seed

property s_cluster

Subject cluster where this fragment belongs to

property s_desc

Description of the subject domain.

property s_end

Residue where the alignment ends for the subject domain (sequence position)

property s_fold_id

The fold the query belongs to

property s_scop_id

The SCOP family the subject belongs to

property s_start

Residue where the alignment starts for the subject domain (sequence position)

property s_sufam_id

The superfamily the subject belongs to

property s_tm_end

Residue in the subject structure where the rmsd_tm_pair alignment ends

property s_tm_start

Residue in the subject structure where the rmsd_tm_pair alignment starts

property sbjct

A 7-letter SCOP95 code for the subject domain

property score

The raw score is computed by the Viterbi HMM-HMM alignment excluding the secondary structure score. It is the sum of similarities of the aligned profile columns minus the position-specific gap penalties in bits.

property score_tm

TM-score for the rmsd_tm superposition

property score_tm_pair

TM-score for the rmsd_tm_pair superposition

property ss

The secondary structure score. This score tells you how well the PSIPRED-predicted (3-state) or actual DSSP-determined (8-state) secondary structure sequences agree with each other.

class database.data.Result(ahits: List[database.data.Hit])[source]

Class handling the data obtained from fuzzle

property avg_len

It returns the Aminoacid average of the returned hits

property ids

It returns all the hits IDs

property list_fams

It returns the list of unique folds in the hits list

property list_folds

It returns the list of unique folds in the hits list

property list_sufams

It returns the list of unique folds in the hits list

property std_len

It returns the Aminoacid standard deviation in the hits list

property unique_clusters

It returns a list of unique domains

property unique_domains

It returns a list of unique domains

database.data.fetch_byPDB(pdb: str, prob: int = 70, rmsd: float = 3.0, ca_min: int = 10, ca_max: int = 200, score_tm_pair: float = 0.3, ratio: float = 1.25, diff_folds: bool = True)[source]

Returns the entries in Fuzzle that contain the representative domains that correspond to that PDB

Parameters
  • prob – Lower cutoff for the hit probability

  • rmsd – Upper cutoff for rmsd_tm_pair

  • ca_min – Lower cutoff for the number of AA (ca_tm_pair)

  • ca_max – Upper cutoff for the number of AA (ca_tm_pair)

  • score_tm_pair – Lower cutoff for the tm_score

  • ratio – Proportion between cols/ca_tm_pair

  • scop_q – A SCOP class. It will retrieve hits that contains domains from this class

  • query – A SCOP protein domain. It will retrieve hits that contain this query

Returns

A Result object

database.data.fetch_byPDBs(pdb1: str, pdb2: str, prob: int = 70, rmsd: float = 3.0, ca_min: int = 10, ca_max: int = 200, score_tm_pair: float = 0.3, ratio: float = 1.25, diff_folds: bool = True)[source]

Includes all hits among the domains that belong to a pair of PDBs

Parameters
  • pdb1 – The first PDB to check

  • pdb2 – The second PDB to check

  • prob – the minimum allowed HHsearch probability

  • rmsd – The maximum allowed RMSD (rmsd_tm_pair: “RMSD for the TMalign alignment between the two domains, passing the sequence alignment as seed)

  • ca_min – The minimum allowed fragment length (for the TMalign alignment)

  • ca_max – The maximun allowed fragment length (for the TMalign alignment)

  • score_tm_pair – The minimum allowed TM-score (for the TMalign alignment)

  • ratio – the maximum ratio for the sequence and structural alignment lengths (cols / ca_tm_pair)

  • diff_folds – Whether to exclude hits from the same fold (True) or not (False)

Returns

A result class obtaining the hits that fulfill these criteria

database.data.fetch_by_domain(domain: str, prob: int = 70, rmsd: float = 3.0, ca_min: int = 10, ca_max: int = 200, score_tm_pair: float = 0.3, ratio: float = 1.25, diff_folds: bool = True)[source]

Fetch all the hits that contain a specific domain

Parameters
  • domain – The 7 letter code for one of the parents

  • prob – the minimum allowed HHsearch probability

  • rmsd – The maximum allowed RMSD (rmsd_tm_pair: “RMSD for the TMalign alignment between the two domains, passing the sequence alignment as seed)

  • ca_min – The minimum allowed fragment length (for the TMalign alignment)

  • ca_max – The maximun allowed fragment length (for the TMalign alignment)

  • score_tm_pair – The minimum allowed TM-score (for the TMalign alignment)

  • ratio – the maximum ratio for the sequence and structural alignment lengths (cols / ca_tm_pair)

  • diff_folds – Whether to exclude hits from the same fold (True) or not (False)

Returns

A result class obtaining the hits that fulfill these criteria

database.data.fetch_by_domains(domain1: str, domain2: str, prob: int = 70, rmsd: float = 3.0, ca_min: int = 10, ca_max: int = 200, score_tm_pair: float = 0.3, ratio: float = 1.25, diff_folds: bool = True)[source]

Fetch all the hits between two parent domains

Parameters
  • domain1 – The 7 letter code for one of the parents

  • domain2 – The 7 letter code for one of the parents

  • prob – the minimum allowed HHsearch probability

  • rmsd – The maximum allowed RMSD (rmsd_tm_pair: “RMSD for the TMalign alignment between the two domains, passing the sequence alignment as seed)

  • ca_min – The minimum allowed fragment length (for the TMalign alignment)

  • ca_max – The maximun allowed fragment length (for the TMalign alignment)

  • score_tm_pair – The minimum allowed TM-score (for the TMalign alignment)

  • ratio – the maximum ratio for the sequence and structural alignment lengths (cols / ca_tm_pair)

  • diff_folds – Whether to exclude hits from the same fold (True) or not (False)

Returns

A result class obtaining the hits that fulfill these criteria

database.data.fetch_group(group1, group2=None, prob: int = 70, rmsd: float = 3.0, ca_min: int = 10, ca_max: int = 200, score_tm_pair: float = 0.3, ratio: float = 1.25, diff_folds: bool = True)database.data.Result[source]

Fetching all hits between two specific groups (folds, superfamilies and families) or inside one specific group (group1)

Parameters
  • group1 – The first group from where to search. E.g ‘c.2’

  • (optional) (group2) – The second group from where to search. E.g ‘c.2’

  • prob – the minimum allowed HHsearch probability

  • rmsd – The maximum allowed RMSD (rmsd_tm_pair: “RMSD for the TMalign alignment between the two domains, passing the sequence alignment as seed)

  • ca_min – The minimum allowed fragment length (for the TMalign alignment)

  • ca_max – The maximun allowed fragment length (for the TMalign alignment)

  • score_tm_pair – The minimum allowed TM-score (for the TMalign alignment)

  • ratio – the maximum ratio for the sequence and structural alignment lengths (cols / ca_tm_pair)

Returns

A Result class with the hits that fulfill these criteria

database.data.fetch_id(fuzzle_id: int)database.data.Hit[source]

Returns the hit in fuzzle with that ID :param fuzzle_id: The Fuzzle HIT id to retrieve from hh207clusters :return: A Hit object

database.data.fetch_subspace(prob: int = 70, rmsd: float = 3.0, ca_min: int = 10, ca_max: int = 200, score_tm_pair: float = 0.3, ratio: float = 1.25, scop_q: Optional[str] = None, diff_folds: bool = True)database.data.Result[source]

Returns the entries in Fuzzle that satisfy the conditions:

Parameters
  • prob – Lower cutoff for the hit probability

  • rmsd – Upper cutoff for rmsd_tm_pair

  • ca_min – Lower cutoff for the number of AA (ca_tm_pair)

  • ca_max – Upper cutoff for the number of AA (ca_tm_pair)

  • score_tm_pair – Lower cutoff for the tm_score

  • ratio – Proportion between cols/ca_tm_pair

  • scop_q – A SCOP class. It will retrieve hits that contains domains from this class

Returns

A Result object

database.data.filter_hits_domain(ahits, domain)[source]

Search all hits from a Result class where a certain domain appears

Parameters
  • ahits – An object Result

  • domain – a SCOPe domain identifier

Returns

np.array. The starts and ends for the domains in all the hits it appears.

database.data.parse_hit(line: List[str])database.data.Hit[source]
Parameters

line

Returns

database.data.validate_scopid(query: str)bool[source]

A SCOP domain is A 7-character sid that consists of “d” followed by the 4-character PDB ID of the file of origin, the PDB chain ID (‘_’ if none, ‘.’ if multiple as is the case in genetic domains), and a single character (usually an integer) if needed to specify the domain uniquely (‘_’ if not). Sids are currently all lower case, even when the chain letter is upper case. Examples: include d4akea1, d1reqa2, and d1cph.1. :param query: The seven letter domain for the query