Fetching from the Fuzzle database

First import the ProtLego module.

[2]:
from protlego.all import *

The webserver fuzzle is a useful source to find fragments shared by different folds. It uses the SCOP95 subset of the SCOP database. For each domain a Hidden Markov Model (HMM) was generated and all-against-all HMM profile comparisons with HHsearch were perfomed. The structural similarity was measured using TM-align. A final set of more than 10 million Hits have been identified and are contained in fuzzle.

You can check the web here https://fuzzle.uni-bayreuth.de/2.0

1. Fetching from ID

There are several ways to fetch from the Fuzzle database, perhaps one of the easiest way is fetching by the Hit ID. Each Hit in fuzzle has an ID which eases its identification.

[5]:
myhit= fetch_id('4413706')
[6]:
type(myhit)
[6]:
protlego.database.data.Hit

You can always get the documentation of any function or variable by using the function help()

[8]:
help(myhit)
Help on Hit in module protlego.database.data object:

class Hit(builtins.tuple)
 |  Some of the documentation of this function was
 |  taken from the hhsuite python documentation:
 |  https://github.com/soedinglab/hh-suite/wiki
 |  as the sequence information from the Fuzzle hits
 |  come from HHsearch.
 |  The structural superimpositions were performed with
 |  TMalign:   https://zhanglab.ccmb.med.umich.edu/TM-align/
 |
 |  =======================================================
 |
 |  Method resolution order:
 |      Hit
 |      builtins.tuple
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __getnewargs__(self)
 |      Return self as a plain tuple.  Used by copy and pickle.
 |
 |  __repr__(self)
 |      Return repr(self).
 |
 |  _asdict(self)
 |      Return a new OrderedDict which maps field names to their values.
 |
 |  _replace(_self, **kwds)
 |      Return a new Hit object replacing specified fields with new values
 |
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |
 |  _make(iterable, new=<built-in method __new__ of type object at 0x55c615dfe240>, len=<built-in function len>) from builtins.type
 |      Make a new Hit object from a sequence or iterable
 |
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |
 |  __new__(_cls, id:int, query:str, q_scop_id:str, no:int, sbjct:str, s_scop_id:str, s_desc:str, prob:float, eval:float, pval:float, score:float, ss:float, cols:int, q_start:int, q_end:int, s_start:int, s_end:int, hmm:int, ident:float, q_sufam_id:str, s_sufam_id:str, q_fold_id:str, s_fold_id:str, rmsd_pair:float, ca_pair:int, rmsd_tm_pair:float, score_tm_pair:float, ca_tm_pair:int, rmsd_tm:float, score_tm:float, ca_tm:int, q_tm_start:int, q_tm_end:int, s_tm_start:int, s_tm_end:int, q_cluster:str, s_cluster:str)
 |      Create new instance of Hit(id, query, q_scop_id, no, sbjct, s_scop_id, s_desc, prob, eval, pval, score, ss, cols, q_start, q_end, s_start, s_end, hmm, ident, q_sufam_id, s_sufam_id, q_fold_id, s_fold_id, rmsd_pair, ca_pair, rmsd_tm_pair, score_tm_pair, ca_tm_pair, rmsd_tm, score_tm, ca_tm, q_tm_start, q_tm_end, s_tm_start, s_tm_end, q_cluster, s_cluster)
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  id
 |      Alias for field number 0
 |
 |  query
 |      Alias for field number 1
 |
 |  q_scop_id
 |      Alias for field number 2
 |
 |  no
 |      Alias for field number 3
 |
 |  sbjct
 |      Alias for field number 4
 |
 |  s_scop_id
 |      Alias for field number 5
 |
 |  s_desc
 |      Alias for field number 6
 |
 |  prob
 |      Alias for field number 7
 |
 |  eval
 |      Alias for field number 8
 |
 |  pval
 |      Alias for field number 9
 |
 |  score
 |      Alias for field number 10
 |
 |  ss
 |      Alias for field number 11
 |
 |  cols
 |      Alias for field number 12
 |
 |  q_start
 |      Alias for field number 13
 |
 |  q_end
 |      Alias for field number 14
 |
 |  s_start
 |      Alias for field number 15
 |
 |  s_end
 |      Alias for field number 16
 |
 |  hmm
 |      Alias for field number 17
 |
 |  ident
 |      Alias for field number 18
 |
 |  q_sufam_id
 |      Alias for field number 19
 |
 |  s_sufam_id
 |      Alias for field number 20
 |
 |  q_fold_id
 |      Alias for field number 21
 |
 |  s_fold_id
 |      Alias for field number 22
 |
 |  rmsd_pair
 |      Alias for field number 23
 |
 |  ca_pair
 |      Alias for field number 24
 |
 |  rmsd_tm_pair
 |      Alias for field number 25
 |
 |  score_tm_pair
 |      Alias for field number 26
 |
 |  ca_tm_pair
 |      Alias for field number 27
 |
 |  rmsd_tm
 |      Alias for field number 28
 |
 |  score_tm
 |      Alias for field number 29
 |
 |  ca_tm
 |      Alias for field number 30
 |
 |  q_tm_start
 |      Alias for field number 31
 |
 |  q_tm_end
 |      Alias for field number 32
 |
 |  s_tm_start
 |      Alias for field number 33
 |
 |  s_tm_end
 |      Alias for field number 34
 |
 |  q_cluster
 |      Alias for field number 35
 |
 |  s_cluster
 |      Alias for field number 36
 |
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |
 |  __annotations__ = OrderedDict([('id', <class 'int'>), ('query', <c...'...
 |
 |  _field_defaults = {}
 |
 |  _field_types = OrderedDict([('id', <class 'int'>), ('query', <c...', <...
 |
 |  _fields = ('id', 'query', 'q_scop_id', 'no', 'sbjct', 's_scop_id', 's_...
 |
 |  _source = "from builtins import property as _property, tupl...temgette...
 |
 |  ----------------------------------------------------------------------
 |  Methods inherited from builtins.tuple:
 |
 |  __add__(self, value, /)
 |      Return self+value.
 |
 |  __contains__(self, key, /)
 |      Return key in self.
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |
 |  __getitem__(self, key, /)
 |      Return self[key].
 |
 |  __gt__(self, value, /)
 |      Return self>value.
 |
 |  __hash__(self, /)
 |      Return hash(self).
 |
 |  __iter__(self, /)
 |      Implement iter(self).
 |
 |  __le__(self, value, /)
 |      Return self<=value.
 |
 |  __len__(self, /)
 |      Return len(self).
 |
 |  __lt__(self, value, /)
 |      Return self<value.
 |
 |  __mul__(self, value, /)
 |      Return self*value.
 |
 |  __ne__(self, value, /)
 |      Return self!=value.
 |
 |  __rmul__(self, value, /)
 |      Return value*self.
 |
 |  count(...)
 |      T.count(value) -> integer -- return number of occurrences of value
 |
 |  index(...)
 |      T.index(value, [start, [stop]]) -> integer -- return first index of value.
 |      Raises ValueError if the value is not present.

As we can see from the help, a hit contains lots of info stored in it. For example: the length, the probability, the domain names, and SCOP-IDs of the parents as well as the TM-align score.

[9]:
print(myhit.cols)
print(myhit.prob)
print(myhit.q_cluster)
print(myhit.s_cluster)
print(myhit.rmsd_tm_pair)
print(myhit.q_scop_id)
print(myhit.s_scop_id)
print(myhit.score_tm)
101
81.7
d2dfda1_1
d1wa5a__2
2.86
c.2.1.5
c.37.1.8
0.54938

2. Search all the hits that contain a specific domain

[10]:
myhits=fetch_by_domain('d1wa5a_')
[7]:
myhits
[7]:
Query from Fuzzle with 9 hits belonging to 3 fold(s)

The variable myhits could contain one or more hits (or none), depending on the promiscuity of the domain. In this case it appears in 9 hits, where overall 9 different folds are involved.

From the variable myhits, we can directly retrieve a few statistical values:

[11]:
print(myhits.ids)
print(myhits.avg_len)
print(myhits.std_len)
print(myhits.list_folds)
[2290353, 361595, 787549, 1181040, 1258170, 2081517, 4413706, 6818256, 9075069, 7536792, 7536798, 7536807, 7536816, 7536888, 7536911, 7536939, 7536971, 7536978]
66.5555555556
41.6616293251
['c.2' 'c.37' 'c.91']

3. Fetch by parents (finding a hit between two domains)

One can also fetch by the domain names of the two parents. This type of search could also produce none, one, or several hits, as there could be query-subject and subject combinations, along with different ways to superimpose the structures in each of them.

[12]:
myhits2 = fetch_by_domains('d1wa5a_','d2dfda1')
[13]:
myhits2
[13]:
Query from Fuzzle with 2 hits belonging to 2 fold(s)
[11]:
help(fetch_by_domains)
Help on function fetch_by_domains in module protlego.database.data:

fetch_by_domains(domain1:str, domain2:str, prob:int=70, rmsd:float=3.0, ca_min:int=10, ca_max:int=200, score_tm_pair:float=0.3, ratio:float=1.25, diff_folds:bool=True)
    Fetch all the hits between two parent domains

    :param domain1: The 7 letter code for one of the parents
    :param domain2: The 7 letter code for one of the parents
    :param prob: the minimum allowed HHsearch probability
    :param rmsd: The maximum allowed RMSD (rmsd_tm_pair: "RMSD for the TMalign alignment between the two domains, passing the sequence alignment as seed)
    :param ca_min: The minimum allowed fragment length (for the TMalign alignment)
    :param ca_max: The maximun allowed fragment length (for the TMalign alignment)
    :param score_tm_pair: The minimum allowed TM-score (for the TMalign alignment)
    :param ratio: the maximum ratio for the sequence and structural alignment lengths (cols / ca_tm_pair)
    :param diff_folds: Whether to exclude hits from the same fold (True) or not (False)
    :return: A result class obtaining the hits that fulfill these criteria

[16]:
mynewhit=myhits2[0]
[17]:
print(mynewhit.cols)
print(mynewhit.prob)
print(mynewhit.q_cluster)
print(mynewhit.s_cluster)
print(mynewhit.q_scop_id)
print(mynewhit.s_scop_id)
print(mynewhit.score_tm)
101
76.2
d1wa5a__2
d2dfda1_1
c.37.1.8
c.2.1.5
0.48058

4. Fetching two groups

There is also the possibilty to fetch between two SCOP groups, for example between two families. Other options are searching between two superfamilies or two folds.

4.1 Fetching between two families

In this case we try a different combination, between the Flavodoxin folds, and the PBP fold:

[20]:
myhits3=fetch_group('c.23.1.1','c.93.1.0')
print(myhits3)
Query from Fuzzle with 472 hits belonging to 2 fold(s)

As before, the variable myhits contains the hits and additional information, like the average length of the hits and the standard deviation or the folds.

[21]:
print(myhits3.hits[0]) # printing the first hit because why not
print(myhits3.avg_len) # average length between the hits in these two families
print(myhits3.std_len)
print(myhits3.list_folds)
Hit between d2b4aa1 and d4nqra_ with probability 71.4 %

78.2330508475
14.8673772953
['c.23' 'c.93']

4.2 Fetch between two superfamilies

One can also fetch between two superfamilies. In the previous section we had 472 hits between two families belonging to these superfamilies. Presumably we will obtain now many more:

[22]:
myhits4=fetch_group('c.23.1','c.93.1')
[23]:
myhits4
[23]:
Query from Fuzzle with 1859 hits belonging to 2 fold(s)

4.3 Fetch between two folds

We can also search hits between two folds. Of course we can impose some criteria, like a certain probability, RMDS or or a certain minimal fragment length

[24]:
myhits_1 = fetch_group('c.23','c.93',prob=70) # searching for hits with probability over 70%
[25]:
myhits_1
[25]:
Query from Fuzzle with 4946 hits belonging to 2 fold(s)
[26]:
myhits_2 = fetch_group('c.23','c.93',prob=80,rmsd=3) # fetching hits with prob. over 80 and rmsd <3
[27]:
myhits_2
[27]:
Query from Fuzzle with 2554 hits belonging to 2 fold(s)
[28]:
myhits_3=fetch_group('c.23','c.93',prob=80, rmsd=3,ca_min=50) # fetching hits that besides\
#are larger than 50 aminoacids
[29]:
myhits_3
[29]:
Query from Fuzzle with 2478 hits belonging to 2 fold(s)

5. Fetching subspaces

Additionally, there is the possibility to fetch a group or a single query against the rest of the database.

5.1 All hits that contain a TIM-barrel

With the function fetch_subspace we can obtain sets of hits that fullfil any criteria. For example all hits belonging to the TIM-barrel fold. Take into account that these functions present some default cutoffs:

[30]:
myhits5 = fetch_subspace(scop_q='c.1')
[31]:
myhits5.list_folds
[31]:
array(['a.1', 'a.100', 'a.101', 'a.102', 'a.108', 'a.114', 'a.118',
       'a.121', 'a.126', 'a.127', 'a.128', 'a.13', 'a.137', 'a.140',
       'a.144', 'a.149', 'a.15', 'a.150', 'a.152', 'a.153', 'a.156',
       'a.157', 'a.159', 'a.16', 'a.168', 'a.174', 'a.177', 'a.178',
       'a.179', 'a.18', 'a.182', 'a.185', 'a.186', 'a.193', 'a.199', 'a.2',
       'a.20', 'a.204', 'a.206', 'a.21', 'a.218', 'a.219', 'a.22', 'a.222',
       'a.229', 'a.23', 'a.237', 'a.24', 'a.244', 'a.247', 'a.248', 'a.25',
       'a.253', 'a.254', 'a.258', 'a.26', 'a.271', 'a.272', 'a.277',
       'a.28', 'a.284', 'a.287', 'a.29', 'a.291', 'a.293', 'a.294',
       'a.297', 'a.298', 'a.3', 'a.30', 'a.300', 'a.301', 'a.31', 'a.32',
       'a.34', 'a.35', 'a.36', 'a.39', 'a.4', 'a.40', 'a.41', 'a.42',
       'a.43', 'a.45', 'a.46', 'a.47', 'a.48', 'a.5', 'a.53', 'a.55',
       'a.56', 'a.58', 'a.59', 'a.6', 'a.60', 'a.61', 'a.64', 'a.65',
       'a.69', 'a.7', 'a.73', 'a.74', 'a.77', 'a.8', 'a.80', 'a.81',
       'a.86', 'a.88', 'a.89', 'a.9', 'a.92', 'a.93', 'a.95', 'a.96',
       'b.1', 'b.101', 'b.106', 'b.11', 'b.111', 'b.121', 'b.122', 'b.124',
       'b.129', 'b.136', 'b.137', 'b.14', 'b.143', 'b.144', 'b.159',
       'b.163', 'b.174', 'b.176', 'b.178', 'b.19', 'b.2', 'b.22', 'b.26',
       'b.29', 'b.31', 'b.34', 'b.35', 'b.36', 'b.38', 'b.40', 'b.43',
       'b.45', 'b.47', 'b.49', 'b.50', 'b.51', 'b.52', 'b.53', 'b.54',
       'b.55', 'b.59', 'b.6', 'b.60', 'b.61', 'b.62', 'b.7', 'b.71',
       'b.72', 'b.73', 'b.80', 'b.82', 'b.84', 'b.85', 'b.87', 'b.88',
       'b.92', 'c.1', 'c.100', 'c.101', 'c.102', 'c.103', 'c.105', 'c.107',
       'c.108', 'c.109', 'c.110', 'c.111', 'c.112', 'c.114', 'c.115',
       'c.116', 'c.119', 'c.12', 'c.120', 'c.121', 'c.122', 'c.123',
       'c.124', 'c.125', 'c.127', 'c.128', 'c.129', 'c.13', 'c.131',
       'c.133', 'c.135', 'c.136', 'c.138', 'c.14', 'c.141', 'c.144',
       'c.145', 'c.149', 'c.15', 'c.150', 'c.151', 'c.154', 'c.155',
       'c.16', 'c.17', 'c.18', 'c.19', 'c.2', 'c.20', 'c.23', 'c.24',
       'c.25', 'c.26', 'c.27', 'c.28', 'c.3', 'c.30', 'c.31', 'c.32',
       'c.33', 'c.34', 'c.36', 'c.37', 'c.38', 'c.4', 'c.41', 'c.42',
       'c.44', 'c.45', 'c.46', 'c.47', 'c.48', 'c.49', 'c.5', 'c.50',
       'c.51', 'c.52', 'c.53', 'c.54', 'c.55', 'c.56', 'c.57', 'c.58',
       'c.59', 'c.6', 'c.60', 'c.61', 'c.62', 'c.65', 'c.66', 'c.67',
       'c.68', 'c.69', 'c.7', 'c.70', 'c.71', 'c.72', 'c.73', 'c.74',
       'c.77', 'c.78', 'c.79', 'c.8', 'c.80', 'c.82', 'c.83', 'c.84',
       'c.85', 'c.86', 'c.87', 'c.88', 'c.89', 'c.9', 'c.90', 'c.92',
       'c.93', 'c.94', 'c.95', 'c.96', 'c.97', 'c.98', 'd.1', 'd.10',
       'd.100', 'd.101', 'd.104', 'd.106', 'd.108', 'd.11', 'd.110',
       'd.111', 'd.112', 'd.113', 'd.115', 'd.116', 'd.118', 'd.120',
       'd.122', 'd.124', 'd.125', 'd.126', 'd.128', 'd.129', 'd.13',
       'd.130', 'd.131', 'd.133', 'd.136', 'd.139', 'd.14', 'd.140',
       'd.141', 'd.142', 'd.144', 'd.145', 'd.146', 'd.147', 'd.15',
       'd.150', 'd.151', 'd.153', 'd.155', 'd.157', 'd.159', 'd.16',
       'd.160', 'd.161', 'd.162', 'd.163', 'd.164', 'd.165', 'd.166',
       'd.168', 'd.169', 'd.17', 'd.173', 'd.175', 'd.178', 'd.18',
       'd.184', 'd.185', 'd.186', 'd.194', 'd.197', 'd.198', 'd.2',
       'd.201', 'd.202', 'd.205', 'd.206', 'd.21', 'd.211', 'd.212',
       'd.213', 'd.217', 'd.218', 'd.22', 'd.224', 'd.225', 'd.227',
       'd.230', 'd.235', 'd.236', 'd.24', 'd.241', 'd.242', 'd.243',
       'd.247', 'd.248', 'd.25', 'd.250', 'd.254', 'd.256', 'd.259',
       'd.26', 'd.264', 'd.267', 'd.268', 'd.273', 'd.274', 'd.276',
       'd.277', 'd.282', 'd.283', 'd.288', 'd.290', 'd.293', 'd.3',
       'd.304', 'd.306', 'd.31', 'd.310', 'd.311', 'd.316', 'd.319',
       'd.32', 'd.321', 'd.326', 'd.327', 'd.328', 'd.332', 'd.335',
       'd.340', 'd.344', 'd.346', 'd.349', 'd.350', 'd.356', 'd.358',
       'd.361', 'd.364', 'd.365', 'd.367', 'd.368', 'd.37', 'd.370',
       'd.379', 'd.38', 'd.381', 'd.39', 'd.390', 'd.391', 'd.4', 'd.40',
       'd.41', 'd.42', 'd.44', 'd.45', 'd.49', 'd.50', 'd.51', 'd.52',
       'd.54', 'd.56', 'd.58', 'd.59', 'd.6', 'd.60', 'd.64', 'd.65',
       'd.66', 'd.67', 'd.68', 'd.7', 'd.70', 'd.73', 'd.74', 'd.75',
       'd.76', 'd.78', 'd.79', 'd.8', 'd.80', 'd.81', 'd.82', 'd.85',
       'd.86', 'd.87', 'd.88', 'd.9', 'd.90', 'd.91', 'd.92', 'd.93',
       'd.94', 'd.95', 'd.96', 'd.99', 'e.10', 'e.13', 'e.19', 'e.22',
       'e.23', 'e.24', 'e.26', 'e.3', 'e.32', 'e.37', 'e.39', 'e.51',
       'e.52', 'e.53', 'e.6', 'e.7', 'e.74', 'e.79', 'e.8', 'e.80', 'f.1',
       'f.17', 'f.21', 'f.23', 'f.3', 'f.31', 'f.48', 'g.18', 'g.19',
       'g.2', 'g.3', 'g.31', 'g.32', 'g.36', 'g.37', 'g.39', 'g.40',
       'g.41', 'g.44', 'g.46', 'g.66', 'g.72', 'g.75', 'g.81', 'g.9',
       'g.96', 'g.97', 'g.98'],
      dtype='<U5')

5.2 fetch full universes

We can also fetch the whole universe setting some cutoffs, like for example probability and rmsd. All the hits that present probability over 70 % , and rsmd below 2.0, for example:

[32]:
myhits6 = fetch_subspace(prob=70,rmsd=2.0)
[33]:
myhits6
[33]:
Query from Fuzzle with 95755 hits belonging to 451 fold(s)