

TemBERTure
1 Introduction
The thermostability of proteins is of significant importance in the biotechnology field, particularly in industries such as pharmaceuticals, food production, and biofuel generation. Thermostable proteins can accelerate chemical reactions and reduce production costs. However, traditional experimental methods for assessing protein thermostability are not only time-consuming and expensive but also difficult to scale, resulting in a limited availability of protein thermostability data.
To enhance the quality of training data and the generalization capability of models, researchers have developed TemBERTureDB1, which is derived from the Meltome Atlas experiment2, UniProtKB3, BacDive4, and NCBI5. This database contains over 48,000 protein sequences, covering thermostable and non-thermostable proteins from 13 species, totaling 600,000 data entries.

Figure 1. (A) shows the composition of TemBERTureDB, and (B) illustrates the architecture of TemBERTure, where TemBERTureCLS uses a classification prediction head, and TemBERTureTm uses a regression prediction head.
Based on the existing protBERT-BFD6 language model, TemBERTureCLS and TemBERTureTm were developed through fine-tuning using the adapter-base7,8 method to predict the thermostability categories and melting temperature of proteins, respectively. TemBERTureCLS demonstrates excellent performance in predicting thermostability categories, achieving an accuracy of 0.89 and an F1 score of 0.9. Although TemBERTureTm shows some bias in predicting melting temperature with a bimodal distribution, it maintains high accuracy in predicting protein thermostability categories.

Figure 2. Performance of TemBERTureCLS

Figure 3. Performance of TemBERTureTm
2 Parameter
Input sequence/fasta file for prediction.
3 Results Explanation
| Column Name | Description |
|---|---|
| Sequence | Input sequence |
| Tm | Predicted melting temperature value (Tm) |
| Thermostability Classification | Predicted protein thermostability category: Thermophilic or Non-thermophilic |
| Thermophilicity Prediction Score | Predicted probability score of thermophilicity, ranging from 0 to 1.0, where a higher value indicates a greater likelihood of thermophilicity |
4 Reference
[1] Chiara Rodella, Symela Lazaridi, Thomas Lemmin, TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms, Bioinformatics Advances, Volume 4, Issue 1, 2024, vbae103. https://doi.org/10.1093/bioadv/vbae103
[2] Jarzab A, Kurzawa N, Hopf T et al. Meltome atlas—thermal proteome stability across the tree of life. Nat Methods 2020;17:495–503. https://doi.org/10.1038/s41592-020-0801-4
[3] The UniProt Consortium. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res 2023;51:D523–31. https://doi.org/10.1093/nar/gkac1052
[4] Reimer LC, Sardà Carbasse J, Koblitz J et al. BacDive in 2022: the knowledge base for standardized bacterial and archaeal data. Nucleic Acids Res 2022;50:D741–6. https://doi.org/10.1093/nar/gkab961
[5] Sayers EW, Bolton EE, Brister JR et al. Database resources of the national center for biotechnology information. Nucleic Acids Res 2022;50:D20–6. https://doi.org/10.1093/nar/gkab1112
[6] Elnaggar A, Heinzinger M, Dallago C et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2022;44:7112–27. https://doi.org/10.1109/TPAMI.2021.3095381
[7] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. & Gelly, S.. (2019). Parameter-Efficient Transfer Learning for NLP. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:2790-2799 Available from https://proceedings.mlr.press/v97/houlsby19a.html
[8] Poth et al., Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning. EMNLP 2023. https://aclanthology.org/2023.emnlp-demo.13/

