|
|
---
|
|
|
license: cc-by-nc-nd-4.0
|
|
|
---
|
|
|
|
|
|
This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction.
|
|
|
|
|
|
- `embeddings` folder contains processed huggingface datasets with peptideCLM embeddings. The `.csv` is the pre-processed data.
|
|
|
- `metrics` folder contains the model performance on the validation data
|
|
|
- `models` host all trained model weights
|
|
|
- `training_data` host all **raw data** to train the classifiers
|
|
|
- `functions` contains files to utilize the trained weights and classifiers
|
|
|
- `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
|
|
|
- `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications
|
|
|
|
|
|
# PeptiVerse 🧬🌌
|
|
|
|
|
|
A collection of machine learning predictors for non-canonical and canonical peptide property prediction for SMILES representation. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.
|
|
|
|
|
|
## Predictors 🧫
|
|
|
|
|
|
PeptiVerse includes the following property predictors:
|
|
|
|
|
|
| Predictor | Measurement | Interpretation | Training Data Source | Dataset Size | Model Type |
|
|
|
|-----------|-------------|-----------------| --------------------|--------------|------------|
|
|
|
| **Non-Hemolysis** | Probability of non-hemolytic behavior | 0-1 scale, higher = less hemolytic | PeptideBERT, PepLand | 6,077 peptides | XGBoost + PeptideCLM embeddings |
|
|
|
| **Solubility** | Probability of aqueous solubility | 0-1 scale, higher = more soluble | PeptideBERT, PepLand | 18,454 peptides | XGBoost + PeptideCLM embeddings |
|
|
|
| **Non-Fouling** | Probability of non-fouling properties | 0-1 scale, higher = lower probability of binding to off-targets | PeptideBERT, PepLand | 17,186 peptides | XGBoost + PeptideCLM embeddings |
|
|
|
| **Permeability** | Cell membrane permeability (PAMPA lipophilicity score log P scale, range -10 to 0) | ≥ −6.0 indicate strong permeability and values < 6.0 indicate weak permeability | ChEMBL (22,040), CycPeptMPDB (7451) | 34,853 peptides | XGBoost + PeptideCLM embeddings + molecular descriptors |
|
|
|
| **Binding Affinity** | Peptide-protein binding strength (-log Kd/Ki/IC50 scale) | Weak binding (< 6.0), medium binding (6.0 − 7.5), and high binding (≥ 7.5) | PepLand | 1806 peptide-protein pairs | Cross-attention transformer (ESM2 + PeptideCLM) |
|
|
|
|
|
|
## Model Performance 🌟
|
|
|
|
|
|
#### Binary Classification Predictors
|
|
|
|
|
|
| Predictor | Val AUC | Val F1 |
|
|
|
|-----------|----------------|----------|
|
|
|
| **Non-Hemolysis** | 0.7902 | 0.8260 |
|
|
|
| **Solubility** | 0.6016 | 0.5767 |
|
|
|
| **Nonfouling** | 0.9327 | 0.8774 |
|
|
|
|
|
|
#### Regression Predictors
|
|
|
|
|
|
| Predictor | Train Correlation (Spearman) | Val Correlation (Spearman) |
|
|
|
|-----------|------------------------------|----------------------------|
|
|
|
| **Permeability** | 0.958 | 0.710 |
|
|
|
| **Binding Affinity** | 0.805 | 0.611 |
|
|
|
|
|
|
## Setup 🌟
|
|
|
|
|
|
1. Clone the repository:
|
|
|
```bash
|
|
|
git clone https://github.com/sophtang/PeptiVerse.git
|
|
|
cd PeptiVerse
|
|
|
```
|
|
|
|
|
|
2. Install environment:
|
|
|
```bash
|
|
|
conda env create -f environment.yml
|
|
|
|
|
|
conda activate peptiverse
|
|
|
```
|
|
|
|
|
|
3. Change the `base_path` in each file to ensure that all model weights and tokenizers are loaded correctly.
|
|
|
|
|
|
## Usage 🌟
|
|
|
|
|
|
#### 1. Hemolysis Prediction
|
|
|
|
|
|
Predicts the probability that a peptide is **not hemolytic**. Higher scores indicate safer peptides.
|
|
|
|
|
|
```python
|
|
|
import sys
|
|
|
sys.path.append('/path/to/PeptiVerse')
|
|
|
from functions.hemolysis.hemolysis import Hemolysis
|
|
|
|
|
|
# Initialize predictor
|
|
|
hemo = Hemolysis()
|
|
|
|
|
|
# Input peptide in SMILES format
|
|
|
peptides = [
|
|
|
"NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
|
|
|
]
|
|
|
|
|
|
# Get predictions
|
|
|
scores = hemo(peptides)
|
|
|
print(f"Non-hemolytic probability: {scores[0]:.3f}")
|
|
|
```
|
|
|
|
|
|
**Output interpretation:**
|
|
|
- Score close to 1.0 = likely non-hemolytic (safe)
|
|
|
- Score close to 0.0 = likely hemolytic (unsafe)
|
|
|
|
|
|
---
|
|
|
|
|
|
#### 2. Solubility Prediction
|
|
|
|
|
|
Predicts aqueous solubility. Higher scores indicate better solubility.
|
|
|
|
|
|
```python
|
|
|
from functions.solubility.solubility import Solubility
|
|
|
|
|
|
# Initialize predictor
|
|
|
sol = Solubility()
|
|
|
|
|
|
# Input peptide
|
|
|
peptides = [
|
|
|
"NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
|
|
|
]
|
|
|
|
|
|
# Get predictions
|
|
|
scores = sol(peptides)
|
|
|
print(f"Solubility probability: {scores[0]:.3f}")
|
|
|
```
|
|
|
|
|
|
**Output interpretation:**
|
|
|
- Score close to 1.0 = highly soluble
|
|
|
- Score close to 0.0 = poorly soluble
|
|
|
|
|
|
---
|
|
|
|
|
|
#### 3. Nonfouling Prediction
|
|
|
|
|
|
Predicts protein resistance/non-fouling properties.
|
|
|
|
|
|
```python
|
|
|
from functions.nonfouling.nonfouling import Nonfouling
|
|
|
|
|
|
# Initialize predictor
|
|
|
nf = Nonfouling()
|
|
|
|
|
|
# Input peptide
|
|
|
peptides = [
|
|
|
"NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
|
|
|
]
|
|
|
|
|
|
# Get predictions
|
|
|
scores = nf(peptides)
|
|
|
print(f"Nonfouling score: {scores[0]:.3f}")
|
|
|
```
|
|
|
|
|
|
**Output interpretation:**
|
|
|
- Higher scores = better non-fouling properties
|
|
|
|
|
|
---
|
|
|
|
|
|
#### 4. Permeability Prediction
|
|
|
|
|
|
Predicts membrane permeability on a log P scale.
|
|
|
|
|
|
```python
|
|
|
from functions.permeability.permeability import Permeability
|
|
|
|
|
|
# Initialize predictor
|
|
|
perm = Permeability()
|
|
|
|
|
|
# Input peptide
|
|
|
peptides = [
|
|
|
"N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O"
|
|
|
]
|
|
|
|
|
|
# Get predictions
|
|
|
scores = perm(peptides)
|
|
|
print(f"Permeability (log P): {scores[0]:.3f}")
|
|
|
```
|
|
|
|
|
|
**Output interpretation:**
|
|
|
- Higher values = more permeable
|
|
|
- Typical range: -10 to 0 (log scale)
|
|
|
|
|
|
---
|
|
|
|
|
|
#### 5. Binding Affinity Prediction
|
|
|
|
|
|
Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.
|
|
|
|
|
|
```python
|
|
|
from functions.binding.binding import BindingAffinity
|
|
|
|
|
|
# Target protein sequence (amino acid format)
|
|
|
target_protein = "MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLL..."
|
|
|
|
|
|
# Initialize predictor with target protein
|
|
|
binding = BindingAffinity(prot_seq=target_protein)
|
|
|
|
|
|
# Input peptide in SMILES format
|
|
|
peptides = [
|
|
|
"CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1c[nH]cn1)C(=O)O"
|
|
|
]
|
|
|
|
|
|
# Get predictions
|
|
|
scores = binding(peptides)
|
|
|
print(f"Binding affinity (-log Kd): {scores[0]:.3f}")
|
|
|
```
|
|
|
|
|
|
**Output interpretation:**
|
|
|
- Higher values = stronger binding
|
|
|
- Scale: -log(Kd/Ki/IC50)
|
|
|
- 7.5+ = tight binding (≤ ~30nM)
|
|
|
- 6.0-7.5 = medium binding (~30nM - 1μM)
|
|
|
- <6.0 = weak binding (> 1μM)
|
|
|
|
|
|
---
|
|
|
|
|
|
## Batch Processing 🌟
|
|
|
|
|
|
All predictors support batch processing for multiple peptides:
|
|
|
|
|
|
```python
|
|
|
from functions.hemolysis.hemolysis import Hemolysis
|
|
|
|
|
|
hemo = Hemolysis()
|
|
|
|
|
|
# Multiple peptides
|
|
|
peptides = [
|
|
|
"NCC(=O)N[C@H](CS)C(=O)O",
|
|
|
"CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)O)C(=O)O",
|
|
|
"N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)O"
|
|
|
]
|
|
|
|
|
|
# Get predictions for all
|
|
|
scores = hemo(peptides)
|
|
|
|
|
|
for i, score in enumerate(scores):
|
|
|
print(f"Peptide {i+1}: {score:.3f}")
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## Unified Scoring with Multiple Predictors 🌟
|
|
|
|
|
|
For convenience, you can use `scoring_functions.py` to evaluate multiple properties at once and get a score vector for each peptide.
|
|
|
|
|
|
### Basic Usage
|
|
|
|
|
|
```python
|
|
|
import sys
|
|
|
sys.path.append('/path/to/PeptiVerse')
|
|
|
from scoring_functions import ScoringFunctions
|
|
|
|
|
|
# Initialize with desired scoring functions
|
|
|
# Available: 'binding_affinity1', 'binding_affinity2', 'permeability',
|
|
|
# 'solubility', 'hemolysis', 'nonfouling'
|
|
|
scoring = ScoringFunctions(
|
|
|
score_func_names=['solubility', 'hemolysis', 'nonfouling', 'permeability'],
|
|
|
prot_seqs=[] # Empty if not using binding affinity
|
|
|
)
|
|
|
|
|
|
# Input peptides in SMILES format
|
|
|
peptides = [
|
|
|
'N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)',
|
|
|
'NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O'
|
|
|
]
|
|
|
|
|
|
# Get scores (returns numpy array of shape: num_peptides x num_functions)
|
|
|
scores = scoring(input_seqs=peptides)
|
|
|
print(scores)
|
|
|
```
|
|
|
|
|
|
### Adding Binding Affinity
|
|
|
|
|
|
```python
|
|
|
from scoring_functions import ScoringFunctions
|
|
|
|
|
|
# Target protein sequence (amino acid format)
|
|
|
tfr_protein = "MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNT..."
|
|
|
|
|
|
# Initialize with binding affinity for one protein
|
|
|
scoring = ScoringFunctions(
|
|
|
score_func_names=['binding_affinity1', 'solubility', 'hemolysis', 'permeability'],
|
|
|
prot_seqs=[tfr_protein] # Provide target protein sequence
|
|
|
)
|
|
|
|
|
|
peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)']
|
|
|
scores = scoring(input_seqs=peptides)
|
|
|
|
|
|
# scores[0] will contain: [binding_affinity, solubility, hemolysis, permeability]
|
|
|
print(f"Scores for peptide 1:")
|
|
|
print(f" Binding Affinity: {scores[0][0]:.3f}")
|
|
|
print(f" Solubility: {scores[0][1]:.3f}")
|
|
|
print(f" Hemolysis: {scores[0][2]:.3f}")
|
|
|
print(f" Permeability: {scores[0][3]:.3f}")
|
|
|
```
|
|
|
|
|
|
### Multiple Binding Targets
|
|
|
|
|
|
```python
|
|
|
# For dual binding affinity prediction
|
|
|
protein1 = "MMDQARSAFSNLFGGEPLSYTR..." # First target
|
|
|
protein2 = "MTKSNGEEPKMGGRMERFQQGV..." # Second target
|
|
|
|
|
|
scoring = ScoringFunctions(
|
|
|
score_func_names=['binding_affinity1', 'binding_affinity2', 'solubility', 'hemolysis'],
|
|
|
prot_seqs=[protein1, protein2] # Provide both protein sequences
|
|
|
)
|
|
|
|
|
|
peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)...']
|
|
|
scores = scoring(input_seqs=peptides)
|
|
|
|
|
|
# scores[0] will contain: [binding_aff1, binding_aff2, solubility, hemolysis]
|
|
|
```
|
|
|
|
|
|
### Output Format
|
|
|
|
|
|
The `ScoringFunctions` class returns a numpy array where:
|
|
|
- **Rows**: Each row corresponds to one input peptide
|
|
|
- **Columns**: Each column corresponds to one scoring function (in the order specified)
|
|
|
|
|
|
```python
|
|
|
# Example with 3 peptides and 4 scoring functions
|
|
|
scores = scoring(input_seqs=peptides)
|
|
|
# Shape: (3, 4)
|
|
|
# scores[0] = [func1_score, func2_score, func3_score, func4_score] for peptide 1
|
|
|
# scores[1] = [func1_score, func2_score, func3_score, func4_score] for peptide 2
|
|
|
# scores[2] = [func1_score, func2_score, func3_score, func4_score] for peptide 3
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## Complete Example 🌟
|
|
|
|
|
|
```python
|
|
|
import sys
|
|
|
sys.path.append('/path/to/PeptiVerse')
|
|
|
from functions.hemolysis.hemolysis import Hemolysis
|
|
|
from functions.solubility.solubility import Solubility
|
|
|
from functions.permeability.permeability import Permeability
|
|
|
|
|
|
# Initialize predictors
|
|
|
hemo = Hemolysis()
|
|
|
sol = Solubility()
|
|
|
perm = Permeability()
|
|
|
|
|
|
# Test peptide
|
|
|
peptide = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O"]
|
|
|
|
|
|
# Get all predictions
|
|
|
hemo_score = hemo(peptide)[0]
|
|
|
sol_score = sol(peptide)[0]
|
|
|
perm_score = perm(peptide)[0]
|
|
|
|
|
|
print("Peptide Property Predictions:")
|
|
|
print(f" Hemolysis (non-hemolytic prob): {hemo_score:.3f}")
|
|
|
print(f" Solubility: {sol_score:.3f}")
|
|
|
print(f" Permeability: {perm_score:.3f}")
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## Model Architecture 🌟
|
|
|
|
|
|
All predictors use:
|
|
|
- **Embeddings**: PeptideCLM-23M (RoFormer-based peptide language model)
|
|
|
- **Classifier**: XGBoost gradient boosting
|
|
|
- **Input**: SMILES representation of peptides
|
|
|
- **Training**: Models trained on curated datasets with cross-validation
|
|
|
|
|
|
---
|
|
|
## Citation
|
|
|
|
|
|
If you find this repository helpful for your publications, please consider citing our paper:
|
|
|
|
|
|
```
|
|
|
@article{tang2025peptune,
|
|
|
title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
|
|
|
author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
|
|
|
journal={42nd International Conference on Machine Learning},
|
|
|
year={2025}
|
|
|
}
|
|
|
```
|
|
|
To use this repository, you agree to abide by the MIT License. |