File size: 11,588 Bytes
2216d16
 
 
 
 
 
 
 
 
 
 
 
069410e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
850dfb5
069410e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
---

license: cc-by-nc-nd-4.0
---


This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction. 

- `embeddings` folder contains processed huggingface datasets with peptideCLM embeddings. The `.csv` is the pre-processed data.
- `metrics` folder contains the model performance on the validation data
- `models` host all trained model weights
- `training_data` host all **raw data** to train the classifiers
- `functions` contains files to utilize the trained weights and classifiers
- `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
- `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications

# PeptiVerse 🧬🌌

A collection of machine learning predictors for non-canonical and canonical peptide property prediction for SMILES representation. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.

## Predictors 🧫

PeptiVerse includes the following property predictors:

| Predictor | Measurement | Interpretation | Training Data Source | Dataset Size | Model Type |
|-----------|-------------|-----------------| --------------------|--------------|------------|
| **Non-Hemolysis** | Probability of non-hemolytic behavior | 0-1 scale, higher = less hemolytic | PeptideBERT, PepLand | 6,077 peptides | XGBoost + PeptideCLM embeddings |
| **Solubility** | Probability of aqueous solubility | 0-1 scale, higher = more soluble | PeptideBERT, PepLand | 18,454 peptides | XGBoost + PeptideCLM embeddings |
| **Non-Fouling** | Probability of non-fouling properties | 0-1 scale, higher = lower probability of binding to off-targets | PeptideBERT, PepLand | 17,186 peptides | XGBoost + PeptideCLM embeddings |
| **Permeability** | Cell membrane permeability (PAMPA lipophilicity score log P scale, range -10 to 0) | ≥ −6.0 indicate strong permeability and values < 6.0 indicate weak permeability | ChEMBL (22,040), CycPeptMPDB (7451) | 34,853 peptides | XGBoost + PeptideCLM embeddings + molecular descriptors |
| **Binding Affinity** | Peptide-protein binding strength (-log Kd/Ki/IC50 scale) | Weak binding (< 6.0), medium binding (6.0 − 7.5), and high binding (≥ 7.5) | PepLand | 1806 peptide-protein pairs | Cross-attention transformer (ESM2 + PeptideCLM) |

## Model Performance 🌟

#### Binary Classification Predictors

| Predictor | Val AUC | Val F1 |
|-----------|----------------|----------|
| **Non-Hemolysis** | 0.7902 | 0.8260 |
| **Solubility** | 0.6016 | 0.5767 |
| **Nonfouling** | 0.9327 | 0.8774 |

#### Regression Predictors

| Predictor | Train Correlation (Spearman) | Val Correlation (Spearman) |
|-----------|------------------------------|----------------------------|
| **Permeability** | 0.958 | 0.710 |
| **Binding Affinity** | 0.805 | 0.611 |

## Setup 🌟

1. Clone the repository:
```bash

git clone https://github.com/sophtang/PeptiVerse.git

cd PeptiVerse

```

2. Install environment:
```bash

conda env create -f environment.yml



conda activate peptiverse

```

3. Change the `base_path` in each file to ensure that all model weights and tokenizers are loaded correctly.

## Usage 🌟

#### 1. Hemolysis Prediction

Predicts the probability that a peptide is **not hemolytic**. Higher scores indicate safer peptides.

```python

import sys

sys.path.append('/path/to/PeptiVerse')

from functions.hemolysis.hemolysis import Hemolysis



# Initialize predictor

hemo = Hemolysis()



# Input peptide in SMILES format

peptides = [

    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"

]



# Get predictions

scores = hemo(peptides)

print(f"Non-hemolytic probability: {scores[0]:.3f}")

```

**Output interpretation:**
- Score close to 1.0 = likely non-hemolytic (safe)
- Score close to 0.0 = likely hemolytic (unsafe)

---

#### 2. Solubility Prediction

Predicts aqueous solubility. Higher scores indicate better solubility.

```python

from functions.solubility.solubility import Solubility



# Initialize predictor

sol = Solubility()



# Input peptide

peptides = [

    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"

]



# Get predictions

scores = sol(peptides)

print(f"Solubility probability: {scores[0]:.3f}")

```

**Output interpretation:**
- Score close to 1.0 = highly soluble
- Score close to 0.0 = poorly soluble

---

#### 3. Nonfouling Prediction

Predicts protein resistance/non-fouling properties.

```python

from functions.nonfouling.nonfouling import Nonfouling



# Initialize predictor

nf = Nonfouling()



# Input peptide

peptides = [

    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"

]



# Get predictions

scores = nf(peptides)

print(f"Nonfouling score: {scores[0]:.3f}")

```

**Output interpretation:**
- Higher scores = better non-fouling properties

---

#### 4. Permeability Prediction

Predicts membrane permeability on a log P scale.

```python

from functions.permeability.permeability import Permeability



# Initialize predictor

perm = Permeability()



# Input peptide

peptides = [

    "N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O"

]



# Get predictions

scores = perm(peptides)

print(f"Permeability (log P): {scores[0]:.3f}")

```

**Output interpretation:**
- Higher values = more permeable
- Typical range: -10 to 0 (log scale)

---

#### 5. Binding Affinity Prediction

Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.

```python

from functions.binding.binding import BindingAffinity



# Target protein sequence (amino acid format)

target_protein = "MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLL..."



# Initialize predictor with target protein

binding = BindingAffinity(prot_seq=target_protein)



# Input peptide in SMILES format

peptides = [

    "CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1c[nH]cn1)C(=O)O"

]



# Get predictions

scores = binding(peptides)

print(f"Binding affinity (-log Kd): {scores[0]:.3f}")

```

**Output interpretation:**
- Higher values = stronger binding
- Scale: -log(Kd/Ki/IC50)
  - 7.5+ = tight binding (≤ ~30nM)
  - 6.0-7.5 = medium binding (~30nM - 1μM)
  - <6.0 = weak binding (> 1μM)

---

## Batch Processing 🌟

All predictors support batch processing for multiple peptides:

```python

from functions.hemolysis.hemolysis import Hemolysis



hemo = Hemolysis()



# Multiple peptides

peptides = [

    "NCC(=O)N[C@H](CS)C(=O)O",

    "CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)O)C(=O)O",

    "N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)O"

]



# Get predictions for all

scores = hemo(peptides)



for i, score in enumerate(scores):

    print(f"Peptide {i+1}: {score:.3f}")

```

---

## Unified Scoring with Multiple Predictors 🌟

For convenience, you can use `scoring_functions.py` to evaluate multiple properties at once and get a score vector for each peptide.

### Basic Usage

```python

import sys

sys.path.append('/path/to/PeptiVerse')

from scoring_functions import ScoringFunctions



# Initialize with desired scoring functions

# Available: 'binding_affinity1', 'binding_affinity2', 'permeability', 

#            'solubility', 'hemolysis', 'nonfouling'

scoring = ScoringFunctions(

    score_func_names=['solubility', 'hemolysis', 'nonfouling', 'permeability'],

    prot_seqs=[]  # Empty if not using binding affinity

)



# Input peptides in SMILES format

peptides = [

    'N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)',

    'NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O'

]



# Get scores (returns numpy array of shape: num_peptides x num_functions)

scores = scoring(input_seqs=peptides)

print(scores)

```

### Adding Binding Affinity

```python

from scoring_functions import ScoringFunctions



# Target protein sequence (amino acid format)

tfr_protein = "MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNT..."



# Initialize with binding affinity for one protein

scoring = ScoringFunctions(

    score_func_names=['binding_affinity1', 'solubility', 'hemolysis', 'permeability'],

    prot_seqs=[tfr_protein]  # Provide target protein sequence

)



peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)']

scores = scoring(input_seqs=peptides)



# scores[0] will contain: [binding_affinity, solubility, hemolysis, permeability]

print(f"Scores for peptide 1:")

print(f"  Binding Affinity: {scores[0][0]:.3f}")

print(f"  Solubility: {scores[0][1]:.3f}")

print(f"  Hemolysis: {scores[0][2]:.3f}")

print(f"  Permeability: {scores[0][3]:.3f}")

```

### Multiple Binding Targets

```python

# For dual binding affinity prediction

protein1 = "MMDQARSAFSNLFGGEPLSYTR..."  # First target

protein2 = "MTKSNGEEPKMGGRMERFQQGV..."  # Second target



scoring = ScoringFunctions(

    score_func_names=['binding_affinity1', 'binding_affinity2', 'solubility', 'hemolysis'],

    prot_seqs=[protein1, protein2]  # Provide both protein sequences

)



peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)...']

scores = scoring(input_seqs=peptides)



# scores[0] will contain: [binding_aff1, binding_aff2, solubility, hemolysis]

```

### Output Format

The `ScoringFunctions` class returns a numpy array where:
- **Rows**: Each row corresponds to one input peptide
- **Columns**: Each column corresponds to one scoring function (in the order specified)

```python

# Example with 3 peptides and 4 scoring functions

scores = scoring(input_seqs=peptides)  

# Shape: (3, 4)

# scores[0] = [func1_score, func2_score, func3_score, func4_score] for peptide 1

# scores[1] = [func1_score, func2_score, func3_score, func4_score] for peptide 2

# scores[2] = [func1_score, func2_score, func3_score, func4_score] for peptide 3

```

---

## Complete Example 🌟

```python

import sys

sys.path.append('/path/to/PeptiVerse')

from functions.hemolysis.hemolysis import Hemolysis

from functions.solubility.solubility import Solubility

from functions.permeability.permeability import Permeability



# Initialize predictors

hemo = Hemolysis()

sol = Solubility()

perm = Permeability()



# Test peptide

peptide = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O"]



# Get all predictions

hemo_score = hemo(peptide)[0]

sol_score = sol(peptide)[0]

perm_score = perm(peptide)[0]



print("Peptide Property Predictions:")

print(f"  Hemolysis (non-hemolytic prob): {hemo_score:.3f}")

print(f"  Solubility: {sol_score:.3f}")

print(f"  Permeability: {perm_score:.3f}")

```

---

## Model Architecture 🌟

All predictors use:
- **Embeddings**: PeptideCLM-23M (RoFormer-based peptide language model)
- **Classifier**: XGBoost gradient boosting
- **Input**: SMILES representation of peptides
- **Training**: Models trained on curated datasets with cross-validation

---
## Citation

If you find this repository helpful for your publications, please consider citing our paper:

```

@article{tang2025peptune,

  title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},

  author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},

  journal={42nd International Conference on Machine Learning},

  year={2025}

}

```
To use this repository, you agree to abide by the MIT License.