Classifier_Weight / README.md

yinuozhang

update name

850dfb5 28 days ago

preview code

raw

history blame contribute delete

11.6 kB

metadata

license: cc-by-nc-nd-4.0

This repo contains important large files for PeptiVerse, an interactive app for peptide property prediction.

embeddings folder contains processed huggingface datasets with peptideCLM embeddings. The .csv is the pre-processed data.
metrics folder contains the model performance on the validation data
models host all trained model weights
training_data host all raw data to train the classifiers
functions contains files to utilize the trained weights and classifiers
train contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
scoring_function.py contains a class that aggregates all trained classifiers for diverse downstream sampling applications

PeptiVerse 🧬🌌

A collection of machine learning predictors for non-canonical and canonical peptide property prediction for SMILES representation. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.

Predictors 🧫

PeptiVerse includes the following property predictors:

Predictor	Measurement	Interpretation	Training Data Source	Dataset Size	Model Type
Non-Hemolysis	Probability of non-hemolytic behavior	0-1 scale, higher = less hemolytic	PeptideBERT, PepLand	6,077 peptides	XGBoost + PeptideCLM embeddings
Solubility	Probability of aqueous solubility	0-1 scale, higher = more soluble	PeptideBERT, PepLand	18,454 peptides	XGBoost + PeptideCLM embeddings
Non-Fouling	Probability of non-fouling properties	0-1 scale, higher = lower probability of binding to off-targets	PeptideBERT, PepLand	17,186 peptides	XGBoost + PeptideCLM embeddings
Permeability	Cell membrane permeability (PAMPA lipophilicity score log P scale, range -10 to 0)	≥ −6.0 indicate strong permeability and values < 6.0 indicate weak permeability	ChEMBL (22,040), CycPeptMPDB (7451)	34,853 peptides	XGBoost + PeptideCLM embeddings + molecular descriptors
Binding Affinity	Peptide-protein binding strength (-log Kd/Ki/IC50 scale)	Weak binding (< 6.0), medium binding (6.0 − 7.5), and high binding (≥ 7.5)	PepLand	1806 peptide-protein pairs	Cross-attention transformer (ESM2 + PeptideCLM)

Model Performance 🌟

Binary Classification Predictors

Predictor	Val AUC	Val F1
Non-Hemolysis	0.7902	0.8260
Solubility	0.6016	0.5767
Nonfouling	0.9327	0.8774

Regression Predictors

Predictor	Train Correlation (Spearman)	Val Correlation (Spearman)
Permeability	0.958	0.710
Binding Affinity	0.805	0.611

Setup 🌟

Clone the repository:

git clone https://github.com/sophtang/PeptiVerse.git
cd PeptiVerse

Install environment:

conda env create -f environment.yml

conda activate peptiverse

Change the base_path in each file to ensure that all model weights and tokenizers are loaded correctly.

Usage 🌟

1. Hemolysis Prediction

Predicts the probability that a peptide is not hemolytic. Higher scores indicate safer peptides.

import sys
sys.path.append('/path/to/PeptiVerse')
from functions.hemolysis.hemolysis import Hemolysis

# Initialize predictor
hemo = Hemolysis()

# Input peptide in SMILES format
peptides = [
    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
]

# Get predictions
scores = hemo(peptides)
print(f"Non-hemolytic probability: {scores[0]:.3f}")

Output interpretation:

Score close to 1.0 = likely non-hemolytic (safe)
Score close to 0.0 = likely hemolytic (unsafe)

2. Solubility Prediction

Predicts aqueous solubility. Higher scores indicate better solubility.

from functions.solubility.solubility import Solubility

# Initialize predictor
sol = Solubility()

# Input peptide
peptides = [
    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
]

# Get predictions
scores = sol(peptides)
print(f"Solubility probability: {scores[0]:.3f}")

Output interpretation:

Score close to 1.0 = highly soluble
Score close to 0.0 = poorly soluble

3. Nonfouling Prediction

Predicts protein resistance/non-fouling properties.

from functions.nonfouling.nonfouling import Nonfouling

# Initialize predictor
nf = Nonfouling()

# Input peptide
peptides = [
    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
]

# Get predictions
scores = nf(peptides)
print(f"Nonfouling score: {scores[0]:.3f}")

Output interpretation:

Higher scores = better non-fouling properties

4. Permeability Prediction

Predicts membrane permeability on a log P scale.

from functions.permeability.permeability import Permeability

# Initialize predictor
perm = Permeability()

# Input peptide
peptides = [
    "N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O"
]

# Get predictions
scores = perm(peptides)
print(f"Permeability (log P): {scores[0]:.3f}")

Output interpretation:

Higher values = more permeable
Typical range: -10 to 0 (log scale)

5. Binding Affinity Prediction

Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.

from functions.binding.binding import BindingAffinity

# Target protein sequence (amino acid format)
target_protein = "MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLL..."

# Initialize predictor with target protein
binding = BindingAffinity(prot_seq=target_protein)

# Input peptide in SMILES format
peptides = [
    "CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1c[nH]cn1)C(=O)O"
]

# Get predictions
scores = binding(peptides)
print(f"Binding affinity (-log Kd): {scores[0]:.3f}")

Output interpretation:

Higher values = stronger binding
Scale: -log(Kd/Ki/IC50)
- 7.5+ = tight binding (≤ ~30nM)
- 6.0-7.5 = medium binding (~30nM - 1μM)
- <6.0 = weak binding (> 1μM)

Batch Processing 🌟

All predictors support batch processing for multiple peptides:

from functions.hemolysis.hemolysis import Hemolysis

hemo = Hemolysis()

# Multiple peptides
peptides = [
    "NCC(=O)N[C@H](CS)C(=O)O",
    "CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)O)C(=O)O",
    "N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)O"
]

# Get predictions for all
scores = hemo(peptides)

for i, score in enumerate(scores):
    print(f"Peptide {i+1}: {score:.3f}")

Unified Scoring with Multiple Predictors 🌟

For convenience, you can use scoring_functions.py to evaluate multiple properties at once and get a score vector for each peptide.

Basic Usage

import sys
sys.path.append('/path/to/PeptiVerse')
from scoring_functions import ScoringFunctions

# Initialize with desired scoring functions
# Available: 'binding_affinity1', 'binding_affinity2', 'permeability', 
#            'solubility', 'hemolysis', 'nonfouling'
scoring = ScoringFunctions(
    score_func_names=['solubility', 'hemolysis', 'nonfouling', 'permeability'],
    prot_seqs=[]  # Empty if not using binding affinity
)

# Input peptides in SMILES format
peptides = [
    'N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)',
    'NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O'
]

# Get scores (returns numpy array of shape: num_peptides x num_functions)
scores = scoring(input_seqs=peptides)
print(scores)

Adding Binding Affinity

from scoring_functions import ScoringFunctions

# Target protein sequence (amino acid format)
tfr_protein = "MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNT..."

# Initialize with binding affinity for one protein
scoring = ScoringFunctions(
    score_func_names=['binding_affinity1', 'solubility', 'hemolysis', 'permeability'],
    prot_seqs=[tfr_protein]  # Provide target protein sequence
)

peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)']
scores = scoring(input_seqs=peptides)

# scores[0] will contain: [binding_affinity, solubility, hemolysis, permeability]
print(f"Scores for peptide 1:")
print(f"  Binding Affinity: {scores[0][0]:.3f}")
print(f"  Solubility: {scores[0][1]:.3f}")
print(f"  Hemolysis: {scores[0][2]:.3f}")
print(f"  Permeability: {scores[0][3]:.3f}")

Multiple Binding Targets

# For dual binding affinity prediction
protein1 = "MMDQARSAFSNLFGGEPLSYTR..."  # First target
protein2 = "MTKSNGEEPKMGGRMERFQQGV..."  # Second target

scoring = ScoringFunctions(
    score_func_names=['binding_affinity1', 'binding_affinity2', 'solubility', 'hemolysis'],
    prot_seqs=[protein1, protein2]  # Provide both protein sequences
)

peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)...']
scores = scoring(input_seqs=peptides)

# scores[0] will contain: [binding_aff1, binding_aff2, solubility, hemolysis]

Output Format

The ScoringFunctions class returns a numpy array where:

Rows: Each row corresponds to one input peptide
Columns: Each column corresponds to one scoring function (in the order specified)

# Example with 3 peptides and 4 scoring functions
scores = scoring(input_seqs=peptides)  
# Shape: (3, 4)
# scores[0] = [func1_score, func2_score, func3_score, func4_score] for peptide 1
# scores[1] = [func1_score, func2_score, func3_score, func4_score] for peptide 2
# scores[2] = [func1_score, func2_score, func3_score, func4_score] for peptide 3

Complete Example 🌟

import sys
sys.path.append('/path/to/PeptiVerse')
from functions.hemolysis.hemolysis import Hemolysis
from functions.solubility.solubility import Solubility
from functions.permeability.permeability import Permeability

# Initialize predictors
hemo = Hemolysis()
sol = Solubility()
perm = Permeability()

# Test peptide
peptide = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O"]

# Get all predictions
hemo_score = hemo(peptide)[0]
sol_score = sol(peptide)[0]
perm_score = perm(peptide)[0]

print("Peptide Property Predictions:")
print(f"  Hemolysis (non-hemolytic prob): {hemo_score:.3f}")
print(f"  Solubility: {sol_score:.3f}")
print(f"  Permeability: {perm_score:.3f}")

Model Architecture 🌟

All predictors use:

Embeddings: PeptideCLM-23M (RoFormer-based peptide language model)
Classifier: XGBoost gradient boosting
Input: SMILES representation of peptides
Training: Models trained on curated datasets with cross-validation

Citation

If you find this repository helpful for your publications, please consider citing our paper:

@article{tang2025peptune,
  title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
  author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
  journal={42nd International Conference on Machine Learning},
  year={2025}
}

To use this repository, you agree to abide by the MIT License.