Classifier_Weight / README.md

update name

850dfb5 29 days ago

11.6 kB

	---
	license: cc-by-nc-nd-4.0
	---

	This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction.

	- `embeddings` folder contains processed huggingface datasets with peptideCLM embeddings. The `.csv` is the pre-processed data.
	- `metrics` folder contains the model performance on the validation data
	- `models` host all trained model weights
	- `training_data` host all raw data to train the classifiers
	- `functions` contains files to utilize the trained weights and classifiers
	- `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
	- `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications

	# PeptiVerse 🧬🌌

	A collection of machine learning predictors for non-canonical and canonical peptide property prediction for SMILES representation. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.

	## Predictors 🧫

	PeptiVerse includes the following property predictors:

	\| Predictor \| Measurement \| Interpretation \| Training Data Source \| Dataset Size \| Model Type \|
	\|-----------\|-------------\|-----------------\| --------------------\|--------------\|------------\|
	\| Non-Hemolysis \| Probability of non-hemolytic behavior \| 0-1 scale, higher = less hemolytic \| PeptideBERT, PepLand \| 6,077 peptides \| XGBoost + PeptideCLM embeddings \|
	\| Solubility \| Probability of aqueous solubility \| 0-1 scale, higher = more soluble \| PeptideBERT, PepLand \| 18,454 peptides \| XGBoost + PeptideCLM embeddings \|
	\| Non-Fouling \| Probability of non-fouling properties \| 0-1 scale, higher = lower probability of binding to off-targets \| PeptideBERT, PepLand \| 17,186 peptides \| XGBoost + PeptideCLM embeddings \|
	\| Permeability \| Cell membrane permeability (PAMPA lipophilicity score log P scale, range -10 to 0) \| ≥ −6.0 indicate strong permeability and values < 6.0 indicate weak permeability \| ChEMBL (22,040), CycPeptMPDB (7451) \| 34,853 peptides \| XGBoost + PeptideCLM embeddings + molecular descriptors \|
	\| Binding Affinity \| Peptide-protein binding strength (-log Kd/Ki/IC50 scale) \| Weak binding (< 6.0), medium binding (6.0 − 7.5), and high binding (≥ 7.5) \| PepLand \| 1806 peptide-protein pairs \| Cross-attention transformer (ESM2 + PeptideCLM) \|

	## Model Performance 🌟

	#### Binary Classification Predictors

	\| Predictor \| Val AUC \| Val F1 \|
	\|-----------\|----------------\|----------\|
	\| Non-Hemolysis \| 0.7902 \| 0.8260 \|
	\| Solubility \| 0.6016 \| 0.5767 \|
	\| Nonfouling \| 0.9327 \| 0.8774 \|

	#### Regression Predictors

	\| Predictor \| Train Correlation (Spearman) \| Val Correlation (Spearman) \|
	\|-----------\|------------------------------\|----------------------------\|
	\| Permeability \| 0.958 \| 0.710 \|
	\| Binding Affinity \| 0.805 \| 0.611 \|

	## Setup 🌟

	1. Clone the repository:
	```bash
	git clone https://github.com/sophtang/PeptiVerse.git
	cd PeptiVerse
	```

	2. Install environment:
	```bash
	conda env create -f environment.yml

	conda activate peptiverse
	```

	3. Change the `base_path` in each file to ensure that all model weights and tokenizers are loaded correctly.

	## Usage 🌟

	#### 1. Hemolysis Prediction

	Predicts the probability that a peptide is not hemolytic. Higher scores indicate safer peptides.

	```python
	import sys
	sys.path.append('/path/to/PeptiVerse')
	from functions.hemolysis.hemolysis import Hemolysis

	# Initialize predictor
	hemo = Hemolysis()

	# Input peptide in SMILES format
	peptides = [
	"NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
	]

	# Get predictions
	scores = hemo(peptides)
	print(f"Non-hemolytic probability: {scores[0]:.3f}")
	```

	Output interpretation:
	- Score close to 1.0 = likely non-hemolytic (safe)
	- Score close to 0.0 = likely hemolytic (unsafe)

	---

	#### 2. Solubility Prediction

	Predicts aqueous solubility. Higher scores indicate better solubility.

	```python
	from functions.solubility.solubility import Solubility

	# Initialize predictor
	sol = Solubility()

	# Input peptide
	peptides = [
	"NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
	]

	# Get predictions
	scores = sol(peptides)
	print(f"Solubility probability: {scores[0]:.3f}")
	```

	Output interpretation:
	- Score close to 1.0 = highly soluble
	- Score close to 0.0 = poorly soluble

	---

	#### 3. Nonfouling Prediction

	Predicts protein resistance/non-fouling properties.

	```python
	from functions.nonfouling.nonfouling import Nonfouling

	# Initialize predictor
	nf = Nonfouling()

	# Input peptide
	peptides = [
	"NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
	]

	# Get predictions
	scores = nf(peptides)
	print(f"Nonfouling score: {scores[0]:.3f}")
	```

	Output interpretation:
	- Higher scores = better non-fouling properties

	---

	#### 4. Permeability Prediction

	Predicts membrane permeability on a log P scale.

	```python
	from functions.permeability.permeability import Permeability

	# Initialize predictor
	perm = Permeability()

	# Input peptide
	peptides = [
	"N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O"
	]

	# Get predictions
	scores = perm(peptides)
	print(f"Permeability (log P): {scores[0]:.3f}")
	```

	Output interpretation:
	- Higher values = more permeable
	- Typical range: -10 to 0 (log scale)

	---

	#### 5. Binding Affinity Prediction

	Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.

	```python
	from functions.binding.binding import BindingAffinity

	# Target protein sequence (amino acid format)
	target_protein = "MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLL..."

	# Initialize predictor with target protein
	binding = BindingAffinity(prot_seq=target_protein)

	# Input peptide in SMILES format
	peptides = [
	"CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1c[nH]cn1)C(=O)O"
	]

	# Get predictions
	scores = binding(peptides)
	print(f"Binding affinity (-log Kd): {scores[0]:.3f}")
	```

	Output interpretation:
	- Higher values = stronger binding
	- Scale: -log(Kd/Ki/IC50)
	- 7.5+ = tight binding (≤ ~30nM)
	- 6.0-7.5 = medium binding (~30nM - 1μM)
	- <6.0 = weak binding (> 1μM)

	---

	## Batch Processing 🌟

	All predictors support batch processing for multiple peptides:

	```python
	from functions.hemolysis.hemolysis import Hemolysis

	hemo = Hemolysis()

	# Multiple peptides
	peptides = [
	"NCC(=O)N[C@H](CS)C(=O)O",
	"CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)O)C(=O)O",
	"N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)O"
	]

	# Get predictions for all
	scores = hemo(peptides)

	for i, score in enumerate(scores):
	print(f"Peptide {i+1}: {score:.3f}")
	```

	---

	## Unified Scoring with Multiple Predictors 🌟

	For convenience, you can use `scoring_functions.py` to evaluate multiple properties at once and get a score vector for each peptide.

	### Basic Usage

	```python
	import sys
	sys.path.append('/path/to/PeptiVerse')
	from scoring_functions import ScoringFunctions

	# Initialize with desired scoring functions
	# Available: 'binding_affinity1', 'binding_affinity2', 'permeability',
	# 'solubility', 'hemolysis', 'nonfouling'
	scoring = ScoringFunctions(
	score_func_names=['solubility', 'hemolysis', 'nonfouling', 'permeability'],
	prot_seqs=[] # Empty if not using binding affinity
	)

	# Input peptides in SMILES format
	peptides = [
	'N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)',
	'NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O'
	]

	# Get scores (returns numpy array of shape: num_peptides x num_functions)
	scores = scoring(input_seqs=peptides)
	print(scores)
	```

	### Adding Binding Affinity

	```python
	from scoring_functions import ScoringFunctions

	# Target protein sequence (amino acid format)
	tfr_protein = "MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNT..."

	# Initialize with binding affinity for one protein
	scoring = ScoringFunctions(
	score_func_names=['binding_affinity1', 'solubility', 'hemolysis', 'permeability'],
	prot_seqs=[tfr_protein] # Provide target protein sequence
	)

	peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)']
	scores = scoring(input_seqs=peptides)

	# scores[0] will contain: [binding_affinity, solubility, hemolysis, permeability]
	print(f"Scores for peptide 1:")
	print(f" Binding Affinity: {scores[0][0]:.3f}")
	print(f" Solubility: {scores[0][1]:.3f}")
	print(f" Hemolysis: {scores[0][2]:.3f}")
	print(f" Permeability: {scores[0][3]:.3f}")
	```

	### Multiple Binding Targets

	```python
	# For dual binding affinity prediction
	protein1 = "MMDQARSAFSNLFGGEPLSYTR..." # First target
	protein2 = "MTKSNGEEPKMGGRMERFQQGV..." # Second target

	scoring = ScoringFunctions(
	score_func_names=['binding_affinity1', 'binding_affinity2', 'solubility', 'hemolysis'],
	prot_seqs=[protein1, protein2] # Provide both protein sequences
	)

	peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)...']
	scores = scoring(input_seqs=peptides)

	# scores[0] will contain: [binding_aff1, binding_aff2, solubility, hemolysis]
	```

	### Output Format

	The `ScoringFunctions` class returns a numpy array where:
	- Rows: Each row corresponds to one input peptide
	- Columns: Each column corresponds to one scoring function (in the order specified)

	```python
	# Example with 3 peptides and 4 scoring functions
	scores = scoring(input_seqs=peptides)
	# Shape: (3, 4)
	# scores[0] = [func1_score, func2_score, func3_score, func4_score] for peptide 1
	# scores[1] = [func1_score, func2_score, func3_score, func4_score] for peptide 2
	# scores[2] = [func1_score, func2_score, func3_score, func4_score] for peptide 3
	```

	---

	## Complete Example 🌟

	```python
	import sys
	sys.path.append('/path/to/PeptiVerse')
	from functions.hemolysis.hemolysis import Hemolysis
	from functions.solubility.solubility import Solubility
	from functions.permeability.permeability import Permeability

	# Initialize predictors
	hemo = Hemolysis()
	sol = Solubility()
	perm = Permeability()

	# Test peptide
	peptide = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O"]

	# Get all predictions
	hemo_score = hemo(peptide)[0]
	sol_score = sol(peptide)[0]
	perm_score = perm(peptide)[0]

	print("Peptide Property Predictions:")
	print(f" Hemolysis (non-hemolytic prob): {hemo_score:.3f}")
	print(f" Solubility: {sol_score:.3f}")
	print(f" Permeability: {perm_score:.3f}")
	```

	---

	## Model Architecture 🌟

	All predictors use:
	- Embeddings: PeptideCLM-23M (RoFormer-based peptide language model)
	- Classifier: XGBoost gradient boosting
	- Input: SMILES representation of peptides
	- Training: Models trained on curated datasets with cross-validation

	---
	## Citation

	If you find this repository helpful for your publications, please consider citing our paper:

	```
	@article{tang2025peptune,
	title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
	author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
	journal={42nd International Conference on Machine Learning},
	year={2025}
	}
	```
	To use this repository, you agree to abide by the MIT License.