Spaces:
Runtime error
Repository Guidelines
This repository contains the LLM-based Cancer Risk Assessment Assistant.
Core Technologies
- FastAPI for the web framework
- LangChain for LLM orchestration
- uv for environment and dependency management
- hydra: for configuration management
Development Setup
Environment Setup
- Create the virtual environment (at '.venv') with
uv sync. - As the repository uses uv, the uv should be used to run all commands, e.g., "uv run python ..." NOT "python ...".
Running Commands
- Streamlit Interface:
uv run streamlit run apps/streamlit_ui/main.py - CLI Demo:
uv run python apps/cli/main.py - Tests:
uv run pytest
Coding Standards
Coding Philosophy
- Write simple, explicit, modular code
- Prioritize clarity over cleverness
- Prefer small pure functions over large ones
- Return early instead of nesting deeply
- Favor functions over classes unless essential
- Favor simple replication over heavy abstraction
- Keep comments short and only where code isn't self-explanatory
- Avoid premature optimization or over-engineering
Variable Naming
- Avoid single-letter variable names (x, y, i, j, e, t, f, m, c, ct) in favor of descriptive names.
- Avoid abbreviations (fh, ct, w, h) in favor of full descriptive names.
- Use context-specific names for loop indices based on what you're iterating over:
item_indexfor general enumerationline_indexfor text line iterationcolumn_indexfor table/array column iterationrow_indexfor table/array row iteration
- Use descriptive names for comprehensions and iterations:
iteminstead ofifor general itemselementinstead ofefor list elementskeyinstead ofkfor dictionary keysvalueinstead ofvfor dictionary values
- Use descriptive names for coordinates and positions:
x_position,y_positioninstead ofx,ywidth,heightinstead ofw,h
- Use descriptive names for data structures:
file_pathinstead offfor file pathsmodelinstead ofmfor model instancesuserinstead ofufor user objects
Examples from recent refactoring:
for i, ref in enumerate(references)βfor ref_index, ref in enumerate(references)for e in examplesβfor example in examplesfor m in modelsβfor model in modelsx = pdf.get_x()βx_position = pdf.get_x()fh = family_historyβfamily_history = family_history(avoid abbreviations)ct for ct in cancer_typesβcancer_type for cancer_type in cancer_typesf in MODELS_DIR.globβfile_path in MODELS_DIR.globt in field_type.__args__βtype_arg in field_type.__args__
Path Handling
- Always use
pathlib.Pathfor all file I/O, joining, and globbing - Accept
Path | strat function boundaries; normalize toPathinternally - Never use
os.pathfor path operations
Example:
from pathlib import Path
def read_text(file: Path | str) -> str:
path = Path(file)
return path.read_text(encoding="utf-8")
Type Hints and Modern Python
- Use modern type hints:
list,dict,tuple,set(notList,Dict, etc.) - Use PEP 604 unions:
A | B(notUnion[A, B]orOptional[A]) - Import from
typingonly when necessary (TypedDict,Literal,Annotated, etc.) - Never use
from __future__ import annotations - Add type hints to all public functions and methods
- Prefer precise types (
float,Path, etc.) over generic ones - If
Anyis required, isolate and document why
Import Management
- Place all imports at the top of the file, never inside functions or classes
- Group imports in three sections with blank lines between:
- Standard library imports
- Third-party library imports
- Local/project imports
- This improves performance (imports loaded once) and code readability
Error Handling and Logging
- Use
try/exceptonly for I/O or external APIs - Catch specific exceptions only (never broad
except:) - Raise clear, actionable error messages
- Use
logurufor logging, neverprint()in production code
Example:
from loguru import logger
try:
data = Path(file_path).read_text(encoding="utf-8")
except FileNotFoundError as error:
logger.error(f"Configuration file not found: {file_path}")
raise ValueError(f"Missing required config: {file_path}") from error
Docstring Standards
- Use Google-style docstrings for all public functions and classes
- Do NOT include type hints in docstrings (they're in the signature)
- Describe behavior, invariants, side effects, and edge cases
- Include examples for complex functions
- Avoid verbose docstrings for simple, self-explanatory functions
Testing
Testing Philosophy
- Write meaningful tests that verify core functionality and prevent regressions
- Use
pytestas the testing framework - Tests go under
tests/mirroring the source layout - Test both valid and invalid input scenarios
Test Types
- Unit tests: Small, deterministic, one concept per test
- Integration tests: Real workflows or reference comparisons with external systems
- Use
pytest.markto tag slow or manual tests
Test Coverage Requirements
- Ensure comprehensive test coverage for all risk models
- Ground Truth Validation: Test against known reference values
- Input Validation: Test that invalid inputs raise
ValueError - Edge Cases: Test boundary conditions
- Inapplicable Cases: Test when models should return "N/A"
Running Tests
uv run pytest # Run all tests
uv run pytest -q # Quiet mode
uv run pytest -v # Verbose mode
uv run pytest tests/test_risk_models/ # Specific directory
Pre-Submission Checklist
Before committing code, verify:
- β
Run
uv run pytest -q(all tests pass) - β
Run
pre-commit run --all-files(all hooks pass) - β
No
print()statements in production code - β
No broad
except:blocks - β All type hints present on public functions
- β
File paths use
pathlib.Path - β
Logging uses
loguru
Risk Models
Implemented Models
The assistant currently includes the following built-in risk calculators:
- Gail - Breast cancer risk
- Claus - Breast cancer risk based on family history
- Tyrer-Cuzick - Breast cancer risk (IBIS model)
- BOADICEA - Breast and ovarian cancer risk (via CanRisk API)
- PLCOm2012 - Lung cancer risk
- LLPi - Liverpool Lung Project improved model for lung cancer risk (8.7-year prediction)
- CRC-PRO - Colorectal cancer risk
- PCPT - Prostate cancer risk
- Extended PBCG - Prostate cancer risk (extended model)
- Prostate Mortality - Prostate cancer-specific mortality prediction
- MRAT - Melanoma risk (5-year prediction)
- aMAP - Hepatocellular carcinoma (liver cancer) risk
- QCancer - Multi-site cancer differential
Additional models should follow the interfaces under src/sentinel/risk_models.
Risk Model Implementation Guide
Base Architecture
All risk models must inherit from RiskModel in src/sentinel/risk_models/base.py:
from sentinel.risk_models.base import RiskModel
class YourRiskModel(RiskModel):
def __init__(self):
super().__init__("your_model_name")
Required Methods
Every risk model must implement these abstract methods:
def compute_score(self, user: UserInput) -> str:
"""Compute the risk score for a given user profile.
Args:
user: The user profile containing demographics, medical history, etc.
Returns:
str: Risk percentage as a string or an N/A message if inapplicable.
Raises:
ValueError: If required inputs are missing or invalid.
"""
def cancer_type(self) -> str:
"""Return the cancer type this model assesses."""
return "breast" # or "lung", "prostate", etc.
def description(self) -> str:
"""Return a detailed description of the model."""
def interpretation(self) -> str:
"""Return guidance on how to interpret the results."""
def references(self) -> list[str]:
"""Return list of reference citations."""
UserInput Structure
All risk models must use the centralized UserInput structure - this is the single source of truth for all data types and enums. The UserInput class follows a hierarchical structure:
UserInput
βββ demographics: Demographics
β βββ age_years: int
β βββ sex: Sex (enum)
β βββ ethnicity: Ethnicity | None
β βββ anthropometrics: Anthropometrics
β βββ height_cm: float | None
β βββ weight_kg: float | None
βββ lifestyle: Lifestyle
β βββ smoking: SmokingHistory
β βββ alcohol: AlcoholConsumption
βββ personal_medical_history: PersonalMedicalHistory
β βββ chronic_conditions: list[ChronicCondition]
β βββ previous_cancers: list[CancerType]
β βββ genetic_mutations: list[GeneticMutation]
β βββ tyrer_cuzick_polygenic_risk_score: float | None
βββ female_specific: FemaleSpecific | None
β βββ menstrual: MenstrualHistory
β βββ parity: ParityHistory
β βββ breast_health: BreastHealthHistory
βββ symptoms: list[SymptomEntry]
βββ family_history: list[FamilyMemberCancer]
REQUIRED_INPUTS Specification
Every risk model must define a REQUIRED_INPUTS class attribute using Pydantic's Annotated types with Field constraints:
REQUIRED_INPUTS: dict[str, tuple[type, bool]] = {
"demographics.age_years": (Annotated[int, Field(ge=18, le=100)], True),
"demographics.sex": (Sex, True),
"demographics.ethnicity": (Ethnicity | None, False),
"family_history": (list, False), # list[FamilyMemberCancer]
"symptoms": (list, False), # list[SymptomEntry]
}
Input Validation
Every compute_score method must start with input validation:
def compute_score(self, user: UserInput) -> str:
"""Compute the risk score for a given user profile."""
# Validate inputs first
is_valid, errors = self.validate_inputs(user)
if not is_valid:
raise ValueError(f"Invalid inputs for {self.name}: {'; '.join(errors)}")
# Model-specific validation
if user.demographics.sex != Sex.FEMALE:
return "N/A: Model is only applicable to female patients."
# Continue with model-specific logic...
Data Access Patterns
# Demographics
age = user.demographics.age_years
sex = user.demographics.sex
ethnicity = user.demographics.ethnicity
# Female-specific data
if user.female_specific is not None:
menarche_age = user.female_specific.menstrual.age_at_menarche
num_births = user.female_specific.parity.num_live_births
# Family history
for member in user.family_history:
if member.cancer_type == CancerType.BREAST:
relation = member.relation
age_at_diagnosis = member.age_at_diagnosis
Enum Usage
Always use enums from sentinel.user_input, never string literals or custom enums:
# β
Correct - using UserInput enums
if user.demographics.sex == Sex.FEMALE:
if member.cancer_type == CancerType.BREAST:
if member.relation == FamilyRelation.MOTHER:
# β Incorrect - string literals
if user.demographics.sex == "female":
if member.cancer_type == "breast":
# β Incorrect - custom enums
if user.demographics.sex == MyCustomSex.FEMALE:
Important: All risk models must use the same centralized enums from UserInput. If a required enum doesn't exist in UserInput, you must:
- Extend UserInput by adding the new enum to
src/sentinel/user_input.py - Never create model-specific enums - this prevents divergence between models
- Update all models to use the new centralized enum
This ensures all risk models share the same data structure and prevents fragmentation.
Extending UserInput
When a risk model needs fields or enums that don't exist in UserInput:
- Add to UserInput: Extend
src/sentinel/user_input.pywith new fields/enums - Update all models: Ensure all existing models can handle the new fields (use
| Nonefor optional fields) - Never create model-specific structures: This prevents divergence and fragmentation
- Test thoroughly: Add tests for new fields in
tests/test_user_input.py
Example of extending UserInput:
# In src/sentinel/user_input.py
class ChronicCondition(str, Enum):
# ... existing values
NEW_CONDITION = "new_condition" # Add new enum value
class PersonalMedicalHistory(StrictBaseModel):
# ... existing fields
new_field: float | None = Field(None, description="New field description")
Testing Requirements
Create comprehensive test files with:
- Ground Truth Validation: Test against known reference values
- Input Validation: Test that invalid inputs raise
ValueError - Edge Cases: Test boundary conditions and edge cases
- Inapplicable Cases: Test cases where model should return "N/A"
Example test structure:
import pytest
from sentinel.user_input import UserInput, Demographics, Sex
from sentinel.risk_models import YourRiskModel
GROUND_TRUTH_CASES = [
{
"name": "test_case_name",
"input": UserInput(
demographics=Demographics(
age_years=40,
sex=Sex.FEMALE,
# ... other fields
),
# ... rest of input
),
"expected": 1.5, # Expected risk percentage
},
# ... more test cases
]
class TestYourRiskModel:
@pytest.mark.parametrize("case", GROUND_TRUTH_CASES, ids=lambda x: x["name"])
def test_ground_truth_validation(self, case):
"""Test against ground truth results."""
user_input = case["input"]
expected_risk = case["expected"]
actual_risk_str = self.model.compute_score(user_input)
actual_risk = float(actual_risk_str)
assert actual_risk == pytest.approx(expected_risk, abs=0.01)
Migration Checklist
When adapting an existing risk model to the new structure:
- Update imports to use new
user_inputmodule - Add
REQUIRED_INPUTSwith Pydantic validation - Refactor
compute_scoreto use newUserInputstructure - Replace string literals with enums
- Update parameter extraction logic
- Add input validation at start of
compute_score - Update all test cases to use new
UserInputstructure - Run full test suite to ensure 100% pass rate
- Run pre-commit hooks to ensure code quality
LLM and Code Assistant Guidelines
When generating or modifying code, AI assistants MUST:
Mandatory Rules
- Follow ALL guidelines in this document without exception
- Never use forbidden constructs (
os.path,Optional[],List[],print(), broadexcept:) - Never add decorative comment banners or unnecessary formatting
- Always generate clean, modular, statically typed code
Code Generation Standards
- Prefer clarity and simplicity over cleverness
- Use modern Python type hints exclusively
- Include comprehensive docstrings for non-trivial functions
- Ensure all examples compile, type-check, and pass linting
Verification
All generated code must:
- Pass
ruff formatandruff check - Include proper type hints
- Use
pathlib.Pathfor all file operations - Use
logurufor logging - Follow the Variable Naming guidelines
Important Note for Developers
When making changes to the project, ensure that the following files are updated to reflect the changes:
README.mdAGENTS.mdGEMINI.md
For additional implementation details, refer to the existing risk model implementations in src/sentinel/risk_models/.