Spaces:
Runtime error
Runtime error
| # Repository Guidelines | |
| This repository contains the LLM-based Cancer Risk Assessment Assistant. | |
| ## Core Technologies | |
| - **FastAPI** for the web framework | |
| - **LangChain** for LLM orchestration | |
| - **uv** for environment and dependency management | |
| - **hydra:** for configuration management | |
| ## Development Setup | |
| ### Environment Setup | |
| - Create the virtual environment (at '.venv') with `uv sync`. | |
| - As the repository uses uv, the uv should be used to run all commands, e.g., "uv run python ..." NOT "python ...". | |
| ### Running Commands | |
| - **Streamlit Interface**: `uv run streamlit run apps/streamlit_ui/main.py` | |
| - **CLI Demo**: `uv run python apps/cli/main.py` | |
| - **Tests**: `uv run pytest` | |
| ## Coding Standards | |
| ### Coding Philosophy | |
| - Write simple, explicit, modular code | |
| - Prioritize clarity over cleverness | |
| - Prefer small pure functions over large ones | |
| - Return early instead of nesting deeply | |
| - Favor functions over classes unless essential | |
| - Favor simple replication over heavy abstraction | |
| - Keep comments short and only where code isn't self-explanatory | |
| - Avoid premature optimization or over-engineering | |
| ### Variable Naming | |
| - **Avoid single-letter variable names** (x, y, i, j, e, t, f, m, c, ct) in favor of descriptive names. | |
| - **Avoid abbreviations** (fh, ct, w, h) in favor of full descriptive names. | |
| - Use context-specific names for loop indices based on what you're iterating over: | |
| - `item_index` for general enumeration | |
| - `line_index` for text line iteration | |
| - `column_index` for table/array column iteration | |
| - `row_index` for table/array row iteration | |
| - Use descriptive names for comprehensions and iterations: | |
| - `item` instead of `i` for general items | |
| - `element` instead of `e` for list elements | |
| - `key` instead of `k` for dictionary keys | |
| - `value` instead of `v` for dictionary values | |
| - Use descriptive names for coordinates and positions: | |
| - `x_position`, `y_position` instead of `x`, `y` | |
| - `width`, `height` instead of `w`, `h` | |
| - Use descriptive names for data structures: | |
| - `file_path` instead of `f` for file paths | |
| - `model` instead of `m` for model instances | |
| - `user` instead of `u` for user objects | |
| **Examples from recent refactoring:** | |
| - `for i, ref in enumerate(references)` β `for ref_index, ref in enumerate(references)` | |
| - `for e in examples` β `for example in examples` | |
| - `for m in models` β `for model in models` | |
| - `x = pdf.get_x()` β `x_position = pdf.get_x()` | |
| - `fh = family_history` β `family_history = family_history` (avoid abbreviations) | |
| - `ct for ct in cancer_types` β `cancer_type for cancer_type in cancer_types` | |
| - `f in MODELS_DIR.glob` β `file_path in MODELS_DIR.glob` | |
| - `t in field_type.__args__` β `type_arg in field_type.__args__` | |
| ### Path Handling | |
| - **Always use `pathlib.Path`** for all file I/O, joining, and globbing | |
| - Accept `Path | str` at function boundaries; normalize to `Path` internally | |
| - **Never use `os.path`** for path operations | |
| Example: | |
| ```python | |
| from pathlib import Path | |
| def read_text(file: Path | str) -> str: | |
| path = Path(file) | |
| return path.read_text(encoding="utf-8") | |
| ``` | |
| ### Type Hints and Modern Python | |
| - **Use modern type hints**: `list`, `dict`, `tuple`, `set` (not `List`, `Dict`, etc.) | |
| - **Use PEP 604 unions**: `A | B` (not `Union[A, B]` or `Optional[A]`) | |
| - Import from `typing` only when necessary (`TypedDict`, `Literal`, `Annotated`, etc.) | |
| - **Never use** `from __future__ import annotations` | |
| - Add type hints to all public functions and methods | |
| - Prefer precise types (`float`, `Path`, etc.) over generic ones | |
| - If `Any` is required, isolate and document why | |
| ### Import Management | |
| - **Place all imports at the top of the file**, never inside functions or classes | |
| - Group imports in three sections with blank lines between: | |
| 1. Standard library imports | |
| 2. Third-party library imports | |
| 3. Local/project imports | |
| - This improves performance (imports loaded once) and code readability | |
| ### Error Handling and Logging | |
| - **Use `try/except` only for I/O or external APIs** | |
| - Catch specific exceptions only (never broad `except:`) | |
| - Raise clear, actionable error messages | |
| - **Use `loguru`** for logging, never `print()` in production code | |
| Example: | |
| ```python | |
| from loguru import logger | |
| try: | |
| data = Path(file_path).read_text(encoding="utf-8") | |
| except FileNotFoundError as error: | |
| logger.error(f"Configuration file not found: {file_path}") | |
| raise ValueError(f"Missing required config: {file_path}") from error | |
| ``` | |
| ### Docstring Standards | |
| - **Use Google-style docstrings** for all public functions and classes | |
| - Do NOT include type hints in docstrings (they're in the signature) | |
| - Describe behavior, invariants, side effects, and edge cases | |
| - Include examples for complex functions | |
| - Avoid verbose docstrings for simple, self-explanatory functions | |
| ## Testing | |
| ### Testing Philosophy | |
| - Write meaningful tests that verify core functionality and prevent regressions | |
| - Use `pytest` as the testing framework | |
| - Tests go under `tests/` mirroring the source layout | |
| - Test both valid and invalid input scenarios | |
| ### Test Types | |
| - **Unit tests**: Small, deterministic, one concept per test | |
| - **Integration tests**: Real workflows or reference comparisons with external systems | |
| - Use `pytest.mark` to tag slow or manual tests | |
| ### Test Coverage Requirements | |
| - Ensure comprehensive test coverage for all risk models | |
| - **Ground Truth Validation**: Test against known reference values | |
| - **Input Validation**: Test that invalid inputs raise `ValueError` | |
| - **Edge Cases**: Test boundary conditions | |
| - **Inapplicable Cases**: Test when models should return "N/A" | |
| ### Running Tests | |
| ```bash | |
| uv run pytest # Run all tests | |
| uv run pytest -q # Quiet mode | |
| uv run pytest -v # Verbose mode | |
| uv run pytest tests/test_risk_models/ # Specific directory | |
| ``` | |
| ### Pre-Submission Checklist | |
| Before committing code, verify: | |
| 1. β Run `uv run pytest -q` (all tests pass) | |
| 2. β Run `pre-commit run --all-files` (all hooks pass) | |
| 3. β No `print()` statements in production code | |
| 4. β No broad `except:` blocks | |
| 5. β All type hints present on public functions | |
| 6. β File paths use `pathlib.Path` | |
| 7. β Logging uses `loguru` | |
| ## Risk Models | |
| ### Implemented Models | |
| The assistant currently includes the following built-in risk calculators: | |
| - **Gail** - Breast cancer risk | |
| - **Claus** - Breast cancer risk based on family history | |
| - **Tyrer-Cuzick** - Breast cancer risk (IBIS model) | |
| - **BOADICEA** - Breast and ovarian cancer risk (via CanRisk API) | |
| - **PLCOm2012** - Lung cancer risk | |
| - **LLPi** - Liverpool Lung Project improved model for lung cancer risk (8.7-year prediction) | |
| - **CRC-PRO** - Colorectal cancer risk | |
| - **PCPT** - Prostate cancer risk | |
| - **Extended PBCG** - Prostate cancer risk (extended model) | |
| - **Prostate Mortality** - Prostate cancer-specific mortality prediction | |
| - **MRAT** - Melanoma risk (5-year prediction) | |
| - **aMAP** - Hepatocellular carcinoma (liver cancer) risk | |
| - **QCancer** - Multi-site cancer differential | |
| Additional models should follow the interfaces under `src/sentinel/risk_models`. | |
| ### Risk Model Implementation Guide | |
| #### Base Architecture | |
| All risk models must inherit from `RiskModel` in `src/sentinel/risk_models/base.py`: | |
| ```python | |
| from sentinel.risk_models.base import RiskModel | |
| class YourRiskModel(RiskModel): | |
| def __init__(self): | |
| super().__init__("your_model_name") | |
| ``` | |
| #### Required Methods | |
| Every risk model must implement these abstract methods: | |
| ```python | |
| def compute_score(self, user: UserInput) -> str: | |
| """Compute the risk score for a given user profile. | |
| Args: | |
| user: The user profile containing demographics, medical history, etc. | |
| Returns: | |
| str: Risk percentage as a string or an N/A message if inapplicable. | |
| Raises: | |
| ValueError: If required inputs are missing or invalid. | |
| """ | |
| def cancer_type(self) -> str: | |
| """Return the cancer type this model assesses.""" | |
| return "breast" # or "lung", "prostate", etc. | |
| def description(self) -> str: | |
| """Return a detailed description of the model.""" | |
| def interpretation(self) -> str: | |
| """Return guidance on how to interpret the results.""" | |
| def references(self) -> list[str]: | |
| """Return list of reference citations.""" | |
| ``` | |
| #### UserInput Structure | |
| **All risk models must use the centralized `UserInput` structure** - this is the single source of truth for all data types and enums. The `UserInput` class follows a hierarchical structure: | |
| ``` | |
| UserInput | |
| βββ demographics: Demographics | |
| β βββ age_years: int | |
| β βββ sex: Sex (enum) | |
| β βββ ethnicity: Ethnicity | None | |
| β βββ anthropometrics: Anthropometrics | |
| β βββ height_cm: float | None | |
| β βββ weight_kg: float | None | |
| βββ lifestyle: Lifestyle | |
| β βββ smoking: SmokingHistory | |
| β βββ alcohol: AlcoholConsumption | |
| βββ personal_medical_history: PersonalMedicalHistory | |
| β βββ chronic_conditions: list[ChronicCondition] | |
| β βββ previous_cancers: list[CancerType] | |
| β βββ genetic_mutations: list[GeneticMutation] | |
| β βββ tyrer_cuzick_polygenic_risk_score: float | None | |
| βββ female_specific: FemaleSpecific | None | |
| β βββ menstrual: MenstrualHistory | |
| β βββ parity: ParityHistory | |
| β βββ breast_health: BreastHealthHistory | |
| βββ symptoms: list[SymptomEntry] | |
| βββ family_history: list[FamilyMemberCancer] | |
| ``` | |
| #### REQUIRED_INPUTS Specification | |
| Every risk model must define a `REQUIRED_INPUTS` class attribute using Pydantic's `Annotated` types with `Field` constraints: | |
| ```python | |
| REQUIRED_INPUTS: dict[str, tuple[type, bool]] = { | |
| "demographics.age_years": (Annotated[int, Field(ge=18, le=100)], True), | |
| "demographics.sex": (Sex, True), | |
| "demographics.ethnicity": (Ethnicity | None, False), | |
| "family_history": (list, False), # list[FamilyMemberCancer] | |
| "symptoms": (list, False), # list[SymptomEntry] | |
| } | |
| ``` | |
| #### Input Validation | |
| Every `compute_score` method must start with input validation: | |
| ```python | |
| def compute_score(self, user: UserInput) -> str: | |
| """Compute the risk score for a given user profile.""" | |
| # Validate inputs first | |
| is_valid, errors = self.validate_inputs(user) | |
| if not is_valid: | |
| raise ValueError(f"Invalid inputs for {self.name}: {'; '.join(errors)}") | |
| # Model-specific validation | |
| if user.demographics.sex != Sex.FEMALE: | |
| return "N/A: Model is only applicable to female patients." | |
| # Continue with model-specific logic... | |
| ``` | |
| #### Data Access Patterns | |
| ```python | |
| # Demographics | |
| age = user.demographics.age_years | |
| sex = user.demographics.sex | |
| ethnicity = user.demographics.ethnicity | |
| # Female-specific data | |
| if user.female_specific is not None: | |
| menarche_age = user.female_specific.menstrual.age_at_menarche | |
| num_births = user.female_specific.parity.num_live_births | |
| # Family history | |
| for member in user.family_history: | |
| if member.cancer_type == CancerType.BREAST: | |
| relation = member.relation | |
| age_at_diagnosis = member.age_at_diagnosis | |
| ``` | |
| #### Enum Usage | |
| **Always use enums from `sentinel.user_input`, never string literals or custom enums:** | |
| ```python | |
| # β Correct - using UserInput enums | |
| if user.demographics.sex == Sex.FEMALE: | |
| if member.cancer_type == CancerType.BREAST: | |
| if member.relation == FamilyRelation.MOTHER: | |
| # β Incorrect - string literals | |
| if user.demographics.sex == "female": | |
| if member.cancer_type == "breast": | |
| # β Incorrect - custom enums | |
| if user.demographics.sex == MyCustomSex.FEMALE: | |
| ``` | |
| **Important**: All risk models must use the same centralized enums from `UserInput`. If a required enum doesn't exist in `UserInput`, you must: | |
| 1. **Extend UserInput** by adding the new enum to `src/sentinel/user_input.py` | |
| 2. **Never create model-specific enums** - this prevents divergence between models | |
| 3. **Update all models** to use the new centralized enum | |
| This ensures all risk models share the same data structure and prevents fragmentation. | |
| #### Extending UserInput | |
| When a risk model needs fields or enums that don't exist in `UserInput`: | |
| 1. **Add to UserInput**: Extend `src/sentinel/user_input.py` with new fields/enums | |
| 2. **Update all models**: Ensure all existing models can handle the new fields (use `| None` for optional fields) | |
| 3. **Never create model-specific structures**: This prevents divergence and fragmentation | |
| 4. **Test thoroughly**: Add tests for new fields in `tests/test_user_input.py` | |
| Example of extending UserInput: | |
| ```python | |
| # In src/sentinel/user_input.py | |
| class ChronicCondition(str, Enum): | |
| # ... existing values | |
| NEW_CONDITION = "new_condition" # Add new enum value | |
| class PersonalMedicalHistory(StrictBaseModel): | |
| # ... existing fields | |
| new_field: float | None = Field(None, description="New field description") | |
| ``` | |
| #### Testing Requirements | |
| Create comprehensive test files with: | |
| - **Ground Truth Validation**: Test against known reference values | |
| - **Input Validation**: Test that invalid inputs raise `ValueError` | |
| - **Edge Cases**: Test boundary conditions and edge cases | |
| - **Inapplicable Cases**: Test cases where model should return "N/A" | |
| Example test structure: | |
| ```python | |
| import pytest | |
| from sentinel.user_input import UserInput, Demographics, Sex | |
| from sentinel.risk_models import YourRiskModel | |
| GROUND_TRUTH_CASES = [ | |
| { | |
| "name": "test_case_name", | |
| "input": UserInput( | |
| demographics=Demographics( | |
| age_years=40, | |
| sex=Sex.FEMALE, | |
| # ... other fields | |
| ), | |
| # ... rest of input | |
| ), | |
| "expected": 1.5, # Expected risk percentage | |
| }, | |
| # ... more test cases | |
| ] | |
| class TestYourRiskModel: | |
| @pytest.mark.parametrize("case", GROUND_TRUTH_CASES, ids=lambda x: x["name"]) | |
| def test_ground_truth_validation(self, case): | |
| """Test against ground truth results.""" | |
| user_input = case["input"] | |
| expected_risk = case["expected"] | |
| actual_risk_str = self.model.compute_score(user_input) | |
| actual_risk = float(actual_risk_str) | |
| assert actual_risk == pytest.approx(expected_risk, abs=0.01) | |
| ``` | |
| #### Migration Checklist | |
| When adapting an existing risk model to the new structure: | |
| - [ ] Update imports to use new `user_input` module | |
| - [ ] Add `REQUIRED_INPUTS` with Pydantic validation | |
| - [ ] Refactor `compute_score` to use new `UserInput` structure | |
| - [ ] Replace string literals with enums | |
| - [ ] Update parameter extraction logic | |
| - [ ] Add input validation at start of `compute_score` | |
| - [ ] Update all test cases to use new `UserInput` structure | |
| - [ ] Run full test suite to ensure 100% pass rate | |
| - [ ] Run pre-commit hooks to ensure code quality | |
| ## LLM and Code Assistant Guidelines | |
| When generating or modifying code, AI assistants MUST: | |
| ### Mandatory Rules | |
| - Follow ALL guidelines in this document without exception | |
| - Never use forbidden constructs (`os.path`, `Optional[]`, `List[]`, `print()`, broad `except:`) | |
| - Never add decorative comment banners or unnecessary formatting | |
| - Always generate clean, modular, statically typed code | |
| ### Code Generation Standards | |
| - Prefer clarity and simplicity over cleverness | |
| - Use modern Python type hints exclusively | |
| - Include comprehensive docstrings for non-trivial functions | |
| - Ensure all examples compile, type-check, and pass linting | |
| ### Verification | |
| All generated code must: | |
| - Pass `ruff format` and `ruff check` | |
| - Include proper type hints | |
| - Use `pathlib.Path` for all file operations | |
| - Use `loguru` for logging | |
| - Follow the Variable Naming guidelines | |
| ## Important Note for Developers | |
| When making changes to the project, ensure that the following files are updated to reflect the changes: | |
| - `README.md` | |
| - `AGENTS.md` | |
| - `GEMINI.md` | |
| For additional implementation details, refer to the existing risk model implementations in `src/sentinel/risk_models/`. | |