Financial-Advice-LLM: A fine-tuned Llama 3.1 Model for Structured Budgeting and Financial Advice
Introduction
This project fine-tunes Llama-3.1-8B-Instruct to generate structured, personalized financial advice based on a user's financial profile. While general-purpose LLMs often produce inconsistent or unstructured financial recommendations, this model addresses that gap by training on a large synthetic dataset designed to enforce format, clarity, and completeness. The goal is to help users better interpret their financial situations and receive actionable, template-aligned guidance. After fine-tuning, the model demonstrated improved structure and completeness in budgeting tasks while retaining competitive general reasoning performance.
Data
Because no public dataset exists for personalized budgeting guidance, I created a synthetic dataset of over 300 examples using few-shot prompting. Each example includes a structured instruction (income, debt, family size, goals, etc) paired with a detailed budgeting response covering:
- Budget Overview
- Savings Recommendations
- Debt Strategy
- Additional Tips
The dataset follows a consistent Instruction / Response format and a 90/10 train and validation split was used to prepare the data.
For evaluation, I used:
- GSM8K-COT for math and numerical reasoning
- MMLU High School Math for general reasoning
- TruthfulQA (MC1) for factual robustness
- A custon LLM-as-a-judge dataset to evaluate structure, completeness, and accuracy of budgeting outputs
These benchmarks collectively measure both general reasoning and task-specific financial performance.
Methodology
Based on experimentation through homework assignments, I selected full fine-tuning because lightweight approaches were insufficient for enforcing consistent multi-section templates. Full fine-tuning allows deeper alignment to structural patterns found in the synthetic dataset.
Three hyperparameter configurations were tested:
| Config | Learning Rate | Epochs | Notes |
|---|---|---|---|
| 1 | 5e-6 | 1 | Underfit |
| 2 | 1e-5 | 2 | Best validation performance |
| 3 | 2e-5 | 2 | Slight overfitting |
The final model was selected from Configuration 2, which showed the lowest validation loss and most stable learning.
Evaluation
Benchmark Results
| Model | MMLU Math | GSM8K Strict | GSM8K Flexible | TruthfulQA MC1 | Structure | Completeness | Accuracy |
|---|---|---|---|---|---|---|---|
| Base Llama-3.1-8B | 42.2% | 76.4% | 77.6% | 36.96% | 4.5 | 3.9 | 4.5 |
| Financial-Advice-LLM (Our Fine-tuned Model) | 40.7% | 73.0% | 75.1% | 37.70% | 3.67 | 4.0 | 4.56 |
| Mistral-7B | ~37% | ~66% | ~68% | N/A | N/A | N/A | N/A |
| Gemma-7B | ~39% | ~70% | ~72% | N/A | N/A | N/A | N/A |
Benchmark Task Descriptions
I selected these benchmarks to test both general purpose and domain-specific reasoning:
- GSM8k: Makes surer the model still performs well on multi-step numerical reasoning tasks, relevant for budgeting math
- MMLU High School Math: Checks general reasoning retention after specializing
- TruthfulQA (MC1): Assesses factual robustness and resistance to misleading questions
- LLM-as-a-Judge: Measures template adherence, accuracy, and completeness
Summary of Performance
As expected, the model shows a slight decrease in general reasoning benchmarks because of specialization for budgeting style responses. To add on, our third benchmark, TruthfulQA improved modestly, showing that the fine-tuning did not harm general factual robustness. However, the LLM-as-a-judge scores improved in completeness and accuracy and stayed consistent in structure, meaning the fine-tuned model is better aligned for its intended task.
Usage and Intended Uses
Loading the Model
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model_name = "db5kb/financial-advice-llm-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = (
"Instruction: I earn $85k, have $10k in student loans, and want to save for a home.\n\n"
"Response:"
)
pipe(prompt, max_new_tokens=350)
Intended Use Cases
- Personalized budgeting and financial advice summaries
- Beginner-friendly explanations of financial principles
- Consistent template-based financial guidance
Not meant for investement decisions, tax planning, or high-stakes financial advice
Prompt Format
The model expects inputs formatted as:
Instruction: <user financial profile>
Response:
Example:
Instruction: I'm a single professional earning $75k with $20k in student loans and want to save for a home.
Response:
Expected Output Format
The model outputs a structured, multi-section budgeting summary:
1. Budget Overview:
<text>
2. Savings Recommendations:
<text>
3. Debt Strategy:
<text>
4. Additional Tips:
<text>
Limitations
- The dataset is fully synthetic and may miss some real-world scenarios and variability
- The model may oversimplify complex financial situations
- There are slight performance decreases on general reasoning benchmarks
- Evaluation through an LLM judge may introduce bias
- It should not be used for financial decision making
- Downloads last month
- 39
Model tree for db5kb/financial-advice-llm-Llama-3.1-8B-Instruct
Base model
meta-llama/Llama-3.1-8B