Automated Prompt Engineering with DSPy | Prompt Optimization for Financial News Semantic Analysis

TL;DR

DSPy can improve sentiment classification accuracy by optimizing prompt instructions rather than retraining or altering model weights.

Briefing Cornell Notes

Briefing

Prompt optimization can materially improve sentiment extraction from financial news without retraining a model—DSPy’s prompt optimizer boosted classification accuracy by about 15% in a local, end-to-end workflow. The key move is to treat prompt writing as an optimization problem: a “teacher” model generates better instructions, while a smaller “student” model performs sentiment labeling. In this setup, both models run locally via Oama, and the only thing that changes is the prompt, not model weights.

The workflow starts with a baseline. A sentiment classification pipeline is built around a DSPy signature that asks for one of three labels—positive, negative, or neutral—plus an optional “confidence” field. The initial prompt is relatively straightforward: it frames the model as an expert financial analyst, instructs it to analyze a high-stakes financial news article, and outputs sentiment and confidence. Using a custom dataset split by time (July news for one subset and mid-August 2025 news for another, with no date overlap), the baseline prompt is evaluated on a test set of 1,000 examples. Accuracy is reported at about 61% (the transcript also mentions “seg 1%,” but the later comparison makes clear the baseline is roughly 61%).

DSPy then runs prompt optimization using the MIPROv2 optimizer. The optimizer is configured with (1) the metric to maximize—exact sentiment match accuracy, (2) a teacher model used to bootstrap improved instructions, and (3) the student model that will actually be evaluated. The optimization uses only 100 training examples for prompt tuning (half the available training subset), while the test set remains untouched until final evaluation. The compile step runs multiple trials (10 are mentioned), and the optimizer may temporarily reduce accuracy on some trials before converging on a stronger instruction set.

The best optimized prompt reaches 74% accuracy on the validation process during optimization, up from the ~71% average reported before a new high score. The resulting prompt is more elaborate than the baseline, explicitly emphasizing how sentiment and confidence should be assessed in a way that could influence investor behavior, market volatility, and macroeconomic factors. It also includes more detailed guidance on how to interpret the article’s language and expected impact.

Final evaluation on the held-out test set shows 76% accuracy with the optimized prompt—an improvement of more than 15 percentage points versus the unoptimized baseline. The transcript notes that the dataset’s sentiment labels come from another language model rather than human annotation, which is an important caveat for interpreting absolute accuracy. Still, the experiment demonstrates a practical path for improving financial-news sentiment extraction using DSPy prompt optimization on custom data, saving the optimized prompt to a JSON artifact for later production use.

Cornell Notes

DSPy can optimize prompts for sentiment classification in financial news without fine-tuning model weights. The approach uses a teacher model to bootstrap better instructions and a student model to perform labeling, with accuracy measured by exact sentiment match (positive/negative/neutral). Starting from a simple “expert financial analyst” prompt, baseline performance is about 61% accuracy on a 1,000-example test set. Running MIPROv2 prompt optimization with only 100 training examples produces a stronger, more detailed instruction prompt and reaches about 74% during optimization. Final evaluation on the untouched test set yields about 76% accuracy—over a 15% improvement—though labels were generated by another LM rather than humans.

How does DSPy improve sentiment accuracy without changing model weights?

It optimizes the prompt text itself. A teacher model (larger, used for bootstrapping) generates improved instructions, while a student model (smaller, used for the actual task) runs sentiment extraction. The compile step (MIPROv2) searches across prompt variants using a metric—here, exact match sentiment accuracy—then returns the best-performing prompt.

What does the sentiment task require the model to output?

The DSPy signature defines outputs for sentiment categories: positive, negative, or neutral. It also includes an optional “confidence” field, described as a token the model produces (not calibrated ML confidence). The input text is the article content, formed from the title plus the article text.

Why split the dataset by time, and what sizes were used?

The transcript uses time-based splits to avoid overlap between training and test news. One subset uses July financial news; another uses mid-August 2025 news from a previous collection, with no date overlap. Prompt tuning uses 200 training articles but the optimizer effectively uses 100 examples for validation during optimization; final evaluation uses 1,000 test examples.

What was the baseline performance before prompt optimization?

Using the initial, relatively simple prompt (expert financial analyst; output sentiment and confidence), evaluation on the 1,000-example test set is reported at roughly 61% accuracy. This baseline sets the comparison point for the optimized prompt.

What changed in the optimized prompt, and how much did accuracy improve?

The optimized prompt is more detailed and emphasizes high-stakes interpretation: it instructs the model to evaluate sentiment and confidence in a way that could affect investor behavior, market volatility, and macroeconomic factors. During optimization, the best trial reaches about 74% accuracy; final test accuracy is about 76%, an improvement of more than 15 percentage points over the baseline.

What caveat affects how to interpret the reported accuracy?

The transcript notes that the sentiment labels in the dataset were produced by another language model, not confirmed by humans. That means the measured accuracy reflects agreement with LM-generated labels, not necessarily ground-truth human sentiment.

Review Questions

What role do the teacher and student models play in DSPy’s prompt optimization, and which one is used for the final sentiment labeling?
How does the evaluation metric (exact sentiment match accuracy) influence what the optimizer searches for during MIPROv2 compilation?
Why might time-based dataset splitting matter for financial news sentiment tasks, and how did the transcript implement it?

Key Points

1
DSPy can improve sentiment classification accuracy by optimizing prompt instructions rather than retraining or altering model weights.
2
A teacher model bootstraps better prompts, while a student model performs the sentiment extraction task using the optimized prompt.
3
The sentiment pipeline uses a DSPy signature with three labels (positive/negative/neutral) and an optional model-generated confidence field.
4
Time-based dataset splits (July vs mid-August 2025) help create a realistic train/test separation for news sentiment evaluation.
5
MIPROv2 prompt optimization used only 100 examples for tuning/validation during compilation, running multiple trials to find a higher-performing instruction set.
6
Final evaluation on a held-out 1,000-example test set showed about 76% accuracy, versus roughly 61% for the unoptimized baseline.
7
Reported accuracy should be interpreted cautiously because dataset sentiment labels were generated by another language model rather than human annotation.

Highlights

Prompt optimization via DSPy improved financial-news sentiment accuracy by roughly 15 percentage points without fine-tuning model weights.

MIPROv2 searched across prompt variants using exact sentiment match accuracy, with a teacher model bootstrapping instructions for the student model.

The optimized prompt added high-stakes guidance—linking sentiment and confidence to investor behavior, market volatility, and macroeconomic factors.

Even with only 100 examples used during prompt tuning, the optimized prompt generalized to a 1,000-example test set.

Because labels were LM-generated rather than human-verified, the experiment measures agreement with those labels, not necessarily true ground truth.

Topics

DSPy Prompt Optimization
Financial News Sentiment Analysis
MIPROv2
Local LLM Inference
Semantic Classification

Mentioned

Venelin Valkov
DSPy
NLP
Oama
LM
MIPROv2
JSON