Automated Prompt Engineering with DSPy | Prompt Optimization for Financial News Semantic Analysis
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
DSPy can improve sentiment classification accuracy by optimizing prompt instructions rather than retraining or altering model weights.
Briefing
Prompt optimization can materially improve sentiment extraction from financial news without retraining a model—DSPy’s prompt optimizer boosted classification accuracy by about 15% in a local, end-to-end workflow. The key move is to treat prompt writing as an optimization problem: a “teacher” model generates better instructions, while a smaller “student” model performs sentiment labeling. In this setup, both models run locally via Oama, and the only thing that changes is the prompt, not model weights.
The workflow starts with a baseline. A sentiment classification pipeline is built around a DSPy signature that asks for one of three labels—positive, negative, or neutral—plus an optional “confidence” field. The initial prompt is relatively straightforward: it frames the model as an expert financial analyst, instructs it to analyze a high-stakes financial news article, and outputs sentiment and confidence. Using a custom dataset split by time (July news for one subset and mid-August 2025 news for another, with no date overlap), the baseline prompt is evaluated on a test set of 1,000 examples. Accuracy is reported at about 61% (the transcript also mentions “seg 1%,” but the later comparison makes clear the baseline is roughly 61%).
DSPy then runs prompt optimization using the MIPROv2 optimizer. The optimizer is configured with (1) the metric to maximize—exact sentiment match accuracy, (2) a teacher model used to bootstrap improved instructions, and (3) the student model that will actually be evaluated. The optimization uses only 100 training examples for prompt tuning (half the available training subset), while the test set remains untouched until final evaluation. The compile step runs multiple trials (10 are mentioned), and the optimizer may temporarily reduce accuracy on some trials before converging on a stronger instruction set.
The best optimized prompt reaches 74% accuracy on the validation process during optimization, up from the ~71% average reported before a new high score. The resulting prompt is more elaborate than the baseline, explicitly emphasizing how sentiment and confidence should be assessed in a way that could influence investor behavior, market volatility, and macroeconomic factors. It also includes more detailed guidance on how to interpret the article’s language and expected impact.
Final evaluation on the held-out test set shows 76% accuracy with the optimized prompt—an improvement of more than 15 percentage points versus the unoptimized baseline. The transcript notes that the dataset’s sentiment labels come from another language model rather than human annotation, which is an important caveat for interpreting absolute accuracy. Still, the experiment demonstrates a practical path for improving financial-news sentiment extraction using DSPy prompt optimization on custom data, saving the optimized prompt to a JSON artifact for later production use.
Cornell Notes
DSPy can optimize prompts for sentiment classification in financial news without fine-tuning model weights. The approach uses a teacher model to bootstrap better instructions and a student model to perform labeling, with accuracy measured by exact sentiment match (positive/negative/neutral). Starting from a simple “expert financial analyst” prompt, baseline performance is about 61% accuracy on a 1,000-example test set. Running MIPROv2 prompt optimization with only 100 training examples produces a stronger, more detailed instruction prompt and reaches about 74% during optimization. Final evaluation on the untouched test set yields about 76% accuracy—over a 15% improvement—though labels were generated by another LM rather than humans.
How does DSPy improve sentiment accuracy without changing model weights?
What does the sentiment task require the model to output?
Why split the dataset by time, and what sizes were used?
What was the baseline performance before prompt optimization?
What changed in the optimized prompt, and how much did accuracy improve?
What caveat affects how to interpret the reported accuracy?
Review Questions
- What role do the teacher and student models play in DSPy’s prompt optimization, and which one is used for the final sentiment labeling?
- How does the evaluation metric (exact sentiment match accuracy) influence what the optimizer searches for during MIPROv2 compilation?
- Why might time-based dataset splitting matter for financial news sentiment tasks, and how did the transcript implement it?
Key Points
- 1
DSPy can improve sentiment classification accuracy by optimizing prompt instructions rather than retraining or altering model weights.
- 2
A teacher model bootstraps better prompts, while a student model performs the sentiment extraction task using the optimized prompt.
- 3
The sentiment pipeline uses a DSPy signature with three labels (positive/negative/neutral) and an optional model-generated confidence field.
- 4
Time-based dataset splits (July vs mid-August 2025) help create a realistic train/test separation for news sentiment evaluation.
- 5
MIPROv2 prompt optimization used only 100 examples for tuning/validation during compilation, running multiple trials to find a higher-performing instruction set.
- 6
Final evaluation on a held-out 1,000-example test set showed about 76% accuracy, versus roughly 61% for the unoptimized baseline.
- 7
Reported accuracy should be interpreted cautiously because dataset sentiment labels were generated by another language model rather than human annotation.