Build Dataset For Fine-Tuning and Evaluation with LLM | Sentiment Analysis for Financial News
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Filter financial news by a recent date window (after early August through Aug 23) to reduce overlap with model training data.
Briefing
A practical workflow for building a sentiment-labeled dataset from financial news using a fast large language model (LLM) is the core takeaway: take recent Yahoo Finance news articles, convert them into a clean text format, run Gemini 2.5 Flash White to label each item as positive/negative/neutral, then store the results for later evaluation and fine-tuning.
The process starts with data selection and formatting. The dataset used comes from the Yahoo Finance “stock news” field, which includes metadata such as related symbols, title, publisher, report date, type, story text, and a link. From each news item, the article body is stored as a list of paragraphs; the workflow turns this into a single text string by taking the title as-is and concatenating the first one or two paragraphs (joined with a newline). It also extracts the relevant company tickers from the “related symbols” field and keeps them as a list. To avoid contamination from model training data, the articles are filtered to a recent window—roughly after early August up to August 23—yielding about 1,000 examples.
Before labeling, the workflow checks text length distribution to ensure the inputs are manageable. Token/word counts show most examples fall within a relatively tight range (roughly 1,000–2,000 words for the bulk of rows), with a few outliers that could be removed if they were extreme (10,000–20,000 words). In this case, trimming isn’t necessary.
For annotation, Gemini 2.5 Flash White is used via the Google API, with an environment variable holding the API key. A prompt instructs the model to classify sentiment into exactly three categories—positive, negative, or neutral—using only the provided title and article text. The prompt is designed to return a minimal output: the model is asked to output only the category name with no extra text. The workflow also sets a “thinking budget,” noting that pushing it too high can worsen results compared with using fewer thinking tokens.
Labeling is then performed row-by-row across the dataset, with a short pause between API calls to avoid rate limits. The run takes about 30–35 minutes for ~1,000 items. Early outputs show a distribution that is heavily skewed toward positive sentiment, with negative and neutral appearing at much lower—though still non-trivial—rates. That imbalance becomes a key checkpoint: even with a strong model, the labels should be reviewed for adequacy.
Finally, the predicted sentiment categories are written back into the dataset and saved as a new Parquet file. The resulting labeled dataset is positioned for the next step—evaluation (and potentially fine-tuning) using other tools—while also acknowledging a practical constraint: training directly on outputs from this Google model may be restricted, so the labels are treated as a way to bootstrap evaluation and dataset understanding rather than an automatic end-to-end training pipeline.
Cornell Notes
The workflow builds a sentiment-analysis dataset for financial news by labeling ~1,000 Yahoo Finance “stock news” items with Gemini 2.5 Flash White. It filters articles to a recent date range (after early August through Aug 23) to reduce overlap with model training data, then converts each item into a single input text by combining the title with the first one or two paragraphs of the article body. Before labeling, it checks token/word-length distribution to catch extreme outliers. The model is prompted to output only one of three labels—positive, negative, or neutral—based on the title and text, and the results are saved back into a Parquet file for later evaluation. The label distribution (often skewed toward positive) is treated as a quality signal that still requires verification.
How does the workflow turn Yahoo Finance news records into model-ready text inputs?
Why filter the dataset by date before labeling?
What prompt design choices help ensure clean sentiment labels?
What role does the “thinking budget” play, and what tradeoff is observed?
How is dataset quality checked after annotation?
What is the final artifact produced for later work?
Review Questions
- What specific transformations are applied to the title and paragraph list to create the final text input for sentiment classification?
- How does the workflow’s date filtering strategy relate to concerns about training-data overlap?
- Why does the prompt require outputting only the category name, and how does that simplify downstream processing?
Key Points
- 1
Filter financial news by a recent date window (after early August through Aug 23) to reduce overlap with model training data.
- 2
Convert each article into a single input by combining the title with the first one or two paragraphs joined by newlines.
- 3
Extract related company tickers from the “related symbols” field and keep them as a list for dataset context.
- 4
Use Gemini 2.5 Flash White with an API key stored in the environment and a prompt that forces outputs to be only positive, negative, or neutral.
- 5
Tune the “thinking budget” carefully; excessive reasoning tokens can worsen sentiment accuracy.
- 6
Annotate row-by-row with rate-limit protection (e.g., short sleeps) and expect ~30–35 minutes for ~1,000 items.
- 7
Save the labeled results back into a Parquet file to support later evaluation and fine-tuning workflows (subject to training restrictions on model outputs).