Get AI summaries of any video or article — Sign up free
Build Dataset For Fine-Tuning and Evaluation with LLM | Sentiment Analysis for Financial News thumbnail

Build Dataset For Fine-Tuning and Evaluation with LLM | Sentiment Analysis for Financial News

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Filter financial news by a recent date window (after early August through Aug 23) to reduce overlap with model training data.

Briefing

A practical workflow for building a sentiment-labeled dataset from financial news using a fast large language model (LLM) is the core takeaway: take recent Yahoo Finance news articles, convert them into a clean text format, run Gemini 2.5 Flash White to label each item as positive/negative/neutral, then store the results for later evaluation and fine-tuning.

The process starts with data selection and formatting. The dataset used comes from the Yahoo Finance “stock news” field, which includes metadata such as related symbols, title, publisher, report date, type, story text, and a link. From each news item, the article body is stored as a list of paragraphs; the workflow turns this into a single text string by taking the title as-is and concatenating the first one or two paragraphs (joined with a newline). It also extracts the relevant company tickers from the “related symbols” field and keeps them as a list. To avoid contamination from model training data, the articles are filtered to a recent window—roughly after early August up to August 23—yielding about 1,000 examples.

Before labeling, the workflow checks text length distribution to ensure the inputs are manageable. Token/word counts show most examples fall within a relatively tight range (roughly 1,000–2,000 words for the bulk of rows), with a few outliers that could be removed if they were extreme (10,000–20,000 words). In this case, trimming isn’t necessary.

For annotation, Gemini 2.5 Flash White is used via the Google API, with an environment variable holding the API key. A prompt instructs the model to classify sentiment into exactly three categories—positive, negative, or neutral—using only the provided title and article text. The prompt is designed to return a minimal output: the model is asked to output only the category name with no extra text. The workflow also sets a “thinking budget,” noting that pushing it too high can worsen results compared with using fewer thinking tokens.

Labeling is then performed row-by-row across the dataset, with a short pause between API calls to avoid rate limits. The run takes about 30–35 minutes for ~1,000 items. Early outputs show a distribution that is heavily skewed toward positive sentiment, with negative and neutral appearing at much lower—though still non-trivial—rates. That imbalance becomes a key checkpoint: even with a strong model, the labels should be reviewed for adequacy.

Finally, the predicted sentiment categories are written back into the dataset and saved as a new Parquet file. The resulting labeled dataset is positioned for the next step—evaluation (and potentially fine-tuning) using other tools—while also acknowledging a practical constraint: training directly on outputs from this Google model may be restricted, so the labels are treated as a way to bootstrap evaluation and dataset understanding rather than an automatic end-to-end training pipeline.

Cornell Notes

The workflow builds a sentiment-analysis dataset for financial news by labeling ~1,000 Yahoo Finance “stock news” items with Gemini 2.5 Flash White. It filters articles to a recent date range (after early August through Aug 23) to reduce overlap with model training data, then converts each item into a single input text by combining the title with the first one or two paragraphs of the article body. Before labeling, it checks token/word-length distribution to catch extreme outliers. The model is prompted to output only one of three labels—positive, negative, or neutral—based on the title and text, and the results are saved back into a Parquet file for later evaluation. The label distribution (often skewed toward positive) is treated as a quality signal that still requires verification.

How does the workflow turn Yahoo Finance news records into model-ready text inputs?

Each news item’s title is kept as a string, and the article body is stored as a list of paragraphs. The workflow concatenates the first one or two paragraphs (typically the first, and the second if more exist) into a single text field, joining paragraphs with a newline. It also extracts the relevant company tickers from the “related symbols” field into a list, though the sentiment prompt primarily uses the title and the combined paragraph text.

Why filter the dataset by date before labeling?

The workflow intentionally selects very recent articles—roughly after the beginning of August up to August 23—so the examples are less likely to have been used in training for the sentiment model. This reduces the chance that the labeling task becomes less meaningful due to memorization or training-data overlap.

What prompt design choices help ensure clean sentiment labels?

The prompt restricts outputs to exactly three categories: positive, negative, or neutral. It also instructs the model to return only the category name with no extra text. This matters because it allows straightforward extraction of the label as a single token/word, rather than parsing verbose responses.

What role does the “thinking budget” play, and what tradeoff is observed?

The workflow sets a thinking budget for Gemini 2.5 Flash White and notes an empirical tradeoff: increasing it too far can degrade results compared with using fewer thinking tokens. The practical takeaway is to tune for enough reasoning without overextending computation.

How is dataset quality checked after annotation?

After generating labels for the first batch (e.g., the first 10 examples), the workflow inspects the predicted category distribution. In the observed run, sentiment is overwhelmingly positive, with negative and neutral appearing at lower counts. That skew triggers a manual review step: even strong models can mislabel, so the extracted labels should be evaluated for fit to the task before downstream use.

What is the final artifact produced for later work?

The predicted sentiment category is added as a new column in the dataset, and the enriched dataset is saved to a Parquet file. This labeled Parquet file is then used in subsequent steps for evaluation (and potentially fine-tuning workflows).

Review Questions

  1. What specific transformations are applied to the title and paragraph list to create the final text input for sentiment classification?
  2. How does the workflow’s date filtering strategy relate to concerns about training-data overlap?
  3. Why does the prompt require outputting only the category name, and how does that simplify downstream processing?

Key Points

  1. 1

    Filter financial news by a recent date window (after early August through Aug 23) to reduce overlap with model training data.

  2. 2

    Convert each article into a single input by combining the title with the first one or two paragraphs joined by newlines.

  3. 3

    Extract related company tickers from the “related symbols” field and keep them as a list for dataset context.

  4. 4

    Use Gemini 2.5 Flash White with an API key stored in the environment and a prompt that forces outputs to be only positive, negative, or neutral.

  5. 5

    Tune the “thinking budget” carefully; excessive reasoning tokens can worsen sentiment accuracy.

  6. 6

    Annotate row-by-row with rate-limit protection (e.g., short sleeps) and expect ~30–35 minutes for ~1,000 items.

  7. 7

    Save the labeled results back into a Parquet file to support later evaluation and fine-tuning workflows (subject to training restrictions on model outputs).

Highlights

The workflow turns Yahoo Finance “stock news” paragraph lists into compact inputs by taking the title plus the first one or two paragraphs.
A strict prompt constraint—output only the sentiment category name—makes label extraction simple and reliable.
Date filtering (early August to Aug 23) is used as a design choice to avoid likely training-data overlap.
Even with a strong model, the initial label distribution can be heavily skewed (often mostly positive), so human verification remains necessary.

Topics

  • Sentiment Labeling
  • Financial News Dataset
  • LLM Annotation
  • Prompt Engineering
  • Dataset Preparation