My Best AI AGENTS So Far! | Gemini 2.0 Flash, o3-mini ++

TL;DR

Gemini 2.0 Flash can be used both to choose a news story and later to pick a thumbnail and upload the finished YouTube video.

Briefing Cornell Notes

Briefing

A fully autonomous pipeline can turn an AI news source into a published YouTube video—complete with script, voiceover, captions, edited clips, title/description, thumbnail, and upload—by chaining specialized models and tools. The workflow starts with TechCrunch headlines, selects a story using Gemini 2.0 Flash, generates a complete script with DeepSeek R1, converts that script into narration via 11 Labs, and uses OpenAI o3-mini to perform advanced video editing and assembly. Gemini 2.0 Flash is then used again to generate a thumbnail choice and to upload the finished video to YouTube, aiming for “hands off” production from raw news to a live post.

The most consequential detail is how the system keeps the visuals synchronized with the narration. After the script is produced, the pipeline generates captions in SRT format using OpenAI Whisper. Those timed caption segments become a control layer for later editing: the clip-selection and assembly step can match specific timestamps and text to the voiceover, so the resulting video lands on the right moments rather than relying on rough, manual cut points. The transcript notes that editing a roughly three-minute output (about 198 seconds) with this approach takes around a minute when run on an Nvidia GPU, suggesting the bottleneck is not the clip alignment itself but the overall orchestration.

In practice, the creator runs the pipeline multiple times to validate quality. One example picks a story about OpenAI’s ecosystem and personnel movement—specifically, OpenAI’s “o3-mini” framing and the recruitment of John Schulman—then produces a video that includes corresponding clips and narration. Another run targets Google’s AI watermarking efforts, generating a thumbnail that the creator describes as a parody-style image (a CEO holding their head) and producing a narration that references Google’s “SynthID” digital fingerprinting and the idea of invisible markers for AI-edited images.

The results are presented as proof-of-concept rather than a guarantee of consistent virality. Still, the creator emphasizes that the clip-picking behavior improves when the system is trained to choose clips that match the voiceover content at the right times, leading to a more engaging, coherent narrative. The pipeline also includes cleanup steps—like stripping “thinking tokens” and formatting artifacts (asterisks) from the script—so the narration reads naturally.

Finally, the setup is positioned as an evolving agent architecture. The creator hints at future refinements (including better model selection for specific tasks) and suggests releasing code via GitHub and a deeper tutorial through a membership channel once the system is “nailed down.” The broader takeaway is that modern LLMs plus speech-to-text plus timed caption alignment can automate much of the production loop for news-to-video workflows, turning a headline into a complete, publishable asset with minimal human intervention.

Cornell Notes

The workflow chains multiple AI services to convert a news source into a fully published YouTube video. Gemini 2.0 Flash selects a TechCrunch story and later helps pick a thumbnail and upload the final result; DeepSeek R1 writes the script; 11 Labs generates the voiceover; OpenAI Whisper produces SRT captions; and OpenAI o3-mini edits and assembles the video using timed caption data. The key mechanism is synchronization: SRT timestamps guide clip selection so visuals match the narration. The creator demonstrates the pipeline with at least two runs—one about OpenAI-related developments involving John Schulman and another about Google’s SynthID watermarking—then plans to refine the system and release code.

How does the pipeline go from a headline to a complete YouTube upload?

It starts by scraping TechCrunch headlines, then uses Gemini 2.0 Flash to pick a story. DeepSeek R1 generates a full script from that story. The script is sent to 11 Labs to create a voiceover, while OpenAI Whisper converts the narration text into SRT captions. OpenAI o3-mini then edits and assembles the video using the captions as alignment guidance. Gemini 2.0 Flash is used again to choose a thumbnail and to run the Google API upload, producing a finished YouTube entry with title, description, thumbnail, and video.

Why are SRT captions central to making the video coherent?

SRT provides timed segments (e.g., “2 seconds,” “4 seconds,” plus the text). Those timestamps let the editing agent match specific clips to specific parts of the voiceover. Instead of cutting blindly, the system can align visuals with what’s being said at each moment, improving narrative flow and engagement.

What cleanup step improves narration quality before 11 Labs?

After DeepSeek R1 writes the script, the workflow applies regex to remove “thinking tokens” and formatting artifacts like asterisks. The result is “pure text” that reads cleanly when converted into speech by 11 Labs.

What evidence is offered that clip selection is working?

The creator describes that the AI can pick clips that match the voiceover content at the right time. In the OpenAI-related example, the narration references reasoning and model updates, and the video switches to corresponding clips (including references to reasoning processes). In the Google watermarking example, the narration about SynthID is paired with relevant visuals, and the thumbnail is selected from the content.

How fast is the editing step, and what hardware is mentioned?

Editing and assembling a short output (around a 3-minute video, ~198 seconds) is described as taking roughly a minute when run on an Nvidia GPU. The workflow treats clip alignment and assembly as practical rather than prohibitively slow.

Review Questions

What role does Gemini 2.0 Flash play at the beginning versus near the end of the workflow?
How do SRT timestamps influence clip selection during video assembly?
Which model is used for script writing, and which tool is used for voiceover generation?

Key Points

1
Gemini 2.0 Flash can be used both to choose a news story and later to pick a thumbnail and upload the finished YouTube video.
2
DeepSeek R1 generates the full narration script from the selected headline context.
3
11 Labs turns the cleaned script text into a voiceover track for the video.
4
OpenAI Whisper produces SRT captions, and those timed segments become the alignment backbone for clip selection.
5
OpenAI o3-mini performs advanced video editing and assembly, using caption timing to synchronize visuals with narration.
6
Regex-based cleanup removes “thinking tokens” and formatting artifacts so the narration sounds natural.
7
The workflow is designed for repeated runs and iterative improvement, with plans to release code and a deeper tutorial later.

Highlights

The pipeline aims to automate the entire loop: headline → script → voiceover → SRT captions → edited video → title/description → thumbnail → YouTube upload.

SRT timestamps act like a “control track” that lets the editing agent match clips to what the narration says at each moment.

Cleaning the script by stripping reasoning tokens and formatting improves how the voiceover reads when sent to 11 Labs.

Two demonstrated runs—OpenAI-related personnel/model updates and Google SynthID watermarking—show the system can adapt to different AI news topics.

Topics

AI Agents
News-to-Video Automation
Gemini 2.0 Flash
DeepSeek R1
SRT Caption Alignment

Mentioned

SRT
GPU
API
Sundar