End To End Multimodal LLMOPS Project Azure Deployment With Observability And Orchestration Engine
Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use Azure Video Indexer to extract both transcript and OCR from the uploaded MP4; the compliance audit depends on these two modalities.
Briefing
A production-grade multimodal compliance system is built to judge whether a YouTube brand advertisement follows disclosure and advertising rules—using Azure video understanding, retrieval over legal/brand guidelines, and an agentic workflow with end-to-end observability. The core workflow takes a YouTube URL, downloads the video, extracts both spoken transcript and on-screen text via Azure Video Indexer (OCR), retrieves the most relevant rule excerpts from two PDF “rule books” using Azure AI Search vector/keyword search, and then asks an Azure-hosted GPT-4 model to output a structured pass/fail verdict plus specific violations.
The compliance rules come from two documents: “Disclosures 101 for Social Media Influencers” (framed like FTC guidance to prevent deceptive ads and endorsement violations) and YouTube ad specification guidelines. Those PDFs are chunked, embedded, and indexed into Azure AI Search so the auditor model doesn’t need to read entire documents. Instead, the system performs hybrid retrieval—using both embeddings and keyword matching—to pull the most relevant pages when the video mentions key concepts (e.g., sunscreen). The extracted evidence from the ad (transcript + OCR text + selected video metadata such as duration and platform) becomes the input to a RAG-style audit step.
The agentic orchestration is implemented with LangGraph as a stateful graph. A typed “video audit state” carries the session’s inputs and outputs through the pipeline, including the video URL, a derived video ID, extracted transcript/OCR, compliance results (a structured list of issues with category, severity, description, and optional timestamps), and system-level errors. The graph is intentionally linear: an “indexer node” runs first to download and process the video, then an “auditor node” runs to perform retrieval-augmented compliance evaluation, and finally the graph ends with a final report in Markdown plus a binary status.
Azure services form the infrastructure backbone. Azure Blob Storage holds the temporary downloaded MP4 so Azure Video Indexer can access it. Azure Video Indexer converts the raw video into transcript and OCR text, and also returns summarized video metadata. Azure OpenAI (hosted in Azure Foundry) provides the GPT-4 model for reasoning and the embedding model (“text-embedding-3-small”) for vectorization. Azure AI Search acts as the knowledge base for the chunked PDFs, supporting hybrid vector search. For monitoring, Azure Application Insights tracks request-level performance, errors, and latency, while LangSmith traces the LLM chain details (inputs, outputs, token counts, and timing) for debugging and portfolio-ready proof of engineering rigor.
Implementation details emphasize operational readiness: a project directory is created with a backend module, environment variables are defined for every Azure credential (storage connection string, search endpoint/key/index name, Foundry model endpoints/versions, Video Indexer identifiers, Application Insights connection string, and LangSmith tracing keys), and the PDF indexing script validates required env vars, chunks text (chunk size 1000, overlap 200), and uploads splits to the vector index. The video indexer service handles authentication via Azure identity and token exchange, downloads YouTube content using yt-dlp, uploads it to Video Indexer, polls until processing completes, and extracts only the transcript/OCR/metadata fields needed for auditing.
When executed, the pipeline produces a structured compliance audit. In the demonstrated run, the verdict is “fail,” with critical issues such as unsubstantiated scientific claims and missing endorsement disclosures (e.g., failing to disclose John Cena as a paid endorser/partner). The system also surfaces observability artifacts: LangSmith shows the traced request, retrieved evidence, prompts, and JSON outputs; Azure Application Insights provides an application map with timing across FastAPI, Azure Video Indexer, and other outbound calls. The result is a reusable end-to-end multimodal LLMOps template for compliance auditing with orchestration, retrieval, and production monitoring baked in.
Cornell Notes
The project builds an end-to-end multimodal compliance audit pipeline for YouTube brand ads. A LangGraph stateful workflow orchestrates two main nodes: an indexer node that downloads a YouTube video, uploads it to Azure Video Indexer, and extracts transcript (speech-to-text) plus OCR (on-screen text), and an auditor node that performs hybrid retrieval over two PDF rule books using Azure AI Search and then uses Azure-hosted GPT-4 to produce a structured pass/fail verdict. The output includes a Markdown final report and a list of compliance issues with category, severity, and description. Azure Blob Storage, Azure OpenAI (GPT-4 + embeddings), Azure AI Search, Azure Application Insights, and LangSmith provide the infrastructure, reasoning, knowledge base, and observability needed for production-grade operation.
How does the system turn a YouTube URL into auditable evidence for compliance checking?
What role do the PDFs and Azure AI Search play in the audit step?
How does LangGraph enforce correctness across the pipeline?
Why is the LLM output constrained to JSON, and how is that handled?
What observability stack is used, and what does each tool help debug?
What does the system return to the client after auditing?
Review Questions
- What exact evidence fields (transcript, OCR, metadata) are extracted by Azure Video Indexer, and how are they combined for retrieval?
- Describe the hybrid retrieval flow: how does Azure AI Search decide which PDF chunks to send to the LLM?
- Where do Azure Application Insights and LangSmith each fit in the debugging workflow, and what kinds of failures do they help isolate?
Key Points
- 1
Use Azure Video Indexer to extract both transcript and OCR from the uploaded MP4; the compliance audit depends on these two modalities.
- 2
Index the two rule PDFs into Azure AI Search using chunking plus embeddings, then retrieve only the most relevant excerpts during each audit.
- 3
Implement the audit as a LangGraph stateful workflow with a typed state schema so nodes pass structured data (not free-form text).
- 4
Constrain the LLM to output strict JSON and add a cleanup step to remove Markdown code fences before parsing.
- 5
Store temporary video files in Azure Blob Storage so Video Indexer can access them reliably.
- 6
Add observability at two levels: Azure Application Insights for system latency/errors and LangSmith for LLM chain tracing and token-level debugging.
- 7
Expose the workflow through FastAPI with Pydantic models so incoming requests and outgoing audit responses are validated.