End To End Multimodal LLMOPS Project Azure Deployment With Observability And Orchestration Engine

TL;DR

Use Azure Video Indexer to extract both transcript and OCR from the uploaded MP4; the compliance audit depends on these two modalities.

Briefing Cornell Notes

Briefing

A production-grade multimodal compliance system is built to judge whether a YouTube brand advertisement follows disclosure and advertising rules—using Azure video understanding, retrieval over legal/brand guidelines, and an agentic workflow with end-to-end observability. The core workflow takes a YouTube URL, downloads the video, extracts both spoken transcript and on-screen text via Azure Video Indexer (OCR), retrieves the most relevant rule excerpts from two PDF “rule books” using Azure AI Search vector/keyword search, and then asks an Azure-hosted GPT-4 model to output a structured pass/fail verdict plus specific violations.

The compliance rules come from two documents: “Disclosures 101 for Social Media Influencers” (framed like FTC guidance to prevent deceptive ads and endorsement violations) and YouTube ad specification guidelines. Those PDFs are chunked, embedded, and indexed into Azure AI Search so the auditor model doesn’t need to read entire documents. Instead, the system performs hybrid retrieval—using both embeddings and keyword matching—to pull the most relevant pages when the video mentions key concepts (e.g., sunscreen). The extracted evidence from the ad (transcript + OCR text + selected video metadata such as duration and platform) becomes the input to a RAG-style audit step.

The agentic orchestration is implemented with LangGraph as a stateful graph. A typed “video audit state” carries the session’s inputs and outputs through the pipeline, including the video URL, a derived video ID, extracted transcript/OCR, compliance results (a structured list of issues with category, severity, description, and optional timestamps), and system-level errors. The graph is intentionally linear: an “indexer node” runs first to download and process the video, then an “auditor node” runs to perform retrieval-augmented compliance evaluation, and finally the graph ends with a final report in Markdown plus a binary status.

Azure services form the infrastructure backbone. Azure Blob Storage holds the temporary downloaded MP4 so Azure Video Indexer can access it. Azure Video Indexer converts the raw video into transcript and OCR text, and also returns summarized video metadata. Azure OpenAI (hosted in Azure Foundry) provides the GPT-4 model for reasoning and the embedding model (“text-embedding-3-small”) for vectorization. Azure AI Search acts as the knowledge base for the chunked PDFs, supporting hybrid vector search. For monitoring, Azure Application Insights tracks request-level performance, errors, and latency, while LangSmith traces the LLM chain details (inputs, outputs, token counts, and timing) for debugging and portfolio-ready proof of engineering rigor.

Implementation details emphasize operational readiness: a project directory is created with a backend module, environment variables are defined for every Azure credential (storage connection string, search endpoint/key/index name, Foundry model endpoints/versions, Video Indexer identifiers, Application Insights connection string, and LangSmith tracing keys), and the PDF indexing script validates required env vars, chunks text (chunk size 1000, overlap 200), and uploads splits to the vector index. The video indexer service handles authentication via Azure identity and token exchange, downloads YouTube content using yt-dlp, uploads it to Video Indexer, polls until processing completes, and extracts only the transcript/OCR/metadata fields needed for auditing.

When executed, the pipeline produces a structured compliance audit. In the demonstrated run, the verdict is “fail,” with critical issues such as unsubstantiated scientific claims and missing endorsement disclosures (e.g., failing to disclose John Cena as a paid endorser/partner). The system also surfaces observability artifacts: LangSmith shows the traced request, retrieved evidence, prompts, and JSON outputs; Azure Application Insights provides an application map with timing across FastAPI, Azure Video Indexer, and other outbound calls. The result is a reusable end-to-end multimodal LLMOps template for compliance auditing with orchestration, retrieval, and production monitoring baked in.

Cornell Notes

The project builds an end-to-end multimodal compliance audit pipeline for YouTube brand ads. A LangGraph stateful workflow orchestrates two main nodes: an indexer node that downloads a YouTube video, uploads it to Azure Video Indexer, and extracts transcript (speech-to-text) plus OCR (on-screen text), and an auditor node that performs hybrid retrieval over two PDF rule books using Azure AI Search and then uses Azure-hosted GPT-4 to produce a structured pass/fail verdict. The output includes a Markdown final report and a list of compliance issues with category, severity, and description. Azure Blob Storage, Azure OpenAI (GPT-4 + embeddings), Azure AI Search, Azure Application Insights, and LangSmith provide the infrastructure, reasoning, knowledge base, and observability needed for production-grade operation.

How does the system turn a YouTube URL into auditable evidence for compliance checking?

It starts with a YouTube URL passed into a FastAPI endpoint or CLI simulation. A video indexer service downloads the video locally using yt-dlp (YTDLP) and then uploads the MP4 to Azure Video Indexer via its REST API. Azure Video Indexer extracts two key modalities: (1) transcript—what’s spoken in the ad—and (2) OCR—text appearing on screen (logos, claims, disclaimers). The service also extracts summarized video metadata (e.g., duration and platform) and returns a cleaned structure containing transcript lines, OCR text lines, and metadata.

What role do the PDFs and Azure AI Search play in the audit step?

The compliance rules come from two PDFs: “Disclosures 101 for Social Media Influencers” (FTC-style endorsement/disclosure guidance) and YouTube ad specification guidelines. A separate indexing script loads each PDF, chunks text (chunk size 1000, overlap 200), embeds chunks using Azure OpenAI embeddings (“text-embedding-3-small”), and uploads them to Azure AI Search. During auditing, the auditor node forms a query by combining transcript and OCR text, then runs hybrid retrieval (keyword + vector embeddings) to fetch the most relevant rule excerpts instead of sending entire documents to the LLM.

How does LangGraph enforce correctness across the pipeline?

LangGraph uses a typed, stateful “video audit state” (a TypedDict schema) that carries inputs and outputs across nodes. The indexer node updates fields like local file path, video metadata, transcript, OCR text, and system errors. The auditor node reads transcript/OCR and writes compliance results, final status (pass/fail), and the final Markdown report. Because nodes must accept and return data matching the schema, the workflow reduces the risk of unstructured or incompatible outputs reaching the API layer.

Why is the LLM output constrained to JSON, and how is that handled?

The auditor node instructs the model to return strictly JSON in a predefined structure (compliance results list with category/severity/description, plus final status and final report). Since LLMs sometimes wrap JSON in Markdown code fences, the code uses regular expressions to strip backticks and then parses the cleaned response with JSON parsing. If parsing or generation fails, the system logs the raw response and records a system error while marking the audit as failed.

What observability stack is used, and what does each tool help debug?

Azure Application Insights provides monitoring for the running service: it tracks latency, errors, and request-level behavior and renders an application map showing outbound calls (FastAPI → Azure Video Indexer, etc.). LangSmith traces the LangGraph/LLM chain details: inputs sent to the model, retrieved evidence, prompts, outputs, token counts, and timing. Together, they help diagnose both system-level failures (timeouts, crashes) and model-level issues (wrong retrieval, prompt problems, malformed JSON).

What does the system return to the client after auditing?

The API returns an audit response containing session ID, video ID, final status (pass/fail), a final Markdown report, and a list of compliance issues (optional but typically present on failure). Each compliance issue includes severity (e.g., critical/high), category (e.g., claim validation, endorsement disclosure), and a description explaining the violation. If the video can’t be processed or transcript/OCR is unavailable, the system marks the audit as failed and generates an audit-skipped or system-error report.

Review Questions

What exact evidence fields (transcript, OCR, metadata) are extracted by Azure Video Indexer, and how are they combined for retrieval?
Describe the hybrid retrieval flow: how does Azure AI Search decide which PDF chunks to send to the LLM?
Where do Azure Application Insights and LangSmith each fit in the debugging workflow, and what kinds of failures do they help isolate?

Key Points

1
Use Azure Video Indexer to extract both transcript and OCR from the uploaded MP4; the compliance audit depends on these two modalities.
2
Index the two rule PDFs into Azure AI Search using chunking plus embeddings, then retrieve only the most relevant excerpts during each audit.
3
Implement the audit as a LangGraph stateful workflow with a typed state schema so nodes pass structured data (not free-form text).
4
Constrain the LLM to output strict JSON and add a cleanup step to remove Markdown code fences before parsing.
5
Store temporary video files in Azure Blob Storage so Video Indexer can access them reliably.
6
Add observability at two levels: Azure Application Insights for system latency/errors and LangSmith for LLM chain tracing and token-level debugging.
7
Expose the workflow through FastAPI with Pydantic models so incoming requests and outgoing audit responses are validated.

Highlights

The pipeline audits compliance by combining transcript + OCR evidence with hybrid retrieval over FTC-style and YouTube ad-spec PDFs, then producing a structured pass/fail verdict.

LangGraph’s stateful TypedDict design standardizes compliance issues (category, severity, description) and prevents unstructured outputs from breaking the API/UI.

Azure Video Indexer is the multimodal bridge: it converts a brand ad video into the exact text evidence the LLM needs.

Observability is treated as first-class engineering: Azure Application Insights shows end-to-end latency and failures, while LangSmith traces prompts, retrieval, and JSON outputs.

The demonstrated run returns “fail” with critical violations like unsubstantiated scientific claims and missing endorsement disclosures.

Topics

Multimodal Compliance Auditing
Azure Video Indexer OCR Transcript
LangGraph Orchestration
RAG with Azure AI Search
LLMOps Observability

Mentioned

OCR
RAG
LLM
API
DAG
ARM
VI
YTDLP
FTC
JSON
UUID