Generative AI With LLM Models Crash Course On AWS Cloud

TL;DR

Define the GenAI use case first (RAG, summarization, chatbot) and scope the data pipeline requirements before choosing models.

Briefing Cornell Notes

Briefing

The core takeaway is a practical end-to-end blueprint for building generative AI applications on AWS: pick a use case, choose a model strategy (Bedrock foundation models vs custom LLMs), evaluate outputs, then deploy behind an API and persist results—while also showing how to run open-source Hugging Face models on SageMaker. The walkthrough matters because it turns “LLM experimentation” into an operational workflow that fits real cloud constraints like permissions, latency, and deployment plumbing.

The session starts by laying out a generic GenAI project life cycle in four to five steps. First comes use-case definition—whether the goal is RAG, text summarization, or a chatbot—followed by scoping the required data pipeline (for RAG, that means converting PDFs into embeddings and storing them in a vector database). The next step is model selection, split into two paths: using foundation models directly (examples named include Llama and other large providers available through AWS Bedrock) or building a custom LLM from scratch. Even when foundation models are used, the workflow allows for fine-tuning and alignment via techniques like LoRA and training with human feedback.

After model choice, the workflow emphasizes “adapt and align” through evaluation—tracking performance metrics and only moving forward when quality improves. Deployment then shifts from model quality to system performance: integrate the model into applications, optimize inference, and rely on LLM Ops practices to keep responses fast and reliable. The narrative repeatedly highlights inference speed as the gating factor for usefulness.

To make the life cycle concrete, the walkthrough implements a “blog generation” application on AWS. The architecture is straightforward but production-shaped: Postman calls an API Gateway endpoint, which triggers an AWS Lambda function. Lambda invokes an Amazon Bedrock foundation model (the example uses a Llama 2 chat model via Bedrock Runtime), receives the generated text, and writes it to Amazon S3 as a timestamped text file. The process includes key operational details: creating the Lambda function with Python 3.12, installing dependencies (notably boto3) and handling the fact that Lambda’s default boto3 may be outdated by using a custom Lambda layer. It also covers a common failure mode—Bedrock invocation failing due to missing IAM permissions—then fixes it by attaching appropriate policies to the Lambda execution role.

Once the blog generation flow works end-to-end, the session expands into deploying Hugging Face models on AWS SageMaker. It walks through creating a SageMaker Studio domain, launching a JupyterLab environment, selecting an instance type (with cost implications), and using SageMaker’s Python SDK to load and deploy a Hugging Face model for inference. The example includes deploying a question-answering model (named in the transcript as distilbert-base-uncased-distilled-squad) and testing it through SageMaker endpoints.

Finally, the transcript pivots to a RAG document Q&A application using LangChain and LlamaIndex concepts. The plan is: ingest PDFs from a data folder, split them into chunks, generate embeddings using Amazon Titan via Bedrock, store them in a vector index (FAISS), and answer user questions by retrieving relevant chunks and prompting Bedrock LLMs (Cloudy/Claude and Llama 2 are mentioned). A Streamlit UI ties it together with buttons to update vectors and switch between model outputs. The session closes with tooling productivity guidance via Amazon Q Developer (Amazon CodeWhisperer), emphasizing AWS-aware code suggestions for faster development.

Cornell Notes

The transcript lays out a practical GenAI project life cycle for AWS: define the use case and scope (including data prep for RAG), choose a model strategy (Bedrock foundation models vs custom LLMs), evaluate quality, then deploy behind an API with inference optimization and LLM Ops practices. A full “blog generation” example shows how Postman → API Gateway → Lambda → Amazon Bedrock produces text and saves it to Amazon S3, including real-world fixes like updating boto3 via a Lambda layer and granting IAM permissions for Bedrock invocation. The walkthrough then demonstrates deploying a Hugging Face model on SageMaker using SageMaker Studio and endpoints. It ends with a LangChain-based RAG app design: ingest PDFs, chunk them, embed with Amazon Titan, store in FAISS, retrieve relevant context, and answer via Bedrock LLMs through a Streamlit interface.

What are the main stages in a GenAI project life cycle, and how do they map to AWS implementation work?

The life cycle is organized into: (1) defining the use case (RAG, summarization, chatbot) and scoping required data workflows; (2) choosing the model approach—either using foundation models from services like AWS Bedrock or building custom LLMs (optionally fine-tuning foundation models with techniques like LoRA); (3) adapting/aligning through evaluation using performance metrics; (4) deploying and integrating into applications, where inference speed and LLM Ops practices matter; and (5) building the LLM-powered application layer (APIs/UI) on top of the deployed model. In the AWS examples, this becomes API Gateway + Lambda for orchestration, Bedrock for generation, and S3 for persistence.

How does the blog generation architecture work end-to-end on AWS?

Postman sends a request to an Amazon API Gateway endpoint. API Gateway triggers an AWS Lambda function. Lambda calls Amazon Bedrock Runtime to invoke a foundation model (the transcript uses a Llama 2 chat model via Bedrock). Lambda extracts the generated text from the Bedrock response and writes it to an Amazon S3 bucket as a timestamped .txt file. The result is both an API response and a stored artifact in S3.

Why was a Lambda layer needed, and what problem does it solve?

The transcript notes that Lambda may include an older default boto3 version, which can be insufficient for invoking newer Bedrock foundation model features. To fix this, dependencies are packaged into a custom Lambda layer: create a local folder with a python/ directory, install boto3 into that folder, zip it, upload it as a layer, and attach the layer to the Lambda function. This ensures the Lambda runtime uses the updated boto3 when calling Bedrock.

What caused the initial Bedrock invocation failure, and how was it resolved?

The CloudWatch logs show an authorization error: the Lambda execution role lacked permission to perform the Bedrock invoke model action on the required resource. The fix was to update the Lambda role by attaching an appropriate IAM policy (the transcript demonstrates attaching broader administrator access for demonstration). After updating permissions, the same API call succeeded and the blog text was generated and saved to S3.

How does the SageMaker deployment example differ from the Bedrock approach?

Bedrock is used as a managed foundation model API invoked from Lambda. SageMaker instead hosts a Hugging Face model behind a SageMaker endpoint. The transcript shows creating a SageMaker Studio domain, launching JupyterLab, installing/using the SageMaker SDK, deploying a Hugging Face question-answering model to an instance type (with cost considerations), and testing inference via the endpoint using a structured input payload.

What is the RAG pipeline described for the Streamlit document Q&A app?

The pipeline is: load PDFs from a data folder using a PDF loader; split documents into chunks with a recursive character text splitter; generate embeddings using Amazon Titan through Bedrock (via LangChain’s Bedrock embeddings wrapper); store embeddings in a vector index (FAISS); and when a user asks a question, retrieve the most relevant chunks from the vector store and pass them into a prompt template for a Bedrock LLM (examples mentioned include Claude/“Cloudy” and Llama 2). Streamlit provides buttons to update vectors and to select which model to use for answering.

Review Questions

If you were building a RAG app on AWS, which life-cycle stage is responsible for converting PDFs into embeddings and storing them in a vector database?
In the blog generation flow, where do IAM permissions need to be granted for Bedrock invocation, and what symptom appears when they are missing?
Compare the operational responsibilities of using Bedrock (Lambda invocation) versus deploying a Hugging Face model on SageMaker endpoints. What changes in deployment and testing?

Key Points

1
Define the GenAI use case first (RAG, summarization, chatbot) and scope the data pipeline requirements before choosing models.
2
Select a model strategy: use Bedrock foundation models directly for many cases, or fine-tune/alignment (e.g., LoRA, human feedback) when behavior must match business data.
3
Evaluate model quality with measurable metrics and only proceed to deployment after performance improves.
4
Deploy behind an API and treat inference speed as a hard requirement; integrate with application layers and use LLM Ops practices to keep responses reliable.
5
For AWS Lambda + Bedrock, ensure the Lambda execution role has IAM permissions for Bedrock invoke model; verify via CloudWatch logs when failures occur.
6
When Lambda needs newer SDK behavior, package updated dependencies (like boto3) into a Lambda layer rather than relying on the default runtime.
7
For open-source models, SageMaker Studio + endpoints provide a full train/deploy/infer workflow for Hugging Face models, but instance choice directly impacts cost.

Highlights

The blog-generation architecture is a clean production pattern: Postman → API Gateway → Lambda → Bedrock → S3 (timestamped .txt output).

A common real-world blocker is IAM: Bedrock invocation fails until the Lambda role is granted permission for invoke model.

Lambda layers solve dependency drift—updating boto3 inside a layer avoids issues with older default packages.

SageMaker deployment shifts from “managed model APIs” to “hosted endpoints,” requiring instance selection and endpoint testing.

The RAG app design uses a standard loop: PDF ingestion → chunking → Titan embeddings → FAISS retrieval → Bedrock LLM answering via Streamlit UI.

Topics

GenAI Project Life Cycle
AWS Bedrock Blog Generation
AWS Lambda Bedrock Integration
SageMaker Hugging Face Deployment
RAG with LangChain and FAISS

Mentioned

Krish Naik
LLM
RAG
LoRA
IAM
API
S3
SDK
FAISS
UI
LLM Ops
CPU
GPU
ARN
SDK