Get AI summaries of any video or article — Sign up free
Open Source AI Inference API w/ Together thumbnail

Open Source AI Inference API w/ Together

sentdex·
5 min read

Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Together’s inference API is presented as a fast, consistent way to access open-source models for text, chat/instruct, and code without building local inference infrastructure first.

Briefing

Together’s inference API is positioned as a fast, reliable way to run open-source text, chat, image, and code models without building and hosting your own stack—while keeping an escape hatch to run the same models locally later. The core pitch is practical: for any given model, pricing and speed can shift across providers, so having an API that can swap models quickly matters. Based on the creator’s testing, Together delivers consistently quick token generation and dependable responses, often outpacing other hosted options they tried.

A major reason to use an API at all is convenience and iteration speed. Even though the models behind Together are open source (and can be downloaded and run on your own hardware), standing up local inference—especially for larger models—takes time and compute. The workflow starts with creating an account, setting up billing, and retrieving an API key from settings. Developers then install the Together Python package (via pip) and can begin with simple calls—like listing available models and hitting a “playground” first to validate prompt behavior before writing production code.

Model selection is framed as both broad and fast-moving. Together adds new models regularly, and the transcript highlights “Mixol,” described as a Mistral AI mixol 8x7B mixture-of-experts model. For text generation models, the key operational detail is prompt structure: each model may require a specific formatting style, so checking the Together Playground (or the referenced Hugging Face page) becomes part of the setup. The transcript also emphasizes streaming: instead of waiting for a full response, the API can stream tokens as they’re generated, improving user experience for chat-like applications where output may be long.

Beyond plain completion, Together offers instruct and chat variants. The transcript stresses that these are still text-generation models under the hood; they’re trained to follow particular prompt conventions that encourage structured dialogue. It also notes that Together models can be queried through an OpenAI-compatible package, which can ease migration for codebases built around OpenAI-style endpoints.

For instruct-style prompting, the transcript walks through a smaller “RedPajama” instruct model (tied to a GPTJ lineage) and discusses zero-shot vs one-shot examples using question/answer patterns. It also highlights the importance of stop sequences to prevent the model from drifting into extra text. When moving to code, the transcript spotlights “Code Llama” (find code llama 34b python v1), describing it as fast and accurate for Python generation and noting that larger coding models can be tens of gigabytes and slow to run locally.

Finally, the transcript demonstrates building a “terminal-capable” assistant: using a system prompt plus example formatting to elicit bash commands, extracting those commands with regex, and executing them via os.system. It suggests a multi-model approach—using one model to generate, another to evaluate, and a third to moderate—while also proposing conversation-history management and summarization once context grows beyond practical limits. The overall takeaway is an engineering workflow: prototype in the Playground, integrate via the API with streaming and stop controls, and only then decide whether to host locally for privacy or compliance needs.

Cornell Notes

Together’s inference API provides a fast way to use open-source models for text generation, chat/instruct, and code without hosting them locally. The workflow centers on API keys, installing the Together Python package, and validating model-specific prompt formats in the Together Playground before writing code. Streaming token output improves responsiveness for chat-style apps, and stop sequences help keep instruct/chat models from generating unwanted extra text. The transcript also shows practical prompting patterns (zero-shot and one-shot) and a code-assistant approach that extracts bash commands from model output and executes them. Because the underlying models are open source, projects can later be moved to local hardware for privacy or compliance.

Why use an inference API if the models are open source and can be run locally?

The transcript frames local hosting as possible but slow and operationally heavy—especially for larger models. Standing up inference for new models can take significant time (described as roughly an hour per model for setup/testing), while Together enables near-instant experimentation via the Playground and API. The API also supports switching providers/models as speed and cost change, and it can be faster and more reliable than alternatives the author tested. If privacy or compliance becomes a requirement, the same open-source models can be downloaded and run on your own hardware later.

What makes prompt formatting a “first-class” concern when using Together models?

Even though all these systems are text-generation models, each model may require a specific prompt structure. The transcript emphasizes checking the Together Playground for the model’s “path” and any prompt-format guidance, or following the Hugging Face link when needed. For chat/instruct models, the training encourages structured behavior only when inputs match the expected format; otherwise outputs can drift or become inconsistent.

How does streaming change the user experience for chat or long outputs?

Instead of waiting for a full completion, the API can stream tokens as they’re generated. That means the first words appear immediately, which is especially helpful for interactive chat where responses can be long. The transcript also notes that token generation is fast enough that a human typically won’t read slower than the model outputs, making the experience feel seamless.

What’s the difference between zero-shot and one-shot prompting in the instruct example?

Zero-shot prompting asks the question without showing any prior examples (e.g., “Q: how many eggs are in a dozen” with no earlier Q/A). One-shot prompting includes a single example of the desired behavior (e.g., “Q: how many days are in a week” followed by “A: seven”), then asks the real question. The transcript uses these patterns to show how instruct models can follow the expected Q/A structure.

Why are stop sequences important when prompting instruct/chat models?

In the transcript’s instruct-model test, the model sometimes continues beyond the intended answer. Stop sequences let the developer define where generation should end—such as treating a newline as the end of the response—so the output stays bounded and easier to parse for downstream logic.

How does the “terminal-capable” assistant work with a code model?

The approach uses a system prompt plus an example to steer the model to output bash commands. After receiving the model’s text, the code extracts bash commands using regex and executes them with os.system. The transcript highlights that this requires careful prompting and output handling because the model may otherwise generate additional “future” turns; forcing the response to begin with bash-command structure helps keep it aligned.

Review Questions

  1. What steps in the Together workflow help ensure you’re using the correct prompt format for a given model?
  2. How do streaming and stop sequences each improve reliability and usability in chat-style applications?
  3. What engineering safeguards are needed when executing model-generated commands (e.g., parsing, stopping, and limiting context)?

Key Points

  1. 1

    Together’s inference API is presented as a fast, consistent way to access open-source models for text, chat/instruct, and code without building local inference infrastructure first.

  2. 2

    Model-specific prompt structure matters; the Playground and Hugging Face references are used to confirm the correct formatting before coding.

  3. 3

    Token streaming improves perceived responsiveness for chat and long outputs by delivering partial results immediately.

  4. 4

    Stop sequences help prevent instruct/chat models from generating extra, unintended text beyond the desired answer boundary.

  5. 5

    Instruct prompting benefits from zero-shot vs one-shot examples to teach the expected Q/A or instruction-following pattern.

  6. 6

    Large code models can be impractical to run locally due to size and setup time, making hosted inference attractive for rapid iteration.

  7. 7

    A practical code-assistant pattern is to prompt for bash commands, extract them from the model output, and execute them—optionally using additional models to evaluate or moderate results.

Highlights

Together supports streaming token output, enabling chat-like interfaces where the first tokens arrive immediately rather than after a full completion.
Instruct/chat behavior depends on matching the model’s expected prompt format, even though the underlying mechanism is still text generation.
The transcript demonstrates a command-execution workflow: prompt for bash, parse commands with regex, then run them via os.system.
Code Llama 34b Python is described as fast and accurate via Together, while similar-sized models can be tens of gigabytes and slow to run locally.
Conversation history can be managed by building prompt context from prior turns and summarizing when token limits approach (e.g., around 20,000 tokens).

Topics

  • Together Inference API
  • Prompt Formatting
  • Streaming Tokens
  • Instruct vs Chat
  • Code Generation
  • Command Execution
  • Model Switching

Mentioned