Get AI summaries of any video or article — Sign up free
Voice Assistant in MIT App Inventor powered by ChatGPT | ChatGPT MIT App Inventor | #openAI #chatgpt thumbnail

Voice Assistant in MIT App Inventor powered by ChatGPT | ChatGPT MIT App Inventor | #openAI #chatgpt

Obsidian Soft·
5 min read

Based on Obsidian Soft's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Generate an OpenAI API key and paste it into MIT App Inventor requests using the “Bearer” authorization field; losing the key requires creating a new one.

Briefing

A practical way to build an Alexa/Siri-style voice chatbot in MIT App Inventor is to replace hard-coded if/else replies with live calls to OpenAI’s ChatGPT API—turning spoken questions into text, sending them to ChatGPT, then converting ChatGPT’s response back into speech and swapping an animated “talking” avatar. The payoff is a chatbot that can generate new answers on demand, rather than being limited to a fixed script.

The setup starts with creating an OpenAI account and choosing an account tier that fits the intended usage. After signing in, the workflow centers on generating an API key from the OpenAI documentation under authentication. That key must be copied carefully because it can’t be retrieved later; losing it means creating a new one. With the key in hand, the project uses MIT App Inventor blocks to make HTTP requests to ChatGPT servers. A curl-to-blocks conversion step provides the “client URL” structure, which is then pasted into the MIT App Inventor web request blocks. The tutorial also highlights JSON as the key-value data format used to interpret responses.

On the MIT App Inventor side, the app layout includes a WebViewer for displaying animated GIFs, a “Speak” button, and a text-to-speech component. For voice input, it imports a continuous speech recognition extension (downloaded as an AIX file) to capture speech without repeatedly showing Google’s speech dialog. The UI also uses image assets: a GIF is split into frames to extract a “girl still” image, while a separate “girl talking” GIF is used when the bot is speaking.

In the block logic, pressing the Speak button triggers the speech recognizer to produce recognized text. That recognized text becomes the prompt sent to the ChatGPT API via a dedicated procedure (named for sending to ChatGPT). When the API returns a response, the app parses the JSON-like dictionary structure to extract the actual answer content—using dictionary “get value for key” steps and a JSON text decode operation. The app then updates the avatar by switching the WebViewer to the talking GIF, speaks the extracted response using text-to-speech, and finally switches the avatar back to the still image after speaking completes.

Two configuration details matter for output quality: the “temperature” value controls randomness (the tutorial suggests 0 for more controlled replies and 0.9 for more creative ones), and the token limit is set using a numeric block (suggested around 200) to avoid errors and to bound response length. The tutorial also notes platform limits: the approach won’t work on iPhone, and Android testing is required. For newer Android versions, the APK may need manifest permission edits (record audio) using an APK editor, and the app must be granted microphone permission at runtime.

Overall, the build turns MIT App Inventor into a full voice loop—speech-to-text, ChatGPT text generation, and text-to-speech—while keeping the logic modular enough to swap assets and tune response behavior through temperature and token settings.

Cornell Notes

The core build replaces scripted chatbot replies with live ChatGPT API calls inside an MIT App Inventor voice app. Speech recognition converts what the user says into text, that text is sent as a prompt to ChatGPT using an API key, and the returned response is parsed from JSON/dictionaries to extract the answer. The app then uses text-to-speech to speak the answer and switches between “girl still” and “girl talking” images in a WebViewer to match speaking. Temperature and token limits control how creative and how long the responses are. The setup is Android-focused and may require APK manifest permission edits for microphone access on newer Android versions.

Why does the project generate and store an OpenAI API key, and what happens if it’s lost?

The API key is required to authenticate requests from MIT App Inventor to ChatGPT servers. The tutorial stresses that the key is secret and not fully displayed; if it’s forgotten, the same key can’t be retrieved, so a new secret key must be generated and used in the app’s “Bearer” authorization field.

How does the app turn spoken input into a ChatGPT prompt?

Pressing the Speak button triggers an extension-based speech recognizer. After getting text, the recognized speech result is passed into a procedure (named for sending to ChatGPT) as an input parameter called text. That input becomes the prompt sent to the ChatGPT API request blocks.

What’s the role of JSON/dictionaries in extracting the ChatGPT answer?

ChatGPT returns a structured response that MIT App Inventor interprets as a dictionary. The logic uses dictionary operations like “get value for key in dictionary or if not found,” then follows nested keys (including a “choices” key) to reach the generated content. A JSON text decode step is used to convert the response content into a form the app can index (the tutorial uses index 1).

How does the app synchronize the avatar animation with speech output?

When the API response arrives, the WebViewer is switched to the “girl talking” asset. After text-to-speech finishes (using the speaking event), the WebViewer is switched back to the “girl still” asset. This creates a simple but effective lip-sync-like behavior tied to the speech lifecycle.

Which settings control response style and why are they implemented as numeric blocks?

Temperature controls randomness: 0 yields more controlled responses, while 0.9 encourages more creative output. Token limit is set to a numeric value (the tutorial suggests 200) to bound response length and to avoid block-type errors—temperature and token values must be numbers rather than text blocks.

What platform and permission constraints affect whether the app works?

The tutorial warns it won’t work on iPhone. On Android, microphone access is required: the app must request audio permission, and for newer Android versions the APK may need an Android manifest edit (record audio permission) using an APK editor. Without these steps, speech recognition won’t function.

Review Questions

  1. In what order do speech recognition, ChatGPT API calling, JSON/dictionary parsing, and text-to-speech occur in the MIT App Inventor blocks?
  2. How do temperature and token limits change the chatbot’s responses, and what block types must they be set to?
  3. What specific dictionary keys and indexing steps are used to extract the generated answer from the ChatGPT response structure?

Key Points

  1. 1

    Generate an OpenAI API key and paste it into MIT App Inventor requests using the “Bearer” authorization field; losing the key requires creating a new one.

  2. 2

    Use curl-to-blocks output to build the HTTP request structure, then send the recognized speech text as the ChatGPT prompt.

  3. 3

    Set temperature and token limits as numeric blocks (e.g., temperature 0 or 0.9; tokens around 200) to control creativity and avoid block errors.

  4. 4

    Parse the ChatGPT response via JSON/dictionary operations to extract the actual answer content from nested keys like “choices” and a specific index.

  5. 5

    Switch the WebViewer between “girl still” and “girl talking” assets based on the speech lifecycle (before and after text-to-speech).

  6. 6

    Rely on an imported speech recognition extension to capture speech without repeatedly showing Google’s speech dialog.

  7. 7

    Test on Android and ensure microphone permissions are granted; newer Android versions may require editing the APK manifest for record audio permission.

Highlights

The chatbot logic replaces multiple if/else branches with a single API call: recognized speech becomes a prompt, and ChatGPT generates the reply dynamically.
A JSON/dictionary decoding chain pulls the answer out of the structured ChatGPT response—then the app speaks it and swaps avatar states accordingly.
Temperature (0 vs 0.9) is used as a direct control knob for how predictable versus creative the bot’s responses feel.
Android-only support is emphasized, with potential APK manifest edits needed to make microphone recording work on newer devices.

Mentioned

  • API
  • JSON