How to Build a Real-Time AI Voice Assistant: Node.js, React, Groq, and Deepgram / VS / OpenAI’s New API
In this post, we will build an AI-powered voice assistant that responds in real-time to user input. The assistant will use WebSockets to enable low-latency, bi-directional communication between the client and the server. We’ll leverage Deepgram for live transcription, Groq for generating AI responses, and Play.ht for converting text responses into speech. Groq's advanced Latency Processing Units (LPUs), which are specialized hardware designed for LLMs (Large Language Models), significantly reduce latency, ensuring fast processing and real-time response generation.
By using WebSockets, Deepgram, and Groq's LPUs, we create a highly optimized flow that minimizes delays in transcription, response generation, and speech synthesis.
We’ll break this down into the following steps:
Setting up the Server (Node.js, Express)
Configuring Deepgram for Live Transcription
Integrating Groq for AI Responses
Converting AI Responses to Speech with Play.ht
Streaming Audio Back to the Client
Setting up Client-Side Audio Recording and Streaming
Session and Stream Management
Scalability and Optimization for Production Use
OpenAI's Real-Time API vs. Custom Approach: Which Suits Your Needs?
How We Achieve Real-Time Responses
To achieve real-time interaction with minimal latency:
WebSocket Connections: WebSockets allow full-duplex communication between the client and server, reducing the overhead of traditional HTTP requests and making the interaction more fluid and instantaneous.
Groq's LPUs: Groq has developed Latency Processing Units, specialized hardware for handling large language models (LLMs). These LPUs enable Groq to deliver AI responses faster than traditional cloud-based models, ensuring the AI can respond quickly to user inputs.
Deepgram: By using Deepgram for live transcription, the user’s audio input is converted into text as they speak, in real-time. This transcribed text is then immediately processed by the AI to generate responses without noticeable delays.
This combination of fast transcription, quick response generation, and efficient audio streaming ensures a responsive and smooth user experience.
Step 1: Setting up the Server (Node.js, Express)
Start by creating a basic Express server that will handle WebSocket connections. This will act as a communication bridge between the client and the services we’ll be using.
npm init -y
npm install express ws dotenv @deepgram/sdk
Create a .env
file for storing API keys:
DEEPGRAM_API_KEY=your_deepgram_api_key
GROQ_API_KEY=your_groq_api_key
PLAY_API_KEY=your_playht_api_key
PLAY_USERID=your_playht_user_id
Create the server in server.js
:
const express = require("express");
const http = require("http");
const WebSocket = require("ws");
const { createClient } = require("@deepgram/sdk");
const { play, initialize } = require('./playht'); // Play.ht module
const { getGroqChat } = require('./groq'); // Groq AI model
const dotenv = require("dotenv");
dotenv.config();
const app = express();
const server = http.createServer(app);
const wss = new WebSocket.Server({ server });
const deepgramClient = createClient(process.env.DEEPGRAM_API_KEY);
initialize(); // Initialize Play.ht
let sid1 = 0, sid2 = 0, pl1 = 0, pl2 = 0; // Session & stream management variables
// Function to handle WebSocket connections
wss.on("connection", (ws) => {
console.log("Client connected");
// Setup Deepgram for live transcription
let deepgram = deepgramClient.listen.live({
language: "en",
punctuate: true,
smart_format: true,
model: "nova-2-phonecall",
endpointing: 400,
});
deepgram.addListener("open", () => {
console.log("Deepgram connected");
deepgram.addListener("transcript", async (data) => {
if (data.is_final && data.channel.alternatives[0].transcript !== "") {
sid1++;
const transcript = data.channel.alternatives[0].transcript;
// Send transcript to Groq for AI response
const responseText = await getGroqChat(transcript);
// Convert AI response to speech using Play.ht
const stream = await play(responseText);
pl2++;
sid2++;
ws.send(JSON.stringify({
'type': 'audio_session', 'sid1': sid1, 'sid2': sid2
}));
if (pl1 === pl2) {
// Stream audio back to client
stream.on("data", (chunk) => {
const buffer = Uint8Array.from(chunk).buffer;
ws.send(JSON.stringify({
'type': 'audio',
'output': Array.from(new Uint8Array(buffer)),
'sid1': sid1,
'sid2': sid2
}));
});
}
}
});
});
ws.on("message", (message) => {
if (deepgram.getReadyState() === 1) {
deepgram.send(message); // Forward audio data to Deepgram for transcription
}
});
ws.on("close", () => {
deepgram.finish();
console.log("Client disconnected");
});
});
server.listen(3000, () => {
console.log("Server is listening on port 3000");
});
Step 2: Configuring Deepgram for Live Transcription
Deepgram is a powerful transcription API that provides live transcription capabilities. In the code above, we initialize a live transcription session with Deepgram and set it up to listen for incoming audio streams. When the audio is transcribed, it sends back the final transcript to the server.
Make sure to replace process.env.DEEPGRAM_API_KEY
with your Deepgram API key in the .env
file.
Step 3: Integrating Groq for AI Responses
We will use the Groq API (or another AI service) to process the transcript and generate a conversational response. The following is an example of a function that sends the transcript to Groq and gets a response:
jsCopy codeconst axios = require('axios');
async function getGroqChat(transcript) {
const response = await axios.post('https://api.groq.com/chat', {
text: transcript,
apiKey: process.env.GROQ_API_KEY,
});
return response.data.text; // Return AI response text
}
module.exports = { getGroqChat };
Step 4: Converting AI Responses to Speech with Play.ht
To convert text to speech, we use the Play.ht API. After receiving the response from Groq, we send it to Play.ht to generate audio.
jsCopy codeconst axios = require('axios');
const { PassThrough } = require('stream');
async function play(text) {
const response = await axios.post('https://api.play.ht/convert', {
text,
voice: "en_us_male",
format: "mp3"
}, {
headers: {
'Authorization': `Bearer ${process.env.PLAY_API_KEY}`,
'UserId': process.env.PLAY_USERID
},
responseType: 'stream'
});
return response.data.pipe(new PassThrough()); // Stream audio back
}
async function initialize() {
console.log('Play.ht initialized');
}
module.exports = { play, initialize };
Step 5: Streaming Audio Back to the Client
We stream the audio response from Play.ht back to the client through the same WebSocket connection. The play_stream
function pipes audio data from Play.ht to the client in chunks, allowing for real-time playback.
Step 6: Setting up Client-Side Audio Recording and Streaming
On the client side, you need to set up audio recording and send it to the server in chunks using WebSocket. Here's a simple example using the Web Audio API:
<script>
const socket = new WebSocket('ws://localhost:3000');
navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
const mediaRecorder = new MediaRecorder(stream);
mediaRecorder.start(100); // Send data every 100ms
mediaRecorder.ondataavailable = (event) => {
socket.send(event.data); // Send audio blob to server
};
socket.onmessage = (message) => {
const data = JSON.parse(message.data);
if (data.type === 'audio') {
const audioBuffer = new Uint8Array(data.output);
playAudioBuffer(audioBuffer); // Function to play audio
}
};
});
function playAudioBuffer(buffer) {
const context = new AudioContext();
context.decodeAudioData(buffer, (audioData) => {
const source = context.createBufferSource();
source.buffer = audioData;
source.connect(context.destination);
source.start();
});
}
</script>
Step 7: Session and Stream Management
Throughout the process, we manage audio sessions with sid1
and sid2
variables. These variables help ensure the audio responses match the correct user session, and prevent overlapping streams.
For example:
sid1
is incremented when new audio input starts.sid2
tracks the audio responses sent back to the client.The client uses this session information to synchronize audio playback.
Scalability and Optimization for Production Use
Real-time voice assistants demand high reliability and scalability to ensure consistent performance under varying loads. Below are strategies for production readiness:
Load Balancing WebSocket Connections: Use NGINX or HAProxy for efficient distribution of WebSocket traffic.
Horizontal Scaling: Implement Kubernetes to scale Deepgram, Groq, and Play.ht services.
Caching AI Responses: Leverage Redis to reduce response times for repeated queries.
Edge Processing: Use CDNs for reduced latency.
Monitoring and Logging: Tools like Prometheus and Grafana can monitor latency and WebSocket connections.
OpenAI's Real-Time API vs. Custom Approach: Which Suits Your Needs?
OpenAI recently launched its Real-Time API, offering an all-in-one solution for transcription, response generation, and text-to-speech capabilities. This API simplifies the development process by eliminating the need to manage multiple providers, enabling seamless integration for AI-powered applications.
Benefits of OpenAI’s Real-Time API
Ease of Use: Minimal setup with robust documentation.
Comprehensive Features: Combines multiple AI services in a single API.
Scalability: Built for high traffic with low-latency infrastructure.
Latest Advancements: Consistent updates ensure state-of-the-art performance.
However, this convenience comes at a cost. OpenAI’s API can become expensive with heavy usage, and reliance on a single provider introduces risks like vendor lock-in and limited customization.
Comparing Approaches: Flexibility vs. Simplicity
Custom Approach (Deepgram, Groq, Play.ht)
The custom approach allows developers to optimize for cost and performance by carefully choosing services like Deepgram for transcription, Groq LPUs for processing, and Play.ht for text-to-speech.
Advantages:
Cost Efficiency: Pay only for what you use.
Flexibility: Tailor services to specific needs and swap providers as necessary.
Performance: Leverage best-in-class tools for unique optimizations.
Vendor Independence: No reliance on a single provider.
Custom Workflow: Full control over architecture and integrations.
Challenges:
Complexity: Requires more setup and integration effort.
Latency: Inter-service communication can introduce delays (mitigated through optimization).
OpenAI’s API at a Glance
OpenAI’s API eliminates complexity by combining all services into one. It’s ideal for projects that prioritize simplicity and quick deployment over granular control.
Advantages:
Simplicity: Streamlined API with minimal overhead.
Integrated Solution: All services in one place, saving time and effort.
Scalability: Built-in global infrastructure for handling high loads.
Challenges:
Cost: Higher costs for extensive usage.
Customization: Limited flexibility for specific optimizations.
Making the Right Choice
Custom Approach: Best for cost-sensitive, scalable, or highly customizable solutions.
OpenAI’s API: Ideal for rapid development, ease of use, and scalability without complexity.
Each approach has its strengths. By understanding your project’s requirements—whether it’s flexibility, cost efficiency, or simplicity—you can confidently choose the right solution.
Final Thoughts
By combining Deepgram for real-time transcription, Groq for fast AI responses, and Play.ht for audio generation, we create a responsive voice assistant with minimal latency. This system, powered by WebSockets, efficiently handles bi-directional communication and ensures fast, real-time interactions.