How to Build a Real-Time AI Voice Assistant: Node.js, React, Groq, and Deepgram / VS / OpenAI’s New API

In this post, we will build an AI-powered voice assistant that responds in real-time to user input. The assistant will use WebSockets to enable low-latency, bi-directional communication between the client and the server. We’ll leverage Deepgram for live transcription, Groq for generating AI responses, and Play.ht for converting text responses into speech. Groq's advanced Latency Processing Units (LPUs), which are specialized hardware designed for LLMs (Large Language Models), significantly reduce latency, ensuring fast processing and real-time response generation.

By using WebSockets, Deepgram, and Groq's LPUs, we create a highly optimized flow that minimizes delays in transcription, response generation, and speech synthesis.

We’ll break this down into the following steps:

Setting up the Server (Node.js, Express)
Configuring Deepgram for Live Transcription
Integrating Groq for AI Responses
Converting AI Responses to Speech with Play.ht
Streaming Audio Back to the Client
Setting up Client-Side Audio Recording and Streaming
Session and Stream Management
Scalability and Optimization for Production Use
OpenAI's Real-Time API vs. Custom Approach: Which Suits Your Needs?

How We Achieve Real-Time Responses

To achieve real-time interaction with minimal latency:

WebSocket Connections: WebSockets allow full-duplex communication between the client and server, reducing the overhead of traditional HTTP requests and making the interaction more fluid and instantaneous.
Groq's LPUs: Groq has developed Latency Processing Units, specialized hardware for handling large language models (LLMs). These LPUs enable Groq to deliver AI responses faster than traditional cloud-based models, ensuring the AI can respond quickly to user inputs.
Deepgram: By using Deepgram for live transcription, the user’s audio input is converted into text as they speak, in real-time. This transcribed text is then immediately processed by the AI to generate responses without noticeable delays.

This combination of fast transcription, quick response generation, and efficient audio streaming ensures a responsive and smooth user experience.

Step 1: Setting up the Server (Node.js, Express)

Start by creating a basic Express server that will handle WebSocket connections. This will act as a communication bridge between the client and the services we’ll be using.

npm init -y
npm install express ws dotenv @deepgram/sdk

Create a .env file for storing API keys:

DEEPGRAM_API_KEY=your_deepgram_api_key
GROQ_API_KEY=your_groq_api_key
PLAY_API_KEY=your_playht_api_key
PLAY_USERID=your_playht_user_id

Create the server in server.js:

const express = require("express");
const http = require("http");
const WebSocket = require("ws");
const { createClient } = require("@deepgram/sdk");
const { play, initialize } = require('./playht'); // Play.ht module
const { getGroqChat } = require('./groq'); // Groq AI model
const dotenv = require("dotenv");

dotenv.config();

const app = express();
const server = http.createServer(app);
const wss = new WebSocket.Server({ server });
const deepgramClient = createClient(process.env.DEEPGRAM_API_KEY);

initialize(); // Initialize Play.ht

let sid1 = 0, sid2 = 0, pl1 = 0, pl2 = 0; // Session & stream management variables

// Function to handle WebSocket connections
wss.on("connection", (ws) => {
  console.log("Client connected");

  // Setup Deepgram for live transcription
  let deepgram = deepgramClient.listen.live({
    language: "en",
    punctuate: true,
    smart_format: true,
    model: "nova-2-phonecall",
    endpointing: 400,
  });

  deepgram.addListener("open", () => {
    console.log("Deepgram connected");

    deepgram.addListener("transcript", async (data) => {
      if (data.is_final && data.channel.alternatives[0].transcript !== "") {
        sid1++;
        const transcript = data.channel.alternatives[0].transcript;

        // Send transcript to Groq for AI response
        const responseText = await getGroqChat(transcript);

        // Convert AI response to speech using Play.ht
        const stream = await play(responseText);
        pl2++;
        sid2++;

        ws.send(JSON.stringify({
          'type': 'audio_session', 'sid1': sid1, 'sid2': sid2
        }));

        if (pl1 === pl2) {
          // Stream audio back to client
          stream.on("data", (chunk) => {
            const buffer = Uint8Array.from(chunk).buffer;
            ws.send(JSON.stringify({
              'type': 'audio',
              'output': Array.from(new Uint8Array(buffer)),
              'sid1': sid1,
              'sid2': sid2
            }));
          });
        }
      }
    });
  });

  ws.on("message", (message) => {
    if (deepgram.getReadyState() === 1) {
      deepgram.send(message); // Forward audio data to Deepgram for transcription
    }
  });

  ws.on("close", () => {
    deepgram.finish();
    console.log("Client disconnected");
  });
});

server.listen(3000, () => {
  console.log("Server is listening on port 3000");
});

Step 2: Configuring Deepgram for Live Transcription

Deepgram is a powerful transcription API that provides live transcription capabilities. In the code above, we initialize a live transcription session with Deepgram and set it up to listen for incoming audio streams. When the audio is transcribed, it sends back the final transcript to the server.

Make sure to replace process.env.DEEPGRAM_API_KEY with your Deepgram API key in the .env file.

Step 3: Integrating Groq for AI Responses

We will use the Groq API (or another AI service) to process the transcript and generate a conversational response. The following is an example of a function that sends the transcript to Groq and gets a response:

jsCopy codeconst axios = require('axios');

async function getGroqChat(transcript) {
  const response = await axios.post('https://api.groq.com/chat', {
    text: transcript,
    apiKey: process.env.GROQ_API_KEY,
  });
  return response.data.text; // Return AI response text
}

module.exports = { getGroqChat };

Step 4: Converting AI Responses to Speech with Play.ht

To convert text to speech, we use the Play.ht API. After receiving the response from Groq, we send it to Play.ht to generate audio.

jsCopy codeconst axios = require('axios');
const { PassThrough } = require('stream');

async function play(text) {
  const response = await axios.post('https://api.play.ht/convert', {
    text,
    voice: "en_us_male",
    format: "mp3"
  }, {
    headers: {
      'Authorization': `Bearer ${process.env.PLAY_API_KEY}`,
      'UserId': process.env.PLAY_USERID
    },
    responseType: 'stream'
  });

  return response.data.pipe(new PassThrough()); // Stream audio back
}

async function initialize() {
  console.log('Play.ht initialized');
}

module.exports = { play, initialize };

Step 5: Streaming Audio Back to the Client

We stream the audio response from Play.ht back to the client through the same WebSocket connection. The play_stream function pipes audio data from Play.ht to the client in chunks, allowing for real-time playback.

Step 6: Setting up Client-Side Audio Recording and Streaming

On the client side, you need to set up audio recording and send it to the server in chunks using WebSocket. Here's a simple example using the Web Audio API:

<script>
const socket = new WebSocket('ws://localhost:3000');

  navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
    const mediaRecorder = new MediaRecorder(stream);
    mediaRecorder.start(100); // Send data every 100ms

    mediaRecorder.ondataavailable = (event) => {
      socket.send(event.data); // Send audio blob to server
    };

    socket.onmessage = (message) => {
      const data = JSON.parse(message.data);

      if (data.type === 'audio') {
        const audioBuffer = new Uint8Array(data.output);
        playAudioBuffer(audioBuffer); // Function to play audio
      }
    };
  });

  function playAudioBuffer(buffer) {
    const context = new AudioContext();
    context.decodeAudioData(buffer, (audioData) => {
      const source = context.createBufferSource();
      source.buffer = audioData;
      source.connect(context.destination);
      source.start();
    });
  }
</script>

Step 7: Session and Stream Management

Throughout the process, we manage audio sessions with sid1 and sid2 variables. These variables help ensure the audio responses match the correct user session, and prevent overlapping streams.

For example:

sid1 is incremented when new audio input starts.
sid2 tracks the audio responses sent back to the client.
The client uses this session information to synchronize audio playback.

Scalability and Optimization for Production Use

Real-time voice assistants demand high reliability and scalability to ensure consistent performance under varying loads. Below are strategies for production readiness:

Load Balancing WebSocket Connections: Use NGINX or HAProxy for efficient distribution of WebSocket traffic.
Horizontal Scaling: Implement Kubernetes to scale Deepgram, Groq, and Play.ht services.
Caching AI Responses: Leverage Redis to reduce response times for repeated queries.
Edge Processing: Use CDNs for reduced latency.
Monitoring and Logging: Tools like Prometheus and Grafana can monitor latency and WebSocket connections.

OpenAI's Real-Time API vs. Custom Approach: Which Suits Your Needs?

OpenAI recently launched its Real-Time API, offering an all-in-one solution for transcription, response generation, and text-to-speech capabilities. This API simplifies the development process by eliminating the need to manage multiple providers, enabling seamless integration for AI-powered applications.

Benefits of OpenAI’s Real-Time API

Ease of Use: Minimal setup with robust documentation.
Comprehensive Features: Combines multiple AI services in a single API.
Scalability: Built for high traffic with low-latency infrastructure.
Latest Advancements: Consistent updates ensure state-of-the-art performance.

However, this convenience comes at a cost. OpenAI’s API can become expensive with heavy usage, and reliance on a single provider introduces risks like vendor lock-in and limited customization.

Comparing Approaches: Flexibility vs. Simplicity

Custom Approach (Deepgram, Groq, Play.ht)

The custom approach allows developers to optimize for cost and performance by carefully choosing services like Deepgram for transcription, Groq LPUs for processing, and Play.ht for text-to-speech.

Advantages:

Cost Efficiency: Pay only for what you use.
Flexibility: Tailor services to specific needs and swap providers as necessary.
Performance: Leverage best-in-class tools for unique optimizations.
Vendor Independence: No reliance on a single provider.
Custom Workflow: Full control over architecture and integrations.

Challenges:

Complexity: Requires more setup and integration effort.
Latency: Inter-service communication can introduce delays (mitigated through optimization).

OpenAI’s API at a Glance

OpenAI’s API eliminates complexity by combining all services into one. It’s ideal for projects that prioritize simplicity and quick deployment over granular control.

Advantages:

Simplicity: Streamlined API with minimal overhead.
Integrated Solution: All services in one place, saving time and effort.
Scalability: Built-in global infrastructure for handling high loads.

Challenges:

Cost: Higher costs for extensive usage.
Customization: Limited flexibility for specific optimizations.

Making the Right Choice

Custom Approach: Best for cost-sensitive, scalable, or highly customizable solutions.
OpenAI’s API: Ideal for rapid development, ease of use, and scalability without complexity.

Each approach has its strengths. By understanding your project’s requirements—whether it’s flexibility, cost efficiency, or simplicity—you can confidently choose the right solution.

Final Thoughts

By combining Deepgram for real-time transcription, Groq for fast AI responses, and Play.ht for audio generation, we create a responsive voice assistant with minimal latency. This system, powered by WebSockets, efficiently handles bi-directional communication and ensures fast, real-time interactions.