Skip to content

LiveSwitch VOSK Custom Audio Sink Transcriber

Jacob Steele Jul 27, 2023 9:53:06 PM

In this blog post, we will explore the seamless integration of VOSK (an open-source audio transcriber) with the custom sink we previously created. If you're new to custom audio sinks in LiveSwitch, we recommend checking out our guide on creating custom audio sinks to familiarize yourself with the basics here. This article assumes you have a basic understanding of custom audio sinks in LiveSwitch.

 

Quick Reminder: Media Pipeline

Before we dive into the code, let's have a quick reminder of the media pipeline structure: 

 

Code Implementation

Below is the code implementation for integrating VOSK into the custom audio sink. Take a look at the breakdown of each section:

module VoskSink


open System
open FM.LiveSwitch
open Vosk

type resultType = {
  conf : float
  ``end`` : float
  start : float
  word : string
}

type VoskResult = {
  result : resultType array
  text : string
}

type VoskSink =
  inherit AudioSink

  new (model: Model) = {
    inherit AudioSink(new Pcm.Format(16000, 1))
    voskRecognizer = new VoskRecognizer(model, 16000f)
    textEvent = new Event<string>()
  }

  val voskRecognizer : VoskRecognizer
  val mutable textEvent : Event<string>

  member this.OnTextEvent = this.textEvent.Publish
  member this.RaiseTextEvent e = this.textEvent.Trigger e
  member this.GetResultFromJson (json : string) : VoskResult = System.Text.Json.JsonSerializer.Deserialize<VoskResult> json

  override this.Label : string = "Vosk Audio Transcriber"

  override
this.DoDestroy () =
    let res = this.GetResultFromJson (this.voskRecognizer.Result())
    this.RaiseTextEvent res.text
    this.voskRecognizer.Dispose()
    ()

  override this.DoProcessFrame (frame: AudioFrame, buf: AudioBuffer) =
    let mutable result = false
    let dataBuf = buf.DataBuffer

    if dataBuf.Index = 0 then
      result <- this.voskRecognizer.AcceptWaveform(dataBuf.Data, dataBuf.Length)
    else
      let data = dataBuf.ToArray();
      result <- this.voskRecognizer.AcceptWaveform(data, data.Length)

    if result then
      let res = this.GetResultFromJson (this.voskRecognizer.Result())
      if not (String.IsNullOrWhiteSpace(res.text)) then
        this.RaiseTextEvent res.text

 

Code Breakdown

Let's break down the code and understand each section:

type VoskSink =

  inherit AudioSink

  new (model: Model)
= {
    inherit AudioSink(new Pcm.Format(16000, 1))
    voskRecognizer
= new VoskRecognizer(model, 16000f)
    textEvent = new Event<string>()
  }

Similar to our custom audio sink, the VoskSink expects to receive 16000Hz mono PCM audio, which aligns with VOSK's requirements. We define a constructor to initialize the VoskRecognizer when the sink is created and wire up the event when text is returned.

val voskRecognizer : VoskRecognizer

  val mutable textEvent : Event<string>

  member this.OnTextEvent = this.textEvent.Publish
  member this.RaiseTextEvent e = this.textEvent.Trigger e
  member this.GetResultFromJson (json : string) : VoskResult = System.Text.Json.JsonSerializer.Deserialize<VoskResult> json

  override this.Label : string = "Vosk Audio Transcriber"

Here, we create the properties and events to be used. The GetResultFromJson function is used to retrieve the result from the returned JSON. Similar to before, we need to create a Label property to assign a string name to our audio sink.

override this.DoDestroy () =

  let res = this.GetResultFromJson (this.voskRecognizer.Result())
  this.RaiseTextEvent res.text
  this.voskRecognizer.Dispose()
  ()

In the previous implementation, we didn't have any cleanup to perform when the audio sink was destroyed (e.g. when a user disconnects). However, in this case, we want to clean up the VoskRecognizer and send the final text to other clients. This code snippet handles that cleanup process.

override this.DoProcessFrame (frame: AudioFrame, buf: AudioBuffer) =

  let mutable result = false
  let dataBuf = buf.DataBuffer

  if dataBuf.Index = 0 then
    result <- this.voskRecognizer.AcceptWaveform(dataBuf.Data, dataBuf.Length)
  else
    let data = dataBuf.ToArray();
    result <- this.voskRecognizer.AcceptWaveform(data, data.Length)

  if result then
    let res = this.GetResultFromJson (this.voskRecognizer.Result())
    if not (String.IsNullOrWhiteSpace(res.text)) then
      this.RaiseTextEvent res.text

The DoProcessFrame method is where each audio frame ends up after being processed through the audio pipeline (De-Packetizer, Decoder, SoundConverter, and Our Sink). In this case, we extract the last byte array from the audio buffer passed to us by calling .DataBuffer. The data buffer can either be a single byte[] or part of a DataBufferPool where the offset and length would be crucial. Fortunately, by calling .ToArray on the buffer, the system handles this and provides us with the byte array for the current buffer from the pool. We then pass the byte array to VOSK and retrieve the resulting text. It's as simple as that!

 

This logic can be applied to any audio filter or processing library, not just VOSK. I have used a similar approach with Microsoft's Speech-to-Text API by providing Wav headers before the raw data streams.

 

I hope this guide helps you integrate other exciting projects directly into LiveSwitch. You can find the complete working project on GitHub.

 

Need assistance in architecting the perfect WebRTC application? Let our team help out! Get in touch with us today!