Monday, May 20, 2024

Sitecore PowerShell Extensions Text-to-Speech Audio Synthesis Module

Another year, another exciting Sitecore Hackathon!  This round, I flew solo under the moniker "Sitecorepunk 2077" (a play on the critically acclaimed 2020 action role-playing video game "Cyberpunk 2077").

If you're curious how the event unfolded, I documented my progress on X (formerly Twitter) every couple of hours:

Needless to say, I was utterly exhausted and slept for 12 hours straight, following the 32 hours I had been awake.  While I didn't snag a win (congrats, team Cloud Surfers and team 451 Unavailable For Legal Reasons ), I enjoyed the experience, am proud of what I was able to output, and look forward to the next one.

Module Concept and Inspiration

The 2024 Sitecore Hackathon category I chose to work against was "Best Module for XM/XP or XM Cloud" - although the result could also fit the bill for "Best use of AI".  Inspired by the ever-increasing need for accessible content, I decided to develop a module that converts text content into spoken audio files, which are then stored remotely and saved as an MP3 links within the item's context - all from within Sitecore. Ultimately, once I landed on the idea, the goal was to provide an easy-to-use tool for generating audio versions of Sitecore content, thereby enhancing accessibility and improving user engagement for individuals with visual impairments or preferences for audio content.


Here’s a breakdown of what makes the SPE Text-to-Speech Audio Synthesis Module stand out:

Lifelike Speech Synthesis from Microsoft Azure Cognitive AI Speech Services

One of the core features of this module is its ability to convert text content into lifelike speech. By transforming text into life-like speech, the module makes content more accessible to a broader audience, including those with visual impairments and individuals who prefer consuming content through audio.

The module utilizes Microsoft Azure Cognitive Services Speech Service to generate audio from selected text fields dynamically. This integration ensures high-quality, natural-sounding speech output. Whether it's a blog post, news article, or product description, every piece of content can be converted into audio, broadening its reach and enhancing user engagement.

Storage via Azure Blob Storage

To store the generated audio files, the module leverages Azure Blob Storage APIs. Once an audio file is generated and store locally in a temporary directory, it is then uploaded to a dedicated Azure Storage container. The API returns a URL to the audio file, which is then populated in the context page item’s Audio URL field. 

Interface and Custom Ribbon Button

A custom Ribbon Button on the Home tab streamlines the audio-generating process. This button triggers an interactive Sitecore PowerShell Extensions dialog where authors can configure various options, such as voice selection, field selection, and speech rate adjustment, and kick off the speech synthesis generation.

The customizable options ensure the audio output matches the intended tone and speech rate, providing a tailored listening experience.

Multi-Language Support

Recognizing the diverse needs of global users, the module supports multiple languages. For demonstration purposes and within the natural time constraints of the Hackathon, the following languages are supported in the initial implementation:

  • English (en)
  • Japanese (ja-JP)
  • German (de-DE)
  • Danish (da)

Each supported language selection has a series of Neural (lifelike, natural-sounding) voice options from Microsoft Azure Cognitive Services Speech Service (~449 neural voices to choose from). These hand-selected voices are configured to provide the best audio experience for each language. Of course, support can be expanded to include additional languages (there are 136 languages supported by Azure AI Speech Services).  

High-level Technical Breakdown

Initialization and Setup

The script sets up the necessary Azure services and local environment configurations.

User Interaction and Dialog Configuration

The script provides a dynamic interface through a custom Ribbon Button in the Sitecore Content Editor. This button, titled 'Generate Audio' or 'Regenerate Audio' based on the context item’s state, opens a dialog for configuring the audio output.  The fields and options available in the dialog are as follows:

- Field to Convert to Speech
  • Lists all Rich Text Editor (RTE) and multi-line text fields available on the item.
  • Special Case: If the 'Speech Content Override' field is populated, it appears as an additional option.

- Include Title?
  • A standalone radio button to include the item's title in the audio file.

- Voice
  • Dynamic option based on the item's language, the dialog offers preselected AI Neural voices.

- Speech Rate
  • Control the how fast the speech is spoken. 
    • Optional double value, defaulting to 1.0 if left empty.
    • Range: Between 0.5 (slow) and 2.0 (fast).

The dialog properties and user input handling are defined as follows:

Fetching and Sanitizing Text Content

The Invoke-AudioStreamFetch function handles the core functionality of fetching the text content from Sitecore, sanitizing it, and preparing it for conversion into speech.

The function checks if the title should be included and concatenates it with the main text content. It then sanitizes the text by removing HTML tags and special characters, ensuring clean input for the TTS service.

Sending Text to Azure AI for Speech Synthesis

As seen above, the sanitized text is then sent to the speech service endpoint for conversion into an audio file. The response, which contains the audio stream, is saved locally.

Uploading the Audio File to Azure Blob Storage

Once the audio file is generated, it is uploaded to Azure Blob Storage by calling the Upload-FileToAzureStorage function. This function handles the Azure Storage REST API authentication and the file upload process.

Updating Sitecore Item with Audio URL

After uploading the audio file to Azure, the script updates the Sitecore item with the URL of the audio file, ensuring that the content authors can easily access and manage the generated audio files.

Utilizing the Audio File on the Front-end

Once an item's Audio URL field has been populated, it can be used on the front-end within an HTML audio tag:

This is the simplest approach for playing the audio file, but further styling customizations are doable.

Video Demo

Part of the Hackathon Entry includes a video demo. You can check it out below:

Final Thoughts

Participating in the Sitecore Hackathon has always been an exhilarating experience for me, given the time crunch and competitiveness of the community. That night, the development of the SPE Text-to-Speech Audio Synthesis Module pushed my organizational and technical boundaries, and I'm proud of what I could accomplish in such a short timeframe. More importantly, I hope the resulting module helps highlight the importance of accessibility in content management and end-user experiences. 

If you're interested in or inspired to build your own Text-to-Speech synthesis module, the full PowerShell script and documentation are available on Github.