literAI: AI-generated open source visual podcasts

Demo: Source: Generator, UI

At my previous job I did some shader programming, and generally tinkered around with GPU workloads, and even had the chance to attend Nvidia’s GPU Technology Conference a few times. I remember in 2018 or so being surprised that more and more of the conversation in this area was being dominated by these things called “deep neural networks”. During my CS studies I was focused on cryptography, but I was curious what all this was about and took an early version of Udacity’s Deep Learning Nanodegree (don’t laugh!)

The class was actually fairly insightful — making you learn about backpropagation, etc. from scratch and took you through the motions of the classic MNIST classification tasks and so forth. It ended with doing face generation using these fancy things called convolutional neural networks.

such fidelity, much wow

Neat, but still felt a bit gadget-y to me. Like every nerd I assumed that someday humanity would develop “artificial” intelligence, but at the time it didn’t seem like such a thing was imminent.

Of course, then came Stable Diffusion and ChatGPT.

When I want to learn something technical, I need to be able to thinker with it. Let me get into VS Code, get something working locally, something I can step into as deep as I want to. And then it’s just, you know, messing around with it.

this is not an exaggeration

Over the past six months I’ve been deep-diving the latest AI advancements, tinkering as I go (I recommend the excellent Neural Networks from Scratch book to get get jump started). A few projects I wrote along the way were txt2imghd and transformers-openai-api.

One pain point I kept hitting is that it seemed like the coolest stuff was all behind an API, instead of being openly accessible. Don’t get me wrong — I probably spent more money on GPU time to run open models than if I’d just paid the damn API costs, and I don’t begrudge companies trying to, you know, actually make money — but whenever I wanted to tinker the best stuff required carefully rate limited API calls. I wanna do dumb shit in a tight for loop without the fear of a gazillion dollar bill!

One night while perusing the latest arXiv posts I came across SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization, which used research into knowledge graphs to generate prompts for text-davinci-003 (the model behind ChatGPT) to create a large dataset of synthetic dialogues along with the accompanying semantic information (e.g. the intent of one of the speakers). This dataset was then used to fine-tune the open source T5 language model from Google to create COSMO, a model that can generate realistic sounding human dialogues.

I spend a fair amount of time listening to audiobooks and podcasts, and this got me thinking about potential applications. Could a podcast about a novel be generated by a model like COSMO? (As part of my research I contributed some SODA data into Open Assistant, a project to create an open source ChatGPT). Furthermore, could it be done using consumer-grade hardware, i.e. not on an A100?

Lo and behold, yacine had similar inklings and while I was working on my project released scribepod, powered by the 900-pound-gorilla that is text-davinci-003. This was partial vindication — yes, it could be done — but also somewhat deflating since it meant it would need to be tethered to an API.

Or must it be? COSMO can make the dialogue — but it needs some information on what to say. The critical task here is summarization; taking the raw novel text and distilling it into meaningful pieces that can be used as context when prompting the dialogue generating LM. Peter Szemraj has been doing fantastic open source work in this space, and I decided to use his long-t5-tglobal-xl-16384-book-summary model (again a fine-tuning of T5 — are we noticing a pattern here? Thanks Google!!!)

Okay, so I had an open source way to summarize text and generate dialogue. How about a bit of flair? Given the incredible results that diffusion models have had in image generation, I wanted to leverage these to give the podcast some imagery. My idea was a player for the podcast that would scroll between images generated from descriptions of the scene that the podcast participants were talking about. To do this, I needed to automatically generate prompts to Stable Diffusion models (Greg Rutkowski here we come).

The ChatGPT-solves-everything answer is to simply few-shot it with some examples of what you’d like using something like LangChain and let those 125 billion parameters work their magic. To maintain our open source purity I chose FLAN-T5 (paper; model), the instruction-tuned version of T5. FLAN-T5 produced very good, although admittedly inferior, results. Alas, such is the price we must pay (or not pay in this case).

Once the image descriptions were created it was simply the matter of generating a prompt and letting a Stable Diffusion model like Dreamlike Diffusion do the rest!

The final piece was to make actual audio. I cribbed yacine’s use of TorToiSe, and at last the amalgamation was complete — literAI was born! You can try out the visual player here.

I’ll save my poetic waxing about AI for another time. Rather, I’d like to simply appreciate the work of the countless researchers who contributed to getting us to the current SOTA. It’s frankly bewildering. I’m looking forward to where we’re going — and being a builder of it along the way.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.