{"id":1617,"date":"2023-02-02T13:35:21","date_gmt":"2023-02-02T17:35:21","guid":{"rendered":"http:\/\/jeffq.com\/blog\/?p=1617"},"modified":"2023-02-02T13:35:21","modified_gmt":"2023-02-02T17:35:21","slug":"literai-ai-generated-open-source-visual-podcasts","status":"publish","type":"post","link":"http:\/\/jeffq.com\/blog\/literai-ai-generated-open-source-visual-podcasts\/","title":{"rendered":"literAI: AI-generated open source visual podcasts"},"content":{"rendered":"\n<p><em>Demo: <a rel=\"noreferrer noopener\" href=\"https:\/\/literai.hooloovoo.ai\/\" target=\"_blank\">https:\/\/literai.hooloovoo.ai\/<\/a> Source: <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/jquesnelle\/literAI\" target=\"_blank\">Generator<\/a>, <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/hooloovoo-ai\/literAI-website\" target=\"_blank\">UI<\/a><\/em><\/p>\n\n\n\n<p>At my previous job I did some shader programming, and generally tinkered around with GPU workloads, and even had the chance to attend Nvidia&#8217;s GPU Technology Conference a few times. I remember in 2018 or so being surprised that more and more of the conversation in this area was being dominated by these things called &#8220;deep neural networks&#8221;. During my CS studies I was focused on cryptography, but I was curious what all this was about and took an early version of Udacity&#8217;s Deep Learning Nanodegree (don&#8217;t laugh!)<\/p>\n\n\n\n<p>The class was actually fairly insightful &#8212; making you learn about backpropagation, etc. from scratch and took you through the motions of the classic MNIST classification tasks and so forth. It ended with doing face generation using these fancy things called <em>convolutional<\/em> neural networks.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Some randomly generated faces created by a<br>deep convolutional generative adversarial network I made as part of my <a href=\"https:\/\/twitter.com\/hashtag\/udacity?src=hash&amp;ref_src=twsrc%5Etfw\">#udacity<\/a> course. Not super practical, but still eminently cool<br><br>P.S. Twitter asks &quot;Who&#39;s in these photos?&quot; when I upload them. The dreams of electric sheep, Twitter. <a href=\"https:\/\/t.co\/Tf6iAWHEl8\">pic.twitter.com\/Tf6iAWHEl8<\/a><\/p>&mdash; emozilla (@theemozilla) <a href=\"https:\/\/twitter.com\/theemozilla\/status\/1016099200067567617?ref_src=twsrc%5Etfw\">July 8, 2018<\/a><\/blockquote><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script>\n<\/div><figcaption class=\"wp-element-caption\">such fidelity, much wow<\/figcaption><\/figure>\n\n\n\n<p>Neat, but still felt a bit gadget-y to me. Like every nerd I assumed that <em>someday<\/em> humanity would develop &#8220;artificial&#8221; intelligence, but at the time it didn&#8217;t seem like such a thing was imminent.<\/p>\n\n\n\n<p>Of course, then came Stable Diffusion and ChatGPT.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>When I want to learn something technical, I need to be able to thinker with it. Let me get into VS Code, get <em>something<\/em> working locally, something I can step into as deep as I want to. And then it&#8217;s just, you know, messing around with it.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/FXyz7ySVQAASPqt.jpeg\"><img decoding=\"async\" loading=\"lazy\" width=\"1500\" height=\"1968\" src=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/FXyz7ySVQAASPqt.jpeg\" alt=\"\" class=\"wp-image-1619\"\/><\/a><figcaption class=\"wp-element-caption\">this is not an exaggeration<\/figcaption><\/figure>\n\n\n\n<p>Over the past six months I&#8217;ve been deep-diving the latest AI advancements, tinkering as I go (I recommend the excellent <a rel=\"noreferrer noopener\" href=\"https:\/\/nnfs.io\/\" data-type=\"URL\" data-id=\"https:\/\/nnfs.io\/\" target=\"_blank\">Neural Networks from Scratch<\/a> book to get get jump started). A few projects I wrote along the way were <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/jquesnelle\/txt2imghd\" target=\"_blank\">txt2imghd<\/a> and <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/jquesnelle\/transformers-openai-api\" target=\"_blank\">transformers-openai-api<\/a>. <\/p>\n\n\n\n<p>One pain point I kept hitting is that it seemed like the <em>coolest<\/em> stuff was all behind an API, instead of being openly accessible. Don&#8217;t get me wrong &#8212; I probably spent more money on GPU time to run open models than if I&#8217;d just paid the damn API costs, and I don&#8217;t begrudge companies trying to, you know, actually make money &#8212; but whenever I wanted to tinker the best stuff required carefully rate limited API calls. I wanna do dumb shit in a tight <code>for<\/code> loop without the fear of a gazillion dollar bill!<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>One night while perusing the latest arXiv posts I came across <a rel=\"noreferrer noopener\" href=\"https:\/\/arxiv.org\/pdf\/2212.10465.pdf\" data-type=\"URL\" data-id=\"https:\/\/arxiv.org\/pdf\/2212.10465.pdf\" target=\"_blank\">SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization<\/a>, which used <a rel=\"noreferrer noopener\" href=\"https:\/\/aclanthology.org\/2022.naacl-main.341\/\" target=\"_blank\">research into knowledge graphs<\/a> to generate prompts for text-davinci-003 (the model behind ChatGPT) to create a large dataset of synthetic dialogues along with the accompanying semantic information (e.g. the intent of one of the speakers). This dataset was then used to fine-tune the <a rel=\"noreferrer noopener\" href=\"https:\/\/ai.googleblog.com\/2020\/02\/exploring-transfer-learning-with-t5.html\" target=\"_blank\">open source T5 language model from Google<\/a> to create <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/allenai\/cosmo-xl\" target=\"_blank\">COSMO<\/a>, a model that can generate realistic sounding human dialogues.<\/p>\n\n\n\n<p>I spend a fair amount of time listening to audiobooks and podcasts, and this got me thinking about potential applications. Could a podcast about a novel be generated by a model like COSMO? (As part of my research I <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/LAION-AI\/Open-Assistant\/pull\/571\" target=\"_blank\">contributed<\/a> some SODA data into <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/LAION-AI\/Open-Assistant\" target=\"_blank\">Open Assistant<\/a>, a project to create an open source ChatGPT). Furthermore, could it be done using consumer-grade hardware, i.e. not on an A100?<\/p>\n\n\n\n<p>Lo and behold, <a rel=\"noreferrer noopener\" href=\"https:\/\/twitter.com\/yacineMTB\" target=\"_blank\">yacine<\/a> had similar inklings and while I was working on my project released <a rel=\"noreferrer noopener\" href=\"https:\/\/scribepod.substack.com\/\" target=\"_blank\">scribepod<\/a>, powered by the 900-pound-gorilla that is text-davinci-003. This was partial vindication &#8212; yes, it could be done &#8212; but also somewhat deflating since it meant it would need to be tethered to an API.<\/p>\n\n\n\n<p>Or must it be? COSMO can make the dialogue &#8212; but it needs some information on what to say. The critical task here is summarization; taking the raw novel text and distilling it into meaningful pieces that can be used as context when prompting the dialogue generating LM. <a rel=\"noreferrer noopener\" href=\"https:\/\/peterszemraj.ch\/\" target=\"_blank\">Peter Szemraj<\/a> has been doing fantastic open source work in this space, and I decided to use his <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/pszemraj\/long-t5-tglobal-xl-16384-book-summary\" target=\"_blank\">long-t5-tglobal-xl-16384-book-summary<\/a> model (again a fine-tuning of T5 &#8212; are we noticing a pattern here? Thanks Google!!!)<\/p>\n\n\n\n<p>Okay, so I had an open source way to summarize text and generate dialogue. How about a bit of flair? Given the incredible results that diffusion models have had in image generation, I wanted to leverage these to give the podcast some imagery. My idea was a player for the podcast that would scroll between images generated from descriptions of the scene that the podcast participants were talking about. To do this, I needed to automatically generate prompts to Stable Diffusion models (<a rel=\"noreferrer noopener\" href=\"https:\/\/lwneal.com\/rutkowski.html\" target=\"_blank\">Greg Rutkowski<\/a> here we come).<\/p>\n\n\n\n<p>The ChatGPT-solves-everything answer is to simply few-shot it with some examples of what you&#8217;d like using something like <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/hwchase17\/langchain\" target=\"_blank\">LangChain<\/a> and let those 125 billion parameters work their magic. To maintain our open source purity I chose FLAN-T5 (<a rel=\"noreferrer noopener\" href=\"https:\/\/arxiv.org\/pdf\/2210.11416.pdf\" target=\"_blank\">paper<\/a>; <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/google\/flan-t5-xl\" target=\"_blank\">model<\/a>), the instruction-tuned version of T5. FLAN-T5 produced very good, although admittedly inferior, results. Alas, such is the price we must pay (or not pay in this case).<\/p>\n\n\n\n<p>Once the image descriptions were created it was simply the matter of generating a prompt and letting a Stable Diffusion model like <a rel=\"noreferrer noopener\" href=\"https:\/\/huggingface.co\/dreamlike-art\/dreamlike-diffusion-1.0\" target=\"_blank\">Dreamlike Diffusion<\/a> do the rest!<\/p>\n\n\n\n<figure class=\"is-layout-flex wp-block-gallery-1 wp-block-gallery has-nested-images columns-default is-cropped\">\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part1-19-3-0.png\"><img decoding=\"async\" loading=\"lazy\" width=\"512\" height=\"768\" data-id=\"1620\"  src=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part1-19-3-0.png\" alt=\"\" class=\"wp-image-1620\"\/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part1-6-6-0.png\"><img decoding=\"async\" loading=\"lazy\" width=\"512\" height=\"768\" data-id=\"1622\"  src=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part1-6-6-0.png\" alt=\"\" class=\"wp-image-1622\"\/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part2-20-6-0.png\"><img decoding=\"async\" loading=\"lazy\" width=\"512\" height=\"768\" data-id=\"1623\"  src=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part2-20-6-0.png\" alt=\"\" class=\"wp-image-1623\"\/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part2-title-1.png\"><img decoding=\"async\" loading=\"lazy\" width=\"512\" height=\"768\" data-id=\"1621\"  src=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part2-title-1.png\" alt=\"\" class=\"wp-image-1621\"\/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part1-6-1-0.png\"><img decoding=\"async\" loading=\"lazy\" width=\"512\" height=\"768\" data-id=\"1624\"  src=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part1-6-1-0.png\" alt=\"\" class=\"wp-image-1624\"\/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part2-19-6-0.png\"><img decoding=\"async\" loading=\"lazy\" width=\"512\" height=\"768\" data-id=\"1625\"  src=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part2-19-6-0.png\" alt=\"\" class=\"wp-image-1625\"\/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part2-title-11.png\"><img decoding=\"async\" loading=\"lazy\" width=\"512\" height=\"768\" data-id=\"1626\"  src=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part2-title-11.png\" alt=\"\" class=\"wp-image-1626\"\/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part1-3-6-0.png\"><img decoding=\"async\" loading=\"lazy\" width=\"512\" height=\"768\" data-id=\"1627\"  src=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part1-3-6-0.png\" alt=\"\" class=\"wp-image-1627\"\/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part1-4-4-0.png\"><img decoding=\"async\" loading=\"lazy\" width=\"512\" height=\"768\" data-id=\"1628\"  src=\"http:\/\/jeffq.com\/blog\/wp-content\/uploads\/2023\/02\/part1-4-4-0.png\" alt=\"\" class=\"wp-image-1628\"\/><\/a><\/figure>\n<figcaption class=\"blocks-gallery-caption wp-element-caption\">Images generated for H. G. Wells&#8217; &#8220;The War of the Worlds&#8221;<\/figcaption><\/figure>\n\n\n\n<p>The final piece was to make actual audio. I cribbed yacine&#8217;s use of <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/neonbjb\/tortoise-tts\" target=\"_blank\">TorToiSe<\/a>, and at   last the amalgamation was complete &#8212; literAI was born! You can try out the visual player <a rel=\"noreferrer noopener\" href=\"https:\/\/literai.hooloovoo.ai\/\" target=\"_blank\">here<\/a>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>I&#8217;ll save my poetic waxing about AI for another time. Rather, I&#8217;d like to simply appreciate the work of the countless researchers who contributed to getting us to the current SOTA. It&#8217;s frankly bewildering. I&#8217;m looking forward to where we&#8217;re going &#8212; and being a builder of it along the way.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Demo: https:\/\/literai.hooloovoo.ai\/ Source: Generator, UI At my previous job I did some shader programming, and generally tinkered around with GPU workloads, and even had the chance to attend Nvidia&#8217;s GPU Technology Conference a few times. I remember in 2018 or so being surprised that more and more of the conversation in this area was being [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[44],"_links":{"self":[{"href":"http:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/posts\/1617"}],"collection":[{"href":"http:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/comments?post=1617"}],"version-history":[{"count":10,"href":"http:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/posts\/1617\/revisions"}],"predecessor-version":[{"id":1637,"href":"http:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/posts\/1617\/revisions\/1637"}],"wp:attachment":[{"href":"http:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/media?parent=1617"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/categories?post=1617"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/jeffq.com\/blog\/wp-json\/wp\/v2\/tags?post=1617"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}