While most people in our field are deploying artificial intelligence (AI) within self-driving car systems, I’ve been looking at a very different niche – using AI to create technology-oriented marketing content. While the results for the written word are impressive thus far, there are still lots of holes that make existing technology inadequate to replace human writers, at least so far.
But what about AI image generation, where you feed the AI a descriptive sentence, and it generates a corresponding image? This technology has recently exploded, and what’s particularly interesting is that you don’t require constructing something with a real-life analog. So, if you want a singing dinosaur, a cat wearing a tuxedo, or a half-rabbit / half-frog, the computer can synthesize those images for you.
Is any of this useful for marketing? I’m glad you asked. In presentations, e-books, and whitepapers, Nancy and I make frequent use of stock photography. However, it’s challenging to find the right image. We spend a lot of time searching through stock sites to find what we need – or something close that can be adapted. Given that AI images can be generated with a prompt, I thought it would be worth experimenting to see if an AI can create what we need in less time than a search.
For my test run, I used our Third Law Autotech Revolution e-Book to see how easy it would be to create approximations of the contained images rather than search for them. I used two different engines to test out different approaches: Open AI’s DALL-E 2 and the open source Stable Diffusion. Note that both these AI image generators deliver a similar result – four different images – from, a text prompt you provide (the more descriptive, the better). Most people pick the best image available as their result, but I’ll show you all the options for each engine so you can get a sense of the output without any interference from my human filtering process.
I’m not necessarily looking for an exact replacement in my AI generated image. Just something that could be usable in this context, even if it’s artistically or thematically different. All the “magic” is in the prompt, so to keep the comparison fair, I’m using the same prompt in both systems.
Target 1 – Front Cover highway intersection
The cover of our e-Book has a great composition from our designer that has some colour treatments over a photographic but stylized highway overpass.
Feeding the prompt “artistic aerial photograph of a highway intersection” into both engines gives us the following results.
DALL-E
Stable Diffusion
All in all, not bad. But the results do underscore our first lesson learned.
AIs get you some of way there, but not all the way
One big challenge with AI generators is that they aren’t terribly precise. If you look at the generated highway images, you’ll see that the lane markings and lanes are not evenly spaced. The roads sometimes are impractical or go nowhere. And as we’ll see in the next image, similar problems happen when depicting people – or anything really. Faces can look quite crooked, eyes can be pointing in different directions, hands can have anywhere from two to seven fingers (and very often those are mangled).
The AI gets the general idea of a concept but it isn’t perfect on execution. To me, it’s super impressive that I can ask an AI to draw a highway and I get a recognizable image at all. But the AI models don’t know what they’re drawing. They’re trying to minimize a complex mathematical function over many iterations – and as a result, they don’t “notice” when things aren’t quite right. This lack of resulting visual accuracy is the same problem that the text-focused AI generators I looked at have when writing copy. They can’t keep the facts straight because facts don’t matter.
Target 2 – micromobility
Next image in the e-book is a man on a scooter. Although the image shows him riding next to a canal, let’s give the generators some flexibility. We won’t specify where this person is and see what the AI comes up with using “man riding on an e-scooter“.
DALL-E
Stable Diffusion
Hmmm – don’t look too closely. The images that show a face are rather horrifying: it’s not even uncanny valley, but valley of the mutants. Faces can be tricky.
Making a pretty face
You may have seen AI generated art that has great faces in it. And there’s also Nvidia’s “This person does not exist”, an AI tool that generates amazingly realistic human faces. Those systems output great faces so why are these AI’s doing so poorly?
In the case of “this person does not exist”, the network was trained specifically to generate faces and was scored on the quality of the faces generated. Some people-pleasing art is being created with MidJourney, a tool that I didn’t examine here, but seems to have received better training in the creation of proper human faces. In other cases, an AI image has been post-proccessed through a technique called digital inpainting. This is where the human guides the AI by painting on the canvas to guide it for areas that require more work. People have also manually taken over in some cases, using specific software to replace or retouch faces like GFPGAN or Adobe Photoshop.
In other words, if you want picture perfect results, you will probably need to do some work to get there. At least with today’s technology, it will require the combination of several tools and a lot of patience. As the story behind “an AI won an art contest” shows, the artist who created that piece created over 900 iterations to get close to what he wanted, and then ended up fixing remaining problems in the image with Photoshop.
Target 3 – Waymo vehicle
Our next image is going to be a bit more difficult, asking for a very specific thing: “Waymo autonomous vehicle on a city street“
DALL-E
Stable Diffusion
Honestly, some of these images look almost as ridiculous as the real thing, but I can’t say that I’d use any of them.
- The bumper in the first DALL-E picture was left with rainbow-like RGB striping. Evidently the model was confused about how to render a reflection. DALL-E doesn’t allow additional iterations beyond what’s built-in. However, since it’s open source, StableDiffusion can. If this sort of problem occurred with Stable Diffusion, you could conceivably try to give it more iterations to try to fix this problem and see if it gets better.
- Contextually, the Stable Diffusion cars are in better road positions, since DALL-E has one car on the sidewalk, and one apparently perpendicular to the direction of travel. However, the Stable Diffusion concept of the Waymo vehicle quite prominently exaggerates the sensor pack on top of the vehicle, to a rather comedic effect.
Hand-picking the best
The differences between each generator’s output on the same prompt underscores that they have different strengths. To be able to use AI artwork for more than just curiosity, you will probably need to get familiar with more than one tool.
Target 4 – network
Next is an abstract image that hopefully will be easier to get right.
We used the prompt “abstract matrix graphic in perspective view with blurry lines and colored points“.
DALL-E
Stable Diffusion
Wow! Here, DALL-E clearly knows what we’re after. I’m sure it’s consumed plenty of these types of images since stock sites are full of them. Stable Diffusion is grasping at straws – the images are not bad, but they don’t really have a “technology” bent. This is something that could no doubt be improved with more tweaking of the input prompt, but it further underscores the differences between AI image generators and how you’ll want to access multiple different styles that each generator provides.
Target 5 – ADAS system
We’ll wrap up with something that shows an autonomous system without anything of a proprietary nature. This type of non-branded, generic approach to technology is something that we often need for marketing purposes. It’s also why we used a much more specific prompt in this case. Coincidentally, this problem is where stock photography sites are woefully inadequate. Hopefully the AI’s can make good use of it and solve this problem! Our prompt: Car interior with view out windshield of city street and pedestrians, with features boxed and labeled for ADAS.
DALL-E
Stable Diffusion
Um, no. While Stable Diffusion might do a more realistic job of rendering a car interior, neither of these systems was able to capture the elements of an ADAS system. For something this specific, there is probably just not enough source material available to feed the models. It’s not probably useful to photoshop any of these images either – it would be far easier to use Photoshop to remove branding and add the needed ADAS features to an existing car image than it would be to correct all the flaws with these generated images.
Lessons learned
After all our experimenting with AI image generation, do we think it can replace stock photography? Overall, not a chance. But perhaps in very specific use cases.
AI images can give you something, but it’s often not what you want.
You don’t have a lot of control. That can be a good thing if it breaks you out of a rut – when you’re looking to find one type of image and the AI gives you something completely different that might just work. Frankly, a lack of control is often the same problem we have when searching stock photography for something quite specific. If you are expecting to get something of quality and exactly what’s in your mind’s eye, you might not get it from either an AI or a stock photography site. You’re better off to have a photographer stage it or ask an artist to render it for you.
AI images have terrible consistency
If you’re expecting everyone to have human-looking eyes, or you want your car to be parallel to the road traffic, or you need an infotainment system that wasn’t designed by a toddler – you might have trouble with AI generated images. In cases where the images may be very small (such as using them for a mosaic), it might not matter.
You’ll spend as much time perfecting the prompts as you will searching.
I tried not to focus on the “prompt crafting” aspect for these images, but to get the right image usually requires a lot of tinkering. Getting the prompt right is a critical part of getting good AI images. By changing the text prompts, you can often narrow down on the right image. But there is a lot of tinkering involved and it can take a lot of experimenting. Each set of images takes time to generate, so this can be a long process – one that sometimes is an ultimate dead end.
AI images might not have good artistic merit or usability
Stock pictures have usually been thought through for lighting, composition, positioning, and viewpoint so that the image is interesting, attractive, and communicates well. Right now, those things are better done by a human. If you can find something in a stock database, it’s almost certainly going to be better at fulfilling your need.
That’s also true for usability. Is the image easily used for it’s intended purpose? As an example, if it’s for a slide background, does it have enough “blank space” to allow text to be placed over top? Does the colour palette match the colours used in the rest of the piece? Does it have the same stylistic treatment as your corporate brand? Some of these attributes might be correctable with additional prompts, but the AI’s aren’t great at being able to proactively fix them.
Stick to the abstract
If you need images where consistency doesn’t matter, you might be able to use AI generated work. For things like abstract backgrounds, like our attempt #4 – they might be good enough.
Resolution might be a problem
One issue we didn’t address is resolution. Most stock images that we buy are high resolution. For maximum flexibility, we downscale them as needed – for web, white paper, PowerPoint, or print. Because AI generators need to work quite a bit on crafting every pixel, the resulting images are often limited to relatively small sizes like 512 x 512. Another solution to this problem is to use another AI tool like Gigapixel AI to upscale the resulting image.
Since StableDiffusion is open source, you could run it on your own machine with larger sizes. However, don’t plan on running these algorithms on your own hardware unless you’ve got a killer graphics card and a blazing fast machine. Generation of images take a very long time on standard hardware. Very long. As an example, my Apple Macbook Air with the speedy new M1 chip took 33 minutes to render a single image, while the https://stablediffusionweb.com/ web site that’s backed by a lot of cloud processing power rendered four images in around 30 seconds.
Fine-tuning to the rescue
Some of the problems we’ve encountered could be fixed if you were able to get the model to better understand specific subjects – like a Waymo vehicle or an ADAS system, for example. The model within Stable Diffusion was trained on five billion text/image pairs. Although the model can be fine-tuned to make it better for specific images, this process requires somewhere between 10,000 and 100,000 extra images. This makes it impractical to change the underlying model for anyone except researchers.
This is where DreamBooth comes to the rescue. Researchers at Google have created new algorithms that extend Stable Diffusion to make it much smarter at fine-tuning. They can add specific knowledge to the model to generate specific subjects within new contexts with just four images. So, if you want to create images with a specific car, specific bit of technology, or specific human model, now that’s perfectly possible. With extra fine-tuning, the results on those specific targets also become much clearer.
Although DreamBooth code has been kept proprietary inside Google, the paper it was based on gave enough detail to allow developers to replicate the DreamBooth techniques in new open source software like Waifu diffusion.
What’s Next?
The challenges to using AI generated images for technology marketing purposes aren’t unsurmountable. They require someone who’s invested a bit of time into learning the tools and techniques, and with all the attention the field is getting, the models will continue to improve. Digital art has been completely transformed by AI generators within the last year or so. Stock photography will likely follow the same path. With a little more work, human-guided AI image generation will make stock photography obsolete. It won’t be long before automatically generating copyright-free photography, clipart, or illustrations for your slides is the next big feature in PowerPoint.