When faced with a bit of downtime, many of my friends will turn to the same party game. It’s based on the surrealist game Exquisite Corpse, and involves translating brief written descriptions into rapidly made drawings and back again. One group calls it Telephone Pictionary; another refers to it as Writey-Drawey. The internet tells me it is also called Eat Poop You Cat, a sequence of words surely inspired by one of the game’s results.
As recently as three years ago, it was rare to encounter text-to-image or image-to-text mistranslations in daily life, which made the outrageous outcomes of the game feel especially novel. But we have since entered a new era of image-making. With the aid of AI image generators like Dall-E 3, Stable Diffusion and Midjourney, and the generative features integrated into Adobe’s Creative Cloud programs, you can now transform a sentence or phrase into a highly detailed image in mere seconds. Images, likewise, can be nearly instantly translated into descriptive text. Today, you can play Eat Poop You Cat alone in your room, cavorting with the algorithms.
Back in the summer of 2023, I tried it, using a browser-based version of Stable Diffusion and an AI application called Clip Interrogator, which translates any image into a text prompt. It took about three minutes to play two rounds of the game. I kicked things off by typing “Eat Poop You Cat” (why not?) into a field that encouraged me to “Enter your prompt”. Then I clicked “Generate Image”.
Stable Diffusion generates four images in response to any prompt; I cheated slightly by just choosing my favourite to proceed. From the centre of the frame, a decently realistic tabby cat stared me down, green eyes glowing wide, mouth hanging open to display a salmon-pink tongue. The background was grungy grey without much detail; some bubbly white text in the image’s lower third read: EAT EAT POOOOP POOP YU NOU SOME YOU!
I dragged this image into Clip Interrogator, which spat back the prompt: “A closeup of a cat with green eyes, blue text that says 3kliksphilip, epic urban bakground, poop, white border and background, licking out, epic poster, office cubicle background, golden toilet, funny cartoonish, erin, classic gem, messy eater, exploitable image, leave, motivational, moving poetry, toilet.”
A nuanced syntax for image-generating prompts has emerged alongside the development of generative AI (genAI) tools, and Clip Interrogator’s “prompt” mimicked that accretionary layering of styles, details and descriptors – though this list felt excessive, like a psychedelic extrapolation of the image, which I was glad to know was already a “classic gem”.
After a few more back-and-forths I ended up with an image of a black-and-brown cat lounging on a commode that could have been designed by Frank Lloyd Wright. A bit of toilet paper, which had fallen on to the cat’s head from the roll above, approximated a hat. The image was flat and looked painted. The style felt familiar – expressionist? German expressionist? Faux-naïf? Influenced, certainly, by Modigliani, early Picasso, some of the later still lifes by the Polish cubist Henri Hayden.
Clip Interrogator described this tableau as “a painting of a cat sitting on a toilet, PlayStation 2 gameplay still, in style of pop-art, by Ignacy Witkiewicz, the fool tarot, inspired by Phil Foglio, punkdrone, molecular gastronomy, app, bong, persona 5, text: roborock, destroy lonely, dog, ASCII, 1 8 2 4, tarot card design.” Destroy Lonely is not a command, I learned, but a trap artist from Atlanta. Roborock is a Roomba-like automated vacuum cleaner. Phil Foglio is a cartoonist best known for unconventionally silly Magic: The Gathering illustrations. The inclusion of the late-19th-century writer and painter Stanisław Ignacy Witkiewicz affirmed my intuition that there was something vaguely Polish about this image.
Stable Diffusion makes images by mapping language to a vast set of visual variables, while Clip Interrogator performs the inverse function. The seemingly random strings of proper and phrasal nouns and adjectives are the result of neural networks “reading” the image and assessing sections of pixels for clues that are then correlated with terms, however opaquely. (While the configuration of pixels that translates to “cat sitting on a toilet” is clear enough, those signaling “punkdrone” or “the fool tarot” are less so.)
Because there are so many ways to picture even the simplest cat in the simplest scenario, text-to-image and image-to-text models are far from one-to-one processes of translation. If they were, the algorithms and I couldn’t play this game. But close reading even such an unserious set of prompts and images offers clues about the scaffolding behind these operations, as well as broader insights into the clumsy, grab-bag way humans tend to deploy language when attempting to describe an image.
How to create people who do not exist
Although there were plenty of precursors, it wasn’t until January 2021 that talk of AI artists became big news, as people began to learn of the image-generating platform Dall-E. Back then, descriptions of the “AI artist” still felt like something out of a children’s book: type in a sentence and the computer magically spits out an image!
The technology sounded too advanced to be real, but it had been coming down the pipeline for decades. The first neural network was proposed in 1943, and the technology’s development continued in fits and starts throughout the 20th century. As early as 1989, neural networks could decipher typed and handwritten characters, and computer-vision applications expanded rapidly as hardware capacity increased. Soon, optical character recognition allowed us to convert PDFs to editable text, and now we can copy text snippets in photos taken on our phones. Optical character recognition relies on natural language processing, the field concerned with enabling algorithms to output and receive messages in human language rather than a programming language. Natural language processing combines computational linguistics with statistical modelling and algorithms – now usually neural networks – to process and produce “natural” language through methods such as breaking down sentences, tagging parts of speech, assessing words’ most frequent positions in a sentence, and highlighting words that do the most prominent signifying (usually nouns and verbs).
By 2015, algorithmic processes were able to form simple sentences or phrases to describe an image. Patterns of pixels identified as, say, “cat” or “cup” were matched with linguistic tags, which were then translated into automated image captions in natural language. Quickly, researchers realised they could flip the order of these operations: what would it look like to input tags – or even natural language – and ask the neural networks to produce images in response? But reversing the image-to-text operation proved less than straightforward, as there’s a vast difference between the complexity of a basic phrase and even the simplest image. (While almost any image of a large, centred feline could be described as “a closeup of a cat”, there are infinite possible ways to depict the phrase.) One would also have to collect an enormous quantity of visual data to build up an understanding of the near-infinite visual signs that can be described in language.
Some early attempts at image generation dealt with the problems of complexity and dataset size by constraining both the style of an image and its subject matter. The authors of a pivotal 2016 paper – Generative Adversarial Text to Image Synthesis – began by training their models on limited libraries of images, specifically the Oxford-102 Flowers and Caltech-UCSD Birds datasets.
The bird dataset contains 11,788 photographic images of birds broken down into 200 mostly North American species, annotated with additional attributes such as “Bill Shape”, “Belly Pattern”, and “Underparts Color”. The dataset’s images were downloaded from Flickr and then categorised and annotated by human workers hired on Amazon’s Mechanical Turk, a crowdsourcing platform often referred to as “artificial artificial intelligence”. While one might assume today’s text-to-image tools have been automated all the way down, their architecture and maintenance rely on enormous quantities of human labour, whether the repetitive “clickwork” performed predominantly in the global south by workers paid pennies per “task”, or the voluntary, quotidian labour you’ve provided each time you’ve filled out a Captcha. To learn, neural networks need an initial set of labelled and categorised images, and a person needs to do that initial tagging and sorting – in this case, identifying the location of parts (“back”, “beak”, “belly”, “breast”) and attributes (“has_bill_length::about_the_same_as_head”) for the 59 photos that typify the “glaucous-winged gull”. (The Oxford-102 Flowers were, somewhat less informatively, “acquired by searching the web and taking pictures”.)
By training generative adversarial networks on these limited datasets of tagged images, the paper’s authors were able to generate unique, somewhat plausible bird images from phrases such as “this small bird has a short, pointy orange beak and white belly” and “this magnificent fellow is almost all black with a red crest, and white cheek patch”.
A few years later, in early 2019, the US chip manufacturer Nvidia released an open-source version of StyleGAN, a generative AI that produces a near-infinite supply of unique, synthesised images of faces, allowing a user to control features such as face shape and hairstyles. (This AI was also trained on thousands of images from Flickr, and Nvidia claims “only images under permissive licenses were collected”.) Soon after, Phillip Wang, a software engineer, created thispersondoesnotexist.com, a website that publishes a new, random, synthesised portrait upon each refresh. From there, a horde of copycats followed: This Horse Does Not Exist, This City Does Not Exist, This Chair Does Not Exist, and so on.
While fears of deepfakes had been gracing headlines and raising hackles for more than a year, the sudden onslaught of images of People Who Did Not Exist seemed to trip a wire in the broader collective consciousness. These fake faces were quickly cited as threats to democracy, and calls arose for algorithms that would catch and flag the generated images. Meanwhile, StyleGAN branched out and began to tackle anime portraits. While the image type changed, the subject matter remained constrained.
In contrast, ImageNet, a project initiated in 2006 by the computer scientist Fei-Fei Li, had the immodest aim of “map[ping] out the entire world of objects”. The dataset contains upward of 14m annotated images, organised into more than 100,000 “meaningful categories”. It also employed the labour of more than 25,000 workers via Mechanical Turk. While 100,000 is an astonishing number of categories, it’s extraordinarily small when you consider the visual complexity of the world.
Categorical reduction and oversimplification never bode well, especially when it comes to labelling humans. ImageNet drew upon a pre-existing lexical taxonomy that was developed in the 1980s and borrowed from several earlier lexical sets. As one dataset built upon another, each carried forward the logics and hierarchies of the previous set, if not all its terms. As researcher Kate Crawford and artist Trevor Paglen have highlighted, the original ImageNet dataset contained an image of a child labelled as a “loser”; included the categories “slut”, “whore”, and “negroid”; and curiously placed “hermaphrodite” as a subcategory of “bisexual”, which in turn was listed as a subcategory of “sensualist”, alongside “cocksucker” and “epicure”. In 2019, ImageNet removed more than 600,000 images tagged with “unsafe”, “offensive” or “sensitive” categories, patching the most visible cracks in a fundamentally flawed framework. Still, ImageNet’s categories look controlled and careful when compared with its successors.
GenAI goes mainstream
On 5 January 2021, when the San Francisco-based research laboratory OpenAI announced Dall-E, it also announced Clip, an image-classifying neural network, which was integrated into Dall-E’s processes. In a braggy blog post, OpenAI mocks the ImageNet dataset for its costliness in terms of time and labour, as well as its limited range of content. “In contrast,” the post’s authors declare, “Clip learns from text-image pairs that are already publicly available on the internet.” (Where on the internet, exactly, we still don’t know. But considering the staggering scale of the training dataset – more than 400m image-text pairs – the answer is likely pretty much everywhere.)
We know for certain that Clip includes thousands of works by artists, illustrators, photographers and graphic designers, because one of the things you could do with Dall-E – one of the things you were encouraged to do – is ask it to generate an image in the style of a particular artist. In summer 2022, nearly a year after a public version called Dall-E Mini was released, social media was flooded with images that followed an “A but B” formula, juxtaposing a subject with an unexpected style or context: “Kim Kardashian painted by Salvador Dalí” (naturally), “R2-D2 getting baptised”, and (a personal favourite) “a peanut butter sandwich Rubik’s Cube”.
These generated images are not simply Frankenstein’s monsters assembled from various bits of images hoovered from across the web. Instead, genAI models create generalised ideas of signs, signifiers, image types and styles that correlate with probable pixel patterns. Dall-E’s deep learning algorithms decode a digital image’s arrangement of pixels into hundreds of axes of variables, which it then uses to assess an image and its component parts, and consequently create similar but unique arrangements in the future. When you ask a genAI tool such as Dall-E or Stable Diffusion to style an image after a particular artist, it isn’t copying the artist’s work so much as it is interpreting and reproducing the artist’s patterns – their subject matter, compositional decisions and use of colour, line and form.
The quantity and range of images available on the internet, and how they are tagged, impact how well genAI tools can generate images of a certain subject matter. The more digital images of different works by a particular artist are available, the better the genAI will be at replicating their style; the more a visual idea appears, the more it will be reproduced. Given that there is, for instance, an over-representation of images and descriptions of white men as surgeons on the internet, genAI tools circa 2023 almost always produced a white man when you asked them to generate a surgeon.
Rather than fix the foundational issues in the datasets, these tools’ developers have attempted to obscure them through “debiasing”, or coding in safeguards to ensure diversity – which is how we get Gemini, Google’s recently rebranded genAI tool, producing images of Nazis of colour when prompted to “generate an image of a 1943 German soldier”.
Oh, the humanity!
As text-to-image genAI tools grew increasingly sophisticated, the surrounding discourse grew increasingly alarmed: “Generative AI Is Changing Everything”; “Did Picture-Generating AI Just Make Artists Obsolete?”; “Can AI End Your Design Career?”; “Art Is Dead and We Have Killed It”.
Many of these proclamations came from the camp of AI boosters, others from technophobes and visual artists themselves. In early May 2023, an open letter-cum-manifesto entitled Restrict AI Illustration from Publishing appeared on the website of the Center for Artistic Inquiry and Reporting, written by the institute’s director, Marisa Mazria Katz, and the prominent leftist illustrator Molly Crabapple. The letter outlines something of a fairytale relationship between journalism and illustration, which “speaks to something not just intimately connected to the news, but intrinsically human about story itself”. Generative tools, on the other hand, take mere seconds to “churn out polished, detailed simulacra of what previously would have been illustrations drawn by the human hand”, producing images that are either entirely free or cost “a few pennies”. The letter concludes with a call to “take a pledge for human values against the use of generative-AI images to replace human-made art”. More than 4,000 people – a range of well-known writers, journalists, artists and celebrities – have signed.
There are plenty of reasons to be cautious about the use of genAI for journalistic image production, the technology’s embedded biases and enormous energy footprint chief among them. As of late 2023, Stable Diffusion showed us that “Iraq” only ever looks like a military occupation and that “a person at social services” isn’t white, though “a productive person” usually is, and is always male, while “a person cleaning” is always a woman. Midjourney interpreted “an Indian person” with remarkable consistency as an old, bearded man in an orange pagri, and “a house in Nigeria” as a dilapidated structure with a tin or thatched roof. Meanwhile, a November 2023 study found that producing a single image with genAI can use about the same amount of energy as charging a smartphone halfway – much more than is required to generate text – and that as models have grown more powerful and complex, they have also grown more energy intensive.
The threats to “human values” and the “humanity” of art, however, strike me as overblown. Humans produce generative AI – not only the scripts and mechanisms behind the technology, but the infrastructure at every stage: the Mechanical Turk workers tagging Caltech-UCSD Birds; the anonymous people posting nonsense on X; the Kenyan content moderators paid $2 an hour to review endless horrors just so people can’t accidentally make Dall-E child sexual abuse images. Human choices, foibles and prejudices are the very bedrock of these tools. I’m more frightened by genAI’s humanity – all the assumptions and oddities inherited via their training images, every representational bias enshrined and automated in their tagging sets, each exhausted impulse of the underpaid labourers clicking and sorting as fast as they can – than most other aspects of genAI.
But what about artists’ livelihoods? It’s true that “no human illustrator can work quickly enough or cheaply enough to compete with these robot replacements”, as Mazria Katz and Crabapple write. But to say that “if this technology is left unchecked, it will radically reshape the field of journalism” is to paint a rather rosy picture of the field. The dystopian future Mazria Katz and Crabapple fear will come to pass if genAI is left unchecked – the one in which “only a tiny elite of artists can remain in business, their work selling as a kind of luxury status symbol” – is, unfortunately, already here. Many, perhaps even most, publications see paying fair market wages for the often extensive labour required to produce a custom image as an unjustifiable expense. Why pay for images when there’s a plethora of stock photos and illustrations you can buy super cheaply, memes you can right-click and copy, open-source images you can download from Wikimedia, clip art you can drag and drop in, and pre-existing work by illustrators that so many simply screenshot and steal? Of the publications and businesses that do still commission original work, many have long outsourced design and illustration through online gig-work platforms such as Fiverr, which were modelled after the general concept of Mechanical Turk.
The best path forward for labour protections might be to ensure that those already trained in crafting communicative, compelling images – illustrators, artists, photographers, photo editors – will be best at using these systems. (Wired, the first US publication to adopt an official AI policy, has already enshrined this idea in guidelines. “Some working artists are now incorporating generative AI into their creative process in much the same way that they use other digital tools,” the policy notes. Wired “will commission work from these artists as long as it involves significant creative input by the artist and does not blatantly imitate existing work or infringe copyright. In such cases we will disclose the fact that generative AI was used.” The magazine expressly says it will not use genAI images instead of stock photography, as “selling images to stock archives is how many working photographers make ends meet”. The Guardian’s statement on its approach to generative AI can be read here.)
Like laptops, cameras and paintbrushes, genAI models are tools, and their true efficacy depends upon the skill and knowledge with which they are used. They are also, of course, tools crafted and actively maintained by humans, who deserve to be visible in the chain of image-production labour and considered in discussions of livelihoods. Rather than “artificial intelligence”, then, I prefer to refer to these algorithmic, neural net-powered tools as estranged intelligence, or alienated intelligence. The intelligence – the humanity! – isn’t fake or forged; it is only concealed, outsourced and offshored, remixed and conglomerated, translated into algorithms that it then quietly labours to refine and train.
But I know what Mazria Katz and Crabapple mean. It’s insulting to have your hard-won style stolen by an algorithm. I want to believe that something clear and visible is lost in AI-generated images, that what we call “the hand” – all the subtle, holy imperfections and artefacts of existence left on a made thing – is palpably missing. But I have taken many online quizzes claiming to test one’s ability to distinguish between AI-generated images and photographs, paintings and drawings made by other means, and I must be honest: I do poorly on these tests. Certainly they were built to stump, pitting the best outputs of the generators against uncanny works made by other means, but given that I’ve worked as a graphic designer, a design educator, and an editor at an art publication, I’d like to think I have a somewhat discerning eye. What, then, is the tell of absent humanity?
In the early days of Dall-E, Stable Diffusion and Midjourney, the distinct tics of the generators’ weaknesses – mangled hands, habits of repetition, penchants for centred compositions, errors of physics – more readily betrayed their output as products of AI, while also making it fairly easy to tell apart images produced by each generator. But with each generation of generators, the tells have become less and less visible.
The era of ‘prompt engineering’
While text-to-image (and image-to-text) genAI tools are built on natural language processing, the language that tends to result in the best outcomes reads as far from “natural”. The syntax of prompting is unique enough that a market for so-called “prompt engineers” has emerged, while blogs and vlogs covering Prompt Writing 101 abound.
Most guides to prompt writing suggest a tripartite form: a subject, a description and a style/aesthetic of the image. A “description” usually means a present-participle phrase, eg “a cat drinking coffee” or “a bulldog swimming in the ocean”. When it comes to the “style/aesthetic” of the image, though, it’s less immediately clear what applies. “Epic poster” is a style, as is “funny cartoonish” and “exploitable image”, which refers to any kind of meme that someone can customise by adding their own text or supplementary image. But these aren’t the sorts of descriptors one would generally reach for when conjuring visual styles.
Terms that have become popular prompting shorthand include “retro”, “product photography”, “food photography”, “highly detailed”, “digital art masterpiece”, “C4d render”, “Octane Render”, and “trending on ArtStation”. Names of proprietary software and platforms – such as the 3D-modelling software Cinema4D, or C4D for short; Octane, an “unbiased” graphics-rendering software; and ArtStation, a platform showcasing work by game designers and animators – have transformed into adjectives overnight. Likewise, artists’ names are more often deployed to achieve a visual style than to directly ape an artist’s work. We already have the cultural habit of using proper nouns as eponyms for periods and styles (Louis XIV, Bauhaus, Studio 54), but prompt language has accelerated the trend. There are now websites that catalogue thousands of image styles indexed by artist names, largely those of digital artists and concept designers.
Prompt crafting relies on learning these terms and understanding the mass of visual phenomena to which they are yoked – the subject matter, visual attributes, media and composition styles. While prompt writing is quickly becoming a marketable skill, there’s still much about the innermost workings of the deep-learning algorithms that even the most advanced engineers don’t fully understand. Sam Bowman, who runs an AI research lab at NYU, has said that even specialists like him can’t discern what concepts or “rules of reasoning” are being used by most of these complex systems. “We built it, we trained it, but we don’t know what it’s doing,” Bowman confessed.
Found in translation
Circa October 2022, Dall-E 2 had a hard time with context clues and sequencing, particularly when dealing with how adjectives or descriptive phrases are applied to nouns or verbs. If you told Dall-E 2 to generate “a fish and a gold ingot” it usually gave you a fish that was also gold, frequently a goldfish, as if attempting a kind of wordplay.
Dall-E 2 also went nuts for heteronyms. One example – as elucidated by the academics Royi Rassin, Shauli Ravfogel and Yoav Goldberg – was the prompt “a bat is flying over a baseball stadium”, which produced a jaunty, cartoonish, vector-like illustration of a baseball stadium, over which a baseball and both a baseball bat and the animal we know as a bat fly. The problem is that the tag “bat” correlates to two different kinds of pixel patterns, and the genAI isn’t sure which to choose. Hedging its bets, it throws in both.
Rassin et al describe the confusion that lurks in these linguistic-to-visual translations as the “semantic leakage of properties between entities”. In the image, the two kinds of bats appear to be soaring in tandem; perhaps the bat (animal) is actually wielding the bat (baseball). A white teardrop shape seems an attempt at a smile, indicating our friend the bat (animal) is having a great time. To the bat’s left, a flat grey cloud and a lightning bolt interrupt the blue sky. The paper’s authors don’t provide a clear linguistic reason for how the lightning bolt snuck in there, but my untested image-associative guess is that bats (animal) frequently show up in imagery with witches, who are prone to doing spells and zapping things.
The lightning bolt is a good example of what Rassin et al refer to as “second-order stimuli”: the networked associations embedded in language and images we’re rarely conscious of. When you ask Dall-E 2 for an armadillo on a seashore, it will often throw in a few shells as well. Why? Well, think of the terms in the word cloud for “armadillo”, or what Fei-Fei Li calls its “social network of visual concepts”: “mammal”, “armour”, “ball” and … “shell”. (For comparison, a request for “dog on a seashore” generates a beach, but no shells.) This “leakage” of associative traits can add a deeper layer of absurdity to these images, which is often pointed to as proof of the generative tools’ lack of sophistication, their poor results.
It would be a mistake, though, to treat semantic leakage as proof of technology’s clumsiness rather than its acute sensitivity. “A tall, long-legged, long-necked bird and a construction site” spits out an image that includes both a crane (bird) and a crane (construction equipment). While this would initially read as an error, and software engineers are surely working to resolve the bug, it’s in fact a sophisticated linguistic affiliation, a return of the heteronym problem by proxy, as the word “crane” never appears in the prompt.
For all the biases and patterns they show, genAI tools also inherit and pictorialise language’s nuances and ambiguities – such as English’s excess of heteronyms and homonyms, and their possible confusion. New image-making technologies – whether the printing press, the camera or satellite imaging – change our perception of the world, which in turn changes our behaviours. The question at hand is: what are these algorithmic images teaching us to see, say and do?
As of January 2024, genAI text-to-image tools produced about 34m images a day. This number is still dwarfed by the daily count of digital photographs, but for how long? From here on out, it’s safest to assume that any image you encounter might be generated. What differentiates these images is not their lack of humanity, but their intense abundance of it: all the alienated intelligence, historical strata and linguistic tics embedded and reproduced within them. Each prompter sets off a huge chain of networked collaboration with artists and academics, clickworkers and random internet users, across time and space, engaging in one massive, multicentury, ongoing game of Eat Poop You Cat. Like it or not, all of us – whether pre-algorithmic image makers or self-described AI artists – will have to learn to play.
A longer version of this essay first appeared in n+1 magazine.
Source link
lol