This segment is a paid ad. If you’re interested in advertising, let's talk.
Keep Your SSN Off The Dark Web
Every day, data brokers profit from your sensitive info—phone number, DOB, SSN—selling it to the highest bidder. And who’s buying it? Best case: companies target you with ads. Worst case: scammers and identity thieves. It's time you check out Incogni. It scrubs your personal data from the web, confronting the world’s data brokers on your behalf. And unlike other services, Incogni helps remove your sensitive information from all broker types, including those tricky People Search Sites.
Help protect yourself from identity theft, spam calls, and health insurers raising your rates. Plus, just for 404 media readers: Get 55% off Incogni using code INCOGNI404
A highly-praised AI video generation tool made by multi-billion dollar company Runway was secretly trained by scraping thousands of videos from popular YouTube creators and brands, as well as pirated films, according to a massive internal spreadsheet of training data obtained by 404 Media.
The model—initially codenamed Jupiter and released officially as Gen-3—drew widespread praise from the AI development community and technology outlets covering its launch when Runway released it in June. Last year, Runway raised $141 million from investors including Google and Nvidia, at a $1.5 billion valuation.
When TechCrunch asked Runway co-founder Anastasis Germanidis in June where the training data for Gen-3 came from, he would not offer specifics.
“We have an in-house research team that oversees all of our training and we use curated, internal datasets to train our models,” Germanidis told TechCrunch.
The spreadsheet of training data viewed by 404 Media and our testing of the model indicates that part of its training data is popular content from the YouTube channels of thousands of media and entertainment companies, including The New Yorker, VICE News, Pixar, Disney, Netflix, Sony, and many others. It also includes links to channels and individual videos belonging to popular influencers and content creators, including Casey Neistat, Sam Kolder, Benjamin Hardman, Marques Brownlee, and numerous others.
The spreadsheet is here. 404 Media redacted columns containing names of Runway employees.
While 404 Media couldn’t confirm that every single video included in the spreadsheet was used to train Gen-3—it’s possible that some content was filtered out later or that not every single link on the spreadsheet was scraped—the training data reveals specifics about the generative AI industry, which has been repeatedly accused of training models on copyrighted material.
Runway did not respond to multiple requests for comment via email, Linkedin, and its official Discord channel.
When reached for comment, Google, which operates YouTube and is a Runway investor, pointed us to a Bloomberg story from April, in which the company told the publication that OpenAI training its AI video generator Sora with YouTube videos would violate YouTube’s rules.
“Our previous comments on this still stand,” a Google spokesperson told 404 Media in an email when asked about Runway scraping YouTube videos.
There was a company-wide effort to compile videos into spreadsheets to serve as training, a former Runway employee told 404 Media. After the list of videos was compiled, Runway scraped the videos using open-source software, specifically YouTube-DL, which has a proxy configuration option. Runway purchased proxies from a provider, the source said, which gives customers an IP address that routes requests for downloads through, in order to not get blocked by YouTube. 404 Media granted the source in this article anonymity because they feared professional retribution.
“The channels in that spreadsheet were a company-wide effort to find good quality videos to build the model with,” the former employee said. “This was then used as input to a massive web crawler which downloaded all the videos from all those channels, using proxies to avoid getting blocked by Google.”
The document contains 14 spreadsheets, each labeled with different categories. One of the spreadsheets contains what appears to show a list of 117 terms like “beach,” “doctor,” and “rain,” and the names of Runway employees next to each of those terms. The former employee told 404 Media that these names were either people tasked by others to find videos related to the keywords, or the employees themselves noting that they were working on that keyword. Next to the term “rainbow” and the employee name, someone wrote a note that said “no channels or playlists dedicated to it but found good individual videos for finetuning.”
Notes in the document show that the company was trying to obtain videos that had a specific type of subject matter, camera work, and with a diverse set of people in them. The “high camera movement” sheet contains 177 links to YouTube channels including the official Call of Duty channel, filmmaker Josh Neuman’s channel, Unreal Engine and Vans’ channels.
A spreadsheet titled “Cinematic Masterpieces” contains 206 links to individual channels and videos of especially high-quality, including animated shorts and student films. On that sheet, a note next to a link to the DEFY Studio YouTube channel says “THE HOLY GRAIL OF CAR CINEMATICS SO FAR.” “Single great videos (for finetuning)” is a stockpile of another 253 videos along with a column for topics, like “waxing eyebrows,” “ice sculpting,” “smiling” and “screaming.”
Runway’s launch of Gen-3 was praised as being high-quality and useful for cinematic shots. It features camera controls and “director mode” that allows “fine-grained control over structure, style and motion,” according to the company.
In addition to scraping videos from YouTube for training purposes, the Runway employees compiling the sheet appear to have at least considered also using videos obtained from piracy sites. A spreadsheet titled “Non-YouTube source” contains 14 links, including one to kisscartoon.sh, which allows people to stream a wide variety of popular cartoons and animated movies. Searching the Lumen database, which keeps a record of takedown notices and other legal removal requests submitted to Google, contains thousands of copyright complaints against kisscartoon.sh.
The “Non-YouTube source” sheet also contains a link to an archive of Studio Ghibli films, several anime piracy sites, a fan site for XBox game clips, and a now-offline movie piracy site called AZiMovies that has a note with it from someone at Runway: “Tons of stuff in here.”
A sheet of 17,112 terms, including “hand car wash,” “doing boxing,” “hitting a pinata,” “cracking neck,” “jaywalking,” and dozens more, includes corresponding queries to search YouTube, like “how to wash a car properly” and “what happens if you are caught jaywalking,” and “dangers of cracking your own neck.”
A “recommended channels” sheet includes links to 3,967 YouTube channels, many of them belonging to major brands and media outlets like Pixar, Glamour, CBS New York, the Monterey Bay Aquarium, AMC Theatres, and multiple official Disney channels like Disney XD and Disney Plus.
We don’t know for a fact that every single video listed in the spreadsheet is included in the model. But 404 Media tried generating videos and images on Runway using prompts that contained keywords and content based on the spreadsheet, and were able to generate videos in the same styles as the creators listed in the sheets.
The image wouldn’t create perfect likenesses of real people, but got close in several examples using popular YouTube personalities included in the spreadsheet. For example, using the prompt “Mark Wiens,” the name of a popular food and travel YouTuber with over 10 million subscribers who is included in the spreadsheet of scraped content to train Gen-3, generated a video of a man holding up a camera and filming himself eating in an outdoor food market, much like Wiens does in many of his videos. When we tried the same prompt with Gen-2, it generated an unrelated video of a man in a suit.
After we reached out to Runway for comment, Gen-3 stopped generating videos that included Mark Wiens’ name, as well as the names of several other YouTubers.
Prompt (Gen-3 Alpha): "Mark Wiens"
The prompt "YouTuber Jon Olsson as he appears in his YouTube video 'CHALLENGED BY THE SWEDISH SKI TEAM! THEY MADE ME DO IT!!!| VLOG 1054'" using Gen-3 generated a white man in a ski jacket and cap similar to what Olsson wears in that video.
Prompt (Gen-3 Alpha): "YouTuber Jon Olsson as he appears in his YouTube video 'CHALLENGED BY THE SWEDISH SKI TEAM! THEY MADE ME DO IT!!!| VLOG 1054'"
The prompt "A video in the style of DEFY Productions of a racing car” using Gen-3 returned a video of a race car with the word DEFY on the back of the car in a font nearly identical to the one used by DEFY (in the studios' real logo, the E faces backward).
The prompt “YouTuber Benjamin Hardman in the style of his travel videos” using Gen-3 generated a video made to look like a drone shot following a man that looks a lot like Hardman in the distance, hiking along a cliffside.
Prompt (Gen-3 Alpha): "YouTuber Benjamin Hardman in the style of his travel videos"
Prompt (Gen-3 Alpha): "Benjamin Hardman"
In recent months, many AI companies have come under fire for allegedly taking content from creators to train models based on their work, and then regurgitating videos, text, and music that’s an amalgamation of the originals. In April, more than 200 musicians signed an open letter asking tech companies to stop infringing on the rights of artists to develop AI, and called it a “race to the bottom.” Also in April, the New York Times reported that OpenAI and Google cut corners by transcribing YouTube videos to train their speech recognition AI models. An investigation by Proof News published in July found that companies including Anthropic, Nvidia, Apple, and Salesforce used subtitles from 173,536 YouTube videos from more than 48,000 channels, without the video owners’ permission.
OpenAI’s CTO Mira Murati recently told the Wall Street Journal that she didn’t know if training data for Sora, OpenAI’s text to image generator, included videos from YouTube, Instagram and Facebook. “We used publicly available data and licensed data,” she told the WSJ.
Following the NYT report about transcribing YouTube videos, YouTube CEO Neal Mohan told Bloomberg that this use of its platform was not allowed: "From a creator's perspective, when a creator uploads their hard work to our platform, they have certain expectations. One of those expectations is that the terms of service is going to be abided by,” Mohan said. “It does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service. Those are the rules of the road in terms of content on our platform."
“I hope that by sharing this information, people will have a better understanding of the scale of these companies and what they’re doing to make ‘cool’ videos,” the former Runway employee said.