Think of any topic you can imagine when it comes to raising children, and there’s probably a post about it on Mumsnet, the UK’s oldest, most popular and most controversial parenting forum for mothers. Over its 20-year history, Mumsnet has amassed an archive of more than six billion words written by an engaged user base on topics like dirty nappies and lazy husbands. (Not to mention Crazy nonsense about dolphins.)
This spring, Mumsnet announced that it had decided to enter into licensing agreements with major players in the field, including OpenAI, after discovering that AI companies were scraping its data. OpenAI had expressed a willingness to seek a settlement after Mumsnet first approached it. After negotiations with OpenAI broke down, Mumsnet announced in July take legal action.
According to Mumsnet, during an early conversation, OpenAI’s head of strategic partnerships told the company that its billion-word-plus dataset was of interest to the AI giant. Mumsnet’s leadership was excited. “We spent a lot of time going back and forth with them,” Justin Roberts, Mumsnet’s founder and CEO, tells WIRED. “We had to sign some NDAs, and they wanted a lot of information from us.”
But more than a month later, OpenAI told Mumsnet that it was no longer interested in a partnership, according to email exchanges reviewed by WIRED. When asked why, OpenAI staff explained that Mumsnet’s 6 billion-word dataset was too small to warrant a licensing deal. They also noted that OpenAI was primarily interested in large datasets that the public couldn’t access online, and that they wanted datasets that captured the breadth of human experience.
When reached for comment by WIRED, the company echoed those sentiments. “We seek partnerships on large-scale data sets that reflect human society, not just publicly available information,” says OpenAI spokesperson Kayla Wood. “We support publishers and creators’ choices, giving them a way to express their preferences for how sites and content work with AI in search results and train generative AI-based models.”
Roberts says she was “irritated” by this development. She recalls that OpenAI was initially particularly interested in Mumsnet because of the platform’s high content content written by women. “It’s a very high-quality conversational dataset,” she says. “It’s 90 percent female conversation, which is very unusual.”
OpenAI has signed a number of data licensing agreements with media outlets and platforms over the past year, including: Vox Media, that AtlanticAxel Springer, hourThat includes platforms like WIRED parent Condé Nast and Reddit, which are full of user-generated content. (Automattic, owner of WordPress.com and Tumblr, was also said to be in licensing talks earlier this year.) The details of those deals haven’t been made public, so it’s unclear how large each corpus is.
When WIRED asked OpenAI about the size of the dataset it would consider for commercial licensing, it declined to share that information, but spokeswoman Kayla Wood emphasized that partnerships with publishers are “focused on showcasing content in products and driving traffic.”