Tumblr and WordPress users might soon find that their data is being used to train artificial intelligence (AI) models, as per a report. The parent company of the blog sites, Automattic, has allegedly struck deals with OpenAI and Midjourney to sell user-generated content that will reportedly be used help train AI. While the details of the deals and the data-sharing practices remain unclear at the moment, this has raised a question on data privacy and the ethics of companies sharing their users' data with third parties.
Internal communications by employees of Automattic, viewed by 404 Media, both confirmed the deal with AI companies and revealed details on these practices. In its report, the publication confirmed that Automattic's deal with OpenAI and Midjourney could be announced soon. Further, it appears data compilation for the AI firms has already begun. Meanwhile, an internal post made by a product manager Cyle Gage suggested that all Tumblr's public post content between 2014 and 2023 was compiled.
The report also highlights a specific message that suggests private and deleted user content was also automatically compiled, alongside public data. It was not clear whether that set of data was already shared with the AI firms or not. Further, since such an accident puts its entire user base's private information in jeopardy, it also raises a question about the company's ethical policy and data safety infrastructure.
Automattic on Tuesday issued a statement stating, “AI is rapidly transforming nearly every aspect of our world, including the way we create and consume content. At Automattic, we've always believed in a free and open web and individual choice. Like other tech companies, we're closely following these advancements, including how to work with AI companies in a way that respects our users' preferences.”
The post detailed several things the company is doing for its users including blocking AI platform crawlers, a setting to discourage search engines from indexing a site on WordPress and Tumblr, and an assurance of an opt-out setting for users who do not wish to share data with the third party. “Currently, no law exists that requires crawlers to follow these preferences,” the post stated.
The mechanism to opt-out of data sharing is also somewhat unclear. While the company stated in the post that the AI firms will respect the opt-out settings and even remove the past content from users who have newly opted out, the report claims the reality is more complicated.
The report found an internal document from February 23 where an employee asked whether the company had any assurance that the data partner would respect the opt-out decision made by users. Andrew Spittle, Automattic's Head of AI, reportedly replied, “We will ask that content be deleted and removed from any future training runs. I believe partners will honor this based on our conversations with them to this point. I don't think they gain much overall by retaining it.”
The response was noted to be vague and does not confirm if Automattic had an agreement on the same, according to the report. Further, it appears that the entire line of reasoning holds on the assumption that AI firms will not gain much by retaining the user data. It should be noted that the practice of third-party data sharing is not new, and most social media platforms hold the rights to user-generated public content on the platform. However, making such deals without revealing it to users could potentially expose private information to companies that are using the same data to train AI systems.
from Gadgets 360 https://ift.tt/I4TgL7r