micahrl (whispering) | UGC privatization could improve AI models

UGC privatization could improve AI models

Monday, July 3, 2023

An off the cuff idea:

AI training on public data is driving privatization of user-generated content (eg Reddit API). This could push users with a real need for the information who today rely on public google search to build private archives. For instance, keeping a copy of the most insightful StackExchange answers that help you do your job.

If that happens, AI companies that figure out how to ingest private archives will have a training advantage, even if doing so is a gray area legally, as training based on sci-hub was for modern models.

It might even mean that the quality of the training goes up even as the amount of publicly available UGC goes down, since the private archives are curated by humans, and curation is a quality signal.