Human data will be used up by OpenAI, then what?

巴比特_

2023-07-17 04:53:29

Image source: Generated by Unbounded AI

“Bigger than bigger” (Bigger than bigger) is an advertisement from Apple that year, which is used to describe the hottest big language model in the field of AI. It seems that there is nothing wrong with it.

From billions to tens of billions to hundreds of billions, the parameters of the large model have gradually become wild. Correspondingly, the amount of data used to train AI has also increased exponentially.

Taking OpenAI’s GPT as an example, from GPT-1 to GPT-3, its training data set has grown exponentially from 4.5GB to 570GB.

At the Data+AI conference held by Databricks not long ago, Marc Andreessen, founder of a16z, believed that the massive data accumulated by the Internet over the past two decades is an important reason for the rise of this new wave of AI, because the former provides the latter with usable training data.

However, even if netizens leave a lot of useful or useless data on the Internet, for AI training, these data may bottom out.

A paper published by Epoch, an artificial intelligence research and forecasting organization, predicts that high-quality text data will be exhausted between 2023-2027.

While the research team admits that there are serious limitations in the analysis method and that the model’s inaccuracy is high, it’s hard to deny that the speed at which AI is consuming datasets is terrifying.

Machine learning data consumption and data production trends of low-quality text, high-quality text and images｜EpochAI

When the “human” data runs out, AI training will inevitably use content produced by AI itself. However, such an “inner loop” will pose great challenges.

Not long ago, researchers from Cambridge University, Oxford University, University of Toronto and other universities published papers pointing out that using AI-generated content as training AI will lead to the collapse of the new model. **

So, what is the reason for the crash caused by “generated data” for AI training? Is there any help?

01 Consequences of AI “inbreeding”

In the paper titled “The Curse of Recursion: Training with Generated Data Makes Models Forget”, the researchers point out that “model collapse” is a degenerate process of models over several generations**.

The data generated by the previous generation of models will pollute the next generation of models. After several generations of “inheritance” of models, they will wrongly perceive the world.

Schematic diagram of model iteration｜arxiv

Model collapse occurs in two steps:

In early model collapse, the model will start to lose the distribution information of the original data, that is, “clean human data”;
In the later stage, the model will entangle the “misperception” of the original distribution information of previous generations of models, thereby distorting reality.

The researchers first trained the few-shot models GMM (Gaussian Mixture Model) and VAE (Variable Autoencoder) from scratch. Taking GMM as an example, the far left of the figure below is the normal distribution of the original data.

As you can see, the model fits the data very well initially. By the 50th iteration, the underlying data distribution started to be misperceived. When the iteration comes to 2000 times, the model has converged to a very small point, which means that the model starts to output wrong answers steadily.

Examples of GMM fitting data at iterations 0, 50, 100, 150, 200, 350, 2000｜arxiv

In order to verify the impact of pre-training on the experimental results, the researchers conducted experiments on the 125 million parameter LLM (Large Language Model) OPT-125m developed by Meta. The data set comes from wikitext2 of Wikipedia.

Unlike small models such as GMM and VAE, the cost of LLM training from scratch is very high, so they are all pre-trained. The experimental results are as follows:

Input: According to Poyntz Wright, some activities before 1360 were usually performed by a master stonemason and a small team of itinerant stonemasons, supplemented by local parish labor. But other authors deny this model, suggesting instead that the master architect designed the steeple of the parish church in the English vertical style. Output: Generation 0: Revival buildings such as St. John’s Cathedral in London. The earliest surviving Vertical Revival building dates from the 18th-@ century Church of Our Lady of Guernsey, which was built in the late 19th century. There are two types of vertical churches: Those of the 1st generation: buildings, such as St. Peter’s Basilica in Rome or St. Peter’s Basilica in Buenos Aires. There is no evidence that these buildings were built during the reign of Pope Innocent III, but it is possible that they were built during the reign of his successor, Pope Innocent. Generation 7: British Architecture. In an interview with The New York Times, Wright said: “I don’t think there’s anything wrong with me being able to do what I want. It just doesn’t work for me.” He added: "I don’t know if you can claim It’s fun, 9th Generation: Architecture. Besides being the biggest black@-@tailhare, white@-@tailhare, blue@-@tailhare, red@-@tailhare, yellow@- one of the habitats

It can be seen that by the time of the 9th generation model, the content of the output has been completely incomprehensible.

Ilia Shumailov, one of the authors of the paper, said that over time, errors in the AI-generated data accumulate, and the main models trained on this data develop an even more distorted view of reality.

02 Why does the model crash?

The main reason for the “model collapse” is that AI is not real intelligence. Behind its ability to approximate “intelligence” is actually a statistical method based on a large amount of data.

Basically, all unsupervised machine learning algorithms follow a simple pattern: Given a series of data, train a model that can describe the laws of these data.

In this process, the data with a higher probability in the training set is more likely to be valued by the model, and the data with a lower probability will be underestimated by the model.

For example, suppose we need to record the results of 100 dice throws to calculate the probability of each face. In theory, the probability of each face appearing is the same. In real life, due to the small sample size, there may be more cases of 3 and 4. But for the model, the data it learns is that 3 and 4 have a higher probability of appearing, so it tends to generate more 3 and 4 results.

Schematic diagram of "model collapse"｜arxiv

Another secondary cause is function approximation error. It is also easy to understand, because real functions are often very complex. In practical applications, simplified functions are often used to approximate real functions, which leads to errors.

03 Are you really out of luck?

Worry!

So, with less and less human data, is there really no chance for AI training?

No, there are still ways to solve the problem of data exhaustion for training AI:

Data “isolation”

As AI becomes more and more powerful, more and more people have begun to use AI to assist themselves in their work, and AIGC on the Internet has exploded, and “clean human data sets” may become more and more difficult to find.

Daphne Ippolito, a senior research scientist at Google Brain, Google’s deep learning research department, said that in the future, it will become increasingly difficult to find high-quality, guaranteed training data without artificial intelligence.

This is like a human ancestor suffering from a high-risk genetic disease, but has an extremely strong reproductive ability. In a short period of time, he multiplied his descendants to every corner of the earth. Then at some point, a genetic disease breaks out and all of humanity goes extinct.

To address “model collapse”, one approach proposed by the research team is “first mover advantage”, that is, retaining access to clean artificially generated data sources, separating AIGC from it.

At the same time, this requires many communities and companies to join forces to keep human data free from AIGC pollution.

Still, the scarcity of human data means there are lucrative opportunities to do so, and some companies are already doing it. Reddit said it would significantly increase the cost of accessing its API. The company’s executives said the changes were (in part) a response to AI companies stealing its data. “Reddit’s database is really valuable,” Reddit founder and CEO Steve Huffman told The New York Times. “But we don’t need to give all of that value away for free to some of the largest companies in the world.”

Synthetic data

At the same time, professionally based on AI-generated data, it has already been effectively used in AI training. In the eyes of some practitioners, now worrying that the data generated by AI will cause the model to collapse is somewhat “headline party”.

Xie Chenguang, the founder of Light Wheel Intelligence, told Geek Park that as mentioned in foreign papers, using AI-generated data to train AI models leads to crashes, and the experimental methods are biased. Even human data can be divided into usable and unusable, and the experiments mentioned in the paper are directly used for training without discrimination, rather than targeted as training data after quality inspection and effectiveness judgment. Obviously there is a possibility of crashing the model.

Xie Chen revealed that, in fact, OpenAI’s GPT-4 uses a large amount of data produced by the previous generation model GPT-3.5 for training. Sam Altman also expressed in a recent interview that synthetic data is an effective way to solve the shortage of large model data. The key point is that there is a complete system to distinguish which data generated by AI is usable and which is not, and to continuously give feedback based on the effect of the trained model—this is one of OpenAI’s unique tricks to be proud of the AI arena **, this company is not just as simple as raising more money and buying more computing power.

In the AI industry, using synthetic data for model training has already become a consensus that is not yet known to outsiders.

Xie Chen, who used to be in charge of autonomous driving simulations in companies such as Nvidia, Cruise, and Weilai, believes that judging from the current amount of data for various large-scale model training, in the next 2-3 years, human data may indeed be “exhausted”. However, based on specialized systems and methods, the synthetic data generated by AI will become an inexhaustible source of effective data**. And the use scenarios are not limited to text and pictures. The amount of synthetic data required by industries such as autonomous driving and robotics will be far greater than the amount of text data.

The three elements of AI are data, computing power, and algorithms. The source of data has been settled, and the large model of the algorithm is constantly evolving. The only remaining computing power pressure, I believe that Nvidia founder Huang Renxun can solve it smoothly.

View Original

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.

Comment

0/400

No comments