Why training PaLM 2 with fewer parameter is better and makes sense

PaLM 2

The performance of large language models has been measured in recent years mainly taking into account the number of parameters established during the training stage. Under this reasoning, it was totally logical to think that the models improved their ability to perform tasks or solve problems as more parameters were incorporated.

But there are indications to believe that we are witnessing a major paradigm shift in which the volume of parameters is not as important as previously believed. Although a lot of information is kept under lock and key due to the increasingly complex competitive scenario, a clear example of this is the path that important players such as Google and OpenAI would be following.

At this point it is necessary to point out the importance of this apparent change in trend. Providing language models with large amounts of parameters translates into high investments of time and money. Now, if it is possible to make better models by saving money in this area, we could see much faster and more significant advances in different fields of AI.

PaLM 2, fewer parameters, more data

A week ago, Google introduced its PaLM 2 language model intended to take part in the battle with OpenAI’s GPT-4. This is the evolution of PaLM, which arrived the year before to compete with another of the products of Sam Altman’s company, at that time the promising GPT-3. What has been seen recently? That the Mountain View company is changing the way it trains its models.

Details about the technical characteristics of Google’s latest model have not been released to the public, but internal documents seen by CNBC indicate that PaLM 2 has been trained with millions of fewer parameters than its predecessor, and still boasts superior performance. Specifically, the new generation model would have 340 billion parameters compared to 540 billion in the previous one.

In a blog post, the search engine company has acknowledged the use of a new technique known as “computational optimal scaling” to make the overall performance of the model more efficient, including the use of fewer parameters and, consequently, a lower cost of implementation. lower training. Google’s trick for PaLM 2 has come from another part: increasing the data set.

Remember that data sets ( datasets ) are made up of a wide variety of information collected from web pages, scientific studies, etc. In this sense, the leaked information indicates that the new Google has been trained with five times more data than PaLM presented in 2022. This change is presented in tokens, that is, in the units that make up the datasets.

PaLM 2 would have been trained with 3.6 billion tokens, while PaLM would have only 780 billion tokens. To get an idea of ​​this scenario we can mention, for example, that Meta’s LLaMA model has been trained with 1.4 billion tokens. This information about GPT-4 is unknown, but the GPT-3 papers state that the model has 300 billion tokens.

This paradigm shift of using fewer parameters to train models is not unique to Google. OpenAI is also working in that direction. For months Altman has pointed out that the race to increase the number of parameters reminds him of the late 1990s when the hardware industry was obsessed with increasing processor clock speeds.

What are the parameters?

Broadly speaking, the parameters enter the scene in the training stage of the AI ​​models. These allow models to learn from the data and provide answers based on predictions. For example, if we train a model specifically designed to find houses based on price, it would learn parameters such as dimensions, location, or amenities.

Leave a Reply