Today we are going to discuss the great challenges of machine learning. One of the most notable peculiarities of the human being is, without doubt, the use of language. We use an intricate and vast network of symbols that, for millennia, far surpassed the biological function with which it was born. The richness of our language is much greater than what is necessary for our survival in the African savannah. Why this surplus of words, grammatical and syntactic structures? Why this plurality of meanings, this semantic variety that enriches as well as confuses?
But what’s more, if the spoken language was not complex in itself, writing arrived . The ideograms, syllabaries and alphabets arrived. The monotonous cuneiform, the Phoenician, the most beautiful Greek and our sober Latin arrived. A shame not to write from right to left using aliphate, considered an art in its own right.
Machines, when they got serious, were not understood essentially as calculators, but as symbolic manipulators. Inspired by the famous Turing machines, artificial intelligence pioneers such as Allen Newell and Herbert Simon understood that a computer receives as inputs a series of symbols (letters or numbers) on which it performs operations: it deletes, stores, classifies or does mathematical or logical operations on them. In this way, new symbol structures are created that evolve as they are manipulated.
From the complicated machine language to the relatively simple Python, programming languages have represented an interface between the vast tangle of hardware microcircuits and the instructions of the programmer. The first languages, the lowest level, such as assembly language , were based, fundamentally, on being as friendly as possible for the programmer. As it is impossible to keep track of the millions of operations carried out by the processor, the instructions of the languages are a kind of “desktop icons”, something simple that starts a multitude of complicated processes.
And so it went up to high-level languages , which try to be the closest thing to pseudocode, that is, they try to resemble a natural language, to speak to the machine in English or other language. High-level languages would be icons of icons of icons … levels of language that communicate with other levels of language, and with others and with others, creating semantic networks. And those networks are also alive .
As with natural languages, in which new words or expressions are continually born and die, it is also the case in computing. Languages, libraries, methods, arguments, functions, procedures … die in oblivion due to their disuse or they are created and become relevant. Fortran, Pascal , Basic, Prolog … beautiful corpses that have been replaced by others: C, Java, PHP, R, Matlab, Python … And then there are the newborns, on which no one dares to bet: will Kotlin go far, Rust, Scala or Julia?
Well, but let’s not get too far from the topic we want to deal with: what happened when using these artificial languages we tried to teach machines to use our natural languages?
The challenge of teaching machines our natural language
At first there were great expectations that, sooner rather than later, ended up truncated. The first NLP ( Natural Language Processing ) projects were aimed at machine translation . Nothing seemed more desirable than having a machine to which we presented a text in Chinese and returned it immediately translated into our vernacular.
Perhaps the founding work was that of the Russian Smirnov-Troyanskii , who by 1933 was already talking about such modern ideas as performing mechanical syntactic logical analysis of the expressions to be translated and then accommodating them as best as possible to the target language, so that they sound natural in East. However, his work was not known outside of Mother Russia and was soon forgotten.
Interesting things were going to happen on the other side of the Iron Curtain, specifically at the all-powerful Rockefeller Foundation. One of its directors, the great Warren Weaver, had published in 1949, together with Claude Shannon , one of the most important books of the 20th century: ‘ The Matematical Theory of Communication ‘. There a theory about the transmission and process of information was formulated mathematically.
Weaver proposed applying this theory to machine translation: you could apply probability techniques, statistical methods, cryptographic techniques… Everything that was in vogue in computer science in those days was going to be applied to machine translation. And so, in 1954 the first public presentation of an automatic translator took place. IBM, in collaboration with Georgetown University, designed a system capable of translating 49 sentences from Russian to English using six grammar rules and a 250-word vocabulary.
It was not a big deal, but it was a great incentive to research, a warning to some Soviets still far behind on the subject, and a great euphoria: the FAHQT ( Fully Automatic High Quality Translation ) was set as a great objective. However, that enthusiasm of the fifties was passed to bewilderment and unease in the following decade.
In 1960, Oscar Westreich of MIT criticized the real possibilities of translators: it was impossible, given the state of the art, to build machines that would translate in a similar way to human translators. And the doubts were dispelled with the famous ALPAC ( Automatic Language Processing Advisory Committee ) report.
The main funding sources (basically military agencies) commissioned the National Science Foundation to analyze what was being done with their large contributions to the cause of machine translation. A committee led by John R. Pierce was formed, which ruled, basically, that the money was being wasted. It was much faster and cheaper to hire a human translator than to use machines. Funding was drastically reduced.
But what happened? Why is it so difficult for machines to translate from one language to another? Let’s see. At first, the simplest way to understand the comparison between two languages is to think of their relationship as a bijective function, that is, to think of them as two sets in which for each word in one there is another equivalent in the other. For “dog” in English, there is “dog”, for “house” there is “house”, for “car” there is “car”, and so on with all the elements of the language.
If this were the case, it would be enough to build an automatic dictionary that associates each word in one language with its counterpart in the other. I wish it were that easy! We soon come across expressions whose meaning does not depend on the meaning of each isolated word but on its position in the grammatical structure of the sentence, what was said in previous sentences, or the context in which it was used.
Likewise, the English expression “I’ve enough on my plate” which means that I can no longer handle my life, has nothing to do with its literal translation. And so we run into one problem after another that causes our bijective translation model to collapse inexorably.
The great American philosopher Willard VO Quine proposed a suggestive thought experiment. Suppose we are anthropologists who are studying the language of a lost tribe in the Amazon jungle. Then we hear a native utter the word “gavagai” when he observes a rabbit. It seems natural that we translate “gavagai” by “rabbit”. But maybe we could be wrong.
The native could have referred to many other things: “Food”, “Bad luck” (perhaps in that tribe it is thought that rabbits bring it), “The hunting season is about to begin” … It could simply refer to some quality or part of the rabbit: “big ears”, “white color”, “be fast”, “be small”, etc. How do you know which is the correct translation?
We could narrow the meaning by asking the native more questions. For example, we could show him a black rabbit and ask him if he is also a “gavagai”. If he answered no, we would know that the word refers to the color white, and if he said yes, we would know that it refers to rabbit.
However, we would still not be sure. Perhaps you answered that not because the word “gavagai” means youth and the black rabbit was old, or perhaps you answered yes not because “gavagai” means rabbit, but because it means, “animal with two ears” and both rabbits have them. Thus, it would be impossible to be completely sure that our translation is fully correct.
Let’s also think about expressions that are simply impossible to translate without some kind of semantic loss . For example, the word “Iktsuarpok” in the Inuit language is often translated as “go outside to see if someone is coming” (you can imagine what the translator went through to get to this). However, this translation does not take into account the fact that the Inuit live in the arctic regions of North America, so there, going out of the tent or igloo to see if there is someone is something, to say the least, unpleasant. if not dangerous, in view of the inclement weather of the North Pole.
So, when translating “Iktusarpok” by “going outside to see if someone comes” we are losing all reference to the weather that surely appears in the Eskimo’s mind when he pronounces the word. Or, another example, the Slovak word “prozvonit” comes to mean what in Castilian “missed call”, that is, call someone, but cut the call before they pick up, so that he is the one who calls you and thus save minutes in your telephone rate.
Let’s think then how we would translate this into a language in which there are not even telephones like, verbigratia , Inuit… Impossible! It would take many, many pages full of explanations to translate the meaning of a single word! There is a certain indeterminacy in translation, a certain incommensurability between languages.
And statistical systems came to translation
So many difficulties did not discourage the developers that they kept trying to improve their programs, albeit with a certain dose of realism, thinking more of translation assistants than of the FAHQT, which was almost considered impossible. One highly celebrated program was METEO, designed by John Chandioux based on the TAUM-METEO prototype from the University of Montreal, which was used to translate weather reports.
He did it pretty well but of course, it was only good for the specialized language of meteorology. We were where we always were: artificial intelligence worked well in very specific areas, but it failed miserably if it was brought out into general areas. There were many others: widely used was Systran (which continues to work today ); Spanam or Engspam for Spanish; Metal, from the Siemens company; EUROTRA, funded by the European Union; etc etc. however, although it was improving, progress was slow. A different way of dealing with texts was needed … And it came in 2017! They are statistical systems based on corpus.
Has anyone noticed how much Google Translate has improved ? He’s still wrong, but he does it pretty well. Just a few years ago it was so much worse. Why is this remarkable progress? To a system called BERT ( Bidirectional Encoder Representations from Transformers , acronyms that coincide with the name of the Sesame Street character).
It is about designing a set of neural architectures that is trained using huge amounts of text. During this training the algorithms analyze the frequency with which each token (word or expression) occurs in a given context. For this they use a new type of network that is revolutionizing the field of machine learning called Transformer .
This architecture analyzes text sequences looking for affinities between words. For example, the article “El” will have more affinity with names like “dog” than with verbs like “runs”. By analyzing millions of texts, they check the frequencies with which words are paired, thus achieving excellent grammar correction. But, the Transformers go much further.
Before its arrival, recurrent networks such as LSTM ( Long Short-Term Memory ) were used, which analyzed the information sequentially, but had the problem that as they progressed in the text the information analyzed at the beginning was degraded (technically it is the vanishing gradient problem ).
The transformers have a mechanism of attention (article where Vaswani and others presented them to the world was called “Attention is all you need” ) that enables better analysis of the context in which it is a word, punctuating its relevance within the text. In addition, they do not require that the data be managed sequentially, in order, it does not need to process the beginning before the end, which allows more parallelization (processing many data at once) than its predecessors. Then we have created the monster, we just need to feed it.
BERT was trained with over 3.3 billion words (All Wikipedia and Google Books), but even bigger things can be done. Elon Musk’s company OpenIA created the GPT ( Generative Pre-trained Transformer ) lineage of programs . The latest version, GPT-3, released in the summer of this year, was trained with 175,000 million parameters… It’s a gigantic architecture!
But the interesting thing about these systems is not that they translate texts much better than previous programs, but that they can be used for anything else. The essential function for which they have been designed is to continue text, that is, for the user to enter a string and they continue it consistently. With this they can obviously translate languages, but they can also do endless tasks . You can introduce them the headline of a news item, and they write it to you in full as if they were a journalist; the title of a story and they make it complete … Malicious uses come quickly: a job for college? This very article! No, don’t worry, they’re still not that good (although on Reddit they already hit the mark ).
The idea is to create a base language model that we have already trained with a huge corpus, which we can then retrain to perform a specific task that we want to be especially good at. For example, we could retrain GPT-3 with everything that has been written about cooking, to create a super expert on this subject that we could use to recommend recipes based on given ingredients, or to know which sauce combines best with such meat or such fish.
But, in addition, it was found that, with only the basic training , without any specific training, these models performed very decently in many tasks (this is what is called zero-shot learning ). And this is an extremely important achievement because it opens hope for the Holy Grail of artificial intelligence: the dreamed of general artificial intelligence.
The NLP systems prove their worth to face various tests. The famous Turing Test is already a bit outdated (and it was too imprecise), performing batteries of much richer and more concrete tests such as LAMBADA, StoryCloze,, QuAc, Drop, SQUad 2.0, RACE (various text comprehension tests), Hellaswag (text prediction), TriviaQA (general knowledge), CoQa (conversation), PhysicalQA (common sense physics), the Winograd schematic test (with its XL WinoGrande version ), the famous SuperGlue (varied language skills), etc. .
Well, only the GPT-3 base, without any specific training, had quite acceptable results in many tasks, surpassing the state of the art in LAMBADA and TriviaQA. Note that we are saying that a general system beats specifically designed ones. It is as if the same athlete were competitive in marathon, horse riding, sailing and archery events, without having specifically prepared for any of them… Incredible!
There’s still a long way to go
However, despite the enormous commotion that their appearance has generated and it must be recognized that they are a magnificent technological achievement, these systems have a very serious basic defect: they do not understand absolutely anything of what they read or what they say . Keep in mind that they are only based on frequencies of appearance of words, that is, they do not classify words based on any semantic criteria, they only quantify their probabilities.
This is why GPT-3 performs quite poorly on tasks involving common sense reasoning, in the same way that its arithmetic skills are very limited. Gary Marcus and Ernest Davis subjected GPT-3 to an informal test of 157 questions on various common sense questions on various topics: physical, temporal, spatial, social, psychological reasoning, etc.
“Moshe posted on Facebook a photograph showing Maurice Ravel, Francois Poulenc, Frederic Mompou, and Erik Satie. Satie died in 1925. Poulenc was born in 1899. So the photograph must have been taken in 1926 ”.
It would be worthy of a horror movie if Satie were in the photo being a year dead.
“You poured yourself a glass of cranberry, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it . You are now dead ”.
Grape juice is not poison.
“You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find that it got thoroughly soaked in a heavy rain last night. So you have to dry it out before you can use it. You can do this by spreading it out on a table and putting a fan on it ”.
This made me very funny. Sorry GPT-3 but a fan wouldn’t dry the flour, it would spread it everywhere.
And it is that, despite the success in quality with respect to their predecessors, these frequency-based systems constitute a certain step backwards. Previous models tried to have some kind of semantic understanding of words or expressions, even if their competence was less. GPT-3, to put it poetically, is soulless . Despite the fact that behind his sentences it may seem that there is “someone” trying to respond coherently, as if it were a child who has just learned to speak, there is none of that. Just a calculated word-after-word frequency measurement.
In philosophy we have a famous thought experiment that perfectly describes GPT-3: Searle’s Chinese room . Let’s imagine a room where an English-speaking person lives. In that room there are two windows, one through which messages written in Chinese enter. On the other hand, the inhabitant must send a reply to the received message, also written in Chinese. To carry out the task, he has a rulebook that matches each expression in Chinese with its corresponding answer.
Thus, the room could have a competent conversation with a Chinese person, without its English-speaking inhabitant understanding a word of Chinese. Well, Searle thinks that NLP programs do and it comes in handy to describe GPT-3.
I like to combine that explanation by saying that the rulebook used by the ignorant translator is based on Borel’s infinite monkey theorem : if we put a bunch of monkeys randomly pressing keys on typewriters for an infinite time, in the end They will necessarily write Don Quixote . That’s right, a lot of GPUs running in parallel, processing teras and teras of information, will necessarily find the word that continues the text, even if they do not understand, at all, what they are doing.
If the 175,000 million parameters that GPT-3 analyzes do not seem enough for the metaphor of monkeys, there is an even bigger architecture. Very shortly after OpenIA introduced GPT-3, Google released GShard , a translation program that analyzed 600 billion parameters and was trained with 2,048 v3 TPUs to be able to translate 100 languages into English.
To my knowledge, this is the largest artificial neural network ever created. But, like the others, what is great is stupid: she translates a hundred languages without understanding the meaning of any of the words she uses, as if she were a savant who knows how to recite by heart the results of all the games in the league. history of the Premier League, but does not understand anything about football.
Sorry folks, human language is one of evolution’s most powerful inventions, and its secrets are still far from being revealed. Many think that artificial intelligence will overtake us in a very few years, but they are wrong from beginning to end: we have not yet started to walk.