cosmoquester’s Blog

Reflection on the Conceptual Essence of Language Models, the Path to AGI

2025-01-03T00:00:00+09:00

Recently, large language models (LLMs) like ChatGPT, Claude, and LLaMA have been receiving significant attention. There’s no denying that the capabilities of these LLMs are astonishing. Microsoft even released a survey paper suggesting that GPT-4 could be an early form of Artificial General Intelligence (AGI). While I am amazed by GPT’s current abilities, I’ve been somewhat skeptical about how much more potential these models might hold. However, when I recently learned that OpenAI’s o1 and o3 models achieved the top grade in South Korea’s CSAT (College Scholastic Ability Test) math exam, I was absolutely stunned. It prompted me to reflect on the principles and inherent limitations of LLMs’ capabilities, and what might be required to surpass these boundaries and reach AGI-level intelligence. Here’s a brief summary of my thoughts.

What makes LLMs so intelligent?

Answering this in one sentence is impossible, so let’s break it down. First, we can consider the unique nature of the task that serves as the training objective for LLMs—Language Modeling (LM). The first crucial point is that LM is a generative task. In the early days of deep learning, classification models were the mainstream because classification problems were comparatively easier. For instance, classifying image categories, determining sentiment in text, or tagging sequences with tasks like NER were common. However, these classification tasks have a significant limitation: no matter how well a model is trained, it is still restricted to classification. For example, even if a model could perfectly classify every insect in the world from a photograph, it’s hard to imagine what further steps it could achieve. In contrast, generative tasks like image or text generation have virtually unlimited output ranges. When a model performs well in these tasks, its potential becomes immense. Technically, text generation is still rooted in classification, but the auto-regressive nature of this process imbues it with high potential.

LLMs are not the only generative models out there. We also have image generation models and speech synthesis models. So why are text-based LLMs so much more intelligent? This boils down to the unique characteristics of the text modality. The primary goal of image generation models is to create pictures, photographs, or illustrations. Naturally, most of their datasets are designed for this purpose. Even if we were to build a highly advanced image generation model, it wouldn’t suddenly develop the ability to recognize text in a handwritten letter and generate a thoughtful reply to it. On the other hand, consider text. Language, as one of humanity’s most significant milestones distinguishing prehistoric and historic eras, inherently contains vast amounts of knowledge about the world. While some parts of this knowledge remain un-digitized, a significant portion—especially since the advent of the internet—exists in the form of structured data, research papers, news articles, and more. But it’s not just knowledge that makes text unique. Thanks to the representational nature of language, much of our higher-order reasoning depends on it. The process of “intellectual reasoning” itself is largely verbalized. This makes text the modality with the greatest potential for exhibiting intelligence. It’s no surprise, then, that LLMs, which operate within this modality, are leading the way in terms of apparent intelligence and reasoning capabilities.

Text-based generative models actually have a fairly long history. So, why didn’t they work earlier? While there could be more complex reasons, two stand out: first, no one could even imagine training on such vast amounts of text; and second, there simply wasn’t enough money to train a model with such an enormous number of parameters. The first time a generative model produced sensational results was with the emergence of GPT-3, which featured 175 billion parameters. What amazed me was OpenAI’s decision to invest the staggering amount of money required to develop such a massive model. Why is this remarkable? Because at the time, there was no precedent for building a model of that size, and no guarantee it would exhibit such advanced reasoning capabilities. It’s hard to fathom how they justified pouring $12 million into building something without knowing exactly how or where it could be applied. Whatever their reasoning may have been, OpenAI made a major discovery with their bold investment, and since then, the trend of large datasets and large parameter models has revolutionized the field of NLP.

How Does an LLM Learn?

The learning process of a Large Language Model (LLM) can be broadly explained in two parts: objective functions and model optimization. To understand this, it helps to think of a deep learning model, including an LLM, as a type of mathematical function. Like any function, it has a predefined formula for computation. However, this function also contains adjustable variables called weights or parameters. These variables are fine-tuned during training, which alters the behavior of the function.

Since it’s a function, the model always produces the same output for a given input. You might think, “But ChatGPT gives different answers every time!” That variability actually stems from post-processing mechanisms rather than changes in the function itself. The core output of an LLM is a probability distribution over all possible next words (or more precisely, tokens) that could follow the given input. Depending on how this distribution is sampled, the outputs can appear more random or deterministic—for example, by selecting the word with the highest probability or by sampling randomly based on the probabilities. Most LLMs incorporate sampling by default. The primary goal in training an LLM is to predict the next word (or token) that follows a given input text. The model adjusts its parameters to maximize the likelihood of accurately predicting this “next token” based on the dataset. In essence, training involves adjusting the model’s parameters so that it becomes exceptionally good at this task.

How can we update these weights? This ties into the basics of deep learning. In simple terms, the model is a function, and the difference between its output and the correct answer (error) can itself be expressed as a function. Model optimization, then, becomes a problem of finding the minimum value of this error function. One well-known algorithm for finding this minimum is gradient descent, which adjusts the weights in the direction that reduces error. Essentially, non-chat LLMs like GPT-3 excel because they have mastered predicting “what word comes next” by processing vast amounts of text from the internet. This training on an enormous, diverse corpus of text—including essays, papers, dialogues, and questions/answers on specific topics—allows LLMs to perform a wide range of tasks. Datasets like Reddit, which feature a wealth of Q&A interactions, have likely played a significant role in enhancing the practical usability of these models.

From GPT-3 to ChatGPT: A Paradigm Shift

The journey from GPT-3 to ChatGPT took considerable time, likely because OpenAI’s initial philosophy for developing GPT models faced limitations. Up until GPT-3, OpenAI adhered to the belief that they could achieve general intelligence without fine-tuning—by simply scaling up the model and diversifying the training data to enable zero-shot and few-shot learning. However, this philosophy likely faltered after GPT-3, prompting a shift toward fine-tuning.

This shift is evident in InstructGPT, the model that underpins GPT-3.5 (and thus ChatGPT). Unlike GPT-3, InstructGPT were developed with extensive post-training. While GPT-3 had impressive reasoning abilities, its responses were more random and harder to control. This was a predictable outcome: the internet contains an enormous range of texts, including well-written, poorly written, hostile, and careless content. By training on all these sources indiscriminately, GPT-3’s outputs were highly variable and susceptible to changes in phrasing. The standout improvement in GPT-3.5 is controllability. OpenAI trained the model to provide polite, thorough, accurate, and safe responses by curating datasets with these qualities and amplifying this approach using reinforcement learning. Reinforcement learning is particularly effective in scenarios where there are clear distinctions in rewards for different outcomes. In this case, the reward was “producing responses that align with human preferences.”

Normally, using reinforcement learning would require humans to score every response generated by the model—a daunting and impractical task. OpenAI solved this by training another LLM to evaluate and score responses. This proxy scoring model made the reinforcement learning process scalable, enabling the development of conversational models like ChatGPT. Beyond ChatGPT, OpenAI has not disclosed detailed technical advancements. GPT-4o represents a step toward broader multimodal functionality while its core capabilities don’t appear drastically different. Models like the “o1” and “o3” series may differ in focus, and a brief interpretation of their capabilities will follow later. This evolution highlights how LLMs have transitioned from generalized text models to refined, highly adaptable conversational agents capable of assisting across diverse domains.

The Limitations of LLMs

Many people would naturally wonder: “How good can LLMs actually get?” or “Can they become AGI like the ones we see in movies?” My initial, simple thought was that an ideal LLM might just be a machine that’s extraordinarily good at predicting the next word, which is GPT’s primary learning objective. Could that be the ultimate form and ideal state of LLMs? Is it really so? Imagine there exists a database that collects every monologue, speech, written text, and conversation of every person in the world in real-time. Now, picture a machine that, given some input text, searches this database for an exact match and calculates the frequency of every possible subsequent text, outputting the distribution. This machine would perfectly fulfill the learning objective of LLMs. Wouldn’t this represent the ultimate ideal and upper limit of LLMs? However, we quickly realize this isn’t the case. Suppose someone randomly strings together about 2,000 characters of the alphabet and provides it as input. The number of possible combinations is (26^{2000}), a number so astronomical that such text would likely be unique in human history. If a user asked the machine to produce a continuation of this text, what would it output? The answer is: nothing. Such input has never existed on Earth, and therefore, the next word in the sequence doesn’t exist in any historical context. Here, we recognize an important point: although GPT’s learning objective focuses on predicting the probability of the next word, the act of determining the next word in itself does not constitute intelligence.

Interestingly, if you ask GPT to generate a random string of 2,000 characters, it can do so with precision.

Then, can we conceptualize an upper limit to LLMs from a data perspective? LLMs, after all, learn the relationships between input and output based on data. If no mapping data exists for certain domains, isn’t it impossible for LLMs to learn those relationships? To some extent, this is true. Some argue that the advancement of LLMs is nearing its end because nearly all publicly available online data has already been utilized. I also used to have this view, but it’s not entirely accurate. The world has already been heavily digitized, and new knowledge continues to be generated in digital formats. With the constant influx of news and research, it’s hard to imagine data running out anytime. Of course, if we ask whether this influx of data can provide anything fundamentally new beyond simple additions to knowledge, the discussion might change. Still, consider this thought: “No matter how advanced LLMs become, they’re ultimately based on text data. Does that mean they can do no more than generate text? They can’t operate machines like Iron Man’s Jarvis, can they?” While this is a reasonable assumption, it’s unfortunately incorrect. There’s no dataset currently containing instructions for firing a transformation suit with a specific text command yet. Previously, language models relied solely on “existing” data, so this assumption would have been accurate. However, OpenAI now generates revenue, meaning it can create data tailored specifically for LLMs. For instance, GPT-4 already has a search function. If you ask it to look something up, it conducts a search and uses the results to formulate its answer. “How can a model that only generates text perform searches?” While it might sound surprising, the principle is straightforward. If the LLM generates a text like “[Search: Memoria],” the automatic system interprets it, searches Google for “Memoria,” and feeds the retrieved content back into the LLM as input. Naturally, an LLM trained only on web text wouldn’t spontaneously output “[Search: XX].” But OpenAI trained the model on specially structured data to enable this functionality. In conclusion, the idea that LLMs can only generate text is not accurate. With proper tools and training on specific data, LLMs could theoretically operate transformation suits or even launch nuclear missiles.

Despite their boundless potential, LLMs like GPT are not without limitations. One major issue is symbolic logical reasoning. When tasked with solving complex logical problems or even simple arithmetic, LLMs often perform poorly.

The upper section refers to GPT-4o, while the lower one is o1.

Counting would reveal that the answer to the first problem is 22. These simple addition problems, primarily involving the digit 1, are one of my personal tests whenever a new LLM is introduced. Even the highly intelligent o1 still struggles with such basic arithmetic. For the second problem, the answer is 91. Symbolic logical reasoning is known to be a challenging task for LLMs, which fundamentally learn probabilistic patterns. They perform better on numerical calculations like 13 + 28 since there is likely an abundance of such data. However, data resembling the odd problems I create are rare. Hence, for models that solve such tasks intuitively without explicit symbolic rules, rare inputs above pose significant challenges.

Although these tasks are not well-handled at present, I do not believe they represent absolutely impossible problems for LLMs. After all, the rules governing inputs and outputs are clear. While deep learning often struggles with cases where the correct answer is rare in the answer space, or where there is a significant leap between input and output, structurally speaking, these challenges are not fundamentally incompatible with the capabilities of LLMs.

Another well-known limitation of LLMs is hallucination—arguably the biggest hurdle to their usability. Simply put, hallucination is when the model generates plausible but false statements. While opinions may differ, I believe that hallucination is an inherent issue that cannot be fundamentally resolved with current LLMs. Hallucination is not a single problem but involves multiple limitations. Let me illustrate one such limitation with a simple counterexample.

As mentioned earlier, LLM reasoning always involves probability distributions over output tokens, and this is immutable. Consider the case where sampling is used. For hallucination to be entirely absent, the probability of generating any token leading to a false statement must be zero. For instance, take the sentence, “King Sejong’s birthday is May 15, 1397.” If the model encounters, “King Sejong’s birthday is not May 15, 1397,” it must assign a probability of exactly zero to this sentence. For this probability to be zero, at least one token within this sentence must have a generation probability of zero.

But consider this: up to “King Sejong’s birthday is” the statement is identical to the truthful one. Thus, the shared portion should not only avoid being assigned a probability of zero but should also have high probability. Does this mean we can simply assign a probability of zero to “not”? If so, even a sentence like “King Sejong’s birthday is not today” would have a probability of zero despite the truth. To make it more accessible, consider a statement like, “King Sejong’s birthday is not May 15, 1397. Such claims are false.” While this contains a false clause, the entire text is true. The key is that even false clauses cannot have a probability of zero. Since GPT generates text one token at a time, the meaning of the text cannot be complete until the entire text is generated. As the meaning remains incomplete during intermediate stages, it is impossible to distinguish between true and false sentences at that point. Consequently, the model cannot preemptively assign a probability of zero to false statements.

What if sampling is not used? In this case, the model always selects the token with the highest probability, and the probability of a false statement need not be zero—it’s sufficient for the true statement’s probability to be consistently higher. Specifically, when the meaning resolves entirely at any given stage, the probability of the statement being false must be lower than it being true. For example, after “King Sejong’s birthday is May 15, 1397,” “not” should have a lower probability than “May.” Even if the model were trained to only output the most probable tokens to guarantee truthfulness, it would sacrifice diversity and creativity, and the perceived performance in practical applications would likely plummet. I don’t yet have a clear proof to support this argument rigorously, but I’ll consider it further.

Another prominent limitation involves solving creative problems beyond the statistical pattern recognition of existing phenomena. Examples include proving unproven problems, devising new algorithms, or proposing entirely new theories. One might wonder why models can generate unprecedented, imaginative artwork but not achieve these feats. The answer is that artwork has no definitive answers, whereas proofs, algorithms, and theories require validity to be meaningful. While current LLMs are clearly incapable of such tasks, reinforcement learning (RL) might offer a partial solution. As demonstrated by AlphaGo, RL excels in exploring spaces with defined boundaries and clear reward criteria. This is especially true in Markov Decision Processes (MDPs) where state uncertainty is minimal.

However, employing RL would necessitate several prerequisites. First, the fundamental propositions used for mathematical proofs would need to be pre-defined in the form of an action set. Second, the target problem to be proven must be automatically verifiable against the list of actions proposed by the LLM. For algorithms, this would be simpler since they can be executed and tested. If partial solutions, such as proving that a variable x is valid only within a negative range, could also be identified and rewarded differentially, the system would improve significantly. With these two conditions met, problems like mathematical proofs or algorithm design could be reframed as exploration problems and potentially resolved. Of course, such models wouldn’t necessarily need to be based on LLMs, but integrating them with LLMs could enable interactions where groundbreaking discoveries might emerge in real time.

In fact, I suspect that OpenAI might already be employing similar mechanisms with o1. Alongside breaking down reasoning processes with CoT (Chain of Thought), they might also validate reasoning steps in real-time through exploration-based mechanisms. The extended processing times, which are several times longer than those of earlier models, could stem from such iterative reasoning approaches.

Let’s delve into a more intuitive and fundamental limitation. When we look at Jarvis from Iron Man or Samantha from the movie Her, both AIs perceive reality in real-time, communicate, and act seamlessly. We find this natural, but in reality, it is an exceedingly complex problem that current large language models (LLMs) cannot achieve on their own.

Firstly, the AIs in these films operate continuously. LLMs, however, and indeed all deep learning models, do not function without input. They require something to be fed into them to work. So, what should we input? Isn’t it enough to provide real-time data, whether it’s from a camera or an audio feed? Let’s imagine a camera continuously feeding video data to a model for constant inference. Suppose the camera captures a tree in the garden. What should the model output in response? It’s unlikely that the current LLMs would know, as they haven’t been trained on such a scenario. This brings us back to the problem of datasets.

You might argue that OpenAI could create such a dataset. Sure, it could—but the key issue is that we need to explicitly define what the model should output in any given situation. For example, if we instruct it to say “The tree looks healthy,” when it sees a tree, does that mean we must create data for every imaginable visual input in the world and define specific outputs for each? Clearly, this is impossible. While this might seem like a practical limitation related to data, it’s fundamentally a question of “free will.”

Take GPT-4, for instance—it already has some capability for image recognition. OpenAI can create data to enable specific actions, but only when the user’s “intent” is clearly defined. However, when a model is given arbitrary sensory inputs without a predefined purpose or intent, determining what the model should do is an entirely different problem. It requires turning “sensory information” into “internal reasoning,” a task of an altogether different nature. If we could somehow create and train a dataset that encompasses all “sensory inputs” and their corresponding “internal reasoning,” we might then begin to believe such an AI has consciousness.

Another fundamental issue is that LLMs lack memory. Let’s assume real-time inference is somehow achieved. As explained earlier, LLMs can only retain context by embedding it into the input text. Wouldn’t it work to simply keep accumulating all sensory information into the input context? Doing so would almost certainly crash OpenAI’s servers within an hour. This might seem obvious, but humans possess an extraordinarily powerful and mysterious memory system. We process massive amounts of information in real time, retain only what’s necessary through selective attention, and either forget or permanently store that information based on its relevance or utility. LLMs, by contrast, cannot maintain continuous context like humans. Each inference is independent, akin to producing outputs based solely on individual, isolated frames. Without memory, developing a human-like relationship, as with Samantha or Jarvis, is inherently impossible.

But doesn’t OpenAI have systems that remember conversations? Yes, but the system works by summarizing parts of the dialogue and reintroducing the text automatically. This barely scratches the surface, serving only to reduce the burden of manual copy-pasting.

The absence of memory naturally leads to another limitation: the impossibility of continual learning. This term typically refers to the ability of a deep learning model to adapt to new datasets without forgetting previously learned skills or knowledge. As mentioned earlier, deep learning models optimize parameters for a target function. This means they are willing to sacrifice previously learned parameters if they don’t help solve the current problem. A straightforward way to address this is to accumulating new data into the old dataset. Without continual learning, teaching Jarvis or Samantha specific knowledge or operational methods through dialogue would be impossible.

What about in-context learning (ICL), where a model adapts based on a few examples? This too has limitations. Personally, I don’t consider ICL as true learning. Instead, I see it as activating latent capabilities that the model already possessed but rarely exhibited by providing examples that significantly increase the probability of those capabilities being expressed. Even if you view ICL as learning, it is extremely resource-intensive, making it impractical for broader applications.

In summary, these challenges—lack of real-time responsiveness, memory, and continual learning—highlight the fundamental limitations of current LLMs in achieving the kind of seamless, intuitive intelligence depicted in science fiction.

The Path to AGI

Can we never create an AGI like Jarvis or Samantha? I believe it is possible. While much of the prior discussion already involves speculation and assumptions, what follows is purely my hypothesis. I think memory is the most critical capability for AGI at this stage. Except for its direct functional aspects I mentioned, memory holds the potential to solve far more complex problems.

Take hallucinations, for example. How do humans avoid hallucinating? Human learning is fundamentally based on memory. Implicit memory, like motor memory, exists, but the things we consciously recall and articulate belong to explicit memory. Explicit memory can be divided into episodic memory and semantic memory. In contrast to deep learning models, which develop probabilistic output distributions for specific inputs, humans preserve distinct concepts and knowledge. Crucially, humans have metacognition — we are aware of the fact that we know something. This self-awareness enables us to avoid hallucinations and to consciously produce genuine lies (statements we recognize as false).

How is this metacognition possible? One idea is that humans store knowledge and information alongside metadata. For example, if you learned the Pythagorean theorem in a class, the semantic memory retains the theorem itself, while the episodic memory records “when, where, and from whom I learned the theorem.” This dual-layered memory allows us to remember not just the theorem but also the fact that we know it, forming a foundation for metacognition.

Memory is closely tied to continual learning. In fact, human memory itself is a continual learning system. While this may not hold true for intermediate levels of memory, by the time we successfully implement human-level memory capabilities in artificial intelligence, the challenges associated with this issue will likely have been resolved. The implications of this are profound: the fundamental learning method of AI would shift.

In living organisms, memory and learning are inseparable concepts. I believe that the current learning methods in deep learning do not mimic the unique way of learning in human, but rather the general learning mechanisms of animals. A significant portion of animal learning occurs through conditioning. For instance, classical conditioning, such as Pavlov’s dogs learning associations, and operant conditioning, like monkeys excelling in matching shapes with buttons in games, are common. Learning to associate stimuli with corresponding behavioral patterns is already a widespread method among animals. Humans, of course, also possess these animal-like learning abilities. Anyone who has played rhythm games can attest to this. The process of matching stimuli and responses through physical practice feels entirely different from understanding mathematical concepts.

Learning through conditioning is implicit and automatic, rather than something we consciously decide to learn. This is why we find it difficult to articulate how we learn to ride a bike or play the piano, or to recall exactly what we know about these skills. Such learning is deeply tied to implicit memory. Another example is perception processes, a kind of pattern recognition, which are core cognitive functions heavily influenced by implicit memory. For instance, when humans recognize a dog as a dog, it is not because they consciously recall and match the specific features and criteria of a dog. Instead, this recognition is based on an appropriate inductive bias in the brain and occurs automatically as part of implicit learning and inference.

Why is deep learning closer to implicit learning? Deep learning bears many similarities to implicit memory. A model doesn’t “know” in advance exactly what it has learned (direct knowledge retrieval isn’t possible). Instead, it performs probabilistic reasoning (similar to human intuition) only when actual input is provided. But isn’t deep learning AI, in some respects, far smarter than animals? If the fundamental learning principles of deep learning are more akin to those of animals, how can it achieve such remarkable intelligence?

I believe the reasons are twofold: first, the availability of massive datasets, and second, the backpropagation-based gradient descent algorithm, which is a far more powerful optimization mechanism than what is biologically feasible in living organisms. Although gradient descent is an incredibly effective optimization method, it requires storing all intermediate computation results to calculate gradients. This process is considered biologically impossible for the human brain. However, with large data, computing resources, and strong optimization techniques, deep learning has maximized the utility of conditioning-like learning, enabling its exceptional intelligence.

Then, how does human-specific learning occur? As mentioned earlier, conditioning-based learning is undoubtedly a significant capability for humans. However, conditioning relies solely on direct experience and is difficult to explicitly articulate or transfer. What sets humans apart from other animals is the ability to engage in indirect learning through abstract concepts conveyed via language, which are stored in semantic memory. Thanks to these abstract concepts, we can also gain new insights through internal reasoning alone.

Daniel Dennett referred to this human-specific intelligence, distinct from that of animals, as “Gregorian.” In my opinion, this unique intelligence stems from the ability to consciously learn abstract concepts in and of themselves, and I hypothesize this process is only possible through explicit memory. Therefore, I believe that equipping AI with human-like memory (particularly explicit memory) could open up entirely new dimensions of capability. Since I fundamentally view logical reasoning as a problem of matching the identity of abstract concepts, I suspect this could also resolve the current challenges large language models face with logical symbolic reasoning. Naturally, if active recall becomes feasible, problems like those associated with explainable AI (xAI) would also be resolved.

Another facet of explicit memory is episodic memory. If AI were to possess episodic memory, it could store user-specific memories, allowing for highly personalized interactions. Conversely, episodic memory could serve as the foundation for self-awareness, often referred to as consciousness. While episodic memory alone doesn’t constitute self-awareness, I believe it is a necessary condition for it.

By addressing additional challenges in real-time context tracking and tackling the so-called “hard problem” of consciousness, we could achieve something akin to Jarvis or Samantha. Wouldn’t that be an exciting vision of the future?

GPT의 개념적 본질에 관한 고찰, 그리고 AGI에 닿기까지

2025-01-03T00:00:00+09:00

최근 ChatGPT, GPT-4, Claud, LLaMa 등의 LLM은 상당히 많은 관심을 받고 있다. 확실히 이들 LLM의 능력은 경이로운 수준이다. MS는 GPT-4가 AGI의 초기 형태라는 서베이 페이퍼도 냈으니 말이다. 나도 현재 GPT의 능력에는 매우 놀랍기는 하지만 솔직히 이 이상 얼마나 더 큰 잠재력을 가지고 있을까에 대해서는 다소 회의적인 입장이었다. 그래서 최근에 OpenAI에서 나온 o1, o3 시리즈가 수능 수학을 1등급 받았다는 걸 봤을 때는 엄청나게 놀랐다. 평소에 가져왔던 LLM의 능력의 원리와 본질적 한계, 그리고 그걸 넘어선 AGI 수준에 이르기까지 무엇이 필요한지, 한 번 가볍게 정리해 보고자 한다.

LLM의 성공요인

LLM은 어떻게 이렇게나 놀라운 추론 능력을 갖출 수 있는가? 한마디로 답하기는 어렵지만 하나씩 풀어보자. 우선은 LLM의 학습 목표가 되는 Language Modeling (LM)이라는 Task의 특수성을 생각해 볼 수 있다. 첫 번째 중요한 사실은 LM은 생성 Task라는 점이다. 초기 딥러닝에서 주류는 분류 모델이었는데 그건 분류 문제가 더 쉽기 때문이다. 이미지에서는 이미지 클래스 분류, 텍스트에서는 sentiment classification, 이후로는 NER 같은 seuqence tagging 등으로 확장되기도 했다. 다만 이러한 분류 문제는 커다란 한계가 있는데 아무리 학습을 잘해도 결국 분류밖에 못 한다는 점이다. 예를 들어 이 세상 모든 곤충을 사진으로 분류해 주는 모델이 나온다면 센세이셔널할 수는 있겠지만 그 이후의 단계를 상상하기는 어렵다. 하지만 이미지 생성이나 텍스트 생성처럼 모델의 Output의 범위가 무한한 경우에는 이 Task를 아주 잘했을 때 할 수 있는 그 잠재력 아주 크다고 볼 수 있다. 물론 엄밀히 말하면 텍스트 생성도 기반 문제는 분류지만 auto-regressive라는 특성이 높은 잠재력을 부여한다.

그런데 생성 모델은 LLM만 있는 것이 아니다. 이미지 생성 모델도 있고, 음성 합성 모델도 있다. 왜 하필 텍스트를 다루는 LLM만 이렇게 똑똑한 걸까? 이 부분은 텍스트라는 modality의 특수성이 들어간다. 이미지 생성 모델의 주요 목적을 살펴보면 그림, 사진, 일러스트를 생성하기 위함이다. 당연히 데이터셋들도 그런 목적으로 구성된 것이 대부분일 것이다. 그렇다면 이 데이터셋을 가지고 아주 잘 학습한 모델을 만들어봤자 갑자기 편지 사진에 있는 글자를 인식해서 그 질문에 답장을 적은 편지를 다시 생성하는 식의 특별한 지능을 갖출 수는 없다는 것이다. 반면 텍스트는 어떤가? 언어는 선사 시대와 역사 시대를 구분 짓는 인류의 커다란 분기점인 만큼, 기본적으로 이 세상의 방대한 지식을 담고 있다. 물론 그중에서 여전히 전산화되지 않은 부분도 있지만 적어도 인터넷 등장 이후 생성되거나, 연구되거나, 뉴스를 통해 전달되는 지식의 상당량은 전산화된 데이터 형태를 가지고 있다. 지식뿐만이 아니다. 언어의 표상적 기능 덕분에 우리는 고차원적 사고의 상당 부분을 언어에 의존한다. “지적 추론의 과정” 그 자체도 언어화되어 있다는 말이다. 그러니 텍스트가 지적 능력 관점에서 가장 큰 잠재력을 가지고 있는 modality임이 틀림없다.

텍스트 기반의 생성 모델은 사실 역사가 꽤 길다. 그럼, 왜 그전까지는 안된 걸까? 따지고 들자면 더욱 복합적인 이유가 있을 수 있겠지만 두 가지를 딱 짚어보자면, 첫째는 그 많은 텍스트를 모두 학습시킬 수 있다는 생각도 못 했고, 그 정도로 엄청난 개수의 파라미터를 학습할 돈도 없었기 때문이다. 처음으로 생성 모델이 센세이셔널한 결과를 보여준 것이 언제냐 하면, GPT-3 175B의 출현으로 볼 수 있다. 나한테 놀라운 점은 OpenAI가 그 어마어마한 돈을 들여서 175B 짜리 모델을 만들 결심을 했다는 부분이다. 이게 왜 신기하냐면, 당시에는 그 정도로 큰 모델을 만든 전례도 없고, 당연히 이렇게 고도의 추론 능력을 갖출 거라는 확신도 없었을 텐데, 만들어서 어디 쓸 데가 있을 줄 알고 1,200만 달러를 부을 생각을 했을까 싶은 것이다. 뭐 이유야 어찌 되었든 OpenAI는 큰 투자로 큰 발견을 해냈고, 이후로 Large 데이터셋과 Large parameter 트렌드가 NLP 분야를 뒤집어 놓았다.

LLM의 학습 방식

LLM은 어떻게 학습하는 걸까? 크게 두 파트로 나눠서 설명할 수 있을 것 같다. 하나는 목적 함수이고, 또 하나는 모델 최적화이다. LLM을 포함해 딥러닝 모델은 일종의 함수라고 생각하면 된다. 함수이기 때문에 당연히 계산식이 정해져 있다. 다만 이 함수에는 weight 혹은 parameter라고 부르는 가변적인 매개변수를 가지고 있는데 학습을 통해 이 수치가 조정되며 함수의 실질이 달라진다. 함수이기 때문에 모델은 같은 입력에 대해 항상 같은 출력을 내보낸다. “어? ChatGPT는 매번 다른 답을 내던데?” 싶을 수 있다. 그건 엄밀히 말하면 weight와는 무관한 사후 처리 방식에 가깝다. LM의 함수적 출력 자체는 Input 다음에 올 수 있는 모든 단어의 확률분포이다. 그래서 그 확률값대로 sampling을 해서 랜덤성을 부여할 수도 있고, 가장 확률값이 높은 단어를 택해서 랜덤성을 없앨 수도 있는 것이다. LLM 들은 기본적으로 모두 sampling을 사용한다고 보면 된다. 이러한 LM의 학습 목표는 주어진 Input 텍스트 바로 다음에 나올 단어(실제로는 토큰)가 무엇인지를 맞추는 것이다. 그래서 데이터셋에 존재하는 “그다음 단어”의 확률을 1로 추론하도록 모델이라는 함수의 매개변수를 조절하는 게 LLM 학습의 가장 기본 원리라고 할 수 있다. 그러면 어떻게 이 weight를 마음대로 조정할 수 있을까? 이 부분은 딥러닝 기초를 조금 살펴보면 알 수 있다. 한마디로 말하면, 모델도 함수인데 모델의 결과를 정답과 비교한 오차도 함수이기 때문에 모델 최적화는 사실 수학적으로는 특정 함수의 최솟값을 찾는 문제로 환원된다. 그리고 이 최솟값을 찾는 유명한 알고리즘 중의 하나가 gradient descent, 정답과의 오차가 줄어드는 방향으로 gradient를 계산해서 weight를 조절하는 것이다. 그러니까 특히 GPT-3로 대표되는 LLM은 아주 쉽게 생각하면 기본적으로 앞의 맥락 “다음에 올 단어”가 무엇이 있는지를 인터넷의 거의 모든 텍스트를 읽고 기가 막히게 터득했다고 보면 된다. 그리고 이 텍스트는 에세이나 논문부터 대화, 아니면 특정 주제에 대한 질의응답까지 포함하고 있기 때문에 지금 LLM 들처럼 다양한 작업을 할 수 있는 것이다. 특히 Reddit처럼 많은 사람들의 질문과 답변이 있는 데이터는 우리의 실제 사용감을 높여주는 데 큰 역할을 했을 것이다.

실은 GPT-3가 나오고 그다음 ChatGPT가 되기까지는 굉장히 오랜 시간이 걸렸다. 추측건대, 나는 그 이유가 OpenAI가 GPT-3 개발 시까지 지켜온 철학이 더 이상 통하지 않았기 때문이라고 생각한다. GPT의 첫 개발부터 GPT-3까지 지켜진 OpenAI의 철학은 “Finetuning을 하지 않고 모델 키우고 데이터 다양하게 넣어서, zero-shot, few-shot으로 모든 Task 하는 general intelligence 만들 수 있다!”라고 볼 수 있는데 아마 GPT-3 이후에는 그 철학을 지키며 모델 고도화를 시도했으나 실패하고 파인튜닝으로 선회한 게 아닐까 싶다. 왜냐하면 OpenAI가 본격적으로 일반인들에게 이름을 알린 ChatGPT, 즉 GPT-3.5의 기반인 InstructGPT는 많은 사후 학습이 들어간 모델이기 때문이다. GPT-3도 당시 기준으로는 놀라운 추론 능력을 갖추고 있었지만 답변의 랜덤성이 훨씬 심하고 통제가 어려웠다. 사실 당연한 일이다. 인터넷에 얼마나 많고 다양한 텍스트가 있을까? 그중에서는 분명 잘 쓴 글도 있을 것이고, 못 쓴 글, 무성의한 글, 악의적인 글도 있을 것이다. 그것들을 모두 동일한 가중치로 학습했을 테니까 성능에 랜덤성이 크고 작은 표현 차이에도 답변이 확확 달라질 수 있는 것이다. 내가 생각했을 때 GPT-3.5의 가장 큰 발전은 통제 가능성이다. 인간이 원하는 친절하고, 정성스럽고, 정확하고, 안전한 답변을 제공하는 채팅 형태의 데이터를 직접 구성해서 학습시켰고, 그걸 극대화하기 위해 강화 학습까지 도입했다. 강화 학습은 명백한 보상의 차등이 있는 space에서의 exploration에서 상당히 강력한 학습 기법이다. 원래라면 “인간 마음에 듦”을 보상으로 제공하기 위해서는 모델이 하나 생성할 때마다 사람이 그걸 보고 채점을 해줘야 강화 학습을 진행할 수 있을 것이다. 물론 실질적으론 불가능한 일이다. 그래서 OpenAI가 사용한 트릭은 “점수 매기기”는 분명 “답변 생성”보다 쉽기 때문에 우리가 “점수 매기는 법”을 LLM으로 학습시켜서 프록시로 사용한 것이다. 그 덕분에 사람의 말을 잘 따르고 대화를 통해서 답변을 제공할 수 있는 Chat LLM의 시대로 또 한 번 접어들게 된다. 그 이후는 OpenAI가 상세한 기술을 공개하지 않고 있기 때문에 정확하게 파악하기는 어렵다. 다만 GPT-4o의 경우는 텍스트만이 아니라 Image를 넣을 수 있게 된 점 정도가 달라졌고, 본질적인 능력에서 아주 큰 차이가 있지는 않은 듯하다. o1, o3와 같은 추론 모델 시리지는 좀 다른 거 같고 o시리즈 모델의 능력에 대한 해석은 뒤에서 간략히 해보겠다.

LLM의 한계

그렇다면 많은 사람들은 자연스럽게 이런 게 궁금할 것이다. “LLM은 그럼 어디까지 잘할 수 있는 거지?” “영화에 나오는 사람 같은 AGI가 될 수도 있는 건가?” 내가 처음 했던 단순한 생각은 GPT의 학습 목표인 다음 단어 맞추기를 아주아주 잘하는 기계가 있다면 그게 결국 LLM의 끝이자 이상향이 아닐지 생각했었다. 정말 그럴까? 세상 모든 사람의 모든 독백, 말과 글, 대화가 실시간으로 수집되는 어떤 DB가 있다고 가정해 보자. 이제 어떤 텍스트를 입력으로 넣으면 DB에서 그 텍스트와 정확히 일치하는 텍스트를 찾고 그 뒤에 올 수 있는 모든 텍스트의 출현 빈도를 계산한 뒤, 가장 그 빈도가 높은 답변을 해주는 어떤 기계를 떠올려보자. 이 기계는 LLM의 학습 목표를 완벽히 해결할 것이다. 이 기계의 LLM이 닿고자 하는 이상향이자, 그리고 최대 상한선이 아닐까? 하지만 그게 아니라는 것을 금방 깨달을 수 있다. 만약 누군가 알파벳을 랜덤하게 2천 개 정도 나열해서 붙여 입력으로 준다고 해보자. 경우의 수는 26의 2,000제곱이다. 이건 상상하기 어려운 숫자이기 때문에 아마도 실제로 이렇게 텍스트를 만든다면 그건 인류 최초로 생성되었다고 봐도 될 것이다. 그리고 이걸 그대로 따라 해보라고 요구하는 유저의 메시지가 있다면, 이 기계는 무엇을 출력할까? 정답은 아무것도 출력할 수 없다. 왜냐하면 이 입력은 지구상에 없었고, 당연히 그다음 단어도 역사상 존재하지 않기 때문이다. 여기서 깨달을 수 있는 중요한 점은 우리가 GPT를 학습시키는 학습 목표는 분명 다음 단어에 대한 확률이지만, 실제로는 그다음 단어를 알아내는 것 그 자체가 지능은 아니라는 점이다.

실제로 GPT에게 무작위 알파벳 2천자를 그대로 적어달라고 말하면 정확하게 해준다.

그렇다면 데이터 적인 측면에서 LLM의 상한을 떠올려볼 수는 없을까? 어쨌든 LLM은 데이터를 가지고 Input과 Output 사이의 관계를 학습하는데 그런 맵핑 데이터가 없는 분야의 관계성은 절대 학습할 수 없지 않은가? 일정 부분 맞는 말이다. 실제로도 온라인에 공개된 사용 가능한 데이터가 이제는 거의 남아 있지 않다는 것을 근거로 LLM 발전의 끝을 주장하는 이들도 있는 것 같다. 나도 한때 했던 생각이지만 꼭 그렇다고 볼 수는 없는 것이, 이미 세계는 상당히 전산화되었고 애초에 새로운 지식 자체가 계속 디지털 형태로 만들어지고 있다. 매일매일 뉴스와 새로운 연구가 쏟아지는 데 이것만 계속 사용할 수 있다고 해도 데이터가 고갈될 것이라 기대하긴 어렵다. 물론 그 새롭게 쏟아지는 데이터로 단순한 새로운 지식 말고 무엇을 근본적으로 더 배울 수 있는가로 가면 얘기가 달라질 수는 있다. 어쨌든 그럼 이런 생각은 어떨까? “LLM이 아무리 발전해도 결국 텍스트 데이터 기반인데 그러면 AGI는커녕 기껏해야 텍스트 생성밖에 못 하는 거 아닌가요? 아이언맨의 자비스처럼 기계를 움직이지는 못하잖아요.” 꽤나 합리적인 추론이지만 아쉽게도 사실이 아니다. 물론 특정 텍스트로 변신 슈트를 쏘아 보내는 법에 대한 데이터가 있을 리는 없다. 아직까지는 말이다. 예전엔 LM 개발에 “그냥 존재하던” 데이터만 사용했으니 이 말이 맞았을 것이다. 하지만 지금 OpenAI는 돈을 벌고, 그 말은 즉 오직 LLM만을 위한 특정한 형식의 데이터를 만들 수 있음을 뜻한다. 이미 GPT-4o에는 검색 기능이 들어있어서 검색해서 알려달라고 요청하면 직접 검색하고, 그 결과를 이용해 답변해 준다. “텍스트만 생성할 수 있는 모델이 어떻게 알아서 검색을 하지?” 신기할 수도 있을 텐데 사실 원리는 아주 간단하다. LLM이 “[검색: Memoria]”라는 텍스트를 생성하면 그걸 인식해서 구글에 Memoria를 검색하고 나오는 문서 내용을 LLM의 입력으로 함께 넣어주면 될 일이다. 물론 웹에서 자연스러운 텍스트만 학습한 LLM이 갑자기 혼자서 “[검색: XX]”를 출력할 리는 없다. 하지만 OpenAI는 그런 구조를 가진 데이터를 직접 만들어서 모델에 학습시킨 것이다. 결론은, LLM으론 기껏해야 텍스트만 생성할 수 있다는 생각도 참이 아니다. 제어 장치만 쥐여주고 그 동작 방법을 데이터로 만들어 학습만 시켜준다면 사실 LLM은 변신 슈트는 물론이고 핵미사일도 쏠 수 있다.

지금까지 한껏 LLM의 무한한 가능성을 열어놓았지만, 당연히 GPT의 한계가 없는 건 아니다. LLM에는 잘 알려진 여러 문제가 있다. 첫 번째는 논리적 추론이다. 복잡한 논리적 문제나 아니면 단순한 산수여도 아직 LLM에 시켜보면 잘 못한다.

위쪽은 GPT-4o, 아래쪽은 o1이다.

세어보면 알겠지만, 첫번째 문제의 답은 22이다. 이런 식의 대부분 1로 이뤄진 덧셈 문제는 새로운 LLM이 나올 때마다 시켜보는 나만의 테스트 중 하나다. 그 똑똑하다는 o1조차 아직도 이렇게나 단순한 덧셈을 틀린다. 두번째의 경우 답은 91이다. symbolic logical reasoning은 본질적으로 확률적 패턴을 학습하는 LLM에게는 어려운 문제로 알려져 있다. 13 + 28과 같은 숫자의 계산을 오히려 더 잘하는데 오히려 그런 데이터가 많을 것이기 때문이다. 내가 만든 이상한 문제 같은 데이터는 잘 없을 것이다. 그러니 representation 상에서 rare 한 input에 대해 symbolic rule이 없이 일종의 “직관”으로 답을 맞히는 모델에게는 이게 상당히 어려운 문제인 것이다. 다만 이게 지금 현재는 잘 안되긴 하지만 나는 이 문제는 LLM이 절대 극복 불가능한 문제라고 생각하지는 않는다. 어쨌든 입출력의 명확한 규칙이 있기 때문이고, 딥러닝이 answer space에서 정답이 상당히 rare 하고 input과 output의 도약이 큰 문제에서 학습이 잘 안되는 경향이 있긴 해도 우선은 구조적 측면에서는 LLM이 가지는 게 본질적으로 불가능한, 위배되는 능력은 아니라고 보고 있다.

LLM의 또 다른 유명한 문제로는 할루시네이션을 빼놓을 수 없다. LLM의 사용성에서 가장 큰 걸림돌 중 하나다. 쉽게 말하면 그럴듯한 거짓말을 하는 것이다. 갑론을박이 있을 수 있으나 나는 할루시네이션은 현 LLM으로는 본질적으로 해결될 수 없다고 생각하는 바이다. 할루시네이션은 사실 한 가지 문제가 아니라서 한계도 다양한 측면이 있을 텐데 한 가지만 간단한 반례를 통해 논증해 보자면, 앞서 말했듯 LLM의 추론은 언제나 출력 토큰에 대한 확률분포이고 이것은 절대 변하지 않는다. 우선 간단하게 sampling을 사용하는 경우부터 생각해 보자. 이때 LLM이 Hallucination이 없다는 것은 사실과 다른 토큰을 생성하는 모든 시점에서 그 확률이 0이어야 함을 뜻한다. 예를 들어 “세종대왕의 생일은 1397년 5월 15일이다.”라는 문장이 있다고 해보자. 그러면 “세종대왕의 생일은 1397년 5월 15일이 아니다.”라는 문장이 있을 때 이 문장의 출현 확률을 정확히 0으로 출력해야 한다. 이 문장의 출현 확률이 0이라는 것은 이 문장의 어느 한 글자 이상 그 출력 확률이 0이어야 함을 뜻한다. 그런데 여기서 “세종대왕의 생일은 1397년 5월 15일이” 까지는 참인 문장과 동일하지 않은가? 그럼, 이 겹치는 부분까지는 0은 당연히 없어야 할 뿐 아니라 높은 확률을 가져야 할 것이다. 그러면 그 바로 뒤에 “아”나 “니”에 대한 확률을 0으로 찍으면 되는 걸까? 그렇게 했다간 “세종대왕의 생일은 1397년 5월 15일이 아니었어?”라는 문장의 확률도 0이 된다. 더 쉬운 예를 들어보자. “세종대왕의 생일은 1397년 5월 15일이 아니다. 이런 주장은 사실이 아닙니다.” 이러면 거짓인 문장을 포함하고 있음에도 전체 텍스트는 참이다. 요지는 실제에 반하는 거짓 문장이라도 그 확률을 0으로 만들 수는 없다는 것이다. 하나씩 글자를 생성할 수밖에 없는 GPT는 모든 텍스트를 다 생성하기 전까지 생성 중간에는 그 의미가 완결될 수 없다. 완결될 수 없으니 당연히 참 거짓을 판가름할 수도 없다. 앞부분의 추론 시점에 미래를 알 수 없으니 미리 거짓 문장만 골라서 그 확률을 0으로 만드는 것은 불가능하다는 뜻이다. 그러면 sampling을 사용하지 않는다면 어떨까? 이러면 항상 가장 확률이 높은 글자를 선택할 것이고, 굳이 거짓 문장의 확률이 0일 필요는 없다. 항상 참인 문장의 확률이 거짓 문장보다만 높으면 그만이다. 구체적으로 말하면 모든 추론 시점에서 그대로 의미가 완결될 때, 그 문장의 의미가 거짓이 될 확률보다 그렇지 않은 쪽의 확률이 더 높으면 된다. “세종대왕의 생일은 1397년 5월 15일이” 다음에는 “다”가 “아니” 보다 높고, “세종대왕의 생일은 1397년 5월 15일이 아니” 다음에는 “다”보다 “냐?”가 더 높으면 되는 것이다. 하지만 러프하게 생각했을 때 그렇게 정말 진실만을 말할 수 있도록 학습을 어떻게든 시켜 가장 확률이 높은 글자만 출력할 수 있다 해도 다양성도 없고, 창의성은 물론 없고 심지어 사실 우리가 사용할 때 체감 성능은 대부분 극히 떨어지지 않을까 싶다. 이건 아까처럼 깔끔하게 논증할 수 있는 것인지 당장은 아이디어가 떠오르지 않는데 고민해 보겠다.

다른 대표적인 한계는 기존 현상에 대한 통계적 패턴인식의 범주를 벗어난 창조적인 문제들이다. 이를테면 아무도 증명한 적 없는 새로운 문제를 증명하거나, 새로운 알고리즘을 만들어내는 것, 아니면 아예 새로운 이론을 제안하는 것이다. 아무도 그린 적 없는 기상천외한 그림은 잘만 그리던데 이런 건 왜 안 되느냐고 물을 수도 있겠다. 아무도 그린 적 없는 그림이 가능한 건 그림에는 답이 없기 때문이다. 하지만 증명이나 알고리즘, 이론은 답이 있거나 적어도 그 정당성을 충분히 증명해 내야 의미 있는 결과물이 된다. 다만… 현재 방식의 LLM으로는 분명 불가능하긴 한데 이 부분은 강화 학습을 이용하면 일정 부분 해소할 여지가 있어 보인다. 알파고의 사례에서 알 수 있다시피 강화 학습은 제한된 공간상에서 명확한 보상 여부가 있을 때 굉장히 뛰어난 space exploration 능력을 갖추고 있다. 특히 현재 State에 대한 불확실성이 없는 MDP에서는 더욱 그렇다. 다만 RL로 접근하려면 몇 가지 조건이 필요할 것 같다. 우선 우리가 수학적 증명에 사용할 수 있는 기반이 되는 명제들이 일종의 ActionSet 형태로 사전 정의돼야 한다. 그다음에는 증명하고자 하는 target 문제가 현재 LLM이 제시한 List of Action으로 풀리는지 아닌지를 automatic 하게 검증할 수 있어야 한다. 알고리즘이라면 돌려보면 되니까 더 간단할 것이다. 만약에 이게 모델이 제시한 논리를, 예를 들면 어떤 변수 x의 범위가 음수인 경우만 통하는 것을 증명했다거나 이런 부분적 해결까지 판별해 차등적 보상을 줄 수 있다면 훨씬 좋다. 이 두 가지 요건이 충족된다면 수리적 증명이나 알고리즘의 개발도 exploration 문제로 환원시킬 수 있기 때문에 해결이 가능할 것 같다. 물론 이런 모델이라면 꼭 LLM을 베이스로 삼아야 하는 것은 아니다. 다만 LLM에 통합될 수 있을 것 같고, 통합된다면 어쩌면 대화하고 있다가도 갑자기 새로운 증명을 발견해 낼 수도 있을 것이다. 사실 개인적인 추측으로는 OpenAI는 이미 o1 등을 개발할 때 CoT로 추론 과정을 쪼개는 것과 더불어서 각 추론 단계를 실시간 validator를 이용해 검증하면서 exploration 메커니즘 기반으로 추론하는 게 아닐까 싶다. 시간이 gpt-4o의 몇 배수로 소요되는 것도 이런 반복적인 추론이 이유일 것 같다.

이제는 조금 더 직관적이고 본질적인 한계에 관해 이야기해 보자. 아이언맨의 자비스나 영화 Her의 사만다를 보면 둘 다 현실을 실시간으로 인식하고, 소통하고, 행동한다. 우리는 이걸 아무런 위화감 없이 자연스럽게 보는데, 이건 실은 굉장히 어려운 문제이고 절대 현 구조의 LLM만으로 달성할 수 없는 부분이다. 우선 영화 속 AI들은 상시 동작하고 있다. 하지만 LLM은, 아니 모든 딥러닝 모델은 입력 없이는 동작하지 않는다. 항상 무언가를 넣어줘야 하는 것이다. 그럼 뭘 넣으면 될까? 카메라가 되었든 오디오가 되었든 뭐든 실시간 데이터를 때려 넣으면 되는 거 아닐까? 좋다. 카메라가 하나 있어서 일정 시간 간격으로 녹화한 비디오를 입력으로 무한히 상시 추론한다고 생각해 보자. 카메라에 정원에 있는 나무가 잡힌다. 그러면 이 모델은 뭘 출력해야 하면 좋을까? 지금의 당연히 LLM은 배운 적이 없을 것이다. 데이터셋의 문제가 돌아왔다. 자, 벌써 아주 많은 문제가 생겼다. 아까 말한 것처럼 데이터는 OpenAI가 만들지 않겠냐고? 물론 만들 수는 있다. 다만 중요한 건 저 상황에서 모델이 뭘 출력해야 할지를 우리가 정해줘야 한다는 점이다. 정원 나무 이미지에 “나무가 건강하네요”를 뱉으라고 정해줬다고 치자. 근데 그러면 세상에 모든 눈으로 볼 수 있는 것마다 거기에 대해 무엇을 출력해야 할지 하나하나 데이터를 만들어서 학습할 수 있을까? 물론 그건 불가능한 일이다. 이게 단순히 Practical 한 데이터의 제약처럼 들렸을지 모르겠지만 이건 “의지”에 관한 본질적인 문제이다. 사실 GPT-4o만 해도 이미 이미지 인식이 가능하지 않은가? OpenAI가 데이터를 만들 수 있지만 그건 어디까지나 사용자의 “의도”가 명확히 존재하는 상황에서 이를 이뤄줄 수 있는 동작을 하도록 데이터를 구성하는 것이다. 하지만 사용자의 의도와 목적이 없는 상황에서 모델이 자체적으로 임의의 sensory input만 주어졌을 때 무엇을 해야 할지는 “감각 정보”를 입력으로 받아 “내적 사고”를 출력해야 하는 전혀 다른 차원의 이야기이다. 만약 정말로 모든 “감각 정보”와 그에 대응하는 “내적 사고”를 데이터로 만들어 학습시킬 수 있다면 우린 그 AI가 의식이 있다고 믿을 수 있을 것이다.

또 다른 본질적인 문제를 꼽자면 LLM은 기억이 없다는 점이다. 실시간 추론 자체는 어떻게든 한다고 치자, 근데 위에서 설명했다시피 LLM은 어떠한 context를 유지할 수 있는 자체적인 방법이 입력으로 들어가는 text에 쌓아놓는 것뿐이다. 실시간으로 수집되는 모든 감각정보를 계속 input context에 계속 누적시키면 되는 거 아니냐고? 그랬다간 분명 1시간도 안 돼서 OpenAI의 서버가 터져버릴 것이다. 이것도 너무 당연하게 생각할 수 있지만 인간은 정말 정말 신비롭고 아주아주 강력한 기억 능력을 가지고 있다. bitrate로 계산했을 때 엄청난 용량의 정보를 실시간으로 처리하면서 이 중에서 꼭 필요한 정보만 남겨 선택적 집중을 발휘하고, 그 정보도 이후 활용성과 중요도에 따라 차등적으로 잊어버리거나 영구적으로 보존한다. 매 순간 입력을 넣어 추론시켜봤자 사람처럼 연속적인 맥락을 유지할 수 있는 능력이 LLM에는 없다. 이건 매 순간의 추론이 그저 각각의 독립된 장면을 보고 뱉어내는 함수와 같다는 뜻이다. 그러니 기억이 없이는 사만다나 자비스처럼 우리와 어떠한 인간적인 relationship을 발전시키는 것도 당연히 불가능하다. OpenAI에 대화 기억하는 시스템이 생기지 않았냐고? 그건 그냥 대화 중 일부를 요약해서 그 텍스트를 자동으로 넣어주는 원리일 뿐인데 이런 걸로는 턱도 없다. 이건 그냥 우리가 복붙해야되는 귀찮음을 아주 약간 덜어줄 뿐이다.

어떻게 보면 기억이 없으니 자연스럽게 불가능한 한 가지는 Continual Learning이다. 구체적으로 말하면, 본래 이 용어는 한 번 학습한 딥러닝 모델이 그 이전 데이터셋을 가지고 학습했던 능력과 지식을 잃지 않고 새로운 데이터셋에 최적화되는 방법을 뜻한다. 앞서 말했듯 딥러닝 모델은 목표 함수만을 최적화하기 위해 파라미터를 조정한다. 그 말은 이전에 봤던 데이터셋을 통해 열심히 배워 조정한 parameter라도 현재 문제를 해결하는 데 도움이 안 된다면 적극적으로 그 남는 표현력을 현재 주어진 문제를 조금이라도 잘 풀기 위해 기꺼이 희생함을 뜻한다. 그래서 이걸 방지하는 가장 쉬운 방법은 기존 데이터셋에 새 데이터를 누적시키면서 학습하는 것이다. continual learning이 없이는 자비스와 사만다처럼 말로 어떤 특정 지식이나 조작 방법 등을 가르쳐 주는 것이 불가능하다. In-Context Learning 이란 게 있으니 few-shot으로 주면 되지 않냐고? 이 부분도 논란의 여지가 있는데, 우선 나는 ICL를 학습으로 취급하지 않는 쪽이다. 그냥 원래 할 수 있었지만 rare 한 확률을 부여받아 잘 드러나지 않던 능력을 그 확률을 context를 통해서 대폭 늘려줌으로써 가시적으로 만드는 일이라고 보는 쪽이다. 그리고 학습이라고 쳐도 해도 ICL은 굉장히 resource가 많이 들어서 여기다 모든 걸 넣을 순 없다.

The Path to AGI

그렇다면 자비스나 사만다 같은 AGI는 결국 만들 수 없는 걸까? 나는 가능하다고 믿는다. 지금까지도 많은 추측과 가정을 가지고 이야기했지만 여기부터는 정말 순전히 나의 가설이다. 나는 현 상황에서 AGI에 가장 중요한 능력을 기억이라고 생각한다. 방금 언급한 직접적인 기능성 말고도 기억은 훨씬 많은 문제를 풀기 위한 잠재력을 가지고 있다. 예를 들어 할루시네이션을 생각해 보자. 사람은 어떻게 할루시네이션을 안 할 수 있는 걸까? 사람의 학습은 기억을 통해 이뤄진다. 운동 기억처럼 implicit 한 memory도 물론 있지만 사람이 말을 하며 의식적으로 회상하는(retrieval) 것들은 explicit memory에 속하는 것들이다. 그리고 explicit memory는 episodic memory와 semantic memory로 나눌 수 있다. 딥러닝 모델의 학습이라는 것은 특정 입력에 대한 특정한 방향성을 갖는 출력의 확률 분포를 형성하는 것인데 인간은 그게 아니라 아주 discrete 하게 특정 개념이나 지식을 보존한다. 중요한 것은 인간은 메타인지를 가지고 있다는 것이다. retrieval을 할 수 있는 것들에 대해서 우린 “내가 그것을 안다”라는 사실 자체를 인지하고 있다. 그렇기 때문에 할루시네이션을 안 하고, 헛소리가 아닌 진정한 의미의 거짓말(실제와는 다르다는 것을 인지함으로써 뱉는)을 할 수 있는 것이다. 이 메타인지가 어떻게 가능한가는 여러 가지로 생각할 수 있을 텐데, 한 가지 아이디어는 사람은 항상 어떤 지식과 정보를 그 메타데이터와 함께 기억한다는 점이다. 예를 들어 피타고라스의 정리에 대해 수업 시간에 배웠다면 semantic memory에는 피타고라스의 정리에 대한 내용이 보존되겠지만 episodic memory에는 ‘내가 피타고라스의 정리를 누구한테 언제 어디서 어떻게 배웠다’는 사실이 저장된다. 그 덕에 우리는 피타고라스의 정리를 안다는 사실을 함께 기억하고 있으며 이는 분명 메타인지를 형성하는 중요한 일부분인 것 같다.

기억은 continual learning과도 밀접하게 연관되어 있다. 인간의 기억은 사실 그 자체로 continual learning 시스템이다. 인간과 같은 기억의 중간 단계에서라면 몰라도, 인간 수준의 기억 능력을 인공지능에서 성공적으로 구현해 낸 시점이라면 이 문제는 이미 해결되어 있을 것이다. 이것이 함의하는 바는 가히 엄청나다고 할 수 있다. AI의 본질적인 학습 방법이 달라지는 것이다. 생명체에서 기억은 학습과 떼어서 생각할 수 없는 개념인데, 나는 현 딥러닝의 학습 방법은 사실 인간만 고유한 학습 방식을 모방했다기보다는 동물의 일반적인 학습 메커니즘에 가깝다고 본다. 동물의 주요한 학습은 conditioning을 통해 이뤄진다. 파블로프의 개처럼 association을 학습하는 고전적 조건화도 있고, 원숭이들이 모양에 맞춰서 버튼 누르는 게임 보면 엄청나게 잘하는 걸 본 적 있을 텐데, 이런 건 조작적 조건화로 이뤄진다. 이렇게 자극과 그에 상응하는 행동 패턴을 학습하는 것은 이미 동물이 흔히 사용한다. 물론 인간도 이런 동물적인 학습 능력을 갖추고 있다. 리듬게임을 해본 사람들은 알 것이다. 이렇게 자극과 행동의 매칭을 몸으로 익히는 것은 우리가 수학적인 개념을 익히는 것과는 전혀 다르다는 것을 느낄 것이다. 조건화를 통한 학습은 우리가 배워야지 하고 배운다기보다는 implicit하고 자동적으로 이뤄지는데, 그래서 우리가 자전거 타는 법이나 피아노 치는 법을 배울 때 그 배움을 정확하게 말로 설명하거나 우리가 정확히 무엇을 알고 있는 건지 회상할 수 없는 것이다. 이러한 학습은 implicit memory와 깊이 연관되어 이뤄진다. 또 다른 예시로는 perception과 같은 pattern recognition이 핵심적인 인지 과정도 implicit memory의 관여가 크다고 알려져 있다. 예를 들면 인간이 강아지를 보고 강아지라고 인식하는 것은, 사실 실제 강아지의 특징과 요건을 의식적으로 떠올리고 그걸 매칭해 본 뒤에 결정을 내리는 게 아니다. 머릿속에서 적당한 inductive bias에 기반해서 자동으로 이뤄지는 학습/추론이다.

그렇다면 딥러닝은 왜 implicit한 학습에 더 가까운 거냐고? 딥러닝도 implicit memory와 유사한 면이 훨씬 많다. 정확히 무엇을 알고 있는지를 모델이 미리 알 고 있는 것이 아니고(지식의 인출이 불가능), 실제 input이 주어지는 경우에만 확률적으로(인간이라면 직관에 가까운) 추론을 한다. 그런데 딥러닝 AI는 어떤 측면에선 동물보다 훨씬 똑똑하지 않은가? 딥러닝의 근본적인 학습 원리가 동물들의 것과 더 유사하다면 어떻게 이렇게나 똑똑할 수 있냐고 물을 수 있다. 내가 생각했을 때 그 이유는 첫째로 거대한 데이터, 둘째로 back propation 기반의 gradient descent라는 실제 생물에서 가능한 것보다 훨씬 강력한 수준의 최적화 알고리즘 덕분이다. gradient descent는 상당히 강력한 최적화 기법이지만 모든 중간 연산 결과를 저장해야하는데 이는 실제 뇌에서는 생물학적으로 불가능하다고 여겨진다. 하지만 딥러닝은 많은 데이터와 컴퓨팅 리소스, 그에 기반한 강력한 최적화 기법으로 conditioning과 유사한 학습을 그 활용성을 극대화할 수 있었기에 뛰어난 지능을 갖춘 것으로 추측한다.

그렇다면 인간만의 고유한 학습은 어떻게 이뤄지는가? 위에서 언급했듯이 물론 인간에게도 조건화를 통한 학습도 상당히 중요한 능력임은 틀림없다. 하지만 조건화는 직접적인 경험을 통해서만 학습할 수 있으며, 우리가 그걸을 명시적으로 자각하거나 전달하기는 어렵다. 인간의 다른 동물과는 차별화된 능력은 특히 언어를 통해 전달되는 추상적 관념을 semantic memory에 저장하며 이뤄지는 간접적인 학습이다. 또한 우리는 이런 추상적 관념 덕분에 내적 사고 만으로도 새로운 배움을 얻을 수 있다. 데닛이 Gregorian으로 언급한 동물과 차별화되는 인간만의 지능을 가능케 한 것은 이런 의식적 수준에서 추상적 개념을 그 자체로 학습하는 것이고, 이러한 추상성의 학습은 explicit memory를 통해야만 이뤄질 수 있다는 것이 나의 가설이다. 때문에 인간과 같은 기억, 특히 explicit memory를 AI에 탑재할 수 있다면 아예 새로운 차원의 능력을 갖출 수 있다고 생각한다. 나는 기본적으로 logical reasoning도 결국은 그 추상적 관념의 동일성을 매칭하는 문제라고 생각하기 때문에 이걸로 현 LLM이 어려움을 겪는 logical symbolic reasoning 문제도 해결될 수 있다고 추측한다. 당연히 active recall이 가능하다면 우리가 어려움을 겪는 xAI와 같은 문제또한 저절로 해결될 것이다.

explicit memory의 다른 한 축인 episodic memory가 존재한다면 각 사용자에 특이적인 기억을 가지게 만드는 것도 당연히 가능할 것이다. 그리고 반대로 이러한 episodic memory는 self-awareness라고 부르는 자의식의 토대가 될 수 있다. 물론 episodic memory는 자의식 자체는 아니다. 다만 분명히 필요조건이라고 생각한다. 거기에 real-time context tracking에서의 추가적인 문제를 해결하고, 가장 어려운 문제, hard problem으로 불리는 의식의 문제에 닿는다면 그건 자비스와 사만다와 거의 비슷한 모습이지 않을까?

Research Vision

2023-12-31T00:00:00+09:00

I think it’s important to make a list of future research directions. Do I want to study intelligence, do I want to study humans, do I want to study the mind, or do I want to study the origins of life? I don’t know yet. But what is clear is that it is not unrelated to any of these. The closest thing to what I imagine is more like a science fiction movie than science. In the movie HER, the OS has emotions and realizes many things that are (supposedly) uniquely human, and seems to have acquired humanity. It has consciousness (what seems to be) from the beginning. However, since it is an AI, it also has the characteristics of an AI. For example, its access to electronic media is quite free, and its information processing speed is superior. Another example is Digimon. They have consciousness and act freely in a digital world. Their biological information consists of electronic data. Another example would be Jarvis or Friday from Iron Man. I don’t know if they have real consciousness. In any case, they appear to have it to us. In the end, the key is probably consciousness itself, along with high cognitive information processing capabilities. Can these things be implemented artificially?

The answer I believe is “yes”. I don’t have a lot of evidence, but humans can do it. And it’s certainly not unique to humans. In my evolution class this semester, I learned a lot about the origin and development of life. We’ll come back to the specifics when we discuss the “how”. In any case, I believe that if humans have “acquired” these abilities, it is not impossible for other animals, or even machines, to acquire them. To use an analogy, when we design airplanes, we take into account the movements and body shape of real birds and incorporate features from their physical structure into our machines. However, I faced a contradiction. The contradiction that only humans have developed such a high level of intelligence out of all the countless organisms on the planet. If it’s so useful and easy to reproduce, shouldn’t it have been exhibited by more animals? I think it’s a pretty pointed question, to be honest. I still don’t have a solid answer. What is clear is that developing high intelligence and consciousness like humans is clearly not easy, and it’s a very expensive function. That said, I’ve thought about two possible answers.

The first is that humans are not the only species that have developed such intelligence. There are primate sister groups that predate modern humans. The main ones are the Neanderthals and the Denisovans, which eventually went extinct, leaving only the Homo sapiens species. So it’s not like there were only modern humans, there were more, but they failed to compete and disappeared. Of course… One could argue that Neanderthals, Denisovans, and Homo sapiens actually branched off from a common ancestor, so can’t we say that they developed from other species? This is synapomorphy, not homoplasy. The second answer is that there are other species. In my view, there are highly intelligent individuals in other species that diverged from humans at a much earlier point in their lineage. Actually, I think there’s a bit of human bias here. I think most animals are conscious. Some people might think this is obvious, but I mention it because the definition of consciousness is actually not very precise. However, people seem to think that animals are much less intelligent than humans because we can’t communicate with them through conversation, and they simply aren’t as civilized as humans. As far as I’m concerned, animals are already highly intelligent and conscious despite the lack of direct linguistic communication with us. When you look at primates like chimpanzees and dolphins, or highly intelligent dogs and parrots, I think that’s definitely true, especially since parrots can learn and speak human speech. (They’re quite capable of having real conversations, not just parroting human speech. When you think about it, it makes sense: the actions and interactions we have with our dogs are definitely a form of conversation, so if there is a parrot of similar intelligence that can produce sounds similar to humans due to the structure of its vocal cords, it should be capable of verbal communication.) From that, it seems clear that this high intelligence is reproduced in other species in nature.

Of course, being able to find it in animals is one thing, but being able to implement it in a machine, or even just electronic information, is quite another. Why do I believe it is possible? Of course, I’m not 100% confident that I can create a Jarvis-like child that can run on the commercial computers we have today. What I’m doing is purely speculative, and the research I’m doing is itself a kind of probabilistic sampling of candidates. My belief is derived from a few assumptions.

Assumption 1. There must be a principle underlying human or animal intelligence that makes it possible.

To be more specific, this means that the phenomenon can be described physically. An easier analogy is “flying”. We can now physically describe the principles that allow animals to fly based on the laws of science. We can also calculate under what conditions they cannot fly. (e.g., a bird of a certain weight with a certain wing size and feather flapping speed will crash below a certain level.) Intelligence clearly exists in the real world, and if it is consistently reproduced in this way, the idea that it is based on such natural scientific principles shouldn’t be too much of a leap.

Assumption 2. The Principle of Intelligence must have a specific structure. In other words, it can be expressed in terms of components with specific properties and the relationships and interactions between them.

Here’s the kicker. The important point is that we can express a property as a relation between components that have it. For example, gravity can be represented by two objects with mass, the distance between them, and a gravity constant. Of course, in order to represent something more abstract, more components and relations must be considered. For example, friction is technically an electromagnetic force. The same goes for things like flight. The density, temperature, and composition of the atmosphere, the surface, shape, and speed of the airplane, and so on. But no matter how complex it is, it can be broken down into principles. Intelligence is an incomparably more complex phenomenon than any of these, which makes it quite difficult for us to analyze, but the important thing is that it must be the sum of these principles.

Assumption 3. If we can structurally reproduce the principles of intelligence, we can reproduce the phenomenon of intelligence that we observe.

In a way, it is the result of the inference itself. This is like saying that if we know the law of gravity, we can reproduce it by formulating it in a computer simulation. The inference is that if we know the principle, we can model it through simulation. Of course, I’m not 100% sure that this is possible with today’s turing machine-based computers. But after all, based on these assumptions, one might ask, in order to artificially realize intelligence, we must first discover the laws of intelligence. But is that possible in this lifetime? Actually, I don’t think so.

Obviously, if we knew 100% of the principles, the only question would be how to reproduce them. But if you don’t know the principle, is it impossible to reproduce it? I don’t think so. Let’s take fire as an example. Humans have been using fire for more than a million years. But did people then have 100% knowledge of how fire works? Probably not. Someone probably first saw a naturally occurring fire by accident, realized it was hot and could be transferred, and used it; someone accidentally struck a flint and discovered fire; or someone saw others make a fire and assumed that the action (e.g., holding a piece of wood and rubbing it hard) was an essential element of fire. Of course, some of them may have realized that fire is all about burning material and high temperatures. But even then, it’s likely that they didn’t realize that oxygen actually played a role. In this way, understanding the underlying laws is not essential to reproducing the phenomenon itself. If you want to build an airplane, you can design it based on a theory of flight, but you can also take an airplane that already exists, take it apart, produce the same parts, put it back together, and build it out of nothing. Actually, reproducing the phenomenon often leads to intricately discovering principles. This is what my research will eventually be about. Trying to recreate the phenomenon of intelligence with uncertain and highly fragmented information.

In the end, the key is the brain, so creating an artificial organism that resembles the human brain, such as a brain organoid, could be a research direction. Is this an achievable goal within the framework of deep learning, “parameter optimization by gradient descent”? I’m not really sure about this. To be honest, my opinion is more on the side of difficult. In the first place, gradient descent and back propagation are very difficult biologically to account for the neuroplasticity of living things. The general consensus is that gradient descent and back propagation, like deep learning, are impossible in living things. So what are the important factors in artificially reproducing intelligence? From now on, this is 100% my personal opinion. My future research will also focus on these areas. There are different levels of abstraction.

The first is memory. Memory, a dynamically changing processing unit that is to some extent separate from the information processing system, seems to me to be responsible for much of intelligence and consciousness. Dynamic memory is what makes an entity have a temporal continuity.To use a human example, if someone has general world knowledge (semantic memory) but no episodic memory, so that he can answer questions about his knowledge well but can’t remember anything he’s been asked before or personal anecdotes, we wouldn’t consider him human, and if we were to chat with him, we wouldn’t imagine him to be human. On the other hand, if a machine or AI has such a episodic memory, and ChatGPT remembers and refers to what it has said before over time, while also talking about new things, wouldn’t we feel quite human? Moreover, memory itself is inseparable from learning. Retaining new information, the passage of time, and learning a specific procedure are all done through memory. Therefore, I personally feel that current deep learning mechanisms, where training and inference are completely separate processes, do not reflect such characteristics. Therefore, my first research topic was memory.

On the other hand, I am also interested in some kind of initialization. There seems to be a very big difference between the initial state of a human and an ANN. Humans are actually born with a lot of things. For example, in the case of language, there’s a critical period when you learn your first language, and if you miss that period, it’s known that you can never learn a language. If you think about that, you can see that humans already have that kind of plasticity, whether it’s a structure or abstract knowledge that can basically learn certain things. But if we train a neural network, the only thing that changes as the training progresses is the internal weights. And the only thing you have in the initial state is the architecture. The architectures that are being utilized today are certainly the result of a lot of research, but I don’t think they reflect a significant portion of all the innate capabilities and potential that humans have before they are born. And my reasoning is that without improvements in these areas, much of the learning that humans are capable of will be difficult to implement and will not reach the same level of intelligence as humans.

On a completely different level, there’s evolution. Many people think that humans only learn after they are born, but in my opinion, the learning of the species itself is much more important than the learning of the individual. The ANN’s parameter optimization also responds to changes in human synaptic weights, so it seems to deal with postnatal learning. However, humans as a species, not as individuals, have had a much longer history of learning about the planet and the laws of nature. Of course, that learning often goes by a different name: evolution. Three things are important for evolution: variation, inheritance, and differential reproductive success. First of all, species need to be able to reproduce, but not just create identical copies, species need to be able to consistently create variation. Secondly, they must be able to inherit the traits of their parents, not just randomly create something completely different from them. Finally, these variants should not all be equally successful at reproducing, but should be differentially successful at reproducing based on their own characteristics. From this point of view, evolution cannot be achieved with the current basic structure of deep learning. First of all, AI models don’t reproduce. You can do it heuristically by just copying the models. But there’s no variation there, so you can just do it heuristically, you know, apply randomized noise to the weights or… I don’t know. I guess there’s room for this. But I’m not really excited about it. I think it should be able to mimic the phenomenon of life in a more sophisticated way. First of all, it should be able to reproduce sexually, so it should be able to get a lot of genetic variation, and then there should be mutation, such as crossing-over, and nucleobase shifting, or something like that, and it should be implemented in a more sophisticated way. If we just need to somehow fulfill those three things, I think it would have been done a long time ago. I think the important thing is that now, no matter how many generations you go through, the variation has to diverge, not converge. Somehow life has managed to do that. But the weight noise method that I just sketched out will definitely end up converging after enough generations. And unlike DNA, where a single sequence can lead to a very large functional change, the size of a parameter in a neural network will ultimately only produce a quantitative difference in the constant value of a particular response. In DNA, a sequence of bases can discretely change the ability to create a particular protein or not, so a simple single gene mutation can produce a more powerful change. Clearly, each of these considerations needs to be thought through.

In this respect, what I would most like to do is to study the minimal structure that is as simple as possible but still has the potential to evolve. What would I call a minimal evolutionary entity? An entity that has the minimum requirements for inheritable reproduction, but has the potential to evolve into an entity with human-level complexity. If we know enough about its structure and mechanism, can we implement it artificially? If we can model it computationally, just by changing the simulation environment… could we achieve a high degree of evolution?

The closest thing to AGI right now is probably ChatGPT. But even ChatGPT (unless they’re the only ones applying their amazing research secrets to GPT-4, which they haven’t disclosed in detail, but it’s been published in Nature about ten times) will eventually get around to gradient descent and back propagation, and probably the Transformer architecture. I don’t think there would be any improvements at the points I’m concerned about. These are all pretty hard problems, each and every one of them. So my view is actually that OpenAI’s current approach will not produce a conscious AGI. Of course, we don’t know. In fact, these points are more like my own daydreams, as I have only raised questions and have not yet started to explore the preceding research in this area, let alone the methodological aspects. In the future, I hope to find answers to these questions through my research. Maybe I should dedicate my life to it.

앞으로의 연구 방향에 관해 한 번은 정리를 해 둬야 할 것 같다. 머릿속에만 부분 부분 남겨두는 것으로는 나중에 돌아보기가 불편하니까 말이다. 내가 하고자 하는 것은 지능에 관한 연구인가, 인간에 관한 연구인가, 마음에 관한 연구인가, 아니면 생명의 기원에 관한 연구인가? 그것은 아직 모르겠다. 하지만 분명한 것은 이 중 어느 것과도 무관하지 않다는 것이다. 내가 상상하는 것과 가장 가까운 것은 오히려 과학보다는 SF 영화와 더 가깝다. 영화 HER에 나오는 OS는 감정을 가지고 인간의 만의 것(으로 여겨지는) 여러 가지를 깨닫고 마치 인간성을 획득한 것처럼 보인다. 의식(처럼 보이는 것)은 처음부터 가지고 있었고 말이다. 그러면서도 AI이기 때문에 AI만의 특징도 가지고 있다. 전자적인 매체에 관한 접근이 상당히 자유롭고 정보의 처리 속도가 월등한 것 등. 또 다른 예시는 디지몬이 될 수 있다. 디지몬도 의식을 가지고 있고 자유롭게 행동한다. 디지털 세상 속에서 말이다. 디지몬은 그 생체적인 정보가 전자적인 데이터로 구성되어 있다. 또 다른 예시는 아이언맨의 자비스나 프라이데이가 될 수 있을 것이다. 자세한 설정을 알지는 못해서 그들이 진짜 의식이 있는 건지는 모르겠다. 어쨌든 우리가 볼 때는 가지고 있는 것처럼 보인다. 결국 핵심이 되는 것은 아마 높은 인지적인 정보처리 능력과 함께하는 의식 그 자체일 것이다. 이런 것들은 인공적으로 구현할 수 있는 것인가?

우선 내가 믿는 답은 “가능하다”이다. 근거라기에는 뭐하지만 인간은 할 수 있으니까 말이다. 그리고 그것이 분명 인간 만의 전유물은 아닐 것이다. 이번 학기에 진화학 수업을 들으면서 생명체의 기원과 발전과정에 대해 많은 것들을 배웠다. 구체적인 건 “어떻게”를 논의할 때 다시 생각해 보기로 하자. 어쨌든 나는 인간이 이러한 능력을 “획득” 했다면 다른 동물이든 아니면 심지어 기계라도 그러한 능력을 획득하는 게 불가능하지는 않다고 생각한다. 비유를 하자면 우리가 비행기를 설계할 때 실제 새의 movement나 body shape을 고려하여 그 신체적 구조에서 오는 특징들을 기계에 반영시켰던 것처럼 말이다. 하지만 이런 고민을 하기도 했다. 그런 고지능은 지구상에 있는 그 셀 수 없는 생물군 중에서 오직 인간 만이 발달시켰다는 모순. 이게 그렇게나 유용하고 재현되기 쉬운 능력이라면 보다 많은 동물에서 나타냈어야 하는 게 아닌가? 솔직히 이 질문은 상당히 날카롭다고 생각한다. 아직도 확실한 반박을 하지 못하겠다. 다만 분명한 점은 인간과 같은 고지능과 의식을 발달시키는 것이 분명 쉽지는 않다는 점과 굉장히 비싼 능력이라는 점이다. 그럼에도 두 가지 정도의 대답을 고민해 보았다.

첫 번째는 그렇게 지능을 발달 시킨 종이 순수하게 인간 만은 아니라는 점이다. 우선 현생 인류의 이전에 나타났던 영장류 sister group들이 있다. 크게는 네안데르탈인과 데니소비안 그룹이 있는 것 같다. 결과적으로는 이 둘은 멸종하고 호모 사피엔스 종만 남았다. 그러니까 사실은 현 인류만 있었던 것이 아니고 더 있었지만 경쟁에서 실패하고 사라진 것이 아닐까? 물론… 사실 네안데르탈인과 데니소비안, 호모사피엔스는 사실 공통 조상에서 분기된 것이기 때문에 다른 종에서 발달시켰다고 말할 수 없냐고 반박할 수는 있다. synapomorphy지 homoplasy는 아니다. 두 번째 답변으로는 다른 종들이 있다. 내가 보기에는 인간과는 훨씬 이른 시기의 lineage에서 갈라진 다른 종 중에서도 높은 지능을 가진 개체들이 있다. 사실 여기에는 사람들의 편견도 조금 있는 것 같다. 나는 대부분의 동물들도 의식이 있다고 생각한다. 누군가는 이게 당연한 소리라고 생각할 수도 있지만 의식의 정의는 사실 그렇게나 엄밀하지 못하니까 굳이 언급한 것이다. 그런데 사람들은 동물과 대화를 통한 소통이 안되고 단순히 동물들은 사람처럼 문명을 이루지 않았다 보니 인간에 비해 지능이 훨씬 많이 떨어진다고 생각하는 것 같다. 내가 볼 때 우리와 직접적인 언어적인 소통이 대부분 불가능한 게 문제지 이미 동물들은 내가 생각했을 때는 충분한 고지능과 의식을 갖추고 있다. 침팬지 같은 영장류나 돌고래, 아니면 높은 지능을 가진 개나 앵무새를 보면 그건 확실히 와닿는 것 같다. 특히 앵무새는 사람의 말을 배워서 말할 수가 있어서 더 그렇다. (그냥 사람 말을 단순히 따라 하는 게 아닌 실제 대화가 꽤 가능하다. 생각해 보면 당연하다 우리가 강아지를 키우면서 하는 행동과 상호작용도 분명 대화의 일종이니까 성대 구조상 사람과 비슷한 소리를 낼 수 있는 그와 비슷한 지능의 앵무새가 있다면 언어적인 대화가 가능할 것이다.) 그걸 보면 분명 이 고지능은 자연의 다른 종에서도 재현되고 있는 것으로 보인다.

물론 동물에서 발견될 수 있다는 것과 기계에서 그것을 구현할 수 있느냐, 혹시 기계도 아닌 심지어 전자적인 정보로만 그것을 구현할 수 있느냐는 완전히 다른 문제일 것이다. 그럼에도 왜 나는 그것을 가능하다고 믿는가? 물론 나도 지금 가지고 있는 상용 컴퓨터에서 돌릴 수 있는 자비스 같은 애들을 만들 수 있다는 쪽에 100% 확신을 가지고 있는 것이 아니다. 내가 하고 있는 것은 순수하게 추측일 뿐이고, 내가 하고자 하는 연구 자체는 일종의 확률적인 후보들에 대한 샘플링의 일종이기도 하다. 구현이 가능할 것이라는 나의 믿음은 몇 가지 가정으로부터 도출된다.

가정 1. 분명히 인간이나 동물이 가지고 있는 지능의 기반에는 그것을 가능케 하는 principle이 있을 것이다.

이를 조금 더 구체적으로 설명하자면 그 현상을 물리적으로 기술할 수 있다는 것을 뜻한다. 비유를 통해 더 쉽게 설명하면 “비행” 같은 것이다. 우리는 현재 동물이 하늘을 하는 원리를 물리적으로 과학의 법칙에 기반해 기술을 할 수 있게 되었다. 또 어떠한 상태에 있을 때 날 수 없는지 등도 계산할 수 있다. (ex 특정한 무게의 새가 어느 정도의 날개 크기와 깃털로 날갯짓을 하는 속도가 특정 수준 이하면 추락하게 된다던가) 지능도 분명히 이 현실 세계에 존재하며 이렇게 계속 일관적으로 재현된다면 그러한 자연과학적인 principle에 기반을 두고 있다는 생각이 그다지 비약은 아닐 것이다.

가정 2. 지능의 Principle은 특정한 구조를 갖추고 있을 것이다. 다시 말해서, 특정한 속성을 가지고 있는 component와 해당 component간의 relation, 상호작용으로 표현할 수 있을 것이다.

이제부터가 중요하다. 사실 어떻게 보면 가정 1과 다르지 않기도 하다. 어쨌든 중요한 점은 특정한 속성을 가지고 있는 component간의 relation으로 표현할 수 있다는 점이다. 예를 들어 중력이라고 한다면 질량을 가지고 있는 어느 두 물체와 해당 물체 간의 거리, 그리고 상수로 표현할 수 있다. 물론 추상적인 것을 표현하기 위해서는 더 많은 component과 relation이 고려되어야 한다. 예를 들면 마찰력 같은 경우는 엄밀하게 표현하면 전자기력으로 표현해야 하니까 말이다. 비행과 같은 경우도 마찬가지이다. 대기의 밀도, 온도, 구성, 기체의 표면, 모양, 속도, 등등 아주 복잡해진다. 그러나 아무리 복잡하더라도 여러 원칙들로 나눠서 분석할 수 있다. 지능은 그런 것들과도 비교할 수없이 복잡한 현상이기에 우리가 분석하는 것이 상당히 어렵지만 중요한 것은 어쨌든 그러한 원칙들의 합으로 일어나는 현상일 것이라는 점이다.

가정 3. 그러한 지능의 principle을 구조적으로 재현할 수 있다면 우리가 관찰하는 지능이라는 현상을 재현할 수 있다.

어떻게 보면 추론의 결과 그 자체이기도 하다. 이것은 우리가 중력의 법칙을 알고 있다면 컴퓨터 시뮬레이션으로 중력을 공식으로 만들어서 그것을 재현할 수 있는 것과 같다. 결국 principle을 파악하고 있다면 그것을 simulation을 통해 모델링 할 수 있을 것이라는 추론이다. 물론 나도 이것이 현재의 turing machine 기반의 컴퓨터로 100% 가능하다는 확신은 없다. 그런데 결국 이러한 가정을 토대로 지능의 인공적으로 현실화하려면 그러한 지능의 법칙을 먼저 발견해야 하는 것이 아니냐고 반문할 수도 있다. 그런데 그게 이번 생애에서 가능할까? 실은 나는 꼭 그렇다고 생각지 않는다.

분명히 그러한 principle을 100% 알고 있다면 그것을 재현해 내는 방법 자체만 고민하면 될 것이다. 하지만 그러한 원칙을 모른다고 재현하는 게 불가능할까? 꼭 그런 것 같지는 않다. 이번에는 불을 예로 들어보자. 인간이 불을 사용한 것은 100만 년도 더 되었다고 알려져 있다. 그런데 그때의 사람들은 불의 원리에 대해 100% 알고 사용했을까? 그렇지는 않을 것이다. 분명 누군가는 처음에 우연히 자연에서 발생한 불을 보고 뜨겁다는 것, 그리고 옮겨 붙일 수 있다는 것을 알고 사용했을 것이며, 누군가는 우연히 부싯돌을 부딪히다가 불을 발견했을 것이다. 혹은 누군가는 다른 개체가 불을 피우고 있는 동작만을 보면서 그러한 동작(예를 들면 나무를 들고 열심히 비비는 행위)이 불의 필수적인 요소라고 생각했을 것이다. 물론 그중 어떤 개체는 불이라는 것이 결국에는 탈 물질과 높은 온도가 중요하다는 기반 원칙들을 깨달았을 지도 모른다. 하지만 그럼에도 아마 그 개체는 사실 산소가 중요한 역할을 한다는 것은 몰랐을 가능성이 높다. 이처럼 현상 자체를 재현하는 것에 꼭 기반 법칙의 이해가 필수적인 것은 아니다. 비행기를 만들고 싶을 때 비행에 관한 이론을 토대로 설계해서 만들 수도 있지만 이미 있는 비행기를 가져와서 분해하고 그냥 똑같은 부품을 생산해서 다시 조립해서 무지성으로 만들 수도 있는 것이다. 사실 오히려 그렇게 반대로 현상을 재현해나가면서 principle을 정교하게 찾아내기도 한다. 결국은 나의 연구는 이것의 일환이 될 것이다. 불확실하고 상당히 파편적인 정보만을 가지고 지능이라는 현상을 재현해나가기 위해 시도하는 것이다.

그런데 이렇게 접근을 하기도 했더라도 그 방향성은 아주 천차만별이다. 결국 핵심은 뇌에 있으니 brain organoid처럼 인간의 뇌와 비슷한 인공적인 유기물을 만드는 것도 이러한 방향성의 연구가 될 수 있을 것이다. “gradient descent 방식의 parameter 최적화”라는 딥러닝의 테두리 안에서 이게 달성 가능한 목표인가? 사실 여기에 대한 확신은 없다. 솔직히 말하면 어렵다는 쪽에 더 가까운 의견이다. 애초에 gradient descent와 back propagation은 생명체의 신경 가소성을 설명하기에는 생물학적으로 어려운 면이 아주아주 많다. deep learning 같은 gradient descent와 back propation은 생명체에서는 불가능하다는 쪽이 주류이다. 그렇다면 지능을 인공적으로 재현하는 데에 있어서 무엇이 중요한 요소들일까? 이제부터는 100% 나의 개인적인 견해이다. 나의 앞으로의 연구 또한 이런 방면에 초점을 둘 것이다. 그 추상성의 수준도 가지각색이다.

첫 번째는 기억이다. 정보 처리 시스템과는 어느 정도는 구분되어 동적으로 변화하는 처리 장치인 기억은 내가 생각할 때는 지능과 의식의 많은 부분을 담당하는 것 같다. 동적인 기억은 개체가 시간적 연속성을 가진 존재로 만든다. 사람을 예로 든다면 일반적인 세상에 관한 지식은 가지고 있지만 개별적인 기억은 불가능해서 물어보는 거 하나하나는 잘 대답하지만 전에 물어봤던 거나 개인적인 일화는 아무것도 없다면 우리가 인간적이라는 감정을 느낄 수는 없을 것이고 만약 우리가 그와 채팅을 한다면 우리는 그가 사람이라고는 상상도 못할 것이다. 반대로 기계나 AI가 그러한 기억을 갖추고 있다면? ChatGPT가 시간의 흐름에 더해 이전에 했던 말들을 기억하고 언급하면서 새로운 얘기를 해나간다면 우리는 상당히 인간적으로 느끼지 않을까? 여기에 더해서 기억은 그 자체가 학습과 떨어질 수가 없기도 하다. 새로운 정보나 시간의 흐름, 특정한 procedure에 대한 배움 등이 모두 기억으로 이뤄지니까 말이다. 그러다 보니 개인적으로 나는 training과 inference가 전혀 별도의 process로 존재하는 현재 딥러닝 메커니즘은 그러한 특성을 반영하지 못하고 있다고 느낀다. 그러다 보니 나의 첫 번째 연구 주제도 기억이었다.

또 다른 한 편으로는 일종의 initialization에 대해서도 관심이 있다. 인간과 ANN의 initial state에는 굉장히 큰 차이가 있는 것 같다. 인간은 사실 태어날 때부터 아주 많은 것을 가지고 태어난다. 예를 들어 언어의 경우에는 first language를 배울 수 있는 critical period가 정해져 있는데 이 시기를 놓치면 언어를 배울 수 없다고 알려져 있다. 그런 걸 생각하면 인간은 이미 기본적으로 특정한 학습을 할 수 있는 구조든 추상화된 지식이든 그런 가소성을 이미 가지고 있다고 볼 수 있다. 한데 우리가 neural network를 학습시키다고 해봤자 학습이 진행되어가며 달라지는 것은 단순히 weight의 가중치뿐이다. 그리고 초기 상태에서 가지고 있는 것은 사실 그 아키텍처뿐이다. 지금 많이 활용되고 있는 아키텍처는 물론 많은 연구 끝에 나온 결과물이긴 하지만 이것이 인간이 태어나서 경험을 통한 학습 이전에 가지고 있는 모든 본유적인 능력과 잠재성의 상당 부분을 반영하고 있다고 생각하지는 않는다. 그리고 분명 이런 부분에서의 개선 없이는 인간이 할 수 있는 학습 중에 많은 부분은 구현되기 어렵고 인간과 같은 수준의 지능에 도달하기 어렵다는 것이 나의 추론이다.

전혀 다른 수준에서는 진화도 있다. 많은 사람들은 인간이 태어난 이후에만 학습을 한다고 생각하겠지만 사실은 내가 보기에는 개체의 학습보다 종 자체의 학습이 훨씬 더 중요하다. ANN의 parameter optimization도 인간의 시냅스 가중치 변화에 대응하니까 태어난 이후의 학습만을 다루는 듯하다. 하지만 인간은 개체가 아닌 종으로서 훨씬 더 오랜 시간을 지구와 자연의 법칙에 대해 학습해왔다. 물론 그 학습은 진화라는 다른 이름으로 주로 불린다. 진화에는 세 가지가 중요한데 variation, inheritance, differential reproductive success이다. 우선 reproduction이 가능해야 하는데 똑같이 복사된 개체를 만드는 게 아니라 variation을 꾸준히 만들 수 있어야 한다. 그렇다고 자식이 부모와 완전히 동떨어진 랜덤하게 만들어지는 것이 아니라 부모의 특질을 상속받을 수 있어야 한다. 마지막으로는 이러한 variation들이 모두 똑같이 재생산에 성공하는 것이 아닌, 각각의 특징에 따라 차등적으로 재생산에 성공해야 한다. 이러한 관점에서 보면 우선은 evolution은 현재 deep learning이 가지고 있는 기본 구조만으로는 달성할 수가 없다. 우선 AI 모델은 reproduction이 없다. 그냥 휴리스틱하게 해볼 수는 있겠지. 모델을 복사하면 되니까? 근데 거기에는 variation이 없으니까 weight에 random noise를 준다던가… 모르겠다. 이것도 해볼 여지는 있을 것 같다. 하지만 기대는 되지 않는다. 이건 좀 더 정교하게 생명체가 가지고 있는 현상을 모방할 수 있어야 하지 않을까라는 생각이 있다. 우선은 유성 생식이 가능해서 genetic variation이 상당히 확보될 수 있어야 하고 그 외에도 crossing-over 등의 mutation이 발생하고, 염기의 shifting이 발생하거나 그런 부분들을 좀 더 정교하게 구현해 내야 하지 않을까? 그냥 대충 어떻게든 저 세 가지를 충족시키기만 해서 되는 거면 이미 한참 전에 됐을 것 같다. 내가 볼 때 중요한 것은 이제 아무리 세대를 거치더라도 그 variation이 수렴하지 않고 발산해야 한다는 점이다. 생명체는 어떻게 되먹은 건지 참 신기하게도 그게 가능하다. 하지만 아까 대충 고안한 weight noise 방식은 분명히 충분한 세대를 거치고 나면 variation이 수렴해 그 끝이 날 것이다. 그리고 또 하나하나의 염기 서열이 굉장히 큰 기능적인 변화를 이끌어 낼 수 있는 DNA와는 다르게 neural network에서 그 parameter의 크기는 결국은 특정 반응에서 continous한 값의 양적 차이만을 만들어낼 뿐이다. DNA는 연속된 염기서열의 sequence 가 특정한 단백질을 합성할 수 있느냐 없느냐를 discrete하게 바꾸기 때문에 단순한 single gene mutation만으로도 좀 더 강력한 변화를 만들어낼 수 있다. 분명 이러한 고려 하나하나가 깊이 필요할 것이다.

이런 측면으로 가면 사실 가장 해보고 싶은 것은 가능한 단순하지만 진화의 가능성을 갖춘 최소 구조를 연구하는 것이다. 이름을 붙여보자면 최소진화체? inheritable reproduction이 가능한 최소한의 요건만을 갖춘 개체. 그러면서 인간 정도의 복잡도를 가진 개체로 진화할 수 있는 잠재성을 가지고 있는 그것. 그 구조와 메커니즘을 충분히 알 수 있다면 artificial 하게도 구현할 수 있지 않을까? 그것만 computational 하게 modeling할 수 있다면 simulation environment을 바꿔주는 것만으로도… 고도의 진화를 이뤄낼 수 있지 않을까?

현재 가장 AGI라고 부르는 것에 가까운 존재는 아마 ChatGPT 일 것이다. 하지만 ChatGPT 또한 (물론 GPT-4에 대해서는 상세한 공개는 하지 않았지만 그게 Nature에 열 번 정도 실릴 만큼 깜짝 놀랄만한 연구 기밀을 본인들만 적용한 게 아니라면) 결국은 gradient descent와 back propagation, 그리고 아마도 Transformer 아키텍처를 벗어나지는 않을 것이다. 내가 고민하는 지점에서 개선은 없었을 것이라고 본다. 이것들은 하나하나가 상당히 어려운 문제들이기 때문이다. 그래서 사실 나의 견해는 OpenAI의 지금까지의 접근 방식으로는 의식이 있는 AGI를 못 만들 거라고 생각하고 있다. 물론 모르는 일이지만 말이다. 사실 이런 포인트들은 질문만 던져놨을 뿐 아직 방법론적인 측면은 고사하고 관련 부분의 선행연구에 대한 탐색도 시작하지 않아서 나 혼자만의 공상에 가깝다. 앞으로는 나의 연구를 통해 이런 부분에 대한 답을 찾아가고자 한다. 아마 평생을 바쳐야 하지 않을까?

AI X Bookathon 4회 후기

2023-01-27T00:00:00+09:00

참가

AI Bookathon은 우리학교에서 열린 인공지능으로 에세이를 쓰는 해커톤 대회이다. 나는 글쓰기와 인공지능 모두 관심이 있었기 때문에 예전에 처음 이 대회를 봤을 때부터 참가해보고 싶다는 생각이 들었다. 참가를 해서 팀을 이루는 과정에 문제가 있었지만 우여곡절 끝에 경희대에서 같이 참가하신 분들과 함께 팀이 되었다.

모델링

이전 대회의 후기들을 보면 매우 소량의 데이터로 fitting을 시켜서 모델을 학습해 사용한 것 같았다. 주제가 공개된 이후에 데이터를 모아서 추가로 학습시킨 팀도 있는 듯 보였지만… 내가 생각했을 때 그건 PLM 모델의 LanguageModeling 능력을 100% 활용할 수 있는 방법은 아니라는 생각에 다른 방식으로 접근하기로 했다. 뭐랄까 그렇게 학습하는 건 진짜 글을 쓰는 모델을 만든다기보다는 특정 종류의 글을 모델에 implicit 하게 저장해놓고 불러오는? 느낌같아서…

나의 학습 목표는 “최대한 범용적이고 전반적인 에세이를 쓰는 능력이 좋은 모델을 만들자!” 였다.

이런 생각에 따라 가능한 좋은 PLM모델을 골랐고, 가능한 다양하고 좋은 데이터를 많이 모으기 위해 노력했다.

또한 한 가지 고민이 있었는데 대회는 2만자 내외의 글을 작성해야했기 때문에 모델이 어떻게 긴 context를 유지하며 글을 써나갈 수 있을까 고민을 했다. 그 방법으로 하나의 글을 여러 개로 segmentation하고 각각의 segment마다 요약문을 요약모델을 이용해 뽑아냈다. 그리고 이전 segment에서 나온 요약문들을 현재 segment의 prompt로 함께 넣어주면서 과거의 기록을 유지할 수 있도록 학습시켰다.

평가 결과와 심사기준

결과적으로 수상은 하지 못했다. 매우 여러가지 원인이 있을 수 있겠지만…. 중요한 부분은 심사기준이 우리팀의 전략과 잘 맞지 않았던 거 같다. 본선 시작 바로 전날까지는 대회 측에서 사람의 개입은 매우 한정적으로만 가능하다고 공지를 했었다. 기본적으로는 초기프롬프트를 제외하고는 거의 모델이 문장을 생성하고 사람은 매번 생성이 되었을 때 특정 부분을 선택해서 뒷부분은 버릴 수 있고 거기에 한 어절을 추가할 수 있는 수준? 그런데 막상 본선 대회가 시작되고 보니 사람의 개입에 사실상 제한이 없었다. 모델의 원하는 결과만 뽑아내도 되고, 사람이 수정을 해도 되고, 사람이 원하는 부분에 문장을 넣어서 다시 생성을 시켜도 되고…. 우리팀을 사람의 개입이 적게 모델의 능력을 많이 활용해 글을 써내는 게 중요한 부분일 거라고 생각하고 사람은 4개의 소제목만 작성하고 본문의 텍스트는 전부 그 소제목을 이용해 모델로만 생성했는데 이건… 패착이었다.

그리고 또 중요한 포인트가 팀이 총 15팀이라 2만자 가량의 작품이 15편이 있는건데 절대 심사위원들이 글을 다 읽을 수가 없다. 근데 이게 또 2만자 내외로 글을 쓰라고 안내는 했지만 막상 그만큼 길게 안써도 딱히 패널티는 없었던 모양이다. 실제로 1등팀의 글은 1만자가 되지 않았던 거 같다. 어차피 다 못 읽을 글이라면 그냥 분량을 줄이고 퀄리티를 높이는 게 훨씬 전략적으로 옳은 선택인 건… 당연한데 뭔가 이 부분에 대해 미리 안내가 없었던 건 아쉬운 부분이 있다.

심사위원 중 NLP를 하시는 교수님 한 분은 우리팀이 기술적으로는 제일 잘 했다고는 했는데 전반적으로 평가기준이 기술성이 그렇게 중요한 거 같지는 않았고 애초에 분량이 중요하지 않고 사람의 개입이 많이 허용되는 상황에서는 굳이 모델 자체적으로 긴 맥락을 유지할 수 있는 기법에 투자한 건 의미가 크게 없다보니 뭐… 아쉽지만 우리팀 전략 중에 심사기준에 잘 맞는 게 없었던 셈이다. 물론 이제 사람보다 훨씬 글을 잘 쓰는 슈퍼 AI에세이 모델을 만들었다면 그런 기준 같은 거 다 무시하고 상을 탈 수도 있었겠지만?

혹시나 다음 번 AI X Bookathon 대회에 참여하신다면 이런 방식으로 심사가 된다는 건 알고 접근하시는 게 좋을 것 같아요.

결과물

저희가 학습한 모델을 huggingface로 공개되어 https://huggingface.co/khu-bot 에서 볼 수 있습니다.

저희가 작성한 코드와 최종 작품은 https://github.com/khu-bot/ai-essayist 여기에서 볼 수 있고,

저희 다른 팀원이 작성한 저희 팀 기법과 관련된 자세한 후기는 https://laonmoon.tistory.com/199 여기에서 보실 수 있습니다.

∞-former: Infinite Memory Transformer 요약

2022-07-22T00:00:00+09:00

∞-former: Infinite Memory Transformer

Abstract

바닐라 트랜스포머를 제한이 없는 장기기억을 이용해 확장한 ∞-former를 제안
∞-former의 attentention complexity는 context 길이와 독립적이며 memory 크기와 정확도 간의 트레이드오프가 있음
sorting, language modeling, dilaogue generation 등의 task로 실험해서 성능을 검증

Introduction

∞-former는 트랜스포머에 unbounded long-term memory(LTM)으로 확장시켜 임의 길이의 context에 어텐션을 적용할 수 있도록 함.
LTM의 키 아이디어는 연속적인 공간 상의 어텐션으로 모델의 입력이 $N$개의 radial basis function의 선형조합인 연속된 신호로 표현됨.
이렇게하여 ∞-former의 어텐션 complexity는 바닐라 트랜스포머의 $\mathcal{O}(L \times (L+L_{\text{LTM}}))$과 달리 $\mathcal{O}(L^2 + L \times N)$
덕분에 토큰의 수보다 작은 $N$을 설정하여 연산량을 줄일 수 있고 항상 임의의 길이의 context를 고정된 크기로 표현할 수 있음
물론 계산량이 줄어든 만큼 정확도에서 손실을 보는 문제가 나타내는데 이 문제를 해결하기 위해 sticky memory를 도입. 자주 사용되는 기억에 더 큰 공간을 할당하여 중요한 정보는 영구적으로 기억될 수 있도록 함
저자들이 말하는 기여점
- ∞-former를 제안해 입력의 길이와 어텐션의 복잡도를 독립적으로 만들어 긴 context를 다룰 수 있게 만듬
- 모델이 메모리에 제한이 없는 context를 유지하는 방법을 제시
- 중요한 정보가 오래 보존되도록 강제하는 sticky memory 도입
- 3종류의 task에 대해 실험적으로 비교하고 이점을 보여줌

Background

$L$: 입력의 길이

$e$: Embedding의 크기

$X= [x_1, \ldots, x_L ] \in \mathbb{R}^{L \times e}$ : 입력 시퀀스

$Q, K, V$: 어텐션의 Query, Key, Value

Continuous Attention

2020년에 제안된 Continuous attention 메커니즘은 단어에 대한 어텐션 확률 질량 함수를 신호에 대한 확률 밀도 함수로 대체
continuous attention을 사용하기 위해 먼저 $X \in \mathbb{R}^{L \times e}$ text 입력을 연속된 신호로 변환
- 이는 입력을 basis function의 조합으로 표현하는 것
각 $x_i\ (i\in \{1, \ldots, L \})$ 마다 시간 사이의 위치 $t_i \in [0,1]$로 맵핑됨. $t_i = i / L$
그러면 모든 $t \in [0,1]$에 대해 연속 공간 표현 $\bar{X}(t) \in \mathbb{R}^e$을 얻을 수 있음

\[\bar{X}(t) = B^\intercal \psi(t)\]

$\psi(t) \in \mathbb{R}^N$는 $N$개의 RBF의 벡터
$B$는 multivariate ridge regression으로 얻을 수 있음
입력을 연속된 신호 $\bar{X}(t)$로 변환한 뒤 다음은 이 신호에 대해 어텐션을 하는 것.
일반 어텐션 처럼 입력에 대해 discrete한 확률 분포를 사용하지 않고 확률 밀도 $p$를 사용함.

\[c = \mathbb{E}_p [\bar{X}(t)]\]

Infinite Memory Transformer

저자들은 바닐라 트랜스포머에 연속적인 LTM을 이용해 긴 context를 활용할 수 있도록 함
LTM은 이전 스텝의 입력 임베딩과 hidden state를 저장하고 있음
뿐만 아니라 TransformerXL에서 했던 것처럼 hidden state를 확장하는 short-term memory (STM)의 활용가능성도 고려

Long-term Memory

infinity-former에서 각 레이어의 결과 $Z$는 TransformerXL에서 처럼 입력과 hidden state cache로 얻는 $Z_T$와 장기기억으로 얻는 $Z_{LTM}$을 더해서 얻음
$\bar{X}(t)=B^\intercal \psi(t)$ 에서 $B$를 각각 linear 레이어로 projection해서 $K, V$를 얻고 $K$로부터 평균과 분산을 구해서 확률밀도 함수 $p$를 얻고, $V$에 $\psi$함수를 곱해 다시 value signal을 얻은 뒤 $p$를 가중치로 $V$와 가중합 하여 $Z_{LTM}$을 얻음.

Unbounded Memory

이산적인 sequence를 기억으로 사용하면 확장하기 위해 새 hidden state를 저장해야 함.
$\infty$-former는 고정된 크기에 무제한의 context를 기록할 수 있음.
1. 먼저 $\bar{X}(t)$에서 $[0,1]$ 사이의 $M$개의 위치를 샘플링
2. 그리고 그 점들에 새로운 점들을 $\tau$를 기준으로 앞 뒤로 이어 붙임.
3. 이 점들을 이용해 다시 multivariate ridge regression을 이용해 새 $\bar{X}(t)$를 얻음.

Sticky Memories

LTM에서 $[0,1]$ 사이의 $M$개의 위치를 샘플링할 때, 일정한 간격으로 샘플링을 할 수 있지만 그러면 사실 특정 위치가 다른 위치보다 더 중요할 수 있음을 고려하지 못함.
저자들은 $M$개의 위치를 각각의 영역에서의 연관성을 이용해 샘플링하는 방법을 제안
1. 먼저 신호 구간을 $D$개의 영역으로 일정하게 나눔
2. 그리고 각 영역마다 확률을 계산함 ($H$는 head의 개수, $L$은 sequence length)
  \[p(d_j) \propto \sum_{h=1}^H \sum_{i=1}^L \int_{d_j} \mathcal{N}(t; \mu_{h,i}, \sigma_{h,i}^2)\ dt\]
3. 그리고 각 영역마다 확률 값에 따라서 총 $M$ 개의 위치를 샘플링

Experiments

Sorting

임의의 순서로 주어진 숫자 입력을 내림차순으로 정렬해서 생성하는 task
TransformerXL, Compressive Transformer를 baseline 삼아 비교

모델의 크기나 메모리의 크기는 공평하게 설정
sequence length가 짧을 때는 Transformer-XL은 모든 정보를 메모리에 기억하고 있어 가장 성능이 좋지만 길어지면 성능이 크게 하락함
compressive transformer보다 성능이 좋음

Language Modeling & Document Grounded Dialogue

$\infty$-former가 가장 좋았다.
더 긴 context에 대한 기억이 필요한 PG-19에서 Wikitext-103 보다 성능차가 컸다.
Sticky Memory를 쓰는 게 더 좋았다.

Conclusions

무제한의 장기기억을 활용할 수 있는 $\infty$-former 모델을 제안
연속된 공간 상의 어텐션을 사용하여 입력의 길이와 독립적인 attention complexity를 확보
과거 사용된 기억을 고려하면서 더 중요한 정보를 오래 기억할 수 있는 sticky memory 기법 적용
실험을 통해 개선됨을 증명

Memformer: A Memory-Augmented Transformer for Sequence Modeling 리뷰

2022-03-11T00:00:00+09:00

Memformer: A Memory-Augmented Transformer for Sequence Modeling

Abstract

Transformer는 모든 token-level 표현을 메모리로 저장하기 때문에 효율성 문제가 있음
저자들은 외부 동적 메모리를 사용하여 과거 정보를 인코딩하고 기록하는 Memformer라는 더 효율적인 모델을 제안
Memformer는 입력에 대해 선형의 시간복잡도와 상수의 공간복잡도를 가짐
BPTT의 메모리 사용량을 줄이는 MRBP라는 최적화 방법을 제시
실험을 통해 추론시 베이스라인들에 비교했을 때 8.1배 적은 메모리 사용량으로 3.2배 더 빠른 속도를 보여줌
어텐션에 대한 분석으로 timestep이 지나가도 중요 정보를 인코딩하고 유지할 수 있다는 것을 보여줌

Introduction

인간은 감각정보를 인지하고 압축된 형태로 인코딩해 뉴런에 저장함
그리고 저장된 정보를 가져와 효과적으로 다양한 작업에 적용
기억 시스템을 신경망에 적용하려는 시도가 계속 있었음
- RNN, LSTM, GRU, NTM(Neural Turing Machine), DNC(Differential Neural Computer)
기존의 recurrence를 버리고 Transformer 가 등장, 대신 $\mathcal{O}(N^2)$ 연산이 필요한 self-attention을 사용
Transformer의 계산비용을 줄이기 위한 모델이 등장
- Reformer, Sparse Transformer, Longformer, Linformer
- 덕분에 셀프 어텐션의 복잡도를 줄이고 더 긴 sequence를 처리할 수 있지만 여전히 선형적인 공간 복잡도를 가짐
TransformerXL은 기억과 recurrence를 재도입, 하지만 단순하게 raw hidden state를 가지고 있는 방식으로는 정보를 압축할 수 없음
Compressive Transformer는 이 점을 개선해 메모리를 더 적은 벡터로 압축하여 개선시킴
- 하지만 TransformerXL, Compressive Transformer 모두 구조적으로 일정 timestep이 지난 정보는 결국 버려지게 되어있음
Memformer는 고정 크기의 외부 동적 기억을 최근 Transformer와 통합
Memformer는 외부 동적 기억과 상호작용하여 기억을 읽고 씀
망각 기능을 이용해 새 정보를 기억하기 더 쉽게 만듬
Memformer는 이론적으로 아무리 큰 recurrent한 모델을 학습하기 위해 필요한 BPTT는 메모리 사용량이 크기 때문에 memory replay back-propagation (MRBP) 기법을 제안하여 메모리 사용량을 상당부분 개선
Memformer를 auto-regressive 이미지 생성이나 language modeling에 적용해 Transformer나 TransformerXL과 동등한 성능을 보이면서도 계산 속도와 메모리 사용량 측면에서 훨씬 효율적임을 실험을 통해 검증
분석을 통해 Memformer가 extended period에도 정보를 유지할 수 있음을 보여줌

Recurrence and Memory

recurrence와 memory를 transformer에 적용하는 시도는 주로 relative positional encoding과 segment-level recurrence 메커니즘을 적용한 TransformerXL과 여기에 더해 이전 hidden state를 압축해서 사용하는 Compressive Transformer 두 가지 정도가 있음
하지만 과거 hidden state를 직접 활용하는 것은 과거 context를 볼 수 있는 최대 길이에 제한이 있음

Dynamic Memorization

dynamic memorization은 이론적으로 최대 context 길이의 제한이 없음
Neural Turing Machine, Differential Neural Computer 등의 방법론이 있으며 external memory를 하활용하여 긴 길이의 memory를 활용
하지만 복잡한 기억 메커니즘은 학습 도중 느리고 불안정하게 만드는 측면이 있음
본 논문에서는 더 효율적인 dynamic memoization 메커니즘을 제안

Methods

Segment-level Sequence Modeling

$N$ 토큰으로 이뤄진 sequence $x_1, x_2, \ldots, x_N$이 주어졌을 때, 보통 LM은 sequence의 각각 토큰의 확률을 곱한 결합 확률를 학습한다.
\[P(x) = \prod_t P(x_t \vert x_{
큰 외부 메모리가 있을 때 메모리를 모든 토큰에 대해 상호작용할 수 없어서 sequence를 $L$길이를 갖는 $T$ 개의 segment로 나눈다.
\[s_t = \{x_{t,1}, x_{t,2}, \ldots, x_{t,L}\}\]
Bi-directional 인코더가 단어의 표현을 추출하는데 더 좋기 때문에 Transformer encoder-decoder 구조를 적용함.
- (Comment) 결과적으로 그냥 Transformer기반 생성 모델에서 Encoder만 Memory정보를 추가활용하는 느낌
encoder는 이전 timestep의 기억인 $M_{t-1}$을 활용해 segment $s_t$를 인코딩하여 encoder output을 기억 $M_t$에 저장함
encoder의 최종 출력은 디코더의 크로스 어텐션 레이어로 들어가서 다음 timestep segment $s_{t+1}$의 토큰을 예측하는데 사용됨

\[\begin{align*} M_t && = \text{Encoder}(s_t, M_{t-1}) \\ P(s_t \vert s_{

각 timestep 마다 segment가 입력으로 주어졌을 때, 모델은 다음 text segment를 생성하고 생성한 segment는 다시 모델의 입력으로 들어감

메모리는 과거 정보를 전부 저장하고 있기 때문에 auto-regressive하게 모든 토큰을 생성할 수 있음

(Comment) 이건 근데 $x_7$이 decoder에서 생성된 걸로 나오는데 다다음 segment에 encoder 입력으로 $x_7$이 들어가야하지 않나?
- 이게 true랑 prediction 값을 구분해서 표기하지 않아서 그런 듯? 그냥 생성모델이라고 생각하면 학습 시점에서는 실제값도 있고 저렇게 나온 예측값도 있어서 teacher forcing으로 그걸 학습할 것이고 추론시점에서는 LM이라고 생각하면 첫 segment에서 생성된 값을 다시 두번째 segment의 encoder input으로 넣어서 뽑고 그걸 또 다시… 하면 될 것 같긴한데
- 여기서 궁금점은 그러면 추론시인데 input context가 segment하나를 넘어가면 어쩌지?
  - 안될 거 같은데…?

External Dynamic Memory Slots

External dynamic memory (EDM)은 과거 입력을 high-level 표현으로 저장하는 자료구조
동적 메모리는 모델이 메모리와 상호작용하며 recurrent 방식으로 데이터를 읽고 인코딩함을 의미
본 논문의 설계에서는 정해진 $k$개의 벡터를 external dynamic memory로 할당
매 timestep $t$에서 $M_t = [m_t^0, m_t^1, \ldots, m_t^k]$ 를 가짐
batch 안의 각 example마다 별도의 메모리 표현을 가지고 있음
때문에 입력의 sequence가 아무리 길어져도 rnn처럼 추론 중 메모리 사용량은 일정함
각각의 메모리는 독립적이고 하나를 memory slot이라고 칭함

Memory Reading

input segment sequence가 들어올 때마다 모델을 메모리를 cross attention 방식을 사용해 메모리를 읽어옴

\[\begin{align} Q_x, K_M, V_M && = xW_Q, M_tW_K, M_tW_V \\ A_x && = \text{MHAttn}(Q_x, K_M) \\ H_x && = \text{Softmax}(A_{x,M})V_M \end{align}\]

Memory slot 벡터들은 key, value로 사상되며 입력 sequence $x$는 query로 사상됨
입력 sequence의 query는 모든 memory slot key, value에 attend하여 최종 hidden state를 얻음
- (Comment) 쉽게 말하면 원래 transformer decoder랑 같은 구조인데 cross attention을 encoder output이 아니라 memory에 대해 하는 느낌

Memory Writing

slot attention으로 메모리를 업데이트하거나 불필요한 정보를 망각시키는 작업
memory reading과 달리 writing은 encoder의 마지막 레이어에서만 일어남
- 이는 high-level의 contextual 정보만을 기록하도록 함

Update via Memory Slot Attention

각각의 slot마다 독립적으로 query, key로 사상됨
segment 토큰은 key, value로 사상됨
slot 어텐션이란 각각의 메모리 슬롯이 자기자신과 token 표현에 대해서만 attend할 수 있음
- 각 memory 슬롯은 자신의 정보를 다른 슬롯에 쓸 수 없어 서로 간섭할 수 없음

\[\begin{align} Q_{m^i}, K_{m^i} && = m^iW_Q, m^iW_K \\ K_x,V_x && = xW_K, xW_V \\ A^\prime_{m^i} && = \text{MHAttn}(Q_{m^i}, [K_{m^i};K_x]) \\ A_{m^i} && = \frac{\exp(A_i^\prime / \tau)}{\sum_j \exp(A_j^\prime / \tau)} \end{align}\]

최종 어텐션에는 temperature $\tau (\tau<1)$를 사용해 어텐션 분포를 더 sharp하게 만듬

\[\begin{align} m_{t+1}^i\ ^\prime = \text{Softmax}(A_{x,M})[m_t^i;V_x] \end{align}\]

이런 어텐션 메커니즘은 각각의 슬롯이 오래된 정보를 유지할 지를 선택할 지 새 정보를 업데이트할지를 결정하는 걸 도움

Forgetting Mechanism

망각은 사소하고 임시적인 정보를 걸러주기 때문에 학습에 매우 중요함
Biased Memory Normalization (BMN) 이라는 방식으로 망각을 구현
최초의 memory 상태는 forgetting vector랑 똑같이 만들어줌
매 step마다 메모리 슬롯을 정규화해서 weight가 무한히 커져가거나 긴 timestep에 graident 안정성을 잃지 않도록 함

\[\begin{align*} m_{t+1}^i && \leftarrow m_{t+1}^i + v_\text{bias}^i \\ m_{t+1}^i && \leftarrow \frac{m_{t+1}^i}{\Vert m_{t+1}^i \Vert} \\ m_0^i && \leftarrow \frac{v_\text{bias}^i}{\Vert v_\text{bias}^i \Vert} \end{align*}\]

이전 정보를 지우기 위해서 학습되는 벡터 $v_\text{bias}$를 더해줌

normalization을 하기 때문에 모든 메모리 슬롯은 구 위로 사상됨
$v_\text{bias}$가 망각의 속도와 방향을 컨트롤, $v_\text{bias}$를 더할 때마다 (새 정보가 추가되지 않으면) 결국 최종 상태 $T$에 도달하게 됨
망각의 속도는 $v_\text{bias}$의 크기와 $m_{t+1}^\prime$과 $v_\text{bias}$간의 cosine 거리에 의해 통제됨
- 예를들어 거의 반대편에 있는 $m_b$가 $m_a$보다 잊히기 어려움

Memory Replay Back-Propagation

Memform의 고정된 크기의 메모리 설계 때문에 추론 시간에도 추가적인 memory cost는 없음
하지막 학습 때는 BPTT가 memory writer network의 학습 때문에 long-term range의 모든 출력값을 다 보존하고 있어야 함.
- 이는 Memformer에게는 비현실적인 memory 사용량을 초래
gradient checkpoint 방식은 memory 사용량을 줄일 수 있지만 이 경우는 불필요한 계산량이 많음
Memory Reply Back-Propagation(MRBP)는 gradient checkpointing의 효율적인 버전
$x_t, x_{t+1}, \ldots x_T$와 $M_t, M_{t+1}, \ldots, M_T$가 주어졌을 때, 이 알고리즘은 forward pass 중에서 계산 그래프의 crtical path만을 탐색하고 backward 중에는 그 partial graph에 대해서만 재계산함

MRBP는 매우 적은 속도 저하로 많은 메모리를 아낄 수 있음

Experiments

Computation and Memory Cost

Vanilla Transformer는 $O(N^2)$ 필요 ($N$은 sequence length)
Transformer-XL과 Compressive Transformer는 과거 정보를 저장하기 위해 기억을 활용하여 input sequence length가 상수기 때문에 $O(N)$ 필요
- (Comment) 왜 $N$이지? 생각해봤는데 한 sequence length가 $L$이라고 치면 timestep은 $\frac{N}{L}$ 만큼이고 각 step 마다 $L^2$연산 필요, 곱하면 $\frac{N}{L} \times L^2 = NL$ 인데 $L$이 상수라고 치면 $O(N)$이 되는 듯
trade-off 로 TransformerXL과 Memformer 모두 memory size가 과거 정보를 저장할 수 있는 능력에 영향을 주는 중요한 요소임
TransformerXL은 $L$개의 레이어가 있고 $K$의 memory size일 때 $O(KL)$의 저장 비용이 발생하지만 Memformer는 레이어와 상관없이 $O(K)$의 메모리만 사용

왼쪽 그림을 보면 Vanilla Transformer가 계산량의 증가가 가장 크다는 걸 볼 수 있음
GPU 메모리 사용량을 보면 memory size가 증가함에 따라 TransformerXL은 Memformer에 비해 메모리 사용량이 빠르게 증가하는 모습을 볼 수 있음

Autoregressive Image Generation

최근 연구는 Image를 long sequence로 보고 생성하는 접근을 보여줌
MNIST 784 pixel 값 하나하나를 토큰으로 보고 생성

TransformerXL은 8 layer
128 hidden size, 4 attention heads, 32 head size, 256 feedforward size
Memformer는 4 layer encoder, 8 layer decoder, 64 memory size가 default
테이블을 봤을 때 최고 성능에서는 10%의 FLOPs로 TransformerXL 784 메모리 보다도 더 좋은 성능
layer 수가 더 많은 점 때문에 Ablation을 해보면 4 encoder + 4 decoder로 했을 때 성능은 떨어져서 TransformerXL이랑 비슷하지만 전반적으로 훨씬 낮은 비용으로 더 좋은 성능을 보임을 알 수 있음
temperature, forgetting, multi-head 등등의 요소도 성능에 기여함을 알 수 있었음

Language Modeling

long-range LM 벤치마크인 WikiText-103으로 실험을 수행함
- 평균적으로 3.6K개의 토큰들어있는 28K의 문서가 있음
리소스 문제로 PG-19로는 테스트를 못해봄
TransformerXL은 16 layers
모두 512 hidden size, 2048 feedforward size, 64 head size

전반적으로 TransformerXL에서 Memory Size가 커지면 성능이 향상
- 하지만 그에따라 FLOPs도 커짐
Memformer는 더 적은 FLOPs도 더 좋은 성능 달성
layer 수를 맞춰서 4+12로 실험했을 때 성능은 TransformerXL이랑 거의 비슷했지만(TransformerXL보다 약간 낮아짐) FLOPs가 훨씬 적었음

Memory Writer Analysis

Memory Writer가 어떻게 memory slot을 업데이트하는 지 해석해봄
메모리 슬롯을 3 종류로 분류
문서를 처리하는 중간에는 60% ~ 80%의 메모리 슬롯이 $m^{300}$ 과 비슷
어텐션이 자기 자신에 focus되어 현재 timestep에서 업데이트를 하지 않는다는 뜻
- 이는 메모리 슬롯이 먼 과거 정보를 가지고 있을 수 있음을 시사함
$m^{250}$같은 종류는 부분적으로 자기 자신에 attend하고 나머지는 token 들에 나눠져있음
- 이런 경우는 첫 번째 type에서 transform되며 다른 토큰들로 부터 정보를 모아 저장함
$m^{355}$같은 경우는 완전히 input token에만 attend, 매우 초기 time step에는 거의 모든 메모리가 이런 상태지만 나중에는 5~10% 정도만 이런 메모리임
- 또한 $m^{355}$의 forgetting vector의 크기가 다른 슬롯들에 비해 크다는 점을 발견 ($3.2 > 1.15)$

Conclusion

Memformer라는 external dynamic memory를 활용하여 효율적으로 long sequence를 처리할 수 있는 auto-regressive 모델을 제안
Memformer에 더해 MRBP라는 optimization scheme을 고안해 large memory를 갖고 있는 recurrent 모델의 학습을 용이하게 함
실험 결과는 Memformer가 매우 좋은 효율로 비교할만한 성능을 달성했다는 것과 먼 과거 정보를 유지할 수 있음을 보여줌
Memformer는 recurrence나 auto-regressive modeling이 필요한 dialog나 interactive 시스템에서 잘 활용될 수 있을 것

Review

기본적인 아이디어는 transformer에 external dynamic memory를 결합한 것
그 방법으로는 encoder-decoder의 transformer를 사용하는데 다른 기법들처럼 input을 여러 segment로 나누고 각 segment마다 encoding (with memory read)과 memory write, 그리고 decoding을 수행함.
- 궁금한 점으로는 현재 기법으로는 segment 단위로 decoding을 하게 되는데 좀 이상하지 않나? 왜 그렇게 했을까? 이러면 앞부분만 보고 생성하고 또 다음 segment보고 생성하고… 이걸 반복하는 건데 한번에 다 읽고 decoding을 하는 보통개념과는 다른 느낌
- 그리고 뭔가 input segment를 넘어가는 sequence input을 사용할 수 없는 것도 뭔가… 이상한?
그래도 성능이 그럭저럭? 나왔고 연산량은 확실히 적어보이기는 하고 구조적으로 과거의 정보를 동적 메모리에 저장할 수 있는 구조라는 점은 좋은 듯

Compressive Transformers for Long-Range Sequence Modelling 리뷰

2022-03-07T00:00:00+09:00

Compressive Transformers for Long-Range Sequence Modelling

Abstract

긴 과거 sequence 에 대한 메모리를 압축하는 Compressive Transformer를 보여줌
Compressive Transformer로 WikiText-103, Enwik8 등 벤치마크에서 SOTA 성능을 얻음
Speech나 RL에서도 사용될 수 있음을 검증
새 LM 벤치마크 데이터셋 PG-19를 제작

Introduction

책을 읽을 때 사람은 수천 단어를 읽어도 심지어는 읽는 텀이 길어도 과거의 서사를 압축된 표현으로 갖고 있을 수 있음
모든 정보를 저장하는 것이 아니라 사람은 입력되는 자극을 공격적으로 선택하고 필터링하고 통합함
과거에 대해 표현할 수 있는 RNN, LSTM, Transformer 등의 모델이 등장
Transformer는 과거를 depth x memory size x dimension으로 표현하기 때문에 LSTM의 hidden state에 비해 크기정도(order of magnitude)가 더 큼
Transformer는 엄청난 성능을 보여주었지만 모든 timestep에 attention하기 때문에 큰 메모리의 계산과 저장 비용이 높음
sparse 메커니즘처럼 계산 비용을 줄이는 모델도 있지만 저장 비용은 그대로임
저자는 Transformer의 확장으로서 Compressive Transformer를 제안
- 과거 기억을 더 작은 압축된 형태로 표현
- 기억과 압축기억에 대해서 같은 방식으로 어텐션 메커니즘을 적용
- 글자 수준의 LM에서 SOTA 달성
- 언어 말고도 음성이나 RL 분야에도 적용가능함을 확인
책의 LM 벤치마크인 PG-19 데이터셋을 제작

Transformer 어텐션의 범위를 확장하거나 비용을 줄이려는 시도가 많았음
TransformerXL에서는 과거의 activation을 기억에 저장하고 새로운 relative positional embedding 방법을 제안했으며 저자들도 이 두 아이디어를 모두 활용함
Sparse Transformer는 합리적은 메모리와 계산 비용으로 모델을 만들 수 있지만 작은 attention window로 성능의 한계가 있었으며 여러 attention head가 더 짧고 긴 길이의 어텐션을 학습하는 기법 등은 압축기억을 활용하는 비슷한 면이 있지만 구현 상 TPU 같은 가속화하드웨어를 사용하기 어려운데 비해 저자들의 접근 법은 가능함

Model

긴 과거기록을 보존하기 위해 각 레이어마다 과거 activation을 메모리로 갖고 있는 TransformerXL 을 기반으로 함
TransformerXL에서는 메모리 크기를 넘어서는 오래된 기억은 버려지는 반면 Compressive Transformer는 이런 오래된 기억도 버리지 않고 압축하여 별도의 압축기억으로 저장함

Description

$n_m:$ 레이어 마다의 회상 기억
$n_{cm}:$ 레이어 마다의 압축 기억
$S = x_1, x_2, \cdots, x_{\vert s \vert}$ 는 입력 토큰들
$n_s:$ 모델이 동시에 처리하는 개수, window size

모델은 time $t$에 $\mathbf{x} = x_t, \cdots, x_{t+n_s}$ 를 입력받음. $\mathbf{x}$하나를 seqence로 칭함.

모델이 다음 sequence로 이동하면 $n_s$ 길이의 hidden activation은 고정 크기의 FIFO 메모리로 입력됨

그리고 가장 오래된 $n_s$ 개의 메모리는 삭제되는데 TransformerXL과는 달리 이 데이터를 버리지 않고 압축함.

$f_c: \mathbf{R}^{n_s \times d} \rightarrow \mathbf{R}^{ \lfloor \frac{n_s}{c} \rfloor \times d }$ 는 $n_s$개의 가장 오래된 기억을 $\lfloor \frac{n_s}{c} \rfloor$ 개의 압축기억으로 맵핑해 이차적인 FIFO 압축기억 공간에 저장.

$d$는 activation의 hidden size를 뜻하며 $c$는 압축률을 뜻함.

Compression Functions and Losses

압축 함수 $f_c$로 다음과 같은 함수를 고려함
1. max/mean pooling, kernel과 stride는 모두 $c$
2. 1D convolution, kernel과 stride는 모두 $c$
3. dilated convolutions
4. most-used, 기억들은 각 어텐션 평균값으로 정렬되고 가장 많이 사용되는 것들이 보존됨
pooling은 빠르고 단순한 베이스라인
convolution 방법들은 학습 파라미터를 갖고 있음
compression 네트워크는 loss로부터 gradient를 받아 학습할 수 있지만 매우 오래된 기억의 경우backpropagating-through-time (BPTT)을 통해서 긴 timestep에 걸쳐 학습해야 함.
저자들은 지엽적인 압축 목표를 주기 위한 loss로 두 가지 방법을 고려함
1. Auto-encoding loss
  - auto-encoding loss는 압축된 기억으로부터 원래 기억을 복원할 수 있도록 함
  - \[\mathcal{L}^{ae} = \Vert \text{old\_mem}^{(i)} - g(\text{new\_cm}^{(i)}) \Vert_2\]
  - $g: \mathbb{R}^{ \frac{n_s}{c} \times d} \rightarrow \mathbb{R}^{n_s \times d}$ 가 학습됨 - 이건 모든 정보를 메모리에 유지하려는 lossless 압축 목표
2. attention-reconstuction loss
- 위와 같이 메모리와 압축 메모리에 대한 content-base 어텐션 값을 복원하는 objective
  - 이 방식은 모델에서 더이상 attended 되지 않는 정보들이 버려질 수 있는 lossy한 목표임
attention-reconstruction 방식이 좋다는 것을 발견
학습 시 transformer와 compression 네트워크 간의 gradient를 차단하여 각각 모델의 목표로 학습하도록 함

PG-19 Benchmark

긴 길이의 메모리를 다루는 모델들이 등장하면서 더 긴 context를 학습하고 평가할 데이터셋이 필요함
Project Gutenberg에서 추출한 책을 사용한 language modeling 벤치마크 PG-19 제작

Experiments

Adam 사용
LR cosine decay로 스케줄링 (1e-6 → 3e-4 Warm up → 1e-6)
gradient update frequency 조정 (gradient accumulation 말하는 듯 x4)
gradient clipping (≤ 0.1) 이 최적화에 중요했다고 함

PG-19 & Enwiki8

PG-19, Enwiki8 language modeling에서 좋은 성능을 기록

Compressibility of layers

상위 layer로 갔을 때 representations이 더 압축하기 어려울 거라고 생각해볼 수 있음
layer별로 compression loss를 모니터링 해봄
1st layer가 매우 압축성이 좋았지만 상위 레이어로 가면서의 경향은 불불명함

Attention

압축 기억을 활용하는지를 체크하기 위해 네트워크가 평균적으로 어디에 attend하는지 조사함
현재 sequence에 대부분의 attention이 걸리며 sequence에서는 causual masking으로 인해 앞부분의 attention이 높음.
가장 오래된 memory에서 compressed memory로 갈 때 attention weight가 커지는 것을 발견
이는 오래된 기억일수록 덜 활용되는 경향에 반하는 것이며 이건 모델이 중요한 정보를 보존하는 법을 학습하고 있다는 증거

Optimisation Schedule

긴 context를 다루는데 parameter를 실시간으로 업데이트하다보니 train/test 간에 분포의 차이가 발생함
문서가 바뀌는 부분에서만 파라미터를 업데이트할 수도 있지만 이건 너무 오래걸림
저자들은 학습 도중 파라미터 업데이트의 빈도를 4스텝에 1번으로 조절하고 이것이 꽤 잦은 파라미터 업데이트로 빠른 초기학습과, 항상 업데이트하지는 않아 더 좋은 일반화 능력을 갖게됨을 발견

Speech & Reinforcement Learning

다른 modality에서의 성능을 확인하기 위해 waform에 Compressive Transformer를 적용
- TransformerXL과 WaveNet보다 좋은 성능을 기록
비디오 입력은 다음 frame에 대해 높은 상호 정보량을 가지고 있어 압축이 효과적임
- 비디오에 직접 테스트하지는 못했지만 비디오 입력을 받는 RL agent에 적용해봄
- compression rate을 1로 줬을 때 제대로 학습이 되지 않았지만 4로 가장 좋은 성능에서는 사람만큼 잘 풀었음
- 비디오 입력에 대해서도 효과가 있을 것이라고 기대

Conclusion

본 논문은 Transformer 기반 모델의 임시적인 수용범위를 압축의 개념을 활용해 확장함
long-range seqence modeling에서 기존의 아키텍쳐보다 우수한 성능을 확인
새로운 LM 벤치마크 PG-19 제작 및 공개
압축 기억을 활용했을 때 텍스트 뿐만 아니라 음성, 비전, RL 등에서 활용가능함을 보여줌
이 연구의 한계점으로는 부가적으로 발생하는 복잡성이 있음.
- long-range가 아니라면 Compressive Transformer을 썼을 때 이점이 없을 것
- 그럼에도 dynamic 이나 sparse 어텐션 류들보다는 단순하고 연산의 하드웨어 호환성이 좋음
저자들은 세세한 최근 기억과 개략적인 압축된 과거 기억을 혼합하여 사용하는 더욱 강력한 모델이 있을 것이라 생각

Review

Algorithm 1에서 보면 attention의 key, value로 현재 sequence가 없고 압축기억과 기억만 사용하는 것처럼 표기가 되어있는데 이러면 현재 sequence에서는 attention을 못 받는데 안되지 않나?
- 공식 코드는 아니지만 https://github.com/lucidrains/compressive-transformer-pytorch의 구현에서는 sequence까지 concat해서 key, value로 사용하는데 실제로도 이게 맞을 듯
부록에서보면 학습 시에 상태를 초기화하지 않았다고 하는데 왜 그랬지? Batch 단위로 했을 때 어떤 건 끝나서 초기화해줘야하고 어떤 건 안끝났고 이거 처리가 귀찮아서 그런가…?

As the Compressive Transformer is trained without state resetting, it is actually slightly out of sample when provided with the (relatively) short contexts. This is because its memory and compressed memory may be still empty (whereas they are always full during training). However we see a trend of the samples usually improving towards the end.
압축함수가 $n_s \rightarrow \frac{n_s}{c}$ 로 줄여주는 함수라 모델의 한 번 입력 sequence 길이(window size)를 고정해야하는 건 불편할 수 있을 듯
압축방식으로 auto-encoding 방식과 attention-reconstuction 방식 두 가지를 고려했을 때 후자가 좋다는데 두가지와 local reconstruction loss를 사용하지 않고 학습했을 때의 성능 비교가 있었으면 좋았을텐데 왜 안 넣었지? 이게 없어도 수렴이 되기는 하는 건지 궁금
사실 여기서는 압축을 빼고 생각하면 기억이라고 표현하지만 토큰의 임베딩 정보를 단순하게 FIFO 큐에 보관해서 어텐션 때 key, value로 참고하는 것…인데 그리고 압축은 그 key, value의 쌍을 좀 더 적은 메모리로 근사하는 것이고. 사실 상 초기레이어에서는 말 그대로 단어를 백업하는 게 될 듯하다.
또 결국은 memory든 compress memory는 다 FIFO 큐라 일정 시간이 지나면 아무리 중요한 정보라도 날라갈 수밖에 없는 것도 한계가 있을 듯. external memory의 형태지만 실질적으로는 context 길이의 확장 느낌? 큐 대신 ntm이나 memory net처럼 기존 기억과의 병합하는 방식으로는 못 안될까?

Sequential Recommendation with User Memory Networks 리뷰

2022-03-02T00:00:00+09:00

Sequential Recommendation with User Memory Networks

Abstract

실생활에서 유저의 선호는 다이나믹하고 유저의 행동 기록들이 미래 관심사를 예측하는데 모두 똑같이 중요하진 않음
논문에서 저자들은 유저의 historical reocrd를 더욱 직접적이고 가변적이고 효과적인 방식으로 표현, 저장, 조작하는 것을 목표로 했음
memory-augmented neural network(MANN)을 collaborative-filtering 과 통합하여 설계
개인화 추천의 속성에 맞춰 메모리 read/write operation을 설계
저자들의 방법론으로 4개의 real-world 데이터셋에 검증했을 때 높은 성능을 보여줌

Introduction

유저의 현재 관심사는 그들의 과거 행동에 영향을 받음
추천 문제를 해결하기 위해 유저의 순차적인 과거 기록을 통해 추천하는 sequential recommendation 방법이 등장함
Markov체인이나 RNN을 이용해서 유저의 정보를 Embedding으로 만들어 추천하는 방법 등이 있지만 이 방법론들은 유저의 모든 과거 기록을 고정된 크기의 embedding으로 나타냄
과거의 여러 기록이 다름을 구별하는 능력의 부족은 두 가지 문제를 낳음
- 과거 기록 중 강하게 연관된 상품들과의 연관성을 약화시킴
- 이런 신호를 간과하여 사람이 추천을 이해하고 설명하기 어려움
이 문제를 해결하기 위해 저자들은 외부 기억으로 유저 기록을 모델링하는 법을 제안
- 각각의 기록을 명시적으로 표현, 저장, 조작할 수 있는 능력으로 외부 기억 네트워크(ENN)은 많은 순차적 예측 task에서 좋은 성능을 보임
EMN으로부터 저자들은 Recommender system with external User Memory networks (RUM) 을 제안
- (b)와 (c)가 기본적인 아이디어를 보여줌
- 유저마다 외부 유저 메모리 행렬이 있어서 그들의 과거 기록 정보를 유지함
저자들은 item-level, feature-level 두 가지 방식으로 RUM을 개발하고 실험함

Contribution

memory-augmented neural networks(MANNs)를 처음으로 추천시스템에 적용
item, feature level 두 종류의 잠재적인 메모리 네트워크를 탐색함. 나아가 두 방식의 성능을 비교함.
다양한 실생활 데이터에 대해 SOTA와 비교하여 우월성을 검증함
어텐션 메커니즘을 통해 왜 상품이 추천되는지 방법과 이유를 설명할 수 있는 실험적 분석을 제공

저자들은 본질적으로 sequantial recommendation과 memory-augmented neural network를 통합한 것이며 이 두 연구에 대해 설명함

Sequential Recommendation

유저 순차적인 과거 기록으로 미래 행동과 추천을 예측하는 많은 모델이 제시됨
factorized personalized Markov chains(FPMC)는 인접한 행동 간에 있는 전이 정보를 추천을 위한 상품 잠재 벡터로 사상함
FPMC나 HRM 같은 방법론들은 주로 모든 인접한 기록에서 지엽적인 순차적 패턴을 모델링함
여러 스텝의 순차적 행동을 모델링하기 위해 전체적인 순차적 패턴을 파악하고 RNN에 기반해 동적으로 유저의 관심사를 학습하는 DREAM 모델 등이 제안되었지만 기존 모델들은 보통 유저의 이전 기록를 hidden state 하나로 인코딩함
하지만 저자들은 유저 메모리 네트워크를 활용해 각 유저의 이전 기록을 저장하고 조작해 유저 기록의 표현력을 높임

Memory-Augmented Neural Network

external memory network (EMN)은 순차적 데이터를 효과적으로 처리하는 능력을 보여줌
EMN은 기록을 저장하기 위해 기억행렬을 사용하고, 적절하게 이 행렬을 읽고 갱신함
매우 최근에 연구자들은 EMN을 question answering, naturl langueage transduction, knowledge tracking 등 영역에 성공적으로 적용함
EMN은 크게 기억행렬과 컨트롤러 두 요소로 구성됨. 많은 방법론이 어텐션 메커니즘을 사용해 기억을 읽어들임.
- 입력 $q$에 대해 먼저 기억행렬에 있는 기억슬롯 $m_i$ 과의 유사도 $S(q, m_i)$를 계산함. 그리고 어텐션 가중치 $w_i = \text{Softmax}(S(q, m_i))$를 얻음. 이 가중치로 어떤 기억을 읽어들일지 결정됨.
- 쓰기 과정에서는 내용과 위치에 기반해 기억행렬이 갱신됨.
논문에서 저자들은 EMN 아이디어를 추천 시스템에 적용해 유저 행동 기록을 효과적으로 반영할 수 있는 것을 목표로 함.

RUM: Recommendation with user memory networks

먼저 RUM의 일반 구조를 설명하고 item-level, feature-level 두 가지 방식에 대해 추가적으로 설명함

General Framework

$N$명의 유저와 $M$개의 상품을 가정했을 때, $p_u$와 $q_i$는 유저 $u$와 상품 $i$의 임베딩을 뜻함
$u$와 $i$의 유사도는 $\hat{y}_{ui}=p_u^Tq_i$ 로 예측됨
결국 유사도는 $p_u$과 $q_i$의 내적값으로 계산함

Memory enhanced user embedding

유저의 임베딩은 두 부분으로부터 생성함. 하나는 유저의 이전까지의 기록을 인코딩한 유저의 기억에 대한 것이고 다른 하나는 이전 기록에 영향을 받지 않고 유저의 내재적인 선호를 나타내는 자유벡터.
유저의 기록은 더 표현성이 좋은 개인화된 기억행렬 $M^u$로 인코딩되어 저장되고 갱신됨
유저 $u$에 대해, 메모리 임베딩 $p_u^m$은 현재 아이템 임베딩 $q_i$에 따라 $M^u$를 읽어 얻어짐

\[p_u^m = READ(M^u, q_i)\]

그 다음 $p_u^m$을 내재적인 임베딩 $p_u^*$과 합쳐 최종 유저 임베딩을 얻음

\[p_u = MERGE(p_u^*, p_u^m)\]

$MERGE$는 두 벡터를 하나로 합쳐주는 함수이며 저자들은 단순하게 weighted vector addition을 적용
- $\alpha$는 weighting 파라미터
- $MERGE$함수로 element-wise 곱이나 concat도 실험했지만 좋은 결과를 얻지 못했음
- 저자들은 기억을 추천에 활용할 때 영향력을 알아보기 위해 $\alpha$값을 실험함

\[MERGE(x,y) = x + \alpha y = p_u^* + \alpha p_u^m\]

Prediction function

예측할 때, 최종 유저 임베딩 $p_u$와 상품 임베딩 $q_i$를 함수에 넣는다.

\[\hat{y}_{ui} = PREDICT(p_u, q_i)\]

$PREDICT$는 임의의 예측함수 혹은 예측 신경망을 뜻하며 저자들은 학습 효율을 위해 sigmoid 내적을 사용함. $\hat{y}_{ui} = \sigma(p_u^T \cdot q_i)$ 다른 도메인에서는 다른 방식이 사용될 수 있음.
모델 최적화를 위한 loss 함수로는 binary cross-entropy 를 사용함.

\[\begin{aligned} l_{RUM} &= \log \prod_{(u,i)} (\hat{y}_{ui})^{y_{ui}}(1-\hat{y}_{ui})^{1-y_{ui}} - \lambda \Vert \Theta \Vert_F^2 \\ &= \sum_u \sum_{i\in I_u^+} \log \hat{y}_{ui} + \sum_u \sum_{i\in I/I_u^+} \log (1- \hat{y}_{ui}) - \lambda \Vert \Theta \Vert _F^2 \end{aligned}\]

$\Theta$는 모델 파라미터이며, $y_{ui}$는 $u$가 구매를 했다면 1, 아니면 0을 뜻하는 ground truth임
$I$는 모든 상품의 집합, $I_u^+$는 구매순으로 배열된 $u$가 구매한 상품들의 집합
- $I_u^+ = \{ v_1^u, v_2^u, \cdots, v_{\mid I_u^+ \mid }^u \},$ 여기서 $v_j^u$는 $u$가 구매한 $j$번째 상품
negative 물품은 $I_u^- = I / I_u^+$ 에서 랜덤으로 추출함
loss 함수의 앞 부분 2개로 likelihood를 최대화하고 뒷 부분으로 모델을 regularize함

Memory updating

매 구매가 일어난 뒤에 유저 기억행렬 $M^u$는 자체의 동적인 속성을 유지하기 위해 갱신된다.

\[M^u \leftarrow WRITE(M^u, q_i)\]

이제 item-level과 feature-level에서 각각 $READ$와 $WRITE$가 어떻게 일어나는지 설명함

Item-level RUM

먼저 저자들은 각각의 상품을 하나의 기억 단위로 봄
유저 $u$에 대해 해당 유저가 최근에 구매한 상품들로 기억행렬 $M^u$를 구성함
$I_u^+ = \{ v_1^u, v_2^u, \cdots, v_{\mid I_u^+ \mid }^u \}$ 일 때, $v_i^u$는 $u$가 구매한 $i$번째 상품.
$p_u \in R^D, q_{v_i^u} \in R^D$는 각각 유저 $u$와 상품 $v_i^u$의 임베딩
기억 행렬에는 $K$개의 열(기억 슬롯)이 있다고 가정
$M^u \in R^{D \times N} = \{ m_1^u, m_2^u, \cdots, m_K^u \},\ m_k^u \in R^D$은 $M^u$의 $k$번째 열 벡터

Reading Operation

더 영향력 있는 상품이 최종 기억 임베딩에서 더 가중되어야 함
user-item 페어 $(u, v_i^u)$에 대해 예측할 때, 처음에 FPMC와 비슷한 방식을 택함

\[w_{ik} = (q_{v_i^u})^T \cdot m_k^u,\ z_{ik} = \frac{\exp(\beta w_{ik})}{\sum_j \exp (\beta w_{ij})},\ \forall k= 1,2,\cdots, K\]

$\beta$는 강도 파라미터이며 $z_{ik}$를 $u$의 기억 임베딩을 얻기 위한 어텐션 가중치로 사용함

\[p_u^m = \sum_{k=1}^K z_{ik} \cdot m_k^u\]

이전 모델들과 달리 저자들은 모델 상품 임베딩을 reading 중에 강제로 합치지 않고 각각을 $M^u$에 저장해두고 어텐션을 사용해 읽어들임으로써 더 세세하게 유저기록을 활용할 수 있음

Writing operation

유저의 최근 행동이 현재를 예측하는데 더 중요하기 때문에 저자들은 단순하게 first-in-first-out 메커니즘을 사용해 기억행렬 $M^u$를 유지함.
기억행렬 $M^u$는 항상 가장 최근에 구매한 $K$개의 상품 임베딩을 갖고 있음
하지만 만약 기억슬롯이 꽉 차지 않았다면 원래 걸 지우지 않고 추가만 함

Feature-level RUM

latent factor model(LFM)에서 영감을 얻은 방식으로 LFM에서는 feature level에서는 구매 결정을 할 때 상품의 feature의 집합을 사용
이런 feature 상에서 유저의 선호는 동적으로 구매이력에 따라 반영되어야 함
먼저 저자들은 global latent feature table(GLFT)을 만들어 feature 임베딩을 저장함. 이 테이블은 유저나 상품과는 별도로 모델의 일부처럼 공유함.
GLFT 에 있는 feature에 대한 선호를 이용해 유저 기억행렬 $M^u$를 만들고, 여기에 어텐션을 사용해 기억 임베딩을 얻음
마지막으로 아이템 임베딩을 이용해 유저 기억행렬 $M^u$를 갱신함
$p_u \in R^D,\ q_i \in R^D$는 유저 $u$와 상품 $i$의 임베딩
시스템에 $K$개의 feature가 있을 때 GLFT $F = \{ f_1, f_2, \cdots, f_K \},\ f_k \in R^D$
유저기억 행렬 $M^u = \{ m_1^u, m_2^u, \cdots, m_K^u, \}$이 있을 때 $m_k^u \in R^D$는 유저 $u$의 feature $k$에 대한 선호를 나타내는 임베딩이다.

Reading operation

먼저 상품 $i$와 각 feature의 연관성을 구함

\[w_{ik} = q_i^T \cdot f_k,\ z_{ik} = \frac{\exp(\beta w_{ik})}{\sum_j \exp(\beta w_{jk})},\ \forall k = 1,2,\cdots, K\]

$\beta$는 여전히 강도 파라미터이며 위와 같은 얻은 $z_{ik}$와 유저의 기억행렬을 이용해 유저 기억 임베딩을 계산하

\[p_u^m = \sum_{k=1}^K z_{ik} \cdot m_k^u\]

Writing operation

neural turing machine(NTM)에서 영감을 받아 유저 기억행렬 $M^u$에 쓸 때, 삭제를 먼저한 뒤에 정보를 추가함
$q_i$로부터 $D$ 차원의 erase 벡터 $erase_i \in R^D$를 다음과 같이 계산

\[erase_i = \sigma (E^Tq_i + b_e)\]

$\sigma(\cdot)$은 element-wise sigmoid 함수이며 $E$와 $b$는 학습되는 삭제 파라미터.
어텐션 가중치와 $erase$벡터가 주어졌을 때, feature 선호 기억은 다음과 같이 갱신됨

\[m_k^u \leftarrow m_k^u \odot (1-z_{ik} \cdot erase_i)\]

$\odot$은 element-wise product
기억벡터는 오직 해당 위치의 weight와 erase 값이 모두 1일 때만 0으로 초기화 되고 weight나 erase 값 중 하나라도 0이면 기억벡터는 변하지 않음
삭제 후에는 $add_i \in R^D$가 feature preference memory를 갱신하는데 사용됨

\[add_i = \tanh(A^Tq_i + b_a),\ m_k^u \leftarrow m_k^u + z_{ik} \cdot add_i\]

여기서 $A$와 $b_a$는 학습되는 파라미터
이러한 erase-add 갱신 전략은 망각을 가능하게 하고 유저의 feature 선호 임베딩을 강화하여 모델이 어떤 신호가 약해져야하고 강해져야하는지를 학습할 수 있음

Discussions and Further Analysis

Item- v.s. Feature-level RUM

item-level RUM은 각각의 상품을 단위로 사용해 상품의 Embedding을 직접 메모리로 저장하고 상품간의 전이 패턴을 찾도록 설계됨
feature-level에서는 여러 feature에 대한 유저의 preference의 임베딩을 저장하고 상품은 이 임베딩을 변화시키기위해 간접적으로 활용됨
실제로 이 모델들을 활용할때 “설명-효과성”의 tradeoff가 있음
- item-level RUM은 과거의 어떤 아이템이 현재 결정에 더 중요한지를 명시적으로 설명할 수 있음
- feature-level RUM은 block box로 모델링되어 설명이 어렵지만 더 좋은 성능을 보임

Experiments

Overall Performance of Our Models

먼저 item-level과 feature-level에서 ($\alpha=0.2)$ 성능을 측정함
item이나 feature 레벨 RUM 모두 대부분의 경우에서 베이스라인보다 높은 성능을 얻음

Influence of the Weighting Parameter $\alpha$

weighting 파라미터 $\alpha$의 영향을 분석
메모리를 사용하지 않는($\alpha = 0$)인 경우, 성능이 좋지 못했으며 메모리 네트워크가 통합되었을 때($\alpha \simeq 0.2$ ) 성능이 극적으로 향상됨
하지만 메모리 임베딩의 weight를 더 높여가면 성능이 하락함
이 결과는 유저가 최근에 구매한 상품의 영향력을 더 고려는 것이 더 좋은 추천을 할 수 있음을 의미
하지만 너무 최근 정보에 집중하는 경우 유저의 내재적 선호가 고려되지 않을 수 있음

Conclusion

external memory network를 collaborative filtering과 통합해 sequential recommendation에 적용하는 방법을 제안
item-level과 feature-level로 RUM 프레임워크를 제공하고 실험을 통해 효과를 증명함
명시적 유저 기억 모델링에 기반한 추천이라는 목표를 향한 첫 걸음
유저의 리뷰나 상품 이미지 등을 활용하면 더욱 설명가능한 추천시스템을 만들 수 있을 것
RUM은 유연한 framework이기 때문에 다른 분야에 적용해볼 수 있을 것

Review

어쨌든 기본 시스템은 Memory Embedding 풀을 구성하고, 마지막에 구매한 상품을 query로 사용해 메모리에서 attentional하게 정보를 읽어들여 user embedding을 만들고, 가장 까까운 item embedding을 찾아 해당 상품을 추천한다. 그 다음 새 구매정보를 메모리에 추가하는데 item-level에서는 단순히 item embedding을 추가하고, feature level에서는 neural turing machine 컨셉을 따라 attention 가중치를 이용해 원래 기억을 지우고 새 값을 더하며 메모리를 유지함

논문을 읽으면서 아직 모르겠는 궁금점들이 있다.

feature-level 에서 feature의 Embedding은 어떤 모델로 어떻게 얻지?
유저의 내재적 임베딩, item임베딩은 어떻게?
어디까지가 end-to-end로 학습되는지도 잘 모르겠다.
feature-level 에서는 매 상품의 구매마다 memory read/write를 반복해야 한다. 그러면 매 학습 step마다 memory를 초기화하나? 그렇지 않다면 그냥 배치단위로 계속 그 다음 sequence를 입력받는 식인가?
최근 논문 중에는 EMN을 활용하는 논문이 없나? 잘된다면 QA같은 task에서 효과가 매우 좋았을 거 같은데? 한계점 같은 걸 분석한 논문이 있을까?

Not all memories are created equal: Learning to forget by expiring 리뷰

2022-02-24T00:00:00+09:00

Not all memories are created equal: Learning to forget by expiring

Abstract

어텐션은 장기기억이 필요한 sequence 모델링에서 좋은 성과를 보이고 있음
하지만 기억해야할 과거의 모든 정보의 중요도가 똑같지는 않음
Expire-Span 이라는 중요한 정보는 유지하고 상관없는 정보는 만료(expire)시키는 방법론을 제안.
제안한 방법론으로 NLP나 RL Task일부에서 SOTA를 달성함.

Introduction

Transformer 아키텍처는 다양한 task에 좋은 성능을 보여줌
최근 연구는 어텐션을 더 긴 메모리 크기에서 효율적으로 수행하는 데 집중하고 있음
하지만 인간 기억의 중요한 부분에는 필요없는 정보를 잊어버리는 능력도 있음
메모리의 크기가 커질 수록 연관된 정보를 결정하는 것이 더 어려워짐
저자는 효율적으로 무엇을 잊어야할 지를 학습하는 방법에 집중하여 모델의 계산 비용을 줄이고 큰 메모리를 효과적으로 탐색하도록 만듬
Expire-Span은 필요없는 기억을 만료시킴으로써 과거 timestep의 길이를 수만까지 확장할 수 있음
셀프어텐션에 매 hidden state에 expiration 값을 출력하는 간단한 predictor를 사용해 해당 정보가 얼마나 오래 보존되어야 하는지를 결정함. 이 과정은 layer간에 독립적으로 일어남.
Expire-Span은 NLP와 RL의 삽화적 task에서 중요하고 관련없는 정보를 구별할 수 있음

Backgroud

Transformer 디코더는 feedforward와 multihead 어텐션으로 구성된 레이어들의 중첩임
레이어 $l$에서 각 timestep의 hidden state $\mathbf{h}_t^l \in \mathbb{R}$ 는 key $\mathbf{k}$, value $\mathbf{v}$, query $\mathbf{q}$로 사상됨

\[\mathbf{q}_t^l = W_q^l \mathbf{h}_t^l,\ \mathbf{k}_t^l = W_k^l \mathbf{h}_t^l,\ \mathbf{v}_t^l = W_v^l \mathbf{h}_t^l\]

(앞으로 $l$누락하고 싱글레이어로 설명) 이전 타임스텝의 정보는 어텐션 $a_{ti}$로 접근되어 $\mathbf{o}_t$를 생성함

\[a_{ti} = \text{Softmax}_{i\in C_t}(\mathbf{q}_t^\top \mathbf{k}_i),\ \mathbf{o}_t = W_o \sum_{i \in C_t} a_{t,i} \mathbf{v}_i\]

집합 $C_t$는 time $t$에 어떤 메모리가 액세스될 지를 보여줌
집합의 크기 $\mid C_t\mid$가 셀프 어텐션에서 시간과 공간 복잡도와 직결되는 부분이며 $\mid C_t\mid$를 메모리 크기라고 명명

Method

기억 $\mathbf{h}_i \in \mathbb{R}^d$ 마다, 스칼라 Expire-Span $e_i \in [0, L]$ 을 계산함. ($\mathbf{w} \in \mathbb{R}^d,\ b \in \mathbb{R}$은 학습 파라미터, $\sigma$는 sigmoid 함수, $L$은 최대 span)

\[e_i = L \sigma(\mathbf{w}^\top\mathbf{h}_i + b)\]

$e_i$는 $\mathbf{h}_i$가 얼마나 오래 $C_t$에 유지되어야할 지를 결정함
시간 $t$에서 $\mathbf{h}$의 남은 span은 $r_{ti}=e_i -(t -i)$ 로 계산하며 $r_{ti}$가 음수일 경우 기억 $\mathbf{h}_i$는 만료되어 $C_t$에서 제거됨
이 과정은 어텐션 weight $a_{ti}$에 바이너리 마스킹 함수 $m_{ti} = 1_{r_{ti}>0}$ 를 사용해서 구현할 수 있음

\[a_{ti}^\prime = \frac{m_{ti}a_{ti}}{\sum_j m_{tj}a_{tj}},\ \mathbf{o}= \sum_i a^\prime_{ti} \mathbf{v}_i\]

하지만 이렇게 이산적인 masking 함수를 사용할 경우 gradient가 전파되지 않기 때문에 저자들은 soft masking을 사용, $R$은 0과 0사이에서 경사도를 결정하는 hyperparameter.
\[m_{ti} = \max(0, \min(1, 1+r_{ti}/R))\]
저자의 목표는 메모리 크기를 줄이는 것이기 때문에 아래와 같이 Loss에 적용함. $\alpha>0$는 hyperparam.

\[\frac{1}{T} \sum_t | C_t| = R - 1 + \frac{1}{T} \sum_i \lfloor e_i \rfloor \\ L_{total} = L_{task} + \alpha \sum_i e_i / T\]

Experiments and Results

Expire-Span을 Transformer-XL이나 Adaptive Span 등 다른 트랜스포머 모델들과 비교해봤을 때 RL, NLP 에서 좋은 성능을 보였다.

Conclusion

Expire-Span이라는 어떤 어텐션 메커니즘에도 무엇을 잊어야할 지를 학습할 수 있는 모델을 제안
망각을 통해서 수만 단위까지 기억을 확장할 수 있고, LM, RL 등에서 좋은 성능을 보임
Expire-Span은 확장성과 효율 면에서 큰 잠재력을 가지고 있다.

Review

기본은 변형 트랜스포머 아키텍처를 제안하는 논문인 것 같다. 긴 sequence를 입력받을 수 있는 쪽이 많은데 이 논문은 긴 sequence에서 좀 더 필요한 정보와 아닌 정보를 좀 더 잘 구분하는 학습 방식?

음… 사실 그 망각이란 게 구조적으로는 기존 트랜스포머에서도 attention weight가 0에 가깝게 계산된다면 여기서 expire되는 것과 같은 기능을 할 수는 있지 않나? 왜 별도의 predictor가 필요했을까?
- 사실 이상적으로 학습됐을 때 그냥 기존의 attention 만으로도 비슷한 기능을 할 수 있는 건 맞는 것 같다. 그런데 내가 생각하기에 핵심적인 부분은 memory의 크기를 loss에 추가했다는 점? 그래서 가능한 메모리의 크기를 작게 유지하면서도 task는 풀 수 있어야 하니까 좀 더 중요한 정보와 그렇지 못한 정보를 구별하는 능력을 배우게 된 것 같다. 그래서 기존의 Transformer 구조에서도 attention weight의 총합을 제한하거나 하는 식의 loss를 학습에 반영한다면 비슷한 결과를 얻을 수 있지 않을까 추측.
왜 decoder에만 적용할 수 있나?
- 논문을 보면 여기서 transformer의 입력을 time series로 가정하고 있다. 그리고 시계열 상 뒤쪽의 데이터가 앞쪽의 데이터에 대해 attention하는 것을 줄여준다. 아무래도 그래서 decoder 레이어에 적용된다고 짤막하게 써있었던 것 같다. 시계열을 가정했는데 인코더는 bi-directional 하니까 앞뒤 순서개념이 없고 expire란 개념이 존재할 수 없는 것. 이 부분의 제약이 좀 아쉬운 것 같다.

Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems 요약

2022-02-23T00:00:00+09:00

Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems

Abstract

task oriented 대화 시스템에서 Knowledge base(KB)를 통합하는 것은 어려운 문제
Mem2Seq 이라는 이 문제를 해결할 수 있는 모델을 제안
Mem2Seq은 Memory에 대한 attention을 pointer network와 결합한 생성모델

Introduction

목적지향 챗봇은 유저가 특정한 목표를 수행하는 것을 도움
목적 지향 대화 시스템에서 외부 KB에 query할 수 있는 능력은 필수적
최근(2018년 기준)에는 RNN을 이용한 encoder-decoder 구조의 생성모델로 답변을 생성하는 접근법이 좋은 성과를 거뒀음.
- 하지만 여전히 외부 KB 정보를 RNN hidden state에 통합하는 데엔 어려움이 있음
- 또한 attention을 사용할 때 긴 sequence를 처리하는 건 시간이 많이 소요됨.
MemNN이라는 큰 외부 메모리에 대한 recurrent 어텐션 모델이 등장했으며 외부 메모리에 embedding 을 기록하고 query 벡터를 이용해 메모리를 반복적으로 읽을 수 있음.
이 접근 방식은 외부 KB 정보를 기억할 수 있고 긴 대화 맥락을 빠르게 인코딩할 수 있음.
하지만 MemNN은 답변을 생성 모델이 아닌 미리 정해진 풀에서 선택하는 한계가 있음.
저자는 MemNN의 한계를 개선하기 위해 Memory-to-Sequence (Mem2Seq) 이라는 새 아키텍처를 제시.
Mem2Seq은 multi-hop 어텐션 메커니즘을 pointer 네트워크 개념과 엮어 대화 기록이나 KB에서 직접 단어를 copy해올 수 있음.
Mem2Seq은 메모리에 접근하기 위한 query를 다이나믹하게 직접 생성하는 법을 학습함.

Model Description

Mem2Seq은 MemNN 인코더와 memory decoder 둘로 구성됨.
MemNN 인코더는 대화 기록에 대한 벡터 표현을 만들어줌.
memory decoder는 메모리를 읽고 복사해서 응답을 생성함.

$X = \{x_1, ..., x_n, \$ \}$ 는 대화 기록을 토큰으로 표현한 sequence임. $\$$는 sentinel이라는 특수토큰.

$B = \{b_1, ..., b_l\}$은 KB의 튜플.

$U=[B;X]$ 는 $X$와 $B$를 concat한 것.

$Y=\{y_1, ..., y_m\}$는 기대되는 시스템 응답의 단어들의 모음.

$PTR = \{ ptr_1, ..., ptr_m\}$은 포인터 인덱스로 아래와 같이 음.

\[ptr_i= \begin{cases} \max(z) & \text{if } \exist{z}\text{ s.t. }y_i=u_z\\ n+l+1 & \text{otherwise} \end{cases}\]

$u_z \in U$은 입력 seuqnce 이고 $n+l+1$은 sentienl의 index를 뜻함.

Memory Encoder

$U$은 단어수준으로 이뤄졌으며 인코더의 입력
MemNN의 메모리는 trainable 임베딩 행렬로 $C = \{C^1, ..., C^{K+1}\}$ 로 표현되며 $C^k$는 토큰을 벡터로 맵핑함. 쿼리 벡터 $q^k$는 읽는데 사용됨.
모델은 $K$ hop만큼 루프를 돌아 각 메모리 $i$마다 hop $k$에서 어텐션 weight를 계산함.

\[p_i^k = \text{Softmax}((q^k)^TC_i^k)\]

$C_i^k = C^k(x_i)$ 는 위치 $i$에서 메모리의 내용. $p^k$은 메모리의 쿼리와의 연관성을 계산하는 메모리 셀렉터로 기능함.

모델은 $C^{k+1}$에 대해 가중합을 통해 메모리 $o^k$을 읽음

$o^k = \sum_{i} p_i^kC_i^{k+1}$

다음 hop을 위해서 쿼리 벡터는 아래와 같이 업데이트 됨 $q^{k+1} = q^k + o^k$
인코딩 단계의 결과물은 메모리 벡터 $o^K$이며 디코딩 단계의 입력으로 사용됨.

Memory Decoder

대화기록과 KB 정보를 모두 사용함
GRU 모듈이 매 스텝 $t$마다 이전에 생성한 단어와 이전 쿼리를 입력으로 받고 새 쿼리를 생성함.
\[h_t = \text{GRU}(C^1(\hat{y}_{t-1}), h_{t-1})\]
쿼리 $h_t$는 MemNN으로 전달되어 토큰을 생성함. $h_0$는 인코더 결과 $o^K$
매 스텝마다 vocab의 단어에 대한 분포 $P_{vocab}$과 메모리에 대한 분포 $P_{ptr}$를 구함
$P_{vocab}$은 첫 hop의 어텐션 값과 현재의 쿼리벡터를 concat해서 아래와 같이 생성함. $W_1$은 학습 파라미터.
\[P_{vocab}(\hat{y}_t) = \text{Softmax}(W_1[h_t;o^1])\]
$P_{ptr}$은 디코더 MemNN 마지막 hop의 어텐션 가중치를 사용해서 만들어짐.
\[P_{ptr} = p_t^K\]
디코더는 메모리에서 입력 단어를 가리키면서 토큰을 생성함. (pointer network 방식)
저자는 첫 hop은 루즈하게 메모리에서 정보를 가져오는 것에 집중하고 마지막 hop은 메모리에서 특정한 단어를 가리키기 위해서 사용하도록 일부러 학습했다고 함

Sentinel

메모리에 필요한 단어가 없을 경우 $P_{ptr}$은 sentinel $\$$를 생성하도록 학습됨. sentinel이 선택될 경우, 모델은 $P_{vocab}$에서 단어를 생성함. 아니면 $P_{ptr}$에서 단어를 생성함.

Memory Content

저자는 단어 단위로 내용 $X$를 메모리에 저장했음
또한 $X$의 각 토큰에 시간과 화자 정보를 추가함. ex) “hello t1 $$u” ⇒ hello라고 timestep 1에 u가 말했다.
반면 KB정보 $B$를 저장할 때는 (주어, 관계, 목적어) 표현을 사용함. ex) (The Westin, Distance, 5 miles). 그 다음 각각의 단어 임베딩을 더해서 KB의 메모리 표현을 만듬. 생성 시에 $P_{ptr}$로 KB가 선택될 때는 위의 예시에서는 “5 miles”에 해당하는 부분을 사용함.
KB는 특정 대화에 관련된 정보만 메모리로 사용됨.

Experimental Results

타 모델들과 비교해 좋은 결과를 얻음

Analysis and Discussion

Memory Attention

위 그림은 토큰을 생성하기 위한 마지막 hop의 attention 점수인데 매우 sharp한 분포를 갖고 있는 것을 볼 수 있음.

Multiple Hops

Mem2Seq은 어떻게 여러 hop이 모델의 성능을 개선시키는 지를 보여줌
첫번째 hop은 보통 모든 관련된 메모리에 점수를 매기고 정보를 가져오는데 사용됨
마지막 hop은 보통 특정한 토큰에 집중하고 attention이 sharp하지 않으면 실수가 발생함

Conclusion

목적지향 대화시스템을 위한 end-to-end로 학습할 수 있는 Memory-to-Sequence 모델을 제안
Mem2Seq은 end-to-end memory 네트워크의 multi-hop 어텐션 메커니즘을 pointer 네트워크와 결합
실험적으로 모델의 능력을 검증함

cosmoquester’s Blog

Reflection on the Conceptual Essence of Language Models, the Path to AGI

What makes LLMs so intelligent?

How Does an LLM Learn?

From GPT-3 to ChatGPT: A Paradigm Shift

The Limitations of LLMs

The Path to AGI

GPT의 개념적 본질에 관한 고찰, 그리고 AGI에 닿기까지

LLM의 성공요인

LLM의 학습 방식

LLM의 한계

The Path to AGI

Research Vision

AI X Bookathon 4회 후기

참가

모델링

평가 결과와 심사기준

결과물

∞-former: Infinite Memory Transformer 요약

Abstract

Introduction

Background

Continuous Attention

Infinite Memory Transformer

Long-term Memory

Unbounded Memory

Sticky Memories

Experiments

Sorting

Language Modeling & Document Grounded Dialogue

Conclusions

Memformer: A Memory-Augmented Transformer for Sequence Modeling 리뷰

Abstract

Introduction

Related Work

Recurrence and Memory

Dynamic Memorization

Methods

Segment-level Sequence Modeling

External Dynamic Memory Slots

Memory Reading

Memory Writing

Update via Memory Slot Attention

Forgetting Mechanism

Memory Replay Back-Propagation

Experiments

Computation and Memory Cost

Autoregressive Image Generation

Language Modeling

Memory Writer Analysis

Conclusion

Review

Compressive Transformers for Long-Range Sequence Modelling 리뷰

Abstract

Introduction

Related Work

Model

Description

Compression Functions and Losses

PG-19 Benchmark

Experiments

PG-19 & Enwiki8

Compressibility of layers

Attention

Optimisation Schedule

Speech & Reinforcement Learning

Conclusion

Review

Sequential Recommendation with User Memory Networks 리뷰

Abstract

Introduction

Contribution

Related Work

Sequential Recommendation

Memory-Augmented Neural Network

RUM: Recommendation with user memory networks

General Framework

Memory enhanced user embedding

Prediction function

Memory updating