English
Share
Sign In
Uncovering the black box of language models: How does LLM work?
Haebom
The concept of a language model has emerged and the method and principles of its operation have been discussed several times. Attention method and transformer method are already widely known operation methods. However, this is a concept understood by those interested in existing ML, and was generally accepted as 'I guess there is such a thing.' The Scaling Monosemanticity paper published in Anthropic explains the operating principles of artificial intelligence (represented by language models) in a more understandable way.
What do you usually think of when you hear the words ‘The Golden Gate Bridge’? Some people may immediately think of the Golden Gate Bridge in San Francisco, while others who hear it for the first time may think, 'Is it a bridge with a golden gate?' In our heads, we form thoughts based on associations based on keywords such as ‘San Francisco’, ‘bridge’, ‘door’, and ‘bridge’. The picture above provides a visual representation of how these related concepts are positioned within the AI ​​model.
Main Component Description
Nearest neighbors to the Golden Gate Bridge feature :
This section shows other features related to the 'Golden Gate Bridge' feature.
Each feature indicates how close they are to each other within the model.
San Francisco region :
Concepts closely related to the 'Golden Gate Bridge' feature and include various references, primarily related to San Francisco.
For example, 'San Francisco, California', 'San Francisco references', 'San Francisco area locations', etc.
Earthquake region :
Contains earthquake-related concepts that relate to the 'Golden Gate Bridge' function.
There are earthquake-related references, such as '1906 SF earthquake', 'San Andreas fault system', and 'Northridge and Loma Prieta earthquake'.
Other related concepts :
A variety of geographic and cultural concepts are included, including the San Francisco 49ers team, UC Berkeley identifiers, New York City districts, the Eiffel Tower, tourist attractions and landmarks, and more.
To get a clearer picture of how an AI model understands text, we analyzed the activation patterns of neurons within the AI ​​model. ‘Features’ refer to the patterns in which multiple neurons fire together, and analyzing them can reveal how AI understands the same concept across different languages ​​and formats. This research helps us better understand how AI models work, making them more secure and trustworthy.
Main concepts
1.
Feature Extraction (Dictionary Learning) :
Neurons in an AI model individually represent different concepts.
A ‘function’ is a combined pattern of activation of multiple neurons. Just as several letters come together to form a word, the combination of several neurons creates a specific function.
Analyzing features provides a clearer picture of how an AI model understands text.
2.
Characteristics of the function :
Features can appear the same in different languages ​​and formats (text, images, etc.).
For example, a feature called 'Golden Gate Bridge' is activated equally across text in different languages, including English, Japanese, and Chinese.
3.
Functional manipulation experiments :
We experimented with how the AI ​​model's response changed by artificially enhancing or suppressing certain features.
For example, activating the 'Golden Gate Bridge' feature will cause the model to mention 'Golden Gate Bridge' in almost every question.
Language density: Where are the words clustered?
1.
Density :
The upper left portion of the graph indicates that most data points are located at activation level 0. This means that most inputs will rarely activate this feature.
2.
Conditional Distribution :
It represents the distribution of activation of a feature according to its activation level (from 0 to 1).
Colors represent Claude's concreteness score:
Blue (0 points): Not relevant
Light Orange (1 point): Vaguely related
Dark Orange (2 points): Related to adjacent text
Red (3 points): Clearly identifies text
3.
Examples inputs sampled from intervals :
The example input below shows text and images sampled within each activation interval.
The example on the left represents a low activation level (0.1 to 0.3), and the example on the right represents a high activation level (0.7 to 1.0).
For example, at lower activation levels, less relevant text such as 'Presidio' and 'Union Square' appears. At high activation levels, text appears that explicitly mentions the 'Golden Gate Bridge'.
Easy to understand
To get a clearer picture of how an AI model understands text, we analyzed the activation patterns of neurons within the AI ​​model. ‘Function’ refers to the pattern in which multiple neurons fire together, and analyzing this can reveal how AI understands the same concept across different languages ​​and formats. This research helps us better understand how AI models work, making them more secure and trustworthy.
What is meaningful about this paper is that it clarifies how language models work. It also means that the pace toward multimodal artificial intelligence will accelerate in the future. Due to the above neuron-based association methods, there can also be talk about how language models can be used better.
1
👍
1
/haebom
Subscribe