How to Interpret GPT2-Small. Mechanistic Interpretability on… | by Shuyang Xiang

Mechanistic Interpretability on prediction of repeated tokens

The occasion of large-scale language fashions, notably ChatGPT, has left people who have experimented with it, myself included, astonished by its excellent linguistic prowess and its functionality to carry out quite a few duties. Nonetheless, many researchers, along with myself, whereas marveling at its capabilities, moreover uncover themselves perplexed. No matter realizing the model’s construction and the actual values of its weights, we nonetheless wrestle to know why a specific sequence of inputs ends in a selected sequence of outputs.

On this weblog put up, I’ll attempt to demystify GPT2-small using mechanistic interpretability on a simple case: the prediction of repeated tokens.

Typical mathematical devices for explaining machine learning fashions aren’t totally acceptable for language fashions.

Keep in mind SHAP, a helpful instrument for explaining machine learning fashions. It’s proficient at determining which operate significantly influenced the prediction of top quality wine. Nonetheless, it’s important to don’t forget that language fashions make predictions on the token diploma, whereas SHAP values are principally computed on the operate diploma, making them in all probability unfit for tokens.

Moreover, Language Fashions (LLMs) have fairly a couple of parameters and inputs, making a high-dimensional home. Computing SHAP values is expensive even in low-dimensional areas, and way more so throughout the high-dimensional home of LLMs.

No matter tolerating the extreme computational costs, the explanations supplied by SHAP may very well be superficial. For instance, realizing that the time interval “potter” most affected the output prediction due to the sooner level out of “Harry” doesn’t current lots notion. It leaves us uncertain regarding the part of the model or the actual mechanism liable for such a prediction.

Mechanistic Interpretability offers a definite technique. It doesn’t merely decide important choices or inputs for a model’s predictions. In its place, it sheds gentle on the underlying mechanisms or reasoning processes, serving to us understand how a model makes its predictions or picks.

We might be using GPT2-small for a simple course of: predicting a sequence of repeated tokens. The library we’re going to use is TransformerLens, which is designed for mechanistic interpretability of GPT-2 kind language fashions.

gpt2_small: HookedTransformer = HookedTransformer.from_pretrained("gpt2-small")

We use the code above to load the GPT2-Small model and predict tokens on a sequence generated by a selected carry out. This sequence comprises two related token sequences, adopted by the bos_token. An occasion might be “ABCDABCD” + bos_token when the seq_len is 3. For readability, we check with the sequence from the begin to the seq_len as the first half, and the remaining sequence, excluding the bos_token, as a result of the second half.

def generate_repeated_tokens(
model: HookedTransformer, seq_len: int, batch: int = 1
) -> Int[Tensor, "batch full_seq_len"]:
'''
Generates a sequence of repeated random tokensOutputs are:
rep_tokens: [batch, 1+2*seq_len]
'''
bos_token = (t.ones(batch, 1) * model.tokenizer.bos_token_id).prolonged()  # generate bos token for each batch
rep_tokens_half = t.randint(0, model.cfg.d_vocab, (batch, seq_len), dtype=t.int64)
rep_tokens = t.cat([bos_token,rep_tokens_half,rep_tokens_half], dim=-1).to(system)
return rep_tokens

After we allow the model to run on the generated token, we uncover an fascinating assertion: the model performs significantly larger on the second half of the sequence than on the first half. That’s measured by the log potentialities on the right tokens. To be precise, the effectivity on the first half is -13.898, whereas the effectivity on the second half is -0.644.

Image for author: Log probs on acceptable tokens

We’re in a position to moreover calculate prediction accuracy, outlined as a result of the ratio of precisely predicted tokens (these just like the generated tokens) to the entire number of tokens. The accuracy for the first half sequence is 0.0, which is unsurprising since we’re working with random tokens that lack exact which means. Within the meantime, the accuracy for the second half is 0.93, significantly outperforming the first half.

Discovering induction head

The assertion above is probably outlined by the existence of an induction circuit. It’s a circuit that scans the sequence for prior instances of the current token, identifies the token that adopted it beforehand, and predicts that the similar sequence will repeat. For instance, if it encounters an ‘A’, it scans for the sooner ‘A’ or a token just like ‘A’ throughout the embedding home, identifies the following token ‘B’, after which predicts the next token after ‘A’ to be ‘B’ or a token just like ‘B’ throughout the embedding home.

This prediction course of may very well be broken down into two steps:

Decide the sooner similar (or associated) token. Every token throughout the second half of the sequence must “focus” to the token ‘seq_len’ areas sooner than it. For instance, the ‘A’ at place 4 must pay attention to the ‘A’ at place 1 if ‘seq_len’ is 3. We’re in a position to identify the attention head performing this course of the “induction head.”
Decide the following token ‘B’. That’s the methodology of copying information from the sooner token (e.g., ‘A’) into the next token (e.g., ‘B’). This information can be utilized to “reproduce” ‘B’ when ‘A’ appears as soon as extra. We’re in a position to identify the attention head performing this course of the “earlier token head.”

These two heads symbolize a complete induction circuit. Remember that usually the time interval “induction head” can be utilized to clarify your full “induction circuit.” For additional introduction of induction circuit, I extraordinarily advocate the article In-context learning and induction head which is a grasp piece!

Now, let’s decide the attention head and former head in GPT2-small.

The following code is used to go looking out the induction head. First, we run the model with 30 batches. Then, we calculate the indicate price of the diagonal with an offset of seq_len throughout the consideration pattern matrix. This technique lets us measure the diploma of consideration the current token affords to the one which appears seq_len beforehand.

def induction_score_hook(
pattern: Float[Tensor, "batch head_index dest_pos source_pos"],
hook: HookPoint,
):
'''
Calculates the induction score, and outlets it throughout the [layer, head] place of the `induction_score_store` tensor.
'''
induction_stripe = pattern.diagonal(dim1=-2, dim2=-1, offset=1-seq_len) # src_pos, des_pos, one place correct from seq_len
induction_score = einops.cut back(induction_stripe, "batch head_index place -> head_index", "indicate")
induction_score_store[hook.layer(), :] = induction_scoreseq_len = 50
batch = 30
rep_tokens_30 = generate_repeated_tokens(gpt2_small, seq_len, batch)
induction_score_store = t.zeros((gpt2_small.cfg.n_layers, gpt2_small.cfg.n_heads), system=gpt2_small.cfg.system)
rep_tokens_30,
return_type=None, 
pattern_hook_names_filter,
induction_score_hook
)]
)

Now, let’s have a look at the induction scores. We’ll uncover that some heads, such as a result of the one on layer 5 and head 5, have a extreme induction score of 0.91.

We’re in a position to moreover present the attention pattern of this head. You’ll uncover a clear diagonal line as a lot as an offset of seq_len.

Image by author: layer 5, head 5 consideration pattern

Equally, we’re in a position to decide the earlier token head. For instance, layer 4 head 11 demonstrates a strong pattern for the sooner token.

Image by author: earlier token head scores

How do MLP layers attribute?

Let’s ponder this question: do MLP layers rely? Everyone knows that GPT2-Small includes every consideration and MLP layers. To analysis this, I counsel using an ablation method.

Ablation, as a result of the title implies, systematically removes certain model components and observes how effectivity modifications due to this.

We’ll substitute the output of the MLP layers throughout the second half of the sequence with these from the first half, and observe how this impacts the final word loss carry out. We’ll compute the excellence between the loss after altering the MLP layer outputs and the distinctive lack of the second half sequence using the following code.

def patch_residual_component(
residual_component,
hook,
pos,
cache,
):
residual_component[0,pos, :] = cache[hook.name][pos-seq_len, :]
return residual_componentablation_scores = t.zeros((gpt2_small.cfg.n_layers, seq_len), system=gpt2_small.cfg.system)
gpt2_small.reset_hooks()
logits = gpt2_small(rep_tokens, return_type="logits")
loss_no_ablation = cross_entropy_loss(logits[:, seq_len: max_len],rep_tokens[:, seq_len: max_len])
for layer in tqdm(differ(gpt2_small.cfg.n_layers)):
for place in differ(seq_len, max_len):
hook_fn = functools.partial(patch_residual_component, pos=place, cache=rep_cache)
ablated_logits = gpt2_small.run_with_hooks(rep_tokens, fwd_hooks=[
(utils.get_act_name("mlp_out", layer), hook_fn)
])
loss = cross_entropy_loss(ablated_logits[:, seq_len: max_len], rep_tokens[:, seq_len: max_len])
ablation_scores[layer, position-seq_len] = loss - loss_no_ablation

We arrive at a shocking consequence: apart from the first token, the ablation doesn’t produce a giant logit distinction. This means that the MLP layers won’t have a giant contribution throughout the case of repeated tokens.

Image by author: loss utterly totally different sooner than and after ablation of mlp layers

Supplied that the MLP layers don’t significantly contribute to the final word prediction, we’re in a position to manually assemble an induction circuit using the highest of layer 5, head 5, and the highest of layer 4, head 11. Recall that these are the induction head and the sooner token head. We do it by the following code:

def K_comp_full_circuit(
model: HookedTransformer,
prev_token_layer_index: int,
ind_layer_index: int,
prev_token_head_index: int,
ind_head_index: int
) -> FactoredMatrix:
'''
Returns a (vocab, vocab)-size FactoredMatrix,
with the first dimension being the query side
and the second dimension being the essential factor side (going by the use of the sooner token head)'''
W_E = gpt2_small.W_E
W_Q = gpt2_small.W_Q[ind_layer_index, ind_head_index]
W_K = model.W_K[ind_layer_index, ind_head_index]
W_O = model.W_O[prev_token_layer_index, prev_token_head_index]
W_V = model.W_V[prev_token_layer_index, prev_token_head_index]
Q = W_E @ W_Q
Okay = W_E @ W_V @ W_O @ W_K
return FactoredMatrix(Q, Okay.T)

Computing the best 1 accuracy of this circuit yields a price of 0.2283. That’s pretty good for a circuit constructed by solely two heads!

For detailed implementation, please confirm my notebook.

Thank you for being a valued member of the Nirantara family! We appreciate your continued support and trust in our apps.

Nirantara Social - Stay connected with friends and loved ones. Download now: Nirantara Social
Nirantara News - Get the latest news and updates on the go. Install the Nirantara News app: Nirantara News
Nirantara Fashion - Discover the latest fashion trends and styles. Get the Nirantara Fashion app: Nirantara Fashion
Nirantara TechBuzz - Stay up-to-date with the latest technology trends and news. Install the Nirantara TechBuzz app: Nirantara Fashion
InfiniteTravelDeals24 - Find incredible travel deals and discounts. Install the InfiniteTravelDeals24 app: InfiniteTravelDeals24

If you haven't already, we encourage you to download and experience these fantastic apps. Stay connected, informed, stylish, and explore amazing travel offers with the Nirantara family!

Source link

How to Interpret GPT2-Small. Mechanistic Interpretability on… | by Shuyang Xiang | Mar, 2024

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra – Nirantara

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra

Elon Musk ‘Fully Endorses’ Donald Trump After Deadly Rally Shooting

📈 Predicting Google Stock Prices with Kernel Regression and Interactive Widgets! 🚀 | by Unicorn Day | Jul, 2024 – Niraranra

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra – Nirantara

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra

Elon Musk ‘Fully Endorses’ Donald Trump After Deadly Rally Shooting

📈 Predicting Google Stock Prices with Kernel Regression and Interactive Widgets! 🚀 | by Unicorn Day | Jul, 2024 – Niraranra

Zendaya Went Full “Challengers” in Ralph Lauren Outfit at Wimbledon

Top Insights

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra – Nirantara

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra

Elon Musk ‘Fully Endorses’ Donald Trump After Deadly Rally Shooting

How to Interpret GPT2-Small. Mechanistic Interpretability on… | by Shuyang Xiang | Mar, 2024

Mechanistic Interpretability on prediction of repeated tokens

Discovering induction head

How do MLP layers attribute?

Related Posts