机器学习笔记

Introduction of Machine/Deep Learning

Different types of Functions

Regression(回归): The function outputs a scalar.

Classification(分类): Given options(classes), the function outputs the correct one.

Structure Learning: create something with structure(image,document).

Case Study: Prediction of no. of views of a YouTube Channel

Function with Unknow Parameters: ——Model
Define Loss from Training Data: L(b, w)——损失函数

Loss is a function of parameters, it represents how good a set of values is.

Loss: ——自行定义，可正可负

$e=| y- | ——L$ is mean absolute error(MAE, 平均绝对误差)

e = (y − ŷ)²——L is mean square error(MSE, 均方误差)

if y and ŷ are both probability distributions——Cross-entropy(交叉熵)
Optimization: w^′, b^′ = argmin_w, bL
Gradient Descent(梯度下降)：
- (Randomly)Pick an initial value w⁰,b⁰
- Compute ,
- Update w and b iteratively(迭代)

Change Model: Piecewise Linear Function

Model Bias: 模型的局限性导致无法模拟真实的状况——need more sophisticated models

Different w——Change slopes
Different b——Shift
Different c——Change height

New Model: $y=b+{i}c{i}sigmoid(b_{i}+{j}w{ij}x_{j}) $

r = b + Wx
a = σ(r)
y = b + c^Ta
All: y = b + c^Tσ(b + Wx)

if more hidden layer?——Neural Network——Deep Learning

Attention: the definition of Loss and Optimization of new Model is almost same as previous.

Homework 1: COVID-19 Cases Prediction

Code: link

结果：

What to do if my network fails to train

Local Minima & Saddle Point

Mathematical Principles

Tayler Series Approximation

L(θ) around θ = θ^′can be approximated below

Gradient g is a vector:

Hessian H is a matrix:

Hessian

When at critical point, (θ − θ^′)^Tg = 0, telling the properties of critical points.

Set θ − θ^′ = v:

H may tell us parameter update direction!

While u is an eigen vector of H, λ is the eigen value of u, u^THu = u^T(λu) = λ∥u∥²

Example: if u < 0, then H < 0, then L(θ) < L(θ^′) , θ = θ^′ + u, L can be decreased.

We can escape the saddle point and decrease the loss. However, this method is seldom used in practice.

Batch & Momentum

Small Batch v.s. Large Batch

Larger batch size does not require longer time to compute gradient (unless batch size is too large)
Smaller batch requires longer time for one epoch (longer time for seeing all data once)
Smaller batch size has better performance in optimization
“Noisy” update is better for training

Batch size is a hyperparameter you have to decide.

Gradient Descent + Momentum

Movement: movement of last step minus gradient at present
Movement not just based on gradient, but previous movement.

Example:

Summary

Critical points have zero gradients.
Critical points can be either saddle points or local minima.
- Can be determined by the Hessian matrix.
- Local minima may be rare.
- It is possible to escape saddle points along the direction of eigen vectors of the Hessian matrix.
Smaller batch size and momentum help escape critical points.

Adaptive Learning Rate

People believe training stuck because the parameters are around a critical point, but sometimes learning rate may be the reason.
While learning rate cannot be one-size-fits-all, so different parameters need different learning rate.

Consider update one parameter(t means iteration time, i means the ith parameter):

θ_i^t + 1 ← θ_i^t − ηg_i^t

While larger gradient needs smaller learning rate and smaller gradient needs larger learning rate, so learning rate has to be parameter dependent:

Root Mean Square

RMSProp

The recent gradient has larger influence, and the past gradients have less influence.

Learning Rate Scheduling

Learning Rate Decay

After the training goes, we are close to the destination, so we reduce the learning rate.

Warm Up

Increase and then decrease.
At the beginning, the estimate of σ_i^t has large variance.

Summary of Optimization

Classification

Classification as Regression

Softmax:

1 > y_i^′ > 0
∑_iy_i^′ = 1

The core function of Softmax is to normalize a model’s raw prediction scores (Logits) into a valid, interpretable probability distribution (all probabilities >=0 and sum=1), explicitly representing the model’s prediction confidence across multiple mutually exclusive classes.

Loss of Classification

Mean Square Error(MSE): e = ∑_i(ŷ_i − y_i^′)²

Cross-entropy: e = −∑_iŷ_ilny_i^′ (More Competitive)

Minimizing cross-entropy is equivalent to maximizing likelihood.

Changing the loss function can change the difficulty of optimization.

Case Study: Pokémon v.s. Digimon

We want to find a function to classify Pokémon/Digimon
Determine a function with unknown parameters(based on domain knowledge)

Observation

Function with Unknown Parameters

ℋ = {1, 2, ..., 10000}, |ℋ|: model “complexity”

Loss of a function (given data)

Given a dataset 𝒟
- 𝒟 = {(x¹, ŷ¹), (x², ŷ²), ..., (x^N, ŷ^N)}
Loss of a threshold h given data set 𝒟
- Error rate:
- l(h, xⁿ, ŷⁿ) means I(f_h(xⁿ) ≠ ŷⁿ): if f_h(xⁿ) ≠ ŷⁿ, output 1, otherwise output 0.
- Of course can choose cross entropy instead.

Training Examples

If we can collect all Pokémons and Digimons in the universe 𝒟_all, we can find the best threshold h^all
- h^all = 𝒶𝓇ℊmin_hL(h, 𝒟_all)
We can only collect some examples 𝒟_train
- 𝒟_{𝓉𝓇𝒶𝒾𝓃} = {(x¹, ŷ¹), (x², ŷ²), ..., (x^N, ŷ^N)}, (x^N, ŷ^N) ∼ 𝒟_all
- h^train = 𝒶𝓇ℊmin_hL(h, 𝒟_train)
- 𝒟_{𝓉𝓇𝒶𝒾𝓃} is independently and identically distributed ( i.i.d.)
We hope L(h^train, 𝒟_all) and L(h^all, 𝒟_all) are close.
- We want L(h^train, 𝒟_all) − L(h^all, 𝒟_all) ≤ δ
- So 𝒟_train has to fulfill: ∀h ∈ ℋ, |L(h, 𝒟_train) − L(h, 𝒟_all)| ≤ ϵ, ϵ = δ/2

Probability of Failure

The following discussion is model-agnostic.

In the following discussion, we don’t have assumption about data distribution.

In the following discussion, we can use any loss function.

Each point is a training set.

Hoeffding’s Inequality:

Model Complexity

What if the parameters are continuous?

Answer 1 : Everything that happens in a computer is discrete.
Answer 2 : VC dimension

Why don’t we simply use a very small |ℋ| ?

smaller |ℋ| means fewer candidates in h ∈ ℋ

Tradeoff of Model Complexity

How to find best balance? DEEP LEARNING.

Homework 2: Framewise phoneme prediction from speech

Data Preprocessing: Extract MFCC features from raw waveform
Classification: Perform framewise phoneme classification using pre-extracted MFCC features

Report Questions

Implement 2 models with approximately the same number of parameters, (A) one narrower and deeper (e.g. hidden_layers=6, hidden_dim=1024) and (B) the other wider and shallower (e.g. hidden_layers=2, hidden_dim=1700). Report training/validation accuracies for both models.

hidden_layers=6, hidden_dim=1024: acc 0.47458
hidden_layers=2, hidden_dim=1700: acc 0.47491

Add dropout layers, and report training/validation accuracies with dropout rates equal to (A) 0.25/(B) 0.5/(C) 0.75 respectively.

1. 0.25: acc 0.45756
1. 0.5: acc 0.44790
1. 0.75: acc 0.42627

合适的Dropout能够防止过拟合，Dropout过大反而会影响模型的准确率。

Code: link

结果：

Images Input

Convolutional Neural Network(CNN)

CNN is a Network Architecture designed for Image.

Simplification 1: Receptive Field

Each receptive field has a set of neurons (e.g., 64 neurons).
The receptive fields cover the whole image.

Two neurons with the same receptive field would not share parameters.

Each receptive field has a set of neurons (e.g., 64 neurons).
Each receptive field has the neurons with the same set of parameters.

Convolutional Layer

Consider channel = 1(black and white image):

The values in the filters are unknown parameters.

Do the same process for every filter, then we get the Feature Map.

Pooling

Subsampling the pixels will not change the object.
Max Pooling, Mean Pooling…

Summary

The Whole CNN

Homework 3: Image Classification

Solve image classification with convolutional neural networks.
Improve the performance with data augmentations.
Understand popular image model techniques such as residual.

Baseline

Simple : 0.50099
Medium : 0.73207 Training Augmentation + Train Longer
Strong : 0.81872 Training Augmentation + Model Design + Train Looonger (+Cross Validation + Ensemble)
Boss : 0.88446 Training Augmentation + Model Design +Test Time Augmentation + Train Looonger (+ Cross Validation + Ensemble)

Record

Simple Baseline

Just sample code

Medium Baseline

Training Augmentation

# Normally, We don't need augmentations in testing and validation.
# All we need here is to resize the PIL image and transform it into Tensor.
test_tfm = transforms.Compose([
    transforms.Resize((128, 128)),
    transforms.ToTensor(),
])

# However, it is also possible to use augmentation in the testing phase.
# You may use train_tfm to produce a variety of images and then test using ensemble methods
train_tfm = transforms.Compose([
    # Resize the image into a fixed shape (height = width = 128)
    transforms.Resize((128, 128)),
    # You may add some transforms here.
    # 95%概率做随机增强（TrivialAugmentWide），10%概率保持原图.
    transforms.RandomChoice(transforms=[
        # Apply TrivialAugmentWide data augmentation method
        transforms.TrivialAugmentWide(),

        # Return original image
        transforms.Lambda(lambda x: x),
    ],
                            p=[0.95, 0.05]),
    # ToTensor() should be the last one of the transforms.
    transforms.ToTensor(),
])

Configuration

# The number of training epochs and patience.
n_epochs = 20
patience = 10 # If no improvement in 'patience' epochs, early stop

optimizer = torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-5)

增加epoch就可以达到baseline，但是colab的时限到了。

Strong Baseline

采用ReduceLROnPlateau, 验证集指标无提升时自动降低lr

# Initialize optimizer, you may fine-tune some hyperparameters such as learning rate on your own.
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.8, patience=patience/2,threshold=0.05)
# 在每个epoch验证后加上
scheduler.step(best_acc)

完整Code：link

结果：

Self-attention

Basic Mechanism

α_1, i^′ = exp(α_1, i)/∑_jexp(α_1, j)
b¹ = ∑_iα_1, i^′vⁱ

Summary:

Q = W^qI, K = W^kI, V = W^vI
A = K^TQ, A^′ = SoftMax(A)
O = VA^′
W^q , W^k and W^v are parameters to be learned.

Multi-head Self-attention

Positional Encoding

No position information in self-attention.
Each position has a unique positional vector eⁱ
hand-crafted
learned from data

Homework 4: Speaker Identification

Task: Multiclass Classification

Predict speaker class from given speech.

Simple Baseline

Build a self-attention network to classify speakers with sample code.
Simple public baseline: 0.66025
Estimate training time: 30~40 mins on Colab.

Output:

Medium Baseline

Modify the parameters of the transformer modules in the sample code.
Medium public baseline: 0.81750

class Classifier(nn.Module):
	def __init__(self, d_model=512, n_spks=600, dropout=0.2):
		super().__init__()
		# Project the dimension of features from that of input into d_model.
		self.prenet = nn.Linear(40, d_model)
		# TODO:
		#   Change Transformer to Conformer.
		#   https://arxiv.org/abs/2005.08100
		self.encoder_layer = nn.TransformerEncoderLayer(
			d_model=d_model, dim_feedforward=256, nhead=32
		)
		self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=2)

		# Project the the dimension of features from d_model into speaker nums.
		self.pred_layer = nn.Sequential(
			nn.Linear(d_model, 2 * d_model),
			nn.ReLU(),
			nn.Dropout(dropout),
			nn.Linear(2 * d_model, n_spks),
		)

Output:

Strong Baseline

Construct Conformer, which is a variety of Transformer.
Strong public baseline: 0.88500

Output:

Code: link

Boss Baseline

Implement Self-Attention Pooling & Additive Margin Softmax to further boost the performance.
Self-Attention Pooling.
Boss baseline : 0.93175

实际上我把epoch放大到200,000就达到了Boss Baseline.

Output:

Sequence to sequence (Seq2seq)

Input a sequence, output a sequence.

The output length is determined by model.

Batch Normalization

ẑⁱ = γ⨀z̃ⁱ + β

How Does Batch Normalization Help Optimization?

Experimental results (and theoretically analysis) support batch normalization change the landscape of error surface.

Transformer

the Applications of Seq2seq

Chatbot

Question Answering

Syntactic Parsing

Multi-label Classification

The Architecture of Transformer

Encoder

Multi-Head Attention + Residue&Layer Norm + Fully-connected Network + Residue&Layer Norm

Decoder

Autoregressive (AT): 自回归

Speech Recognition as example

Masked Self attention

Adding “Stop Token”

Non-autoregressive (NAT)

How to decide the output length for NAT decoder?
- Another predictor for output length
- Output a very long sequence, ignore tokens after END
Advantage: parallel, more stable generation (e.g., TTS)
NAT is usually worse than AT (why? Multi-modality)

Training

The Decoder’s Input is different between Training process and Testing process.

Self-attention Variances

Fixed Pattern

Window Attention, Stride Attention, Global Attention…

Learnable Patterns

make Machine learn how to choose patterns.

Clustering

Sinkhorn Sorting Network

simplified version

Reduce Number of Keys

Compressed Attention

Linformer

Change the order of matrix multiplication

Homework 5: Machine Translation

In this homework, we’ll translate English to Traditional Chinese.
Since sentences are with different length in different languages, the seq2seq framework is applied to this task.

Training datasets

Paired data
- TED2020: TED talks with transcripts translated by a global community of volunteers to more than 100 language
- We will use (en, zh-tw) aligned pairs
Monolingual data(单语言数据)
- More TED talks in traditional Chinese

Evaluation

Modified n-gram precision (n = 1~4), N is the maximum value of n
Brevity penalty: penalizes short hypotheses
- c is the hypothesis length, r is the reference length.
p_n is the precision of n-gram

Workflow

Preprocessing
- download raw data
- clean and normalize
- remove bad data (too long/short)
- tokenization
Training
- initialize a model
- train it with training data
Testing
- generate translation of test data
- evaluate the performance

Baselines

HW05是在 Judgeboi 上提交的，因此无法得到具体的分数，仅给出修改的代码。

PS：由于无法在 Judgeboi 上提交，所以原本打算得出在验证集上的分数，但是遇到了一个暂时无法解决的bug。

Simple Baseline

Running the sample code

Medium Baseline

Add learning rate scheduler and train longer

import math

def get_rate(d_model, step_num, warmup_step):
    # TODO: Change lr from constant to the equation shown above
    lr = 1.0 / math.sqrt(d_model) * min(1.0 / math.sqrt(step_num), step_num / (warmup_step * math.sqrt(warmup_step)))
    return lr

config = Namespace(
    datadir = "./DATA/data-bin/ted2020",
    savedir = "./checkpoints/rnn",
    source_lang = src_lang,
    target_lang = tgt_lang,

    # cpu threads when fetching & processing data.
    num_workers=2,
    # batch size in terms of tokens. gradient accumulation increases the effective batchsize.
    max_tokens=8192,
    accum_steps=2,

    # the lr s calculated from Noam lr scheduler. you can tune the maximum lr by this factor.
    lr_factor=2.,
    lr_warmup=4000,

    # clipping gradient norm helps alleviate gradient exploding
    clip_norm=1.0,

    # maximum epochs for training (Medium)
    max_epoch=30,
    start_epoch=1,

    # beam size for beam search
    beam=5,
    # generate sequences of maximum length ax + b, where x is the source length
    max_len_a=1.2,
    max_len_b=10,
    # when decoding, post process sentence by removing sentencepiece symbols and jieba tokenization.
    post_process = "sentencepiece",

    # checkpoints
    keep_last_epochs=5,
    resume=None, # if resume from checkpoint name (under config.savedir)

    # logging
    use_wandb=False,
)

Strong Baseline

Switch to Transformer and tuning hyperparameter

# # HINT: transformer architecture
from fairseq.models.transformer import (
    TransformerEncoder,
    TransformerDecoder,
)

def build_model(args, task):
    """ build a model instance based on hyperparameters """
    src_dict, tgt_dict = task.source_dictionary, task.target_dictionary

    # token embeddings
    encoder_embed_tokens = nn.Embedding(len(src_dict), args.encoder_embed_dim, src_dict.pad())
    decoder_embed_tokens = nn.Embedding(len(tgt_dict), args.decoder_embed_dim, tgt_dict.pad())

    # encoder decoder
    # HINT: TODO: switch to TransformerEncoder & TransformerDecoder
    # encoder = RNNEncoder(args, src_dict, encoder_embed_tokens)
    # decoder = RNNDecoder(args, tgt_dict, decoder_embed_tokens)
    encoder = TransformerEncoder(args, src_dict, encoder_embed_tokens)
    decoder = TransformerDecoder(args, tgt_dict, decoder_embed_tokens)

    # sequence to sequence model
    model = Seq2Seq(args, encoder, decoder)

    # initialization for seq2seq model is important, requires extra handling
    def init_params(module):
        from fairseq.modules import MultiheadAttention
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if module.bias is not None:
                module.bias.data.zero_()
        if isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        if isinstance(module, MultiheadAttention):
            module.q_proj.weight.data.normal_(mean=0.0, std=0.02)
            module.k_proj.weight.data.normal_(mean=0.0, std=0.02)
            module.v_proj.weight.data.normal_(mean=0.0, std=0.02)
        if isinstance(module, nn.RNNBase):
            for name, param in module.named_parameters():
                if "weight" in name or "bias" in name:
                    param.data.uniform_(-0.1, 0.1)

    # weight initialization
    model.apply(init_params)
    return model

# Follow transformer-base 
arch_args = Namespace(
    encoder_embed_dim=256,
    encoder_ffn_embed_dim=512,
    encoder_layers=1,
    decoder_embed_dim=256,
    decoder_ffn_embed_dim=1024,
    decoder_layers=1,
    share_decoder_input_output_embed=True,
    dropout=0.3,
)

# HINT: these patches on parameters for Transformer
def add_transformer_args(args):
    args.encoder_attention_heads=4
    args.encoder_normalize_before=True

    args.decoder_attention_heads=4
    args.decoder_normalize_before=True

    args.activation_fn="relu"
    args.max_source_positions=1024
    args.max_target_positions=1024

    # patches on default parameters for Transformer (those not set above)
    from fairseq.models.transformer import base_architecture
    base_architecture(arch_args)

add_transformer_args(arch_args)

Boss Baseline

Apply back-translation

Train a backward model by switching languages
Translate monolingual data with backward model to obtain synthetic data
- Complete TODOs in the sample code
- All the TODOs can be completed by using commands from earlier cells
Train a stronger forward model with the new data
- If done correctly, 30 epochs on new data should pass the baseline

Configuration for experiments

Set BACK_TRANSLATION to True in the configuration for experiments and run. Train a back-translation model and process the corresponding corpus.
Set BACK_TRANSLATION to False in the configuration for experiments and run. Train with the corpus of ted2020 and mono (back-translation).

BACK_TRANSLATION = False 
if BACK_TRANSLATION:
    config.datadir = "./DATA/data-bin/ted2020"
    config.savedir = "./checkpoints/transformer-back"
    config.source_lang, config.target_lang= tgt_lang, src_lang

Model Initialization

# # HINT: transformer architecture
from fairseq.models.transformer import (
    TransformerEncoder,
    TransformerDecoder,
)

def build_model(args, task):
    """ build a model instance based on hyperparameters """
    src_dict, tgt_dict = task.source_dictionary, task.target_dictionary

    # token embeddings
    encoder_embed_tokens = nn.Embedding(len(src_dict), args.encoder_embed_dim, src_dict.pad())
    decoder_embed_tokens = nn.Embedding(len(tgt_dict), args.decoder_embed_dim, tgt_dict.pad())

    # encoder decoder
    # HINT: TODO: switch to TransformerEncoder & TransformerDecoder
    # encoder = RNNEncoder(args, src_dict, encoder_embed_tokens)
    # decoder = RNNDecoder(args, tgt_dict, decoder_embed_tokens)
    encoder = TransformerEncoder(args, src_dict, encoder_embed_tokens)
    decoder = TransformerDecoder(args, tgt_dict, decoder_embed_tokens)

    # sequence to sequence model
    model = Seq2Seq(args, encoder, decoder)

    # initialization for seq2seq model is important, requires extra handling
    def init_params(module):
        from fairseq.modules import MultiheadAttention
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if module.bias is not None:
                module.bias.data.zero_()
        if isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        if isinstance(module, MultiheadAttention):
            module.q_proj.weight.data.normal_(mean=0.0, std=0.02)
            module.k_proj.weight.data.normal_(mean=0.0, std=0.02)
            module.v_proj.weight.data.normal_(mean=0.0, std=0.02)
        if isinstance(module, nn.RNNBase):
            for name, param in module.named_parameters():
                if "weight" in name or "bias" in name:
                    param.data.uniform_(-0.1, 0.1)

    # weight initialization
    model.apply(init_params)
    return model

arch_args = Namespace(
    encoder_embed_dim=512,
    encoder_ffn_embed_dim=2048,
    encoder_layers=6,
    decoder_embed_dim=512,
    decoder_ffn_embed_dim=2048,
    decoder_layers=6,
    share_decoder_input_output_embed=True,
    dropout=0.3,
)

# HINT: these patches on parameters for Transformer
def add_transformer_args(args):
    args.encoder_attention_heads=8
    args.encoder_normalize_before=True

    args.decoder_attention_heads=8
    args.decoder_normalize_before=True

    args.activation_fn="relu"
    args.max_source_positions=1024
    args.max_target_positions=1024

    # patches on default parameters for Transformer (those not set above)
    from fairseq.models.transformer import base_architecture
    base_architecture(arch_args)

add_transformer_args(arch_args)

Back-translation

TODO: clean corpus

remove sentences that are too long or too short
unify punctuation

hint: you can use clean_s() defined above to do this

def clean_mono_corpus(prefix, l, ratio=9, max_len=1000, min_len=1):
    if Path(f'{prefix}.clean.zh').exists():
        print(f'{prefix}.clean.zh exists. skipping clean.')
        return
    with open(f'{prefix}', 'r') as l_in_f:
        with open(f'{prefix}.clean.zh', 'w') as l_out_f:
            for s in l_in_f:
                s = s.strip()
                s = clean_s(s, l)
                s_len = len_s(s, l)
                if min_len > 0: # remove short sentence
                    if s_len < min_len:
                        continue
                if max_len > 0: # remove long sentence
                    if s_len > max_len:
                        continue
                print(s, file=l_out_f)
                

mono_data_prefix = f'{mono_prefix}/ted_zh_corpus.deduped'
clean_mono_corpus(mono_data_prefix, 'zh')

TODO: Subword Units

Use the spm model of the backward model to tokenize the data into subword units

hint: spm model is located at DATA/raw-data/dataset/spm[vocab_num].model

def spm_encode(prefix, vocab_size, mono_prefix):
    spm_model = spm.SentencePieceProcessor(model_file=str(prefix/f'spm{vocab_size}.model'))

    in_path = mono_prefix / 'ted_zh_corpus.deduped.clean'

    for lang in [src_lang, tgt_lang]:
        out_path = mono_prefix / f'mono.tok.{lang}'
        if out_path.exists():
            print(f"{out_path} exists. skipping spm_encode.")
        else:
        with open(out_path, 'w') as out_f:
            with open(f'{in_path}.{lang}', 'r') as in_f:
                for line in in_f:
                    line = line.strip()
                    tok = spm_model.encode(line, out_type=str)
                    print(' '.join(tok), file=out_f)

TODO: Generate synthetic data with backward model

Add binarized monolingual data to the original data directory, and name it with “split_name”

ex. ./DATA/data-bin/ted2020/[split_name].zh-en.[“en”, “zh”].[“bin”, “idx”]

then you can use ‘generate_prediction(model, task, split=“split_name”)’ to generate translation prediction

1
2
3

# hint: do prediction on split='mono' to create prediction_file
task.load_dataset(split="mono", epoch=1)
generate_prediction(model, task, split='mono', outfile='./prediction.txt' )

TODO: Create new dataset

Combine the prediction data with monolingual data
Use the original spm model to tokenize data into Subword Units
Binarize data with fairseq

# Combine prediction_file (.en) and mono.zh (.zh) into a new dataset.
#
# hint: tokenize prediction_file with the spm model
!cp ./prediction.txt {mono_prefix}/'ted_zh_corpus.deduped.clean.en'
spm_encode(prefix, vocab_size, mono_prefix)
# spm_model.encode(line, out_type=str)
# output: ./DATA/rawdata/mono/mono.tok.en & mono.tok.zh
#
# hint: use fairseq to binarize these two files again
binpath = Path('./DATA/data-bin/synthetic')
src_dict_file = './DATA/data-bin/ted2020/dict.en.txt'
tgt_dict_file = src_dict_file
monopref = ./DATA/rawdata/mono/mono.tok # or whatever path after applying subword tokenization, w/o the suffix (.zh/.en)
if binpath.exists():
     print(binpath, "exists, will not overwrite!")
else:
     !python -m fairseq_cli.preprocess\
         	--source-lang 'zh'\
         	--target-lang 'en'\
         	--trainpref {monopref}\
	        --destdir {binpath}\
	        --srcdict {src_dict_file}\
            --tgtdict {tgt_dict_file}\
	        --workers 2

Generative Adversarial Network(GAN)

Distribution makes the same input has different outputs, especially for the tasks needs “creativity”.

Unconditional Generation

Take Anime Face Generation as an example.

Discriminator

Discriminator is a neural network. In this example, the input is an image, the output is a Scalar. Larger scalar means real, smaller value fake.

Basic Idea of GAN

This is where the term “adversarial” comes from.

Algorithm

Initialize Generator and Discriminator
In each training iteration:
1. Fix generator G, and update discriminator D.
  - Discriminator learns to assign high scores to real objects and low scores to generated objects.
2. Fix discriminator D, and update generator G.
  - Generator learns to “fool” the Discriminator.

Theory behind GAN

Our Objective

G^⋆ = arg min_G Div(P_G, P_data)
Div(P_G, P_data) is Divergence between distributions P_G and P_data
How to compute the divergence?

Training: D_⋆ = arg max_DV(D, G), The value of max_DV(D, G) is related to JS divergence(JS散度).

Objective Function for D: V(G, D) = E_{y ∼ P_data}[logD(y)] + E_{y ∼ P_G}[log(1 − D(y))]

the V(D, G) is negative cross entropy, so D_⋆ = arg max_DV(D, G) is equal to minimizing cross entropy in Training classifier.

Other divergence can also be used, not just JS divergence.

Tips for GAN

JS divergence is not suitable, because in most cases, P_G and P_data are not overlapped.
Intuition: If two distributions do not overlap, binary classifier achieves 100% accuracy.
Its accuracy (or loss) means nothing during GAN training.

Wasserstein distance

Considering one distribution P as a pile of earth, and another distribution Q as the target.
The average distance the earth mover has to move the earth.

There are many possible “moving plans”. Using the “moving plan” with the smallest average distance to define the Wasserstein distance.

WGAN

Evaluate Wasserstein distance between P_data and P_G:
D ∈ 1 − Lipschitz means D has to be smooth enough. If without the constraint, the training of D will not converge. So we need to Keep the D smooth forces D(x) become ∞ and −∞.

Evaluation of Generation

Human evaluation is expensive and sometimes unfair/unstable. How to evaluate the quality of the generated images automatically?

Problems

Mode Collapse

生成器（Generator）只能生成数据分布中的极少数模式（mode），而无法覆盖全部真实数据的多样性。例如，训练一个生成人脸图像的GAN时，如果发生模式坍塌，生成器可能只会输出同一张脸或少数几张脸，而无法生成其他不同的人脸。

Mode Dropping

生成器部分忽略真实数据分布中的某些模式（mode），导致生成样本的多样性不足，但尚未完全坍塌到极少数模式。与Mode Collapse（模式坍塌） 相比，Mode Dropping是更轻微的多样性缺失问题——生成器可能覆盖了大部分真实数据的模式，但仍遗漏了一些子类别或变体。相比于Mode Collapse，Mode Dropping更难被察觉。

Inception Score(IS)

Good quality,large diversity means large IS.

高质量图像应被预训练的分类模型明确分类
多样性高的生成集应覆盖多个类别
IS = ∑_x∑_yP(y|x)logP(y|x) − ∑_yP(y)logP(y)

Fréchet Inception Distance (FID)

Smaller is better.

A lot of samples is needed.

More Evaluations

Pros and cons of GAN evaluation measures:https://arxiv.org/abs/1802.03446

Conditional Generation

Text-to-image

Image translation(pix2pix)

Sound-to-image

While sound becomes louder…

Talk Head Generation

Learning from Unpaired Data

Unsupervised Conditional Generation can learn the mapping without any paired data.

Cycle GAN

无需成对训练数据，即可实现不同域（domain）之间的风格转换。

循环一致性（Cycle Consistency）:确保从域A转换到域B后，再转换回域A的图像与原始图像尽可能一致，解决了无监督训练中“模式坍塌”和“无配对数据”的挑战。
对抗损失（Adversarial Loss）：使用两个生成器和两个判别器，分别判断生成图像是否属于目标域。

Other Application

Homework 6: Anime face generation

Task introduction

Input: random number
Output: Anime face
Implementation requirement: DCGAN & WGAN & WGAN-GP
Target: generate 1000 anime face images

Evaluation metrics

FID (Fréchet Inception Distance) score

Use another model to create features for real and fake images
Calculate the Fréchet distance between distribution of two features

AFD (Anime face detection) rate

To detect how many anime faces in your submission
The higher, the better

Dataset

Crypko 1. Dataset link is in the colab 2. Dataset format 3. There are 71,314 pictures in the folder 4. You can use additional data to increase the performance

Baselines

Useful information

DCGAN

Sample code implementation
Use several conv layers to generate image

WGAN & WGAN-GP

WGAN: Modify from DCGAN

Remove the last sigmoid layer from the discriminator.
Do not take the logarithm when calculating the loss.
Clip the weights of the discriminator to a constant (1 ~ -1).
Use RMSProp or SGD as the optimizer.
link

WGAN-GP: Modify from WGAN

Use gradient penalty to replace weight clipping
Gradient penalty accumulate gradient from an interpolated image
link

StyleGAN

First transform latent variable z to w
Use w in different stage in generator(Deal with different resolutions)
Useful link

Simple Baseline

Run sample code.

Output:

Medium Baseline

Use WGAN with more epochs

Code: link

Output:

Strong Baseline

Use WGAN-GP

Code: link

Output:

Self-Supervised Learning

In self-supervised learning, the system learns to predict part of its input from other parts of it input. In other words a portion of the input is used as a supervisory signal to a predictor fed with the remaining portion of the input.

在自监督学习中，系统学习如何根据输入的其他部分来预测输入的一部分。换句话说，输入的某一部分被用作监督信号，提供给一个由输入剩余部分所馈送的预测器。

BERT series

参考文献：

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Masking Input

BERT 的核心思想之一是 双向上下文建模。传统的语言模型（如 ELMo）通常是单向的（从左到右或从右到左），只能根据前面的词预测后面的词，或反之。BERT 旨在让模型能够同时利用词语左右两侧的全部上下文信息来理解词语本身的意义和角色。Masked Language Modeling (MLM) 任务就是为了实现这个目标而设计的。

Masking Input 发生在 BERT 的 预训练 (Pre-training) 阶段。其过程可以分解为以下几个步骤：

输入句子准备：
- 输入通常是一个或两个句子（对于 Next Sentence Prediction 任务）。
- 句子被分割成更小的单元，通常是 WordPieces 或 subwords (例如，“playing” -> ["play", "##ing"])。这是为了解决词汇表过大和未登录词(OOV)问题。
- 句子前后会添加特殊的标记：[CLS] (分类) 在句首，[SEP] (分隔) 在句尾（或两个句子之间）。
随机选择遮盖目标：
- 对输入序列中的所有 token（除了 [CLS] 和 [SEP] 等特殊标记），随机选择大约 15% 的 token 作为需要进行预测的目标。这个比例是经过实验确定的平衡点：太低则学习信号不足，太高则破坏了句子的上下文，导致学习困难。
- 举例：假设输入句子是 "The quick brown fox jumps over the lazy dog"。可能选中的目标 token 是 "brown", "fox", "the"（后一个）。
执行遮盖操作：
- 对被选中的这 15% 的 token，进行具体的遮盖处理。这里 不是简单地用一个固定的 “MASK” 替换所有选中 token。BERT 采用了更精细的策略：
  - 80% 的概率： 将这个 token 替换成特殊的 [MASK] 标记。这是最典型的“遮盖”操作。
  - 10% 的概率： 将这个 token 替换成一个随机的 token（从整个词汇表中随机采样）。
  - 10% 的概率： 保持不变，即保留原始的 token。

实行这种策略的原因推测：

80% MASK： 强制模型学习利用完整的上下文来预测被遮盖位置的真正内容。这是任务的核心。

10% 随机替换： 防止模型过于依赖“[MASK]”标记本身的存在。因为在微调(Fine-tuning) 阶段处理真实数据时，输入中是不会出现 [MASK] 的。如果模型在预训练时只见过 [MASK]，它会觉得微调时遇到的真实单词（未被遮盖的）都是“新东西”或与训练信号无关。加入随机替换让模型学会“这个位置可能是任何词，我需要基于上下文来猜”。

10% 保持不变： 提供一种“一致性”信号。模型知道某些时候选中了要预测，但其实词是对的（因为没被修改），这有助于模型学习对未被遮盖词的表示也保持准确和稳定，并训练模型理解上下文即使不需要修改目标词也有效。

模型输入：
- 最终，这个经过遮盖操作处理后的序列（可能包含 [MASK], 随机词, 原始词）被送入 BERT 的 Transformer Encoder。
- 模型在处理这个序列时，知道哪些位置是被选中需要预测的（这些位置在计算损失时会用到）。
模型预测：
- BERT 模型（多层 Transformer Encoder）基于输入序列中 所有 token 的双向上下文，计算出序列中每个位置的 上下文嵌入(Contextual Embedding)。
- 对于 所有被选中（遮盖、随机替换、保持原样）的 token 的位置，模型会将对应位置的顶层输出向量（即该位置的上下文表示）送入一个额外的 Softmax 分类层。
- 这个分类层的作用是：预测该位置原始未被修改的 token 是什么。
- 模型不会预测未被选中的位置（那 85%）。
损失计算：
- 损失函数通常是标准的 交叉熵损失(Cross-Entropy Loss)。
- 损失的计算 仅针对那些在步骤 2 中被选中的 token 的位置（即那 15%）。
- 对于每个这样的位置，模型预测该位置原始 token 在整个词汇表上的概率分布，损失就是该分布与目标分布（即原始 token 的 one-hot 向量）之间的交叉熵。
- 将所有被选中位置的损失求和，得到最终的目标函数用于更新模型参数。

通过人为遮蔽部分输入并迫使模型基于完整上下文进行复原，驱动模型学习通用的、强大的、双向的深度语言理解能力，为后续在各种下游任务的微调打下了坚实的基础。

Next Sentence Prediction

Goal

很多重要的下游任务（如问答QA、自然语言推理NLI、文本蕴含Textual Entailment）都需要模型理解句子之间的关系。MLM 任务主要是学习词级别或短语级别的语境理解，但缺乏明确训练模型掌握跨句子连贯性的能力。NSP 就是为了弥补这一点而被设计出来的。

Theory

NSP 是一个二分类任务。在预训练阶段，BERT 的输入由两个片段 Sentence A 和 Sentence B 组成。模型需要判断：Sentence B 是不是 Sentence A 在原始文本中的实际下一句（即它们在上下文中是连贯、有逻辑顺序的）？

标签 IsNext (正例)： Sentence B 是 Sentence A 在原始文档中真实的后续句子。
标签 NotNext (负例)： Sentence B 是随机从语料库中挑选出来的一个句子，与 Sentence A 在语义和逻辑上没有连贯关系。

通过迫使模型区分这两种情况，BERT 学习捕捉句子间的各种关系，如因果关系、转折关系、顺承关系等，并理解句子组合的整体含义。

Process

输入构造：
- Sentence A：通常是取语料库中的一个完整句子或文本片段。
- Sentence B 存在两种情况：
  - 正例 (50% 的概率)： Sentence B 就是从原始文档中紧接着 Sentence A 的下一个句子。
  - 负例 (50% 的概率)： Sentence B 是从语料库中随机挑选出来的一个句子（通常来自与 Sentence A 不同的文档）。关键点是它与 Sentence A 没有上下文关联。
- 标记符：
  - [CLS]：放在整个输入序列的最开头。
  - [SEP]：放在 Sentence A 和 Sentence B 之间，用于分隔两个句子。如果输入只有 Sentence A（如某些分类任务），则 [SEP] 放在句尾。
模型输入：
- 最终的输入序列是： [CLS] + Token_1^A + Token_2^A + ... + [SEP] + Token_1^B + Token_2^B + ... + [SEP]
- 除了单词 token 的嵌入（Token Embeddings），BERT 的输入嵌入还包括：
  - 段落嵌入 (Segment Embeddings): 用于区分 Sentence A 和 Sentence B。所有属于 Sentence A 的 token 分配一个段嵌入类型（比如 0），所有属于 Sentence B 的 token 分配另一个段嵌入类型（比如 1）。[CLS] 和 [SEP] 通常也按位置或段逻辑分配。
  - 位置嵌入 (Position Embeddings): 表示每个 token 在序列中的位置信息。
模型处理：
- 这个构造好的输入序列（包含 token 嵌入、段嵌入、位置嵌入）被送入 BERT 的 Transformer Encoder。
- Transformer Encoder 处理整个序列，为输入序列中的 每个位置（包括每个 token 和 [CLS]）输出一个上下文感知的向量表示。
NSP 预测：
- NSP 任务的预测只依赖于特殊标记 [CLS] 对应的输出向量 C。
- 将这个向量 C 送入一个额外的 分类层（Classification Layer），通常是一个 Linear + Softmax 层。
- 这个分类层将C映射成一个维度为 2 的向量，代表两个类别的概率：
  - P(IsNext | A, B)： Sentence B 是 Sentence A 的下一句的概率。
  - P(NotNext | A, B)： Sentence B 不是 Sentence A 的下一句的概率。
损失计算：
- 目标函数就是标准的 二元交叉熵损失 (Binary Cross-Entropy Loss)。
- 损失基于模型对 [CLS] 位置输出的预测概率和真实标签 (IsNext 或 NotNext) 进行计算。
- 这个损失与 MLM 任务的损失（基于被遮盖位置的预测）相加，作为 BERT 预训练的总损失，用于更新整个模型的参数。

Problem

This approach is not helpful.

Robustly optimized BERT approach RoBERTa: https://arxiv.org/abs/1907.11692

RoBERTa 发现，去掉 NSP 任务并且只使用更大的 batch size 和更多的数据训练 MLM，反而在多项任务上取得了更好的效果。他们认为 NSP 任务可能不够有挑战性，负例（随机选下一句）太容易被区分（正例来自同一文档，负例来自不同文档，模型可能主要区分文档来源，而不是精细的语义连贯性）。
SOP : Sentence order prediction

Used in ALBERT: https://arxiv.org/abs/1909.11942
它不区分句子是否来自同一文档，而是关注句子顺序逻辑。
- 正例： 两个连续句子按原始顺序呈现 (Sentence A, True Next B)。
- 负例： 将相同的两个句子交换顺序呈现 (Sentence B 变成了第一句，Sentence A 变成了第二句)。
- 任务变为判断句子顺序是否正确。这要求模型理解句子间的内在因果关系、时间顺序等，比区分是否来自同一文档更具挑战性。实验表明 SOP 通常比 NSP 更有效。

Benchmark

General Language Understanding Evaluation (GLUE): CLUE中文语言理解基准测评

•Corpus of Linguistic Acceptability ( CoLA •Stanford Sentiment Treebank (SST 2) •Microsoft Research Paraphrase Corpus (MRPC) •Quora Question Pairs (QQP) •Semantic Textual Similarity Benchmark (STS B) •Multi Genre Natural Language Inference (MNLI) •Question answering NLI (QNLI) •Recognizing Textual Entailment (RTE) •Winograd NLI (WNLI)

How to use BERT

Sentiment analysis

POS tagging

Natural Language Inference (NLI)

Extraction based Question Answering (QA)

Other Research

BERT Embryology

Based on the BERT Embryology (胚胎学)，When does BERT know POS tagging（语义标注）, syntactic parsing（语法分析）, semantics（语义理解）?

link

由论文可知，重建（reconstruction）结果如图1(a)所示。ALBERT首先学习重建功能词（function words），例如限定词（determiners）、介词（prepositions），随后逐步按动词→副词→形容词→名词→专有名词的顺序学习重建实义词（content words）。动词的不同形式和时态具有差异化学习进程：第三人称单数现在时（third-person singular present）重建最为容易，而现在分词（present participle）重建难度最高。图1(b)的预测（prediction）结果表明，掩码预测的学习难度普遍高于令牌重建。ALBERT预测掩码令牌的顺序虽与重建相似，但速度显著较慢且准确率更低。

论文选择了四类探针任务（probing tasks）进行实验：词性标注（POS tagging）、成分标记（constituent tagging）、共指消解（coreference resolution） 和语义角色标注（semantic role labeling）。前两项任务用于探测标记嵌入（token embeddings）中隐含的语法知识（syntactic knowledge），后两项任务则旨在检测标记嵌入所承载的语义知识（semantic knowledge）。

实验结果如图2所示。可以观察到所有四项任务在预训练期间呈现相似趋势，表明语法知识与语义知识在预训练过程中同步发展。具体而言：

语法相关任务（POS标注与成分标记）：

在前10万步快速提升性能

后续训练中不再显著改善，且表现持续波动

语义角色标注（SRL）的异常现象：

约15万步达到性能峰值

此后逐步衰减

这一现象可能表明，当ALBERT模型致力于优化预训练目标时，特定网络层中与任务相关的信息会逐渐减少（信息衰减机制）。

MASS/BART

MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS is based on the sequence to sequence learning framework: its encoder takes a sentence with a masked fragment (several consecutive tokens) as input, and its decoder predicts this masked fragment conditioned on the encoder representations. Unlike BERT or a language model that pre-trains only the encoder or decoder, MASS is carefully designed to pre-train the encoder and decoder jointly in two steps: 1) By predicting the fragment of the sentence that is masked on the encoder side, MASS can force the encoder to understand the meaning of the unmasked tokens, in order to predict the masked tokens in the decoder side; 2) By masking the input tokens of the decoder that are unmasked in the source side, MASS can force the decoder rely more on the source representation other than the previous tokens in the target side for next token prediction, better facilitating the joint training between encoder and decoder.

该方法基于序列到序列学习框架：其编码器接收带有掩码片段（若干连续词元）的句子作为输入，解码器则基于编码器表征预测该掩码片段。与仅预训练编码器或解码器的BERT或语言模型不同，MASS通过两个步骤精心设计以联合预训练编码器和解码器：1）通过预测编码器侧被掩码的句子片段，迫使编码器理解未掩码词元的语义，从而支持解码器对掩码词元的预测；2）通过掩码解码器输入中源端未遮盖的词元，迫使解码器在预测后续词元时更依赖源表征而非目标端历史词元，从而更好地促进编码器与解码器的联合训练。

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Inputs to the encoder need not be aligned with decoder outputs, allowing arbitrary noise transformations. Here, a document has been corrupted by replacing spans of text with mask symbols. The corrupted document (left) is encoded with a bidirectional model, and then the likelihood of the original document (right) is calculated with an autoregressive decoder. For fine-tuning, an uncorrupted document is input to both the encoder and decoder, and we use representations from the final hidden state of the decoder.

编码器的输入无需与解码器输出对齐，这允许进行任意噪声转换。本文操作中，通过将文本片段替换为掩码符号来损坏原始文档。左侧的损坏文档经由双向模型编码后，再由自回归解码器计算右侧原始文档的可能性。在微调阶段，原始文档同时输入编码器与解码器，我们使用解码器最终隐藏状态的表示。

BART is trained by corrupting documents and then optimizing a reconstruction loss—the cross-entropy between the decoder’s output and the original document. Unlike existing denoising autoencoders, which are tailored to specific noising schemes, BART allows us to apply any type of document corruption. In the extreme case, where all information about the source is lost, BART is equivalent to a language model.

BART通过破坏文档再进行重构损失优化完成训练——具体优化的是解码器输出与原始文档间的交叉熵。与现有仅适配特定噪声方案的降噪自编码器不同，BART允许采用任意类型的文档破坏方式。在极端情况下，当原始信息完全丢失时，BART即相当于语言模型。

The transformations:

Token Masking: random tokens are sampled and replaced with [MASK] elements.

Token Deletion: Random tokens are deleted from the input. In contrast to token masking, the model must decide which positions are missing inputs.

Text Infilling: A number of text spans are sampled, with span lengths drawn from a Poisson distribution (λ=3). Each span is replaced with a single [MASK] token. 0-length spans correspond to the insertion of [MASK] tokens. Text infilling teaches the model to predict how many tokens are missing from a span.

Sentence Permutation: A document is divided into sentences based on full stops, and these sentences are shuffled in a random order.

Document Rotation: A token is chosen uniformly at random, and the document is rotated so that it begins with that token. This task trains the model to identify the start of the document.

Why does BERT work?

Contextualized word embedding(上下文词嵌入): 通过Transformer的自注意力机制，每个词的嵌入向量由全句所有词共同计算生成，实现真正的语境感知，同一词在不同句子中生成不同向量（如“bank”在金融/地理语境中向量差异显著），解决多义词问题。

一个有趣的现象：即使在蛋白质、DNA或音乐分类等非文本数据上，使用在文本上预训练的BERT模型，也比随机初始化的模型收敛更快、性能更好，而且只比专门针对该任务设计的模型稍差一点。

原因猜测：

模型架构的普适性：Transformer的自注意力机制（Self-Attention）本质是学习序列元素间的依赖关系，无论输入是文本、蛋白质序列还是音乐信号。

预训练获得的通用表示能力：文本预训练使模型学习到层级化特征提取能力，而这种能力能直接迁移至其他非文本任务中。并且文本预训练数据天然含噪声（如错别字、语法变异），使模型学会过滤无关信号，这一特性在非文本领域至关重要。

迁移学习的优化优势：预训练权重提供了接近最优的初始化点，避免随机初始化陷入局部最优；文本数据的规模（如千亿级token）远超生物/音乐领域，预训练模型可以将大规模数据知识蒸馏至参数中。

Multi-lingual BERT

Training a BERT model by many different languages.

在BERT的Zero-shot Reading Comprehension任务中，采用104种语言预训练（Pre-train），以中英双语微调（Fine-tune）反而优于纯中文预训练+中文微调的组合。

原因猜测：

在104种语言的大规模语料上预训练时，模型通过共享的Transformer编码器学习不同语言之间的深层语义关联。例如，中文“兔子”和英文“rabbit”的嵌入向量在表示空间中高度接近。这种跨语言对齐能力使模型在未训练过的语言任务中（如中文零样本阅读理解）仍能通过语义映射进行推理。

104种语言的训练数据量远超单一中文语料，覆盖更广泛的语法结构、文化背景和领域知识。这种多样性显著提升模型对语言共性与差异的鲁棒性，使其在面对中文零样本任务时更少受限于训练数据的偶然性。单一中文预训练+微调易在小规模中文数据集上过拟合；而多语言预训练通过大规模正则化抑制过拟合，微调时中英文数据进一步提供异构监督信号。

模型在预训练中学习到跨语言的通用模式（如指代消解、因果推理），这些模式在英文微调时被强化，并直接迁移至中文阅读理解中。

GPT series

Predict Next Token

GPT 的核心任务本质上是序列预测。给定一个已有的词序列（或更精确地说，Token 序列），模型的任务是预测在这个序列之后，最有可能出现的下一个 Token 是什么。这个过程是自回归的：模型预测出一个 Token 后，会将它添加到输入序列末尾，然后基于这个新的、更长的序列预测下一个 Token，如此循环往复，从而生成连贯的文本。

Basic Principle

In-context Learning (ICL)

模型无需更新其参数（即无需传统意义上的“训练”或“微调”），仅通过向模型的输入中提供特定格式的文本示例（称为“上下文”或“提示”），就能理解并执行新的任务。

可以将 ICL 理解为一种即时且临时的“学习”或“适应”，完全依赖模型输入中提供的上下文信息来完成特定任务。

为什么GPT能够进行In-context Learning (ICL) ？

LLM 在庞大的文本语料库上进行预训练（目标是预测下一个词）。在这个过程中，它们不仅学习了语言知识（语法、词汇），还大量地、反复地“看到了”各种任务的文本描述和案例（因为互联网上充斥着教程、问答、代码注释等）。预训练让模型具备了对“任务模式”极其敏感的潜在能力。

Transformer 的自注意力机制能够高效地处理长距离依赖关系，并能在生成输出时充分考虑整个输入序列中的所有信息。这使得模型能够动态地将“指令”和“演示示例”与“新输入”联系起来，找出其中的模式和映射关系。

当提供演示示例时，模型会识别这些示例所体现的输入到输出的映射规则（如：“输入英语句子 -> 输出法语句子”，“说好的 -> 积极；说坏的 -> 消极；陈述事实 -> 中性”）。然后，它把新输入代入到推断出的规则中，生成符合该规则的输出序列（即预测下一个词）。

To learn more

SimCLR(Simple Framework for Contrastive Learning of Visual Representations)

一种自监督视觉表征学习方法，通过对比学习从未标注数据中学习高质量的特征表示，无需人工标注即可达到与监督学习相媲美的性能。

Data Augmentation：对同一原始图像应用两次独立的随机增强（如随机裁剪、颜色抖动、高斯模糊、水平翻转），生成一对正样本。
Encoder：使用卷积神经网络提取增强后图像的语义特征，输出高维特征向量 h（称为表示向量）。
Projection Head：在编码器后添加一个小型多层感知机（MLP），将 h 映射到低维空间 z（称为投影向量），投影头通过非线性变换（如ReLU）过滤冗余信息，优化特征空间以适应对比任务。
NT-Xent Loss（比损失函数）：最大化正样本对的相似性，最小化负样本对的相似性。
- sim(z_i, z_j)：余弦相似度
- τ：温度参数，控制相似度分布的尖锐程度（值越小，区分越严格）
- 分母包含同一批次内所有其他样本作为负样本

工作流程：一批 N 张未标注图像，每张图像生成两个增强视图，共 2N 个样本，进行特征提取（增强视图 → 编码器 → 表示向量 h → 投影头 → 投影向量 z），对每个样本 i，计算其正样本对的相似度，并对比所有负样本（2N−2 个）的相似度，通过反向传播更新编码器和投影头的参数，丢弃与任务无关的噪声特征，训练完成后移除投影头，仅用编码器提取的特征 h 进行迁移（如分类、检测）。

BYOL（Bootstrap your own latent）

Bootstrap your own latent: A new approach to self-supervised Learning

是一种无需负样本的自监督对比学习方法，由DeepMind团队于2020年提出。它通过双网络结构和自蒸馏机制，实现高效的特征表示学习，在图像分类、目标检测等任务中达到与监督学习相当的性能。

双网络架构：BYOL的核心是在线网络（Online Network） 和 目标网络（Target Network）。在线网络包含编码器（如ResNet）、投影头（MLP）和预测器（MLP）。参数通过梯度下降更新，负责学习当前任务；目标网络结构与在线网络相同（无预测器），参数通过指数移动平均（EMA） 更新，目标网络提供稳定的学习目标，避免模型坍塌。
自蒸馏机制：对同一图像生成两个增强视图，由在线网络和目标网络分别处理，得出预测表示和投影表示，最小化二者的loss，损失函数采用归一化余弦相似度损失，最小化同一图像不同视图表示的距离，无需负样本约束。

SimCLR以来大量负样本防止坍塌，但负样本可能引入噪声或偏差，而BYOL通过目标网络的EMA更新和预测器，避免坍塌且简化训练，摆脱对负样本的依赖。

Pre-trained Language Models

Background knowledge

Training a language model is self-supervised learning. Self-supervised learning is predicting any part of the input from any other part.

Autoregressive Language Models (ALMs): Complete the sentence given its prefix.
Masked Language Models (MLMs): Use the unmasked words to predict the masked word.
Transformer-based ALMs: Composed of stacked layers of transformer layers.
Pre-trained Language Models (PLMs):
- Using a large corpora(语料库) to train a neural language model.
  - Autoregressive pre-trained: GPT系列(GPT, GPT-2, GPT-3)
  - MLM-based pre-trained: BERT系列(BERT, RoBERTa, ALBERT)
  - We believe that after pre-training, the PLM learns some knowledge, encoded in its hidden representations, that can transfer to downstream tasks
- (Standard) fine-tuning: Using the pre-trained weights of the PLM to initialize a model for a downstream task.

The Problems of PLMs

Data scarcity in downstream tasks: A large amount of labeled data is not easy to obtain for each downstream task.
The PLM is too big, and they are still getting bigger.
- Need a copy for each downstream task
- Inference takes too long
- Consume too much space

The Solutions of Those Problems

Labeled Data Scarcity → Data-Efficient Fine-tuning

Prompt Tuning: By converting the data points in the dataset into natural language prompts, the model may be easier to know what it should do.

Format the downstream task as a language modelling task with predefined templates into natural language prompts.

What you need in prompt tuning:

A prompt template: convert data points into a natural language prompt.
A PLM: perform language modeling.
A verbalizer: A mapping between the label and the vocabulary

Prompt tuning has better performance under data scarcity because it incorporates human knowledge and introduces no new parameters.

Few-shot Learning: We have some labeled training data.

Semi-Supervised learning: We have some labeled training data and a large amount of unlabeled data.

Pattern-Exploiting Training (PET):

(1)Use different prompts and verbalizer to prompt-tune different PLMs on the labeled dataset.

(2)Predict the unlabeled dataset and combine the predictions from different models.

(3)Use a PLM with classifier head to train on the soft-labeled data set.

Zero-shot inference: inference on the downstream task without any training data.

PLMs Are Gigantic → Reducing the Number of Parameters

Parameter-efficient fine-tuning: Reduce the task-specific parameters in downstream task

Fine-tuning = modifying the hidden representation based on a PLM

Adapter: Use special submodules to modify hidden representations.

During fine-tuning, only update the adapters and the classifier head. All downstream tasks share the PLM, the adapters in each layer and the classifier heads are the task-specific modules.

LoRA: Low-Rank Adaptation of Large Language Models.

All downstream tasks share the PLM; the LoRA in each layer and the classifier heads are the task-specific modules.

Prefix Tuning: Insert trainable prefix in each layer.

Only the prefix (key and value) are updated during finetuning.

Soft Prompting: Prepend the prefix embedding at the input layer.

Soft Prompts: vectors (can be initialized from some word embeddings)

Benefits:

Drastically decreases the task-specific parameters.
Less easier to overfit on training data; better out-of-domain performance.
Fewer parameters to fine-tune; a good candidate when training with small dataset.

Early exit: Reduce the models that are involved during inference

Add a classifier at each layer and use a confidence predictor to decide which classifier to use.

Self-supervised Learning for Speech and Image

Generative Approaches

BERT series

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

Mockingjay 是一种针对语音信号的自监督学习模型，属于生成式方法（Generative Approaches）。它通过重构被破坏的语音帧（Audio Frames）来学习语音表征，显著提升了语音识别、语音情感分析等下游任务的性能。

Masking:

Smoothness of acoustic features: Masking consecutive features
Masking strategies for speech: Masking specific dimensions(Learn more speaker information in this way)

GPT series

Generative Pre-Training for Speech with Autoregressive Predictive Coding

Autoregressive Predictive Coding（自回归预测编码，APC）是一种结合自回归建模（Autoregressive Modeling）和预测编码理论（Predictive Coding）的机器学习方法。

通过预测未来数据点来学习当前数据的低维表示，同时利用预测误差驱动模型优化。

对于图像来说，以上两种方法均适用，将图像转化为vector即可。

Predictive Approach

No negative examples

Image - Predicting Rotation

Unsupervised Representation Learning by Predicting Image Rotations

Image - Context Prediction

Unsupervised Visual Representation Learning by Context Prediction

仅需使用大规模无标签图像集，从单张图像中随机抽取成对的图像块，通过训练卷积神经网络预测第二块相对于第一块的方位坐标。通过这种图像内部上下文关系学习到的特征表示，能够有效捕捉图像之间的视觉相似性。

Predict Simplified Objects

For Speech:

HuBERT: HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

通过离线聚类步骤为类BERT预测损失提供对齐的目标标签。本方法的关键设计在于预测损失仅作用于掩蔽区域，迫使模型在连续输入上学习声学与语言信息的联合建模。HuBERT的核心优势依赖于无监督聚类步骤的稳定性，而非所分配聚类标签的固有质量。
BEST: Self-supervised Learning with Random-projection Quantizer for Speech Recognition

本方法训练了一个预测掩蔽语音信号的模型，其预测目标形式为通过随机投影量化器生成的离散标签。该量化器使用随机初始化的矩阵对语音输入进行投影，并在随机初始化的码本中执行最近邻搜索匹配。

For Image:

DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features

在每次迭代中通过标准聚类算法（k均值）对特征执行聚类分组，并将后续分配结果作为监督信号，依此更新网络的权重参数。

Contrastive Learning

SimCLR

Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning

MoCo

Momentum Contrast for Unsupervised Visual Representation Learning

MoCo(Momentum Contrast) 是一种用于无监督视觉表征学习的动量对比算法，基于将对比学习视为字典查询的视角，本方法通过队列机制与动量平均编码器构建动态字典。该设计实现实时构建大规模且一致的字典，从而显著提升对比式无监督学习效果。

VQ-wav2vec

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

VQ-wav2vec + BERT:

Wav2vec 2.0: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Continuous input is critical. Quantized target improves performance.

Why not formulated as typical classification?

原因猜测：音频作为输入经过编码器的输出纷繁复杂，不利于作为简单的分类问题进行解决。

Bootstrapping Approaches

No negative examples

类似一种Typical Knowledge Distillation方式，Teacher Encoder的参数保持不变，通过计算损失进行gradient descent，更新Student Encoder的参数，然后将其复制回Teacher Encoder中，如此循环往复。

Image
- Bootstrap your own latent (BYOL)：https://arxiv.org/abs/2006.07733
- Simple Siamese (SimSiam)：https://arxiv.org/abs/2011.10566
Speech
- Data2vec: the student learns from multiple layers of the teacher：https://arxiv.org/abs/2202.03555

Simply Extra Regularization

No negative examples

Variance-Invariance-Covariance Regularization ( VICReg )

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

通过给Encoder的输出的某一/些维度的方差设置一个阈值，可以使其输出结果的分布不再趋同，增加多样性，避免坍缩问题的发生。

Homework 7: Extractive Question Answering

Dataset: DRCD & ODSQA

DRCD: Delta Reading Comprehension Dataset ODSQA: Open-Domain Spoken Question Answering Dataset

train: DRCD + DRCD-TTS
- 10524 paragraphs, 31690 questions
dev: DRCD + DRCD-TTS
- 1490 paragraphs, 4131 questions
test: DRCD + ODSQA
- 1586 paragraphs, 4957 questions

Chinese Reading Comprehension

Baselines

Simple Baseline

Sample code: link

Output:

Medium Baseline

Apply linear learning rate decay

# 学习率调度器，线性衰减
# 计算总步数
total_steps = len(train_loader) * num_epoch
# 学习率衰减函数，线性衰减到0
lr_lambda = lambda step: max(0.0, 1.0 - step / total_steps)
# 创建学习率调度器
scheduler = LambdaLR(optimizer, lr_lambda=lr_lambda)

Change value of “doc_stride”

1	`self.doc_stride = 75 # 变为原来的一半`

Code: link

Output:

Strong Baseline

Improve preprocessing

训练的数据处理是直接以答案为中心选择文本窗口，这导致模型可能学到一个不该学习的模式：答案就在中间。

采用随机窗口位置：在训练时，不要总以答案为中心截取窗口，而是让窗口的起点在包含答案的所有合法位置中随机采样。这样，答案在窗口中的位置是随机的，模型不会学到“答案总在中间”。

if self.split == "train":
    answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"])
    answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])
    L = len(tokenized_paragraph)
    min_start = max(0, answer_end_token - self.max_paragraph_len + 1)
    max_start = min(answer_start_token, L - self.max_paragraph_len)
    if max_start < min_start:
        paragraph_start = min_start  # 兜底，窗口只能唯一确定
    else:
        paragraph_start = random.randint(min_start, max_start)
    paragraph_end = paragraph_start + self.max_paragraph_len

Try other pretrained models

Models - Hugging Face

Output: link

Auto-encoder

Basic Idea of Auto-encoder

自编码器（Auto-encoder）是一种无监督学习的神经网络模型，其核心思想是通过学习数据的低维表示（编码）来实现数据的压缩与重建。它的设计灵感来源于“信息瓶颈”概念，旨在捕捉数据中最关键的特征，同时忽略冗余信息。

核心任务是让输出尽可能复现输入。

Encoder: 将高维输入数据（如图像、文本）压缩为低维潜在表示（Latent Representation），称为“瓶颈层”（Bottleneck Layer）。
Decoder: 从低维潜在表示中重建原始输入数据。

自编码器的训练无需人工标注，其标签即为输入数据本身。

自编码器的Dimension reduction 与PCA有何区别？

PCA是线性降维，而自编码器通过非线性激活函数捕捉复杂模式，适用于非线性数据。

Why Auto-encoder work?

低维瓶颈层强制模型保留关键信息，丢弃次要细节（如图像背景噪声）。

潜在表示z是数据的“抽象特征”，可视为数据的嵌入式表示（Embedding），适用于下游任务（如分类、聚类）。

De-noising Auto-encoder(DAE)

De-noising Auto-encoder（降噪自编码器，DAE）是一种改进的自编码器（Autoencoder），通过在输入数据中主动添加噪声并学习恢复原始数据，提升模型的鲁棒性和特征表示能力。

The idea of DAE is also used in BERT.

维度	DAE	BERT
核心机制	破坏-重建	掩码预测（MLM）
输入类型	图像/文本向量	文本Token序列
架构	编码器-解码器对称结构	Transformer纯编码器
训练目标	重建原始输入	预测掩码词 + 句子关系
生成能力	支持（通过解码器）	不支持（无显式解码器）
上下文建模	局部或无上下文依赖	全局双向上下文

Feature Disentanglement

在深度学习中，原始数据（如图像、语音或文本）经过神经网络编码后，得到的特征向量往往是混合的，即一个维度的变化可能对应多个因素的共同作用。

Feature Disentanglement 的目标是：将这种混合的特征表示分解成相互独立、互不干扰的子表示，每个子表示对应一个语义独立的因子，如：

图像中“姿态”、“光照”、“物体种类”
语音中“说话人身份”、“情绪”、“内容”
文本中“语气”、“主题”、“句式结构”

这样可以使模型更具可解释性、泛化能力更强，并更容易进行控制与生成。

Application: Voice Conversion

通过Feature Disentanglement可以将speaker与content分离，实现语音转换。

Discrete Latent Representation

在深度学习中，潜在表示（latent representation）是指模型从原始输入中提取的内部特征编码，通常位于自动编码器（Autoencoder）或生成模型（如VAE、GAN）的中间层。而 “离散潜在表示” 则是指这些潜在特征在潜空间中是离散的值，而不是连续的实数向量。

Vector Quantized Variational Auto encoder (VQVAE)

Structure

Encoder: 输入x通过编码器得到一个连续的表示：z_e(x) ∈ ℝ^d
Vector Quantization: 引入一个码本（codebook），ε = e₁, e₂, ..., e_k ⊂ ℝ^d，包含K个向量。
- 编码器输出会被量化为码本中最接近的向量：z_q(x) = Quantize(z_e(x)) = e_k, where
- 所以z_q(x)是一个离散向量。
Decoder: 用离散后的z_q(x)重建输入：x̂ = Decoder(z_q(x))

离散向量和连续向量有什么区别?

# 连续向量示例：向量每一维都是实数，可以是语义混合的，比如人脸的“年龄”、“光照”、“角度”等
z = [0.12, -0.53, 0.87, 0.03]

# 离散向量示例：
z = [0, 0, 1, 0]  # 一个 one-hot 向量，表示第 3 类
z = 2  # 一个“码本索引”，表示第 2 个嵌入向量

离散向量是不可导的，VQ-VAE如何解决梯度问题？

Straight-Through Estimator（STE）: 在前向传播时，使用量化结果，而在反向传播时，把梯度直接传回给未量化的z_e(x)，忽略量化步骤的不可导性。

Text as Representation

Learning to Encode Text as Human-Readable Summaries using Generative Adversarial Networks

This is cycle GAN

Tree as Embedding

Tree as Embedding（树作为嵌入） 是一种将树结构信息编码为向量表示（embedding）的方法，旨在将具有分层或结构化语义的数据（如语法树、抽象语法树AST、XML、知识图谱子树）映射到一个固定维度的向量空间中，从而使树结构能作为神经网络的输入进行建模与学习。

可以理解为将树结构转换成神经网络能处理的向量，同时保留其语义与结构特征。

More Applications

Generator

With some modification, we have variational auto-encoder (VAE).

Variational Autoencoder（VAE） 是一种生成模型（Generative Model），它结合了自编码器（Autoencoder）与变分推断（Variational Inference），在保持端到端训练的同时，学习一个可控、连续的潜在空间（latent space），从而可以生成新的、真实感很强的数据。

用神经网络学习隐变量的概率分布，然后从中采样以重构输入，进而生成新数据。

Input x
   ↓
Encoder: μ(x), σ(x)
   ↓       \
Sample z = μ + σ·ε  ← ε ~ N(0, I)
   ↓
Decoder: x̂ = f(z)
   ↓
Compare x̂ vs. x → 重构损失 + KL 正则

Compression

Lossy Image Compression with Compressive Autoencoders

Anomaly Detection

Given a set of training data {x¹, x², ..., x^N}
Detecting input x is similar to training data or not.

Case 1: With Classifier

本质上是将问题转化为二分类任务。

Anomaly Detection:

Confidence: the maximum scores or negative Entropy.

Example Framework

Training Set: Images x of characters from Simpsons.

Each image 𝑥 is labelled by its characters ŷ
Train a classifier, and we can obtain confidence score c(x) from the classifier.

Dev Set: Label each image 𝑥 is from Simpsons or not.

We can compute the performance of f(x)
Using dev set to determine and other hyperparameters.

Testing Set: Images 𝑥 is from Simpsons or not.

Evaluation

Accuracy is not a good measurement!
A system can have high accuracy, but do nothing.

第一类错误与第二类错误

Case 2: Without Labels

Given a set of training data {x¹, x², ..., x^N}
We want to find a function detecting input x is similar to training data or not.

Maximum Likelihood

Assuming the data points is sampled from a probability density function f_θ(x)

θ determines the shape of f_θ(x)
θ is unknown, to be found from data
L(θ) = f_θ(x¹)f_θ(x²)...f_θ(x^N) : Likelihood
θ^⋆ = arg max_θL(θ)

The colors represents the value of f_{μ^⋆, ∑^⋆}(x)

Concluding Remarks

Homework 8: Anomaly Detection

Task introduction

Unsupervised anomaly detection: Training a model to determine whether the given image is similar with the training data.

Data

Training data
- 100000 human faces
Testing data
- About 10000 from the same distribution with training data (label 0)
- About 10000 from another distribution (anomalies, label 1)
Format
- data/ |—– trainingset.npy |—– testingset.npy
- Shape: (#images, 64, 64, 3) for each .npy file

Methodology

Train an autoencoder with small reconstruction error.
During inference, we can use reconstruction error as anomaly score.
- Anomaly score can be seen as the degree of abnormality of an image.
- An image from unseen distribution should have higher reconstruction error.
Anomaly scores are used as our predicted values.

Evaluation - ROC AUC score

TPR = TP / (TP + FN)
FPR = FP / (FP + TN)

Simple Baseline

Sample code: link

Output:

Medium Baseline

Code: link

# 卷积自动编码器模型 - 改进版本（更少层数）
class conv_autoencoder(nn.Module):
    def __init__(self):
        super(conv_autoencoder, self).__init__()
        # 简化的编码器 - 减少层数
        self.encoder = nn.Sequential(  
            nn.Conv2d(3, 16, 4, stride=2, padding=1),  # 3通道→16通道 (64x64→32x32)
            nn.ReLU(),
            nn.Conv2d(16, 32, 4, stride=2, padding=1),  # 16通道→32通道 (32x32→16x16)
            nn.ReLU(),
            # 移除第三个卷积层，减少模型复杂度
        )
        # 简化的解码器 - 减少层数
        self.decoder = nn.Sequential(  
            nn.ConvTranspose2d(32, 16, 4, stride=2, padding=1),  # 32通道→16通道 (16x16→32x32)
            nn.ReLU(),
            nn.ConvTranspose2d(16, 3, 4, stride=2, padding=1),  # 16通道→3通道 (32x32→64x64)
            nn.Tanh(),  # 输出范围限制在[-1,1]
        )

    def forward(self, x):
        x = self.encoder(x)  # 通过编码器
        x = self.decoder(x)  # 通过解码器
        return x

# 训练配置 - 改进版本
# 训练超参数
num_epochs = 50  # 训练轮数
batch_size = 256  # 更小的批次大小（从2000减少到256）
learning_rate = 1e-3  # 学习率保持不变

# 构建训练数据加载器
x = torch.from_numpy(train)  # 将NumPy数组转换为PyTorch张量
train_dataset = CustomTensorDataset(x)  # 创建训练数据集

train_sampler = RandomSampler(train_dataset)  # 创建随机采样器
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size, num_workers=0)  # 使用更小的批次大小

# 模型选择 - 使用改进的CNN
model_type = 'cnn'   # 选择CNN模型类型
model_classes = {'fcn': fcn_autoencoder(), 'cnn': conv_autoencoder(), 'vae': VAE()}  # 模型类型映射字典
model = model_classes[model_type].cuda()  # 实例化选择的模型并移至GPU

# 推理阶段 - 也使用更小的批次大小
eval_batch_size = 64  # 更小的评估批次大小（从200减少到64）

# 构建测试数据加载器
data = torch.tensor(test, dtype=torch.float32)  # 将测试数据转换为PyTorch张量
test_dataset = CustomTensorDataset(data)  # 创建测试数据集
test_sampler = SequentialSampler(test_dataset)  # 创建顺序采样器
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=eval_batch_size, num_workers=0)

Output:

Strong Baseline

Adjust model structure: Try smaller models (less layers), Smaller batch size.

Code: link

Output:

Explainable Machine Learning

Why we need Explainable ML?

Loan issuers are required by law to explain their models. (金融贷款)
Medical diagnosis model is responsible for human life. Can it be a black box? (医疗诊断)
If a model is used at the court, we must make sure the model behaves in a nondiscriminatory manner. (法庭判决)
If a self driving car suddenly acts abnormally, we need to explain why. (自动驾驶)

We can improve ML model based on explanation.

Interpretable v.s. Powerful

Some models are intrinsically interpretable. For example, linear model (from weights, you know the importance of features), but not very powerful.
Deep network is difficult to interpretable. Deep networks are black boxes , but powerful than a linear model.
Are there some models interpretable and powerful at the same time?
- Maybe decision tree？A tree can still be terrible and we often use a forest instead of just one tree.

Goal of Explainable ML

建立信任：让用户知道模型为什么这么预测，特别是在医疗、金融等高风险领域。
发现问题：帮助识别数据中的偏差、错误或模型学到的“伪规律”。
辅助决策：让人能根据模型的解释更合理地做出决策，而不是盲目依赖模型。
满足法规：符合法律要求，提供决策依据和解释。
促进科研：通过解释模型，帮助发现新的知识和潜在规律。

In a word, Make people (your customers, your boss, yourself) comfortable.

Explainable ML

Local Explanation: Explain the Decision

Local Explanation（局部解释） 是指：

在特定样本（某一条预测）附近，对模型做出的预测结果进行解释，而不是解释整个模型的整体行为。

Which component is critical for making decision?

Object x : Image, text, etc.

Components: {X₁, ..., x_n, ..., x_N}

Image: pixel, segment, etc. Text: a word

Removing or modifying the components
Large decision change
means Important component

Saliency Map

Saliency Map（显著性图） 是一种可视化技术，用于解释神经网络模型在预测某个样本时，输入中哪些部分最“重要”或“敏感”。

它告诉你模型“在看哪里”。在图像分类任务中，Saliency Map 就是一个热点图，显示图像中哪些像素对模型输出影响最大。

Limitations

Noisy Gradient

在最基本的 Saliency Map 方法中（即 Vanilla Gradient），通过求某类别预测结果y对输入图像I的梯度，来评估每个像素对模型预测的“影响力”。但实际计算中会出现的问题是,生成的梯度图（显著性图）非常“杂乱”或“有噪声”:

热区分布不连续；
没有结构性；
很难解释模型到底关注了哪里；
图像呈现出“雪花状”或“闪电纹”，视觉上很模糊。

Solution: SmoothGrad

Randomly add noises to the input image, get saliency maps of the noisy images and average them.

Gradient Saturation

Gradient Saturation（梯度饱和） 是指在神经网络中，由于某些激活函数在特定输入区间会输出近似恒定的值，导致反向传播时梯度接近零，从而阻碍学习或解释过程。

在 Saliency Map（显著性图） 这类基于梯度的方法中：

梯度饱和会导致输入像素的梯度接近 0；
即使某个区域对模型预测结果有重要影响，但其梯度很小；
显著性图中该区域就不会被高亮，造成误导。

“重要的区域被忽略了”，因为梯度被“淹没”在网络中。

Solution: Alternative: Integrated gradient(IG)

通过计算输入特征从“参考点”变化到实际输入过程中产生的累积梯度，来衡量每个输入特征对预测结果的贡献。
从参考点x^′线性地“移动”到输入点x，并沿这条路径计算梯度的积分。

Global Explanation: Explain the Whole Model

Global Explanation（全局解释） 是指：

解释整个机器学习模型的总体行为，而不仅仅是对某一个输入样本的预测结果进行解释（Local Explanation）。

它关注的是：

模型总体是如何“思考”的；
模型整体对哪些特征最敏感；
预测规则是否稳定、合理、公平；
特征之间的交互关系如何影响输出。

What does a filter detect?

X^⋆ = argmax_x∑_i∑_ja_ij (gradient ascent)

E.g., Digit classifier

x^⋆ for each filter

What does a digit look like for CNN?

Find the image that maximizes class probability : X^⋆ = arg max_Xy_i

Obviously we can’t see digits——Consider adversarial attack.

Solution: Add Constraint

The image should looks like a digit.
- X^⋆ = argmax_Xy_i + R(X)
- R(X) = −∑_i, j|X_ij|, represents How likely is a digit.
Constraint from Generator
- 从生成器的输入低维向量中寻找使其生成图片的y_i最大的向量。

Homework 9: Explainable AI

这个作业是在gradescope上布置的，非台大的学生无法完成，因此我只是搜集了所提到的技术的相关信息。

Topic I: CNN explanation

Lime

LIME（Local Interpretable Model-agnostic Explanations） 是一种用于模型可解释性的方法。它的核心思想是：通过对输入数据进行轻微扰动并观察模型输出的变化，从而学习一个局部的、易解释的线性模型，来近似原始复杂模型的行为，以解释模型对某个单个输入样本的预测原因。

LIME 在 CNN 图像中的应用流程：

图像分割：用算法将图像划分为多个 Superpixels。
随机遮挡：创建多个扰动版本，部分 superpixels 被灰色遮挡。
CNN 预测：用 CNN 对每个扰动版本预测，记录输出概率。
加权线性回归：用扰动图像和对应预测结果训练一个线性模型。
可视化解释：线性模型的权重可视化后，形成热力图，突出最重要区域。

假设 CNN 模型对一张猫的图片预测为“猫”，我们用 LIME 得到如下解释：

红色区域表示对“猫”这个预测最有贡献的 superpixels（如耳朵、眼睛、胡须区域）。

蓝色区域表示负面贡献区域（如背景或其他不相关内容）。

用户看到这个解释，就能知道模型是否真的在关注“猫”的关键特征。

针对图像数据的分类解释，LIME 的核心库是 lime，其中处理图像分类解释的模块是：

1	`lime.lime_image.LimeImageExplainer`

主要函数：

函数	功能总结
`LimeImageExplainer()`	创建图像解释器
`explain_instance()`	执行图像扰动、预测、拟合，生成解释对象
`classifier_fn`	用户自定义模型的预测接口（返回类别概率）
`segmentation_fn`	图像切分为 superpixels（通常用 SLIC）
`get_image_and_mask()`	提取显著区域并可视化

Saliency Map

# Saliency Map 函数
# 将 tensor 图像归一化到 [0, 1] 区间，便于显示
def normalize(image):
  return (image - image.min()) / (image.max() - image.min())

def compute_saliency_maps(x, y, model):
  model.eval()
  x = x.cuda()

  x.requires_grad_()
  
  y_pred = model(x)
  loss_func = torch.nn.CrossEntropyLoss()
  loss = loss_func(y_pred, y.cuda())
  loss.backward()

  saliencies, _ = torch.max(x.grad.data.abs().detach().cpu(),dim=1)  #  RGB 三个通道上取最大值 → 得到 shape: (batch_size, H, W)，生成单通道显著图（热图）

  saliencies = torch.stack([normalize(item) for item in saliencies]) # 对每张 saliency map 归一化，使其像素值落在 [0, 1]，更适合显示和对比
  return saliencies
# images, labels = train_set.getbatch(img_indices)
saliencies = compute_saliency_maps(images, labels, model)

# visualize
fig, axs = plt.subplots(2, len(img_indices), figsize=(15, 8))
for row, target in enumerate([images, saliencies]):
  for column, img in enumerate(target):
    if row==0:
      axs[row][column].imshow(img.permute(1, 2, 0).numpy())
    else:
      axs[row][column].imshow(img.numpy(), cmap=plt.cm.hot)
    
plt.show()
plt.close()

Smooth Grad

Smooth Grad是画Saliency Map的一种技术，有的图像在可视化的时候会有很多额外的噪声。可以认为Smooth Grad技术就是为了减少神经网络所看到的一些噪声。所谓的Smooth Grad就是在图像上随机增加一些噪声，通过引入一定的噪声来达到降噪的效果。

# 归一化函数：将图像的像素值映射到 [0,1] 区间，用于增强可视化效果
def normalize(image):
    return (image - image.min()) / (image.max() - image.min())

# SmoothGrad 主函数
def smooth_grad(x, y, model, epoch, param_sigma_multiplier):
    model.eval()  # 设置模型为评估模式（关闭 dropout、BN 等）

    # 设置高斯噪声的均值为0
    mean = 0
    # 根据输入张量范围设置噪声标准差（越大越模糊）
    sigma = param_sigma_multiplier / (torch.max(x) - torch.min(x)).item()

    # 创建一个用于累加多个扰动梯度的数组（shape 和输入相同）
    smooth = np.zeros(x.cuda().unsqueeze(0).size())

    # 重复 epoch 次，生成多个扰动输入
    for i in range(epoch):
        # 创建与输入 x 同形状的高斯噪声张量
        noise = Variable(x.data.new(x.size()).normal_(mean, sigma**2))  # 均值为0，方差为sigma^2

        # 将噪声加入原始输入图像中
        x_mod = (x + noise).unsqueeze(0).cuda()  # 增加 batch 维度并移动到 GPU
        x_mod.requires_grad_()  # 启用梯度计算

        # 前向传播得到模型预测
        y_pred = model(x_mod)
        loss_func = torch.nn.CrossEntropyLoss()
        # 计算预测结果与标签的交叉熵损失
        loss = loss_func(y_pred, y.cuda().unsqueeze(0))

        # 反向传播以获得输入的梯度
        loss.backward()

        # 累加每次的梯度绝对值，detach 去除图计算图，cpu 将张量移回主机，转换为 numpy 格式
        smooth += x_mod.grad.abs().detach().cpu().data.numpy()

    # 对累加的梯度图进行平均并归一化，得到最终的 SmoothGrad 热力图
    smooth = normalize(smooth / epoch)

    # 返回形状为 (1, C, H, W) 的 numpy 数组
    return smooth

# 示例使用代码：对多张图像进行可视化
smooth = []
for i, l in zip(images, labels):
    # 对每张图像调用 smooth_grad 函数，得到其平滑的显著性图
    smooth.append(smooth_grad(i, l, model, 500, 0.4))

# 将多个显著性图拼接成一个数组，shape: (N, 1, C, H, W)
smooth = np.stack(smooth)

# 可视化结果，显示原图和对应的显著性图
fig, axs = plt.subplots(2, len(img_indices), figsize=(15, 8))

# 遍历原图和显著性图
for row, target in enumerate([images, smooth]):
    for column, img in enumerate(target):
        # 对原图进行通道维度转换：(C, H, W) -> (H, W, C)，方便 plt 显示
        if row == 0:
            axs[row][column].imshow(np.transpose(img.numpy(), (1, 2, 0)))
        else:
            # 对 smooth grad 显示热力图（假设 shape 为 (1, C, H, W)，取出并转置）
            axs[row][column].imshow(np.transpose(img.reshape(3, 128, 128), (1, 2, 0)))

plt.show()

Filter Visualization

这个方法是想要获得在CNN网络中间某一层所观察到的特征。在训练模型的时候，大多数情况下会一train到底，所以我们想要获取模型中间某层所观察到的内容比较困难，但是PyTorch提供了hook方法，能够轻松的获取模型中间的层数，并且观察到CNN网络中间所观察到的输出内容。hook函数会自动保存模型中的某一层，使用完再将其移出即可。

# 图像归一化函数：将像素值线性映射到 [0,1] 区间，增强显示对比度
def normalize(image):
    return (image - image.min()) / (image.max() - image.min())

# 定义全局变量用于保存中间层激活结果
layer_activations = None

# 滤波器可视化解释函数
# x：输入图像（Tensor）
# model：CNN 模型
# cnnid：卷积层在 model.cnn 中的索引（即第几层卷积层）
# filterid：要解释的滤波器在该层中的索引（即第几个通道）
# iteration：优化迭代次数
# lr：优化器学习率
def filter_explanation(x, model, cnnid, filterid, iteration=100, lr=1):
    model.eval()  # 设置模型为评估模式，防止 dropout 和 batchnorm 干扰

    # 定义钩子函数：当 model 在 cnn[cnnid] 层前向传播时，会执行该函数
    # 此函数将该层的输出（激活值）保存到全局变量 layer_activations 中
    def hook(model, input, output):
        global layer_activations
        layer_activations = output

    # 注册钩子函数到指定卷积层，获取该层的输出
    hook_handle = model.cnn[cnnid].register_forward_hook(hook)

    # ---------- 第一阶段：获取原始输入图像在该滤波器下的激活图 ----------

    # 对输入图像进行前向传播，触发 hook 函数，layer_activations 会被赋值
    model(x.cuda())

    # 获取第 filterid 个通道的激活图（形状为 H x W），并从计算图中剥离 -> CPU -> NumPy
    filter_activations = layer_activations[:, filterid, :, :].detach().cpu()

    # ---------- 第二阶段：寻找能最大激活该滤波器的图像 ----------

    x = x.cuda()                # 将输入图像转移到 GPU
    x.requires_grad_()          # 允许对输入图像进行梯度更新（以便优化）

    optimizer = Adam([x], lr=lr)  # 使用 Adam 优化器直接优化输入图像

    # 进行 iteration 次迭代优化
    for iter in range(iteration):
        optimizer.zero_grad()   # 梯度清零
        model(x)                # 前向传播，重新获得当前图像的激活结果

        # 定义目标函数：最大化第 filterid 个通道的所有激活值的总和
        # 加负号是因为 optimizer 默认是最小化目标函数
        objective = -layer_activations[:, filterid, :, :].sum()

        # 反向传播计算梯度
        objective.backward()

        # 执行一次优化步骤，更新输入图像以更强地激活目标滤波器
        optimizer.step()

    # 最终优化后的图像即为“最能激活该滤波器的图像”
    # 去除 batch 维度，移回 CPU
    filter_visualizations = x.detach().cpu().squeeze()

    # 移除 hook，避免影响后续模型使用
    hook_handle.remove()

    # 返回结果：
    # - filter_activations：原图在该滤波器下的响应图
    # - filter_visualizations：经过优化后能最大激活该滤波器的图像
    return filter_activations, filter_visualizations

Integrated Gradients

class IntegratedGradients():
    def __init__(self, model):
        self.model = model  # 保存传入的模型
        self.gradients = None  # 用于存储计算得到的梯度
        self.model.eval()  # 将模型设置为评估模式，防止 Dropout、BatchNorm 影响结果

    def generate_images_on_linear_path(self, input_image, steps):
        # 生成从全零图像到输入图像之间的线性插值图像（baseline -> input）
        xbar_list = [input_image * step / steps for step in range(steps)]
        return xbar_list

    def generate_gradients(self, input_image, target_class):
        # 计算输入图像对目标类别的梯度
        input_image.requires_grad = True  # 启用对输入图像的梯度追踪
        model_output = self.model(input_image)  # 前向传播，得到模型输出
        self.model.zero_grad()  # 清除模型中已有的梯度（防止累积）
        
        # 创建与模型输出同形状的 one-hot 向量，表示我们只关心 target_class 的输出
        one_hot_output = torch.FloatTensor(1, model_output.size()[-1]).zero_().cuda()
        one_hot_output[0][target_class] = 1  # 对应目标类别位置置为1

        # 反向传播：计算输入图像对目标类别的梯度
        model_output.backward(gradient=one_hot_output)
        self.gradients = input_image.grad  # 保存得到的梯度

        # 转换为 numpy 数组，并去掉 batch 维度
        gradients_as_arr = self.gradients.data.cpu().numpy()[0]
        return gradients_as_arr

    def generate_integrated_gradients(self, input_image, target_class, steps):
        # 计算积分梯度
        xbar_list = self.generate_images_on_linear_path(input_image, steps)
        # 初始化积分梯度数组，形状同 input_image
        integrated_grads = np.zeros(input_image.size())

        for xbar_image in xbar_list:
            # 对路径上每个插值图像计算梯度
            single_integrated_grad = self.generate_gradients(xbar_image, target_class)
            # 将梯度按步数进行平均，加和每一步的贡献
            integrated_grads = integrated_grads + single_integrated_grad / steps

        # 返回去掉 batch 维度后的结果，只返回第一个通道（如果是批量输入）
        return integrated_grads[0]

Topic II: BERT explanation

Attention Visualization

https://huggingface.co/exbert/

Embedding Visualization

加载训练好的模型权重，去迭代这个权重，通过PCA降维，然后可视化每一个隐藏层的状态。

# 将问题和上下文编码成模型输入格式（input_ids, attention_mask等）
inputs = Tokenizer(questions[QUESTION-1], contexts[QUESTION-1], return_tensors='pt')

# 获取问题和上下文在输入序列中的位置索引（用于可视化时区分不同来源的token）
question_start, question_end = 1, inputs['input_ids'][0].tolist().index(102) - 1  # [SEP] 的token id是102
context_start, context_end = question_end + 2, len(inputs['input_ids'][0]) - 2    # 跳过[SEP]后紧跟的context开始直到倒数第二个token（前一个[SEP]）

# 载入模型预测后保存的hidden states（模型第QUESTION题的输出，包含所有层的隐藏状态）
outputs_hidden_states = torch.load(f"hw9_bert/output/model_q{QUESTION}")

##### 遍历每一层的hidden states进行可视化 #####
# outputs_hidden_states是一个包含13个元素的元组：
# 第0个是词嵌入（embedding layer 输出），其余12个是transformer各层的输出
for layer_index, embeddings in enumerate(outputs_hidden_states[1:]):  # 跳过第一个embedding输出，从第1层到第12层

    # 当前层的 embeddings 维度为 [1, seq_len, 768]，我们使用 PCA 将其降到二维
    reduced_embeddings = PCA(n_components=2, random_state=0).fit_transform(embeddings[0])  # embeddings[0]形状为[seq_len, 768]

    ##### 逐token绘制嵌入向量的二维表示 #####
    for i, token_id in enumerate(inputs['input_ids'][0]):
        x, y = reduced_embeddings[i]              # 获取该token的二维坐标
        word = Tokenizer.decode(token_id)         # 将token id转回单词（不一定是词，有可能是子词）

        # 根据该token的类别（答案/问题/上下文）用不同颜色绘制
        if word in answers[QUESTION-1].split():   # 如果token在答案中，用蓝色标记
            plt.scatter(x, y, color='blue', marker='d')
        elif question_start <= i <= question_end: # 如果token属于问题部分，用红色标记
            plt.scatter(x, y, color='red')
        elif context_start <= i <= context_end:   # 如果token属于上下文部分，用绿色标记
            plt.scatter(x, y, color='green')
        else:
            continue  # 跳过CLS、SEP等特殊token

        # 在点旁边标注token对应的文本
        plt.text(x + 0.1, y + 0.2, word, fontsize=12)

    # 为图例添加空点（只是为了让图例出现，不实际绘制）
    plt.plot([], label='answer', color='blue', marker='d')
    plt.plot([], label='question', color='red', marker='o')
    plt.plot([], label='context', color='green', marker='o')

    plt.legend(loc='best')                      # 显示图例
    plt.title('Layer ' + str(layer_index + 1))  # 添加标题：当前是第几层的可视化
    plt.show()                                  # 显示图像

Embedding Analysis

TODO: Compare output embedding of BERT using:

Euclidean distance
Cosine similarity

# 每个句子中选择要比较的词的下标索引（按 word-level 选）
# 举例：句子 "蘋果茶真難喝"，若 index = 0，表示选择的是"蘋"
# 此处为10个句子中各自要选取比较的词在词级别的索引（如第0个句子的第4个词）
select_word_index = [4, 2, 0, 8, 2, 0, 0, 4, 0, 0]

# 计算两个向量之间的欧几里得距离（L2范数）
def euclidean_distance(a, b):
    # np.linalg.norm(a - b) 计算的是 √((a1 - b1)^2 + ... + (an - bn)^2)
    return np.linalg.norm(a - b)

# 计算两个向量之间的余弦相似度： cos(θ) = (a·b) / (||a||*||b||)
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# 设置用于词向量比较的相似度/距离度量方法
# 可选项：euclidean_distance 或 cosine_similarity
METRIC = euclidean_distance  # 如果你想改为余弦相似度，可以改为 cosine_similarity

# 获取指定词在指定层的 BERT 输出中的嵌入表示
def get_select_embedding(output, tokenized_sentence, select_word_index):
    LAYER = 12  # 选择第几层的输出向量（0 为 embedding 层，1~12 为 Transformer 层）

    # output.hidden_states 是一个包含13层输出的元组 (embedding + 12 layers)
    # [LAYER][0] 取出对应层的 [batch, seq_len, hidden_size]，我们只处理 batch_size = 1 的情况
    hidden_state = output.hidden_states[LAYER][0]

    # 将词级别索引映射为 token 级别的起始下标
    # .word_to_tokens(index) 会返回一个 Span，其中包含该词所对应的 token 区间 [start, end)
    select_token_span = tokenized_sentence.word_to_tokens(select_word_index)

    if select_token_span is None:
        raise ValueError(f"The word index {select_word_index} could not be mapped to tokens.")

    # 取该词对应的第一个 token 的索引来代表整个词（可选策略：也可以取平均）
    select_token_index = select_token_span.start

    # 返回该 token 的向量，并转为 numpy
    return hidden_state[select_token_index].detach().cpu().numpy()

Attacks in NLP

Introduction

In the past, we only focus on attacks in computer vision or audio. The input space for image or audio are vectors in ℝⁿ, but the input space in NLP are words/tokens. To feed those tokens into a model, we need to map each token into a continuous vector. The discreteness nature of text makes attack in NLP very different from those in CV or speech processing.

Evasion Attacks and Defenses

Introduction

Evasion Attacks in Computer Vision

Adding imperceptible noise on an image can change the prediction of a model.

Evasion Attacks in NLP

For a task, modify the input such that the model’s prediction corrupts while the modified input and the original input should not change the prediction for human.

对于一个任务，修改输入使得模型的预测结果被干扰（或破坏），而修改后的输入与原始输入在人类看来不应改变其预测结果。

Sentiment Analysis

Dependency Parsing

Machine Translation

Anything that makes the model behave from what we expect can be considered as an adversarial example.

Four Ingredients in Evasion Attacks

Goal

What the attack aims to achieve.

Untargeted classification: Make the model misclassify the input example.

Targeted classification: Make the model to classify samples having ground truth of class 𝐴 into another class 𝐵.

Universal suffix dropper(通用后缀去除器): Make the translated sentence to drop some suffix.

Wrong parse tree in dependency parsing

Transformations

How to construct perturbations(扰动) for possible adversaries.

Word Level

Word substitution by WordNet synonyms(通过WordNet同义词进行词语替换)

WordNet 是一个英语词汇数据库，由普林斯顿大学开发，它将词汇（名词、动词、形容词、副词）组织成同义词集合（synsets），并用语义关系（如同义、反义、上下位）将它们连接起来。

将原始句子中的某个单词，用其在 WordNet 中的同义词进行替换，以保持句子原意的前提下，生成一个语义等价但表述不同的句子。

Word substitution by 𝑘NN or 𝜀-ball in counter-fitted GloVe embedding space(利用通过counter-fitting技术增强的 GloVe 向量）进行语义保持的对抗词替换)

GloVe（Global Vectors for Word Representation）是一种无监督学习的词嵌入方法，通过统计词与词之间的共现频率来学习词向量，使得语义上相近的词具有相近的向量表示。

Counter-fitted GloVe 是一种经过后处理（post-processing）的 GloVe 向量，目的在于：

拉近语义相似词（synonyms）的向量距离

推远语义相反词（antonyms）的向量距离

Word substitution by BERT masked language modeling (MLM) prediction

利用 BERT 的预测能力进行上下文敏感的词替换，主要过程如下：

选中目标词，将其替换为 [MASK]。

使用 BERT 模型对 [MASK] 位置进行预测，输出 Top-k 可能的词。

从中筛选语义相近或语法合适的词作为替换词，构造新句子。

Word substitution by BERT reconstruction (no masking)

利用 BERT 对句子中每个单词的上下文感知向量表示，判断该词是否与其上下文一致，并选择可能的替代词。

Word substitution by changing the inflectional form of verbs, nouns and adjectives

Inflectional morpheme: an affix that never changes the basic meaning of a word, and are indicative/characteristic of the part of speech (POS)

屈折词缀：一种不改变词语基本含义的附加成分（词缀），它通常用来表示或标示词性（词类）特征。

Word substitution by gradient of the word embedding

通过计算模型对输入词嵌入的梯度，找到最“敏感”的词，并替换为能引起模型输出变化的相似词，从而探测或攻击模型的鲁棒性。

Word insertion based on BERT MLM

利用 BERT 预训练语言模型的上下文预测能力，在文本中插入新词

Word deletion

Char Level

Swap
Substitution
Deletion
Insertion

Constrains

What a valid adversarial example should satisfy.

Overlap between the transformed sample and the original sample

Levenshtein edit distance

Levenshtein Edit Distance（莱文斯坦编辑距离）常被用作约束条件（Constraints）之一，以控制对输入文本进行扰动（perturbation）的幅度，使得修改后的文本保持原始语义或形式的可接受性，而又能欺骗目标模型。

Maximum percentage of modified words

Grammaticality of the perturbed sample

Part of speech (POS,词性) consistency

Number of grammatical errors (evaluated by some toolkit)

Fluency scored by the perplexity of a pre-trained language model

通过预训练语言模型的困惑度评估文本流畅性。

Semantic preserving

变换后的样本与原始样本的语义相似性。

Distance of the swapped word’s embedding and the original word’s embedding

Similarity between the transformed sample’s sentence embedding and the original sample’s sentence embedding

Search Method

How to find an adversarial example from the transformations that satisfies the constrains and meets the goal.

Greedy Search

Score the each transformation at each position, and then replace the words in decreasing order of the score until the prediction flips.

为输入中每个位置的所有变换打分，并按照得分从高到低依次替换词语，直到模型预测发生变化。

Greedy search with word importance ranking (WIR)

Step 1: Score each word’s importance

Step 2: Swap the words from the most important to the least important

Word Importance ranking by leave-one-out (LOO): see how the ground truth probability decreases when the word is removed from the input.

Word Importance ranking by the gradient of the word embedding

Genetic Algorithm

evolution and selection based on fitness

选择，交叉，变异操作

速度太慢，很少使用

Examples of Evasion Attacks

Synonym Substitution Attack

同义词替换攻击

TextFooler

Algorithm

Word Importance Ranking (WIR)：评估输入句子中每个词对模型预测结果的重要性。
生成同义词候选集：对于排名靠前的每个重要词，构造一组可选的同义词替换候选。
贪心替换并检测攻击是否成功：遍历词汇替换候选，每次选择当前对模型输出影响最大（最可能使其预测错误）的候选词进行替换，替换顺序按照步骤1的词重要性排名，一旦模型的预测结果改变，攻击即视为成功（early stopping）。

PWWS

Probability weighted word saliency: consider LOO Δp_positive and Δp_positive in word substitution together to obtain the WIR

BERT-Attack

Genetic Algorithm

Even with those constrains, the adversarial samples may still be human perceptible.

Morpheus

Morpheus 的攻击方法是基于搜索的贪心策略，与 TextFooler 类似，但加入了对检测器行为的建模，形成了一种双目标攻击（dual-objective attack）。

Universal Trigger

A trigger string that is not related to the task but can perform targeted attack when add to the original string.

一组固定的、可通用的扰动（通常是词或子词），插入到任意输入文本中，即可显著影响模型预测，使其发生错误。

它是 无输入依赖的（input-agnostic）对抗攻击，即：

不需要针对每个输入样本单独生成对抗扰动

相同的 trigger 可用于攻击任意输入句子

可视为自然语言中的“对抗咒语”（adversarial spell）

Steps:

Determine how many words the trigger needs and initialize them with some words. (such as [the, the, the])
Backward and obtain the gradient of each trigger word’s embedding and find the token that minimize the objective function .
Update the trigger with the newly find words.
Continue step 1~3 until convergence

Crafting Adversaries by Auto-Encoder

Train a generator (auto-encoder) to generate the adversarial samples
- Goal of generator: make the text classifier predict wrongly
- Goal of the classifier: predict correctly
- Iterate between attack and defense

Attack step

Defense step

Problem during backward: cannot directly backward the argmax in AE

在训练过程中，如何从一个离散分布中采样，并且保持梯度可传播？

A closer look into non-differentiability of the AE output

Gumbel-SoftMax reparameterization trick: using SoftMax with temperature scaling as approximation of argmax

以可导方式模拟 one-hot 向量采样过程

我们将 Gumbel-Max 改成 SoftMax 形式，得到一个可导的“近似one-hot”向量。

温度 τ 的作用

当 τ → 0，输出趋近于 one-hot
当 τ → ∞，输出趋近于均匀分布

Defenses against Evasion Attacks

Training a More Robust Model

Adversarial training: generate the adversarial samples using the current model every N epochs

Adversarial training in the word embedding space by 𝜀-ball
- Motivation: A word’s synonym may be within its neighborhood

ASCC-defense (Adversarial Sparse Convex Combination): Adversarial training in the word embedding space by the convex hull form by the synonym set

在词嵌入空间中，利用同义词形成的凸包（convex hull），通过稀疏凸组合构造鲁棒表示，以增强模型对对抗攻击的抵抗力。

Adversarial data augmentation: use a trained (unrobust) text classifier to pre-generate the adversarial samples, and then add them to the training dataset to train a new text classifier
Adversarial and Mixup Data Augmentation
- Adversarial data augmentation
- Mixup the samples in the training set (including benign and adversarial)

Detecting Adversaries during Inference

Discriminate perturbations (DISP): detect adversarial samples and convert them to benign ones

利用扰动本身的结构信息，定位并消除对抗性修改。

扰动判别器：一个用于判断某个标记是否被扰动的分类器；

嵌入估计器：通过回归方式估计被扰动标记的嵌入表示；

标记恢复：利用估计出的嵌入表示，在嵌入词库中查找并恢复被扰动的标记。

Frequency-Guided Word Substitutions (FGWS): Swap low frequency words with higher frequency counterparts with a three-stepped pipeline.

找出输入中在训练数据中出现频率低于预设阈值 δ 的词语；

将第1步中识别出的所有低频词，替换为它们中出现频率最高的同义词；

如果替换前后模型对原预测类别的概率差异大于预设阈值 γ，则将该输入标记为对抗样本。

Imitation Attacks and Defenses

Imitation Attacks

Imitation attack aims to stole a trained model by querying it
Training a model requires significant resources, both time and money
Training data may be proprietary（独有的）
Factors that may affect how well a model can be stolen
- Architecture mismatch
- Data mismatch

Imitation Attacks in Machine Translation

通过查询黑盒翻译API（如 Google Translate、DeepL、ChatGPT 等）获取输入-输出对，并据此训练一个行为相似的模型，从而窃取其翻译能力。

Pipeline:

数据构造：
- 攻击者准备大量源语言句子（例如英语句子），这些可以从公开语料库中收集
黑盒查询：
- 将源语言句子发送给目标翻译系统（如 Google Translate）
- 收集目标模型返回的翻译结果（目标语言句子，如德语）
训练仿制模型：
- 使用这些“源句-译文”对训练一个神经机器翻译模型
- 模仿目标模型的翻译行为和风格
评估与攻击：
- 评估仿制模型与目标模型之间的BLEU相似度
- 或在仿制模型上设计对抗输入，并将其转移到原模型进行攻击

Results: imitation model can closely follow the performance of victim model

Imitation Attacks in Text Classification

攻击者在不了解模型内部结构的前提下，通过查询文本分类模型（如情感分析器、垃圾邮件检测器、意图识别模型等）并收集其输出结果，构造一个仿制模型以复制原模型的行为。

Adversarial Transferability

对抗样本的可迁移性: 对于输入样本 x，攻击者在模型 f_s（源模型）上生成的对抗样本 x_adv，在不访问目标模型 f_t 的前提下，也能使 f_t(x_adv) ≠ f_t(x)。

在一个模型上生成的对抗样本，在未修改的情况下也能欺骗另一个模型。

After we train the imitator model, we can (white-box) attack the imitator model to obtain adversarial samples, and use those samples to attack the victim model

Adversarial transferability in machine translation (MT)

Adversarial examples can successfully transfer to production MT system

Adversarial transferability in text classification

从模仿模型发起的对抗攻击，其效果有时甚至比直接攻击目标模型更强。

模仿攻击：攻击者通过查询 victim 模型 API，收集大量“文本-标签”对，训练一个 imitator 模型 M^′，模仿 victim 的行为。

白盒攻击：攻击者在 imitator M^′ 上使用梯度导向攻击方法生成对抗样本 x_adv。

迁移测试：将 x_adv 输入到目标 victim 模型 M，结果显示 M(x_adv) ≠ M(x)，即迁移攻击成功。

Defense against Imitation Attacks

Add noise on the victim output

Defense in text classification

With the cost of undermining the original performance
使用加入噪声的输出作为imitator的训练资料

Train an undistillable victim model

训练一个不可蒸馏的受害者模型，使得攻击者即使能够访问模型的输出，也无法有效地模仿（distill）该模型的行为。

模型蒸馏（Knowledge Distillation）是一种模型压缩/迁移学习方法：

学生模型（student）通过模仿教师模型（teacher）的输出（通常是 soft label，即每个类别的概率分布），学会其行为模式。

在模仿攻击中，攻击者充当学生，通过查询受害者模型（victim）API获得 soft label 或 hard label，用来训练 imitator 模型。

Train a clean teacher normally
Train a nasty teacher whose objectives are
- Minimizing the cross entropy (CE) loss of classification
- Maximizing the KL-divergence (KLD) between the nasty teacher and the clean teacher
Release the nasty teacher

KL散度衡量的是两个概率分布 P（真实分布）和 Q（预测分布）之间的差异程度，数值越大，表示 Q和 P 差别越大，模型预测的分布偏离真实分布越远。

交叉熵表示的是：用模型预测的分布 Q 对真实分布 P 的编码长度（信息量）的期望。数值越小，说明模型预测的分布越好，能够更有效地“压缩”真实分布数据，分类准确率越高。

Backdoor Attacks and Defenses

Backdoor Attacks

an attack that aims to insert some backdoors during model training that will make the model misbehave when encountering certain triggers
The model should have normal performance when the trigger is not presented
The model deployer is not aware of the backdoor

模型在正常使用时表现正常，但只要输入中包含攻击者设计的“后门触发器”，模型就会输出攻击者期望的结果。

Data Poisoning

Assume that we can manipulate the training dataset

Construct poisoning dataset
Use the poisoning dataset to train a model
Activate the backdoor with trigger

常见的poisoning data:

类型	说明	示例
像素贴纸	在图像某固定位置添加小块颜色或图案	小红点、水印
颜色变换	改变图像某部分颜色	特定颜色滤镜
语义触发词	文本中插入特定词汇或字符	“cf123”
语法结构	特殊句式或词序	特定排列顺序

Backdoored PLM

Assumption:

We aims to release a pre-trained language model (PLM) with backdoor. The PLM will be further fine-tuned
We have no knowledge of the downstream task.

Select the triggers
Pre-training:
- For those inputs without triggers, train with MLM as usual
- For those inputs with trigger, their MLM prediction target is some random word in the vocabulary
Release the PLM for downstream fine-tuning

Defense

ONION (backdOor defeNse with outlIer wOrd detectioN)

通过检测并剔除输入文本中的“异常词”（Outlier Words），从而有效破坏后门触发器并恢复模型正常行为。

For each word in the sentence, remove it to see the change in PPL of GPT-2
If the change of PPL is lower than a pre-defined threshold 𝑡, flag the word as outlier(trigger)

Bypassing ONION Defense

Insert multiple repeating triggers: Removing one trigger will not cause the GPT-2 PPL to significantly lower

如何解决？

连续多轮检测和剔除异常词，直到文本PPL显著下降或无更多异常词

结合文本整体流畅度、语义连贯性指标判断是否存在多个触发词，例如用语言模型对句子整体评分，检测异常波动，提示存在多触发器。

重复触发词在文本中形成特定模式（频繁出现、位置固定等），可设计规则或统计方法检测异常重复模式

Adversarial Attack

Example of Attack

Targeted: Anything other than “Cat”

Non-targeted: Misclassified as a specific class (e.g., “Star Fish”)

相同的攻击对于不同模型的影响不同

加入Noise的数量不同，模型识别的结果也不同

How to Attack

与原图的输出尽可能远，与目标尽可能近（对于Targeted Attack而言）

Non-perceivable

对于图片而言，L-infinity 也许会更加适用，但对于语音、文字等，也许需要其他的准则。

Attack Approach

Gradient Descent

Update input, not parameters

Fast Gradient Sign Method (FGSM)

https://arxiv.org/abs/1412.6572

Iterative FGSM

https://arxiv.org/abs/1607.02533

White Box v.s. Black Box

In the previous attack, we know the network parameters θ
This is called White Box Attack.
You cannot obtain model parameters in most online API.
Are we safe if we do not release model?
No, because Black Box Attack is possible.

Black Box Attack

Black Box Attack（黑盒攻击）指的是攻击者无法访问模型的内部结构或参数，只能通过输入和输出之间的交互来对模型进行攻击或欺骗。

攻击者只能看到输入输出，例如只能调用模型的 API

If you have the training data of the target network——Train a proxy network yourself&Using the proxy network to generate attacked objects

如果没有目标模型的training data呢？

基于梯度估计（score-based）：使用模型输出（如 softmax 分布）估计梯度

合成数据训练替代模型（zero-data transfer attack）：

随机生成输入样本（如图像中的随机噪声、文本中的模板句子）

将这些样本提交给黑盒模型，获得输出标签

用这些（输入, 输出）对，训练 substitute model

在substitute model上执行白盒攻击（如 FGSM、PGD），再用生成的对抗样本攻击原模型

基于标签的决策边界搜索（Decision-based）：

找到一个与 x 相似但预测错误的初始样本 x_adv

沿着 x_adv → x 的方向，逐步减小差异，逼近边界

找到刚好改变分类的最小扰动

Universal Adversarial Attack

https://arxiv.org/abs/1610.08401

定义：在不改变特定样本的基础上，找到一个通用扰动向量，使得加上它后，大多数输入样本的模型预测都会被误导。

One pixel attack

定义：找到图像中仅一个像素的 RGB 值，使模型对该图像的预测类别发生改变。

Attack in the Physical World

An attacker would need to find perturbations that generalize beyond a single image.
Extreme differences between adjacent pixels in the perturbation are unlikely to be accurately captured by cameras.
It is desirable to craft perturbations that are comprised mostly of colors reproducible by the printer.

攻击者需要找到不仅仅对单张图像有效的扰动（具有泛化能力的扰动）。

扰动中相邻像素间的极端差异不太可能被摄像头准确捕捉。

理想情况下，所构造的扰动应主要由打印机可再现的颜色组成。

Adversarial Reprogramming

https://arxiv.org/abs/1806.11146

定义：对一个已有模型（通常是训练在任务 A 上的模型）输入经过特别设计的对抗性前缀或扰动函数，使其在不修改模型结构或参数的情况下，被重定向去完成任务 B。

“Backdoor” in Model

https://arxiv.org/abs/1804.00792

Attack happens at the training phase

be careful of unknown dataset

Defense

Passive Defense

定义：在不修改模型架构或权重的前提下，通过输入分析、扰动检测、预处理或后处理来防范对抗攻击的一类方法。

类别	方法	说明
输入预处理	JPEG 压缩、图像平滑、去噪、自编码器重构等	降低对抗扰动对模型的影响
输入检测	基于统计、模型输出分布或特征空间	检测输入是否为对抗样本
特征分析	PCA、SVD、频域分析	分离正常与对抗样本的特征分布
多模型一致性	检查多个模型对输入预测是否一致	检测输入是否异常
后处理校正	软标签平滑、置信度调整	降低对抗样本对最终结果的干扰

Proactive Defense

定义：主动防御是一类通过修改模型结构、训练过程或损失函数，来增强模型对对抗扰动的内在鲁棒性的防御方法。

与 Passive Defense 不同，Proactive Defense 并不试图“检测”或“修复”输入，而是训练模型具备对抗能力，让模型自己更强。

将生成的对抗样本加入训练数据中，通过暴露模型于“敌意环境”，迫使其学会对抗。

Homework 10: Adversarial Attack

Task Description

Prerequisite

Those are methodologies which you should be familiar with first

Attack objective: Non-targeted attack
Attack constraint: L-infinity norm and Parameter ε
Attack algorithm: FGSM/I-FGSM
Attack schema: Black box attack (perform attack on proxy network)

Black-box attack

TODO

Choose any proxy network to attack the black box model from TA
Implement non-targeted adversarial attack method
- FGSM
- I-FGSM
- MI-FGSM
Increase attack transferability by Diverse input (DIM)
Attack more than one proxy model - Ensemble attack

FGSM: Fast Gradient Sign Method

I-FGSM: Iterative Fast Gradient Sign Method

MI-FGSM: Use momentum to stabilize update directions and escape from poor local maxima

https://arxiv.org/pdf/1710.06081.pdf

Diverse Input (DIM):

Random resizing (resizes the input images to a random size)
Random padding (pads zeros around the input images in a random manner)

e.g. DIM + MI-FGSM

Ensemble Attack:

Choose a list of proxy models
Choose an attack algorithm (FGSM, I-FGSM, and so on)
Attack multiple proxy models at the same time
[paper A] Ensemble adversarial attack:Delving into Transferable Adversarial Examples and Black-box Attacks
[paper B] How to choose suitable proxy models for black-box attack: Query-Free Adversarial Transfer via Undertrained Surrogates

Evaluation Metrics

Parameter ε is fixed as 8
Distance measurement: L-inf. norm
Model Accuracy is the only evaluation metrics

Data Format

Download link: link
Images:
- CIFAR-10 images
- (32 * 32 RGB images) * 200
  - airplane/airplane1.png, …, airplane/airplane20.png
  - …
  - truck/truck1.png, …, truck/truck20.png
- 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck)
- 20 images for each class

Pre-trained model

In this homework, we can perform attack on pretrained models
Pytorchcv provides multiple models pretrained on CIFAR-10
A model list is provided here

原作业PDF提供的列表链接已失效，这条是现在可见的：Model List

Baselines

由于本次作业是在judgeBoi上提交的，非台大学生无法得出成绩，所以仅给出了代码。

Simple baseline

Just run Sample Code.

link

Medium baseline

Ensemble Attack + random few model + IFGSM

link

Ensemble Attack的实现：

# 集成攻击
class ensembleNet(nn.Module):
    def __init__(self, model_names):
        super().__init__()
        # 创建多个预训练模型的模块列表
        self.models = nn.ModuleList([ptcv_get_model(name, pretrained=True) for name in model_names])
        self.softmax = nn.Softmax(dim=1)
    def forward(self, x):
        # 初始化集成logits为None
        ensemble_logits = None
        for i, m in enumerate(self.models):
            # 获取当前模型的logits输出
            logits = m(x)
            # 如果是第一个模型，直接赋值；否则累加logits
            if ensemble_logits is None:
                ensemble_logits = logits
            else:
                ensemble_logits += logits
        # 返回集成后的平均logits
        return ensemble_logits / len(self.models)

# 构建集成模型
model_names = [
    'nin_cifar10',       
    'resnet20_cifar10',   
    'preresnet20_cifar10' 
]
ensemble_model = ensembleNet(model_names).to(device)
ensemble_model.eval()

Strong baseline

Ensemble Attack + many models + MIFGSM

link

MIFGSM的实现：

# MI-FGSM（Momentum Iterative FGSM）攻击
def mifgsm(model, x, y, loss_fn, epsilon=epsilon, alpha=alpha, num_iter=20, decay=1.0):
    x_adv = x
    # 初始化动量张量
    momentum = torch.zeros_like(x).detach().to(device)
    # 执行num_iter次迭代
    grad = 0
    for i in range(num_iter):
        x_adv = x_adv.detach().clone()
        x_adv.requires_grad = True
        loss = loss_fn(model(x_adv), y)
        loss.backward()
        # TODO: 动量计算
        grad =  decay * grad + x_adv.grad.detach() / torch.norm(x_adv.grad.detach())
        x_adv = x_adv + alpha * grad.sign()
        x_adv = torch.max(torch.min(x_adv, x+epsilon), x-epsilon)
    return x_adv

Boss Baseline

link

DIM-MIFGSM的实现：

# DIM-MIFGSM (Diverse Input Method MI-FGSM) 攻击
def dim_mifgsm(model, x, y, loss_fn, epsilon=epsilon, alpha=alpha, num_iter=20, decay=1.0, prob=0.5):
    """
    DIM-MIFGSM攻击函数，结合了动量和多样化输入变换
    Args:
        model: 目标模型
        x: 输入图像
        y: 真实标签
        loss_fn: 损失函数
        epsilon: 扰动范围
        alpha: 步长
        num_iter: 迭代次数
        decay: 动量衰减因子
        prob: 应用变换的概率
    """
    x_adv = x
    # 初始化动量
    momentum = torch.zeros_like(x).detach().to(device)
    
    for i in range(num_iter):
        x_adv = x_adv.detach().clone()
        x_adv.requires_grad = True
        
        # 应用多样化输入变换
        x_transformed = diverse_input_transform(x_adv, prob)
        
        loss = loss_fn(model(x_transformed), y)
        loss.backward()
        
        # 动量更新
        grad = x_adv.grad.detach()
        momentum = decay * momentum + grad / torch.norm(grad, p=1)
        
        # 更新对抗样本
        x_adv = x_adv + alpha * momentum.sign()
        
        # 裁剪到允许范围
        x_adv = torch.max(torch.min(x_adv, x + epsilon), x - epsilon)
    
    return x_adv

def diverse_input_transform(x, prob=0.5):
    """
    多样化输入变换函数
    Args:
        x: 输入图像张量，形状为 (batch_size, channels, height, width)
        prob: 应用变换的概率
    Returns:
        变换后的图像张量
    """
    batch_size, channels, height, width = x.shape
    
    # 随机决定是否应用变换
    if torch.rand(1).item() > prob:
        return x
    
    # 随机选择变换类型：0-resize, 1-padding
    transform_type = torch.randint(0, 2, (1,)).item()
    
    if transform_type == 0:
        # Random resizing
        return random_resize(x)
    else:
        # Random padding
        return random_padding(x)

def random_resize(x):
    """
    随机调整图像大小
    Args:
        x: 输入图像张量，形状为 (batch_size, channels, height, width)
    Returns:
        调整大小后的图像张量
    """
    batch_size, channels, height, width = x.shape
    
    # 随机选择新的尺寸（在原尺寸的80%-120%之间）
    resize_ratio = 0.8 + 0.4 * torch.rand(1).item()  # [0.8, 1.2]
    new_size = int(height * resize_ratio)
    
    # 确保新尺寸至少为16，且不超过原尺寸的1.5倍
    new_size = max(16, min(new_size, int(height * 1.5)))
    
    # 使用双线性插值调整大小
    x_resized = torch.nn.functional.interpolate(
        x, size=(new_size, new_size), mode='bilinear', align_corners=False
    )
    
    # 如果图像变小了，需要用零填充回原尺寸
    if new_size < height:
        pad_size = (height - new_size) // 2
        x_resized = torch.nn.functional.pad(
            x_resized, 
            (pad_size, height - new_size - pad_size, pad_size, height - new_size - pad_size),
            mode='constant', value=0
        )
    # 如果图像变大了，需要裁剪回原尺寸
    elif new_size > height:
        start = (new_size - height) // 2
        x_resized = x_resized[:, :, start:start+height, start:start+width]
    
    return x_resized

def random_padding(x):
    """
    随机在图像周围填充零
    Args:
        x: 输入图像张量，形状为 (batch_size, channels, height, width)
    Returns:
        填充后再裁剪回原尺寸的图像张量
    """
    batch_size, channels, height, width = x.shape
    
    # 随机选择填充大小（最多为原尺寸的10%）
    max_pad = height // 10
    pad_left = torch.randint(0, max_pad + 1, (1,)).item()
    pad_right = torch.randint(0, max_pad + 1, (1,)).item()
    pad_top = torch.randint(0, max_pad + 1, (1,)).item()
    pad_bottom = torch.randint(0, max_pad + 1, (1,)).item()
    
    # 应用填充
    x_padded = torch.nn.functional.pad(
        x, (pad_left, pad_right, pad_top, pad_bottom), mode='constant', value=0
    )
    
    # 随机裁剪回原尺寸
    padded_height, padded_width = x_padded.shape[2], x_padded.shape[3]
    start_h = torch.randint(0, padded_height - height + 1, (1,)).item()
    start_w = torch.randint(0, padded_width - width + 1, (1,)).item()
    
    x_cropped = x_padded[:, :, start_h:start_h+height, start_w:start_w+width]
    
    return x_cropped

# 使用集成模型执行DIM-MIFGSM攻击
ensemble_dim_mifgsm_examples, ensemble_dim_mifgsm_acc, ensemble_dim_mifgsm_loss = gen_adv_examples(ensemble_model, adv_loader, dim_mifgsm, loss_fn)
print(f'ensemble_dim_mifgsm_acc = {ensemble_dim_mifgsm_acc:.5f}, ensemble_dim_mifgsm_loss = {ensemble_dim_mifgsm_loss:.5f}')

# 创建集成DIM-MIFGSM对抗样本目录并保存图像
create_dir(root, 'ensemble_dim_mifgsm', ensemble_dim_mifgsm_examples, adv_names)

Domain Adaptation

Domain Adaptation（领域自适应） 是迁移学习（Transfer Learning）中的一个重要分支，其目标是将源领域（source domain）学到的知识迁移到目标领域（target domain），尤其是在目标领域的标注数据非常稀缺甚至没有的情况下，仍然能在目标领域中保持良好的模型性能。

Domain Shift

Training and testing data have different distributions.

Basic Idea

Source Domain(with labeled data)

Knowledge of target domain:

Little but labeled
- Idea: training a model by source data, then fine-tune the model by target data
- Challenge: only limited target data, so be careful about overfitting
Large amount of unlabeled data
- Idea: Learn to ignore the differences
little & unlabeled
- Idea: Domain Generalization
- https://ieeexplore.ieee.org/document/8578664
- https://arxiv.org/abs/2003.13216

Domain Adversarial Training

Domain Adversarial Training（领域对抗训练）是领域自适应（Domain Adaptation）中最有代表性和有效的方法之一，其核心思想是：通过对抗学习方式，使源领域和目标领域的特征在判别器面前不可区分，从而实现分布对齐。

Basic Idea

领域对抗训练借鉴了 GAN（生成对抗网络）中的“对抗”机制，引入一个领域判别器（Domain Discriminator），并与特征提取器“对抗”训练：

特征提取器（Feature Extractor）：学习输入的中间表示，希望让下游任务（如分类）更容易。
标签分类器（Label Classifier）：利用特征进行任务预测（如情感分类、目标识别）。L是分类器输出与源域数据真实标签l的cross-entropy。
领域判别器（Domain Discriminator）：判断一个样本是来自源领域还是目标领域。

特征提取器既要为主任务提取有判别力的特征，又要“欺骗”领域判别器，让源域和目标域的特征不可区分，从而实现领域对齐。

Model Architecture

输入图像
   ↓
[Feature Extractor]
   ↓                  ↘
[Label Classifier]    [Domain Discriminator]
   ↓                      ↓
  ŷ（任务预测）        d̂（域判别）

Limitation

Considering Decision Boundary

Reference：

Constraining via Conditional Entropy Minimization

Virtual Adversarial Domain Adaptation (VADA) model: a basic combination of domain adversarial training and semi-supervised training objectives.

The cluster assumption states that the input distribution X contains clusters and that points in the same cluster come from the same class. If the cluster assumption holds, the optimal decision boundaries should occur far away from data-dense regions in the space of 𝒳. we achieve this behavior via minimization of the conditional entropy with respect to the target distribution.

聚类假设（Cluster Assumption）认为：输入分布 X 存在若干聚类簇，且同一簇内的数据点应属于相同类别。若聚类假设成立，最优决策边界应当位于输入空间 𝒳 中数据密度较低的区域。我们通过对目标分布的条件熵最小化来实现这一特性。

Intuitively, minimizing the conditional entropy forces the classifier to be confident on the unlabeled target data, thus driving the classifier’s decision boundaries away from the target data. However,this approximation breaks down if the classifier h is not locally-Lipschitz. Without the locally-Lipschitz constraint, the classifier is allowed to abruptly change its prediction in the vicinity of the training data points, which 1) results in a unreliable empirical estimate of conditional entropy and 2) allows placement of the classifier decision boundaries close to the training samples even when the empirical conditional entropy is minimized. To prevent this, we propose to explicitly incorporate the locally-Lipschitz constraint via virtual adversarial training.

直观而言，最小化条件熵会迫使分类器对未标注目标数据做出高置信度预测，从而将决策边界推离目标数据分布区域。然而，如果分类器 h 不满足局部Lipschitz连续性条件，这一近似方法就会失效。在没有局部Lipschitz约束的情况下，分类器可以在训练数据点附近突然改变其预测结果，这将导致：1) 条件熵的经验估计变得不可靠；2) 即使经验条件熵被最小化，分类器的决策边界仍可能被放置在靠近训练样本的位置。为防止这种情况，我们提出通过虚拟对抗训练显式地引入局部Lipschitz约束。

Initialize with the VADA model and then further minimize the cluster assumption violation in the target domain. In particular, we first use VADA to learn an initial classifier h_θ₀. Next, we incrementally push the classifier’s decision boundaries away from data-dense regions by minimizing the target-side cluster assumption violation loss ℒ_t. We denote this procedure Decision-boundary Iterative Refinement Training (DIRT).

先用VADA模型初始化，再进一步最小化目标域的聚类假设违背程度。具体而言，我们首先通过VADA学习初始分类器h_θ₀，随后通过最小化目标域聚类假设违背损失ℒₜ，逐步将分类器的决策边界从数据密集区域推离。我们将该方法称为决策边界迭代优化训练（Decision-boundary Iterative Refinement Training，DIRT）。

Maximum Classifier Discrepancy for Unsupervised Domain Adaptation

通过利用分类器对目标样本的预测结果，来更好地对齐源域和目标域的特征分布。

为有效检测源域支持集外的目标样本，本文提出训练判别器（F1和F2）以最大化目标特征的预测差异度。若无此操作，两个分类器可能过于相似而无法识别支持集外的目标样本。随后训练生成器通过最小化该差异度来”欺骗”判别器，此举促使目标样本特征被生成在源域支持集内。这种对抗学习步骤在我们的方法中循环执行，最终目标是获得目标域支持集被源域支持集包含的特征分布。

场景设定

源超市（日本店）：已有完整的商品标签系统（比如”寿司”“清酒”分类明确）

目标超市（美国店）：新开业，商品未分类，但有些商品日本店从未见过（比如”汉堡套餐”“奶昔”）

传统方法的问题

就像让一个在日本店培训的店员直接给美国店商品贴标签：

店员只关心”商品摆放位置像不像日本店”（特征分布对齐）

结果：把汉堡放在寿司区（因为都是圆形），奶昔误标为清酒（都是液体）

根本缺陷：完全没考虑商品本身的类别特性

创新方法

派两个在日本店培训的专家店员 + 一个商品摆放员：

第一阶段：找问题商品

专家A说汉堡该放快餐区，专家B坚持该放冷冻区 → 出现分类分歧（高差异度）

这意味着汉堡在美国店的摆放位置（特征）超出了日本店的经验范围（源域支持集外）

第二阶段：调整摆放策略

商品摆放员根据分歧反馈：

把汉堡移到两个专家都认可的位置（比如新设”美式快餐”区）

让奶昔的摆放方式既不像清酒也不像果汁（最小化差异度）

持续优化

反复进行”专家找茬→摆放员调整”的循环

最终效果：

美国店商品区包含日本店所有分类（寿司/清酒区保持不变）

新增区域（汉堡/奶昔）与原有系统和谐共存

为什么需要两个专家？

单个专家容易固执己见（单分类器陷入局部最优）

两个专家不同视角：

专家A关注食品原料（分类器F1侧重纹理特征）

专家B关注包装形状（分类器F2侧重几何特征）

只有当他们都觉得”这商品没见过”时，才真正需要调整系统

Homework 11: Domain Adaptation

Task Description

Given real images (with labels) and drawing images (without labels), please use domain adaptation technique to make your network predict the drawing images correctly.

Dataset

Label: 10 classes (numbered from 0 to 9), as following pictures described.
Training : 5000 (32, 32) RGB real images (with label).
Testing : 100000 (28, 28) gray scale drawing images.

Baseline Guides

Baseline	Accuracy	Hints
Simple	0.41962	Just run the code and submit answer.
Medium	0.59980	Set proper λ in DaNN algorithm&Training more epochs.
Strong	0.71874	The Test data is label-balanced, can you make use of this additional information?
Boss	0.77956	○ All the techniques you’ve learned in CNN. ■ Change optimizer, learning rate, set lr_scheduler, etc… ■ Ensemble the model or output you tried. ○ Implement other advanced adversarial training. ■ For example, MCD MSDA DIRT-T ○ Huh, semi-supervised learning may help, isn’t it? ○ What about unsupervised learning? (like Universal Domain Adaptation?)

Boss Baseline

由于时间原因，我没有从Simple开始训练。

Code Link

我采用了论文中对于λ的Adaptation：

1
2
3

def adaptive_lambda(epoch, num_epoch):
    p = epoch / num_epoch
    return 2. / (1+np.exp(-10*p)) - 1

num_epoch = 2000

# train num_epoch
for epoch in range(num_epoch):
    # You should chooose lamnda cleverly.
    lamb = adaptive_lambda(epoch, num_epoch)
    train_D_loss, train_F_loss, train_acc = train_epoch(source_dataloader, target_dataloader, lamb=lamb)

    torch.save(feature_extractor.state_dict(), f'extractor_model.bin')
    torch.save(label_predictor.state_dict(), f'predictor_model.bin')

    print('epoch {:>3d}: train D loss: {:6.4f}, train F loss: {:6.4f}, acc {:6.4f}'.format(epoch, train_D_loss, train_F_loss, train_acc))

Output:

Reinforcement Learning

What is RL? (Three steps in ML)

Step 1: Function with Unknown

Input of neural network: the observation of machine represented as a vector or a matrix
Output neural network : each action corresponds to a neuron in output layer

在强化学习中，我们通常需要一个函数来指导智能体的决策。这个函数的参数最初是未知的，需要通过数据学习得到。

Step 2: Define “Loss”

为了优化函数参数，需要定义一个衡量当前表现与目标差距的指标，即损失函数（或目标函数）。

Step 3: Optimization

把Actor看作Generator，Environment看作Discriminator，强化学习的优化训练过程就像GAN的思想。

Policy Gradient

How to control your actor

Make it take (or don’t take) a specific action â given specific observation 𝑠.

A_i表示做每一个Action的“鼓励“程度

Version 0

令A_i等于执行当前Action产生的Reward

An action affect the subsequent observations and thus subsequent rewards.
Reward delay: Actor has to sacrifice immediate reward to gain more long term reward.（短视）
Example: In space invader, only “fire” obtains reward, so vision 0 will learn an actor that always “fire”(开火成瘾症)

Version 1

令A_i等于执行当前Action产生的后续Reward的总和

Version 2

在Version 1的基础上，给执行当前Action后产生的后续r_i加上超参数𝛾，使得离当前observation近产生的reward影响比重大，远的影响比重小。

Version 3

在Version 2基础上对A_i进行Normalization

Policy Gradient

Initialize actor network parameters θ₀
For training iteration i=1 to T
- Using actor θ^i − 1 to interact
- Obtain data {s₁, a₁},{s₂, a₂},…,{s_N, a_N}
- Compute A₁, A₂,…,A_N
- Compute loss L
- θⁱ = θ^i − 1 − η∇L
Data collection is in the “for loop” of training iterations.
However the experience of θ^i − 1 cannot use to train θⁱ, that’s why we need Policy Gradient

On-policy v.s . Off-policy

The actor to train and the actor for interacting is the same. →On-policy
Can the actor to train and the actor for interacting be different? →Off-policy
In this way, we do not have to collection data after each update.

Off-policy → Proximal Policy Optimization (PPO)

The actor to train has to know its difference from the actor to interact.

PPO 要求训练用的策略（actor）知道交互用的策略与自己的差异，这样才能在训练中合理控制更新幅度，避免策略偏离太远而不稳定。

Actor-Critic

The output values of a critic depend on the actor evaluated.

A critic does not directly determine the action.
Given an actor 𝜃, it evaluates how good the actor is
Value function V^θ(s): When using actor 𝜃, the discounted cumulated reward expects to be obtained after seeing s

How to estimate Value Function

Monte-Carlo (MC) based approach: The critic watches actor 𝜃 to interact with the environment.

Temporal difference (TD) approach

Version 3.5

令 Version 3中的baseline b = V^θ(s_t)

Version 4

Advantage Actor-Critic: 令Version 3.5 中的G_t^′也取平均，避免由于采样随机导致的问题

Reward Shaping

在很多强化学习任务中，环境提供的原始奖励稀疏或难以学习，例如：

游戏中只有通关才给一次奖励；
机器人导航中，只有到达目标点才得分。

这会导致：

学习慢；
策略不稳定；
甚至学不到有效策略。

Reward Shaping 就是通过增加额外奖励，引导智能体朝着正确方向前进。

Take playing VizDoom as an example:

Curriculum Learning

Starting from simple training examples, and then becoming harder and harder.

Curiosity

https://arxiv.org/abs/1705.05363

智能体对“预测不了”的事情更好奇。

如果智能体对动作导致的结果预测不准，就说明这个环境变化很“新奇”
ICM 就会给予智能体一个内在奖励（intrinsic reward），驱动它去继续探索这种“不可预测”的状态变化
这样即使环境不给外部奖励，智能体也能持续学习

No Reward: Learning from Demonstration

Motivation

Even define reward can be challenging in some tasks.
Hand crafted rewards can lead to uncontrolled behavior.

Inverse Reinforcement Learning

Principle: The teacher is always the best
Basic idea:
- Initialize an actor
- In each iteration
  - The actor interacts with the environments to obtain some trajectories.
  - Define a reward function , which makes the trajectories of the teacher better than the actor.
  - The actor learns to maximize the reward based on the new reward function.
- Output the reward function and the actor learned from the reward function

依旧类似GAN，只是应用场景不同。

Homework12: Reinforcement Learning

本次作业在JudgeBoi上，非台大学生无法提交，因此我只按照要求修改了代码。此外pyvirtualdisplay只能在Linux系统上运行，Windows系统运行需要改动画实现的部分，并不兼容。

HW Content

In this Homework, you can implement some Deep Reinforcement Learning methods by yourself：

Policy Gradient
Actor-Critic ( Implement by yourself to get high score !)

The environment of this HW is Lunar Lander in gym of OpenAI. Other details can be found in the sample code.

Policy Gradient

Actor-Critic

Medium Baseline

将Reward调整成accumulative decaying reward

# 蒐集訓練資料
for episode in range(EPISODE_PER_BATCH):
    
    state = env.reset()
    total_reward, total_step = 0, 0
    episode_rewards = []
    episode_log_probs = []
    
    while True:
        action, log_prob = agent.sample(state)
        next_state, reward, done, _ = env.step(action)

        episode_log_probs.append(log_prob)
        episode_rewards.append(reward)
        state = next_state
        total_reward += reward
        total_step += 1
        
        if done:
            # 计算该episode的累计衰减奖励
            discounted_rewards = []
            for t in range(len(episode_rewards)):
                cumulative = sum(0.99**(k-t) * episode_rewards[k] 
                               for k in range(t, len(episode_rewards)))
                discounted_rewards.append(cumulative)
            
            # 添加到總列表
            log_probs.extend(episode_log_probs)
            rewards.extend(discounted_rewards)
            
            final_rewards.append(reward)
            total_rewards.append(total_reward)
            break

Boss Baseline

将模型改为DQN模型

以下代码来自GitHub

class DQN(nn.Module):

    def __init__(self, state_size=8, action_size=4, fc1_units=64, fc2_units=64):
        """Initialize parameters and build model.
        初始化参数并构建模型
        Params
        ======
            state_size (int): 状态空间的维度 (LunarLander-v2中为8)
            action_size (int): 动作空间的维度 (LunarLander-v2中为4)
            fc1_units (int): 第一个隐藏层的神经元数量
            fc2_units (int): 第二个隐藏层的神经元数量
        """
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

from collections import namedtuple

class ReplayMemory:
    """Fixed-size buffer to store experience tuples.
    """
    def __init__(self, CAPACITY):
        # 设置缓冲区的最大容量
        self.capacity = CAPACITY  
        # 初始化存储经验的列表
        self.memory = []  
        # 当前写入位置的索引（循环使用）
        self.index = 0  
        # 定义经验元组的结构：(状态, 动作, 下一状态, 奖励)
        self.transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward'))
        
    def push(self, state, action, state_next, reward):
        """Push a new experience to memory.
        """
        # 如果内存还没满，先添加None占位
        if len(self.memory) < self.capacity:
            self.memory.append(None)

        # 在当前索引位置存储新的经验元组
        self.memory[self.index] = self.transition(state, action, state_next, reward)

        # 更新索引，使用模运算实现循环覆盖（当内存满时覆盖最旧的经验）
        self.index = (self.index + 1) % self.capacity

    def sample(self, batch_size):
        """Randomly sample a batch of experiences from memory.
        """
        return random.sample(self.memory, batch_size)

    def __len__(self):
        """Return the current size of internal memory.
        """
        return len(self.memory)

class DQNAgent():
    """Interacts with and learns from the environment.
    """
    def __init__(self, num_states, num_actions):
        """Initialize an Agent object.象
        """
        # 保存状态空间和动作空间的维度
        self.num_states = num_states
        self.num_actions = num_actions
        
        # 经验回放缓冲区
        # 设置经验回放内存的容量
        self.memory_capacity = 10000
        # 创建经验回放内存实例
        self.memory = ReplayMemory(self.memory_capacity)
        
        # Q网络
        # 主Q网络：用于选择动作和训练
        self.main_q_network = DQN() 
        # 目标Q网络：用于计算目标Q值，提高训练稳定性
        self.target_q_network = DQN()
        
        # 优化器
        self.optimizer = optim.RMSprop(self.main_q_network.parameters(), lr=1e-4)
    
    def update_q_function(self):
        '''update q function
        '''
        
        # 如果内存中的样本数量不足一个批次，直接返回
        if len(self.memory) < BATCH_SIZE:
            return
        # 如果有足够的样本，创建mini-batch并学习
        self.batch, self.state_batch, self.action_batch, self.reward_batch, self.non_final_next_states = self.make_minibatch()
        
        # 计算期望的状态-动作价值
        self.expected_state_action_values = self.get_expected_state_action_values()

        # 更新主Q网络
        self.update_main_q_network()

    def make_minibatch(self):
        '''Creating a mini-batch
        '''

        # 从经验回放内存中采样一批经验
        transitions = self.memory.sample(BATCH_SIZE)

        # 将采样的经验按类型重新组织
        batch = Transition(*zip(*transitions))

        # 将同类型的数据合并成张量
        # 合并所有状态
        state_batch = torch.cat(batch.state)
        # 合并所有动作
        action_batch = torch.cat(batch.action)
        # 合并所有奖励
        reward_batch = torch.cat(batch.reward)
        # 合并所有非终止的下一状态（过滤掉None值）
        non_final_next_states = torch.cat([s for s in batch.next_state
                                           if s is not None])

        return batch, state_batch, action_batch, reward_batch, non_final_next_states

    def get_expected_state_action_values(self):
        '''calculate Q（St,at）
        '''

        # 将网络设置为评估模式（不计算梯度）
        self.main_q_network.eval()
        self.target_q_network.eval()

        # 使用主网络计算当前状态-动作对的Q值
        # gather(1, self.action_batch)选择对应动作的Q值
        self.state_action_values = self.main_q_network(
            self.state_batch).gather(1, self.action_batch)

        # 创建掩码，标识哪些状态不是终止状态
        non_final_mask = torch.BoolTensor(tuple(map(lambda s: s is not None,
                                                    self.batch.next_state)))
        # 初始化下一状态的Q值为0
        next_state_values = torch.zeros(BATCH_SIZE)

        # 使用目标网络计算非终止下一状态的最大Q值
        next_state_values[non_final_mask] = self.target_q_network(
            self.non_final_next_states).max(1)[0].detach()
        # 使用Bellman方程计算期望的Q值：Q_target = reward + γ * max_Q(next_state)
        expected_state_action_values = self.reward_batch + GAMMA * next_state_values
        
        return expected_state_action_values 
        
    def get_action(self, state, episode, test=False):
        """Returns actions for given state as per current policy.
        """
        # 如果是测试模式
        if test:
            # 设置网络为评估模式
            self.main_q_network.eval()
            # 不计算梯度
            with torch.no_grad():
                # max(1)返回每行的最大值
                # [1]获取最大值的索引（即最优动作）
                # view(1, 1)重塑张量形状
                action = self.main_q_network(torch.from_numpy(state).unsqueeze(0)).max(1)[1].view(1, 1)
            # 返回动作的数值
            return action.item()
        
        # 全局步数计数器（用于epsilon衰减）
        global steps_done
        # ε-贪心策略中的epsilon值计算
        # 使用指数衰减：epsilon随着steps_done增加而减少
        epsilon = EPS_END + (EPS_START - EPS_END) * \
                np.exp(-1. * steps_done / EPS_DECAY)
        
        # 增加步数计数
        steps_done += 1
        
        # 如果随机数大于epsilon，选择贪心动作（利用）
        if epsilon <= np.random.uniform(0, 1):
            # 设置网络为评估模式
            self.main_q_network.eval()
            # 不计算梯度
            with torch.no_grad():
                # 选择Q值最大的动作
                action = self.main_q_network(state).max(1)[1].view(1, 1)
        else:
            # 否则随机选择动作（探索）
            action = torch.LongTensor(
                [[random.randrange(self.num_actions)]])  
            
        return action

    def update_main_q_network(self):
        '''update main q net
        更新主Q网络
        '''

        # 设置网络为训练模式
        self.main_q_network.train()
        # 使用Huber损失函数（smooth_l1_loss）
        # 将expected_state_action_values从(batch_size,)扩展到(batch_size, 1)
        loss = F.smooth_l1_loss(self.state_action_values,
                                self.expected_state_action_values.unsqueeze(1))

        # 更新网络参数
        self.optimizer.zero_grad()  # 清零梯度
        loss.backward()  # 反向传播计算梯度
        # 梯度裁剪，防止梯度爆炸
        for param in self.main_q_network.parameters():
            param.grad.data.clamp_(-1, 1)
        self.optimizer.step()  # 更新网络参数

    def memorize(self, state, action, state_next, reward):
        '''save state, action, state_next, reward into replay memory
        将状态、动作、下一状态、奖励保存到经验回放内存中
        '''
        self.memory.push(state, action, state_next, reward)

    def update_target_q_function(self):
        '''synchronize Target Q-Network to Main Q-Network
        将目标Q网络同步到主Q网络
        '''
        # 将主网络的参数复制到目标网络
        self.target_q_network.load_state_dict(self.main_q_network.state_dict())

# 创建DQN网络实例
network = DQN()
# 创建DQN智能体实例，传入环境的状态空间和动作空间维度
agent = DQNAgent(env.observation_space.shape[0], env.action_space.n)

Network Compression

Network Pruning

Network Pruning（网络剪枝）是指在不显著影响模型性能的前提下，通过去除网络中冗余的参数或结构，使模型更小、更快、更高效的过程。

Importance of a weight: absolute values, life long …
Importance of a neuron: the number of times it wasn’t zero on a given data set ……
After pruning, the accuracy will drop (hopefully not too much)
Fine tuning on training data for recover
Don’t prune too much at once, or the network won’t recover.

Weight pruning

在保持模型结构不变的前提下，移除（置零）神经网络中不重要的权重。

The network architecture becomes irregular.

核心思想：

对每一层的参数进行评估
将权重绝对值小的参数视为“不重要”
将这些参数设为 0（剪除）
通常保留一定的 sparsity ratio（稀疏率），如剪掉 30% 的参数

To Learn more: https://arxiv.org/pdf/1608.03665.pdf

Neuron pruning

剪除神经网络中“整个神经元”的操作。

在 MLP 或 CNN 网络中，每一个神经元都对应一个 输出维度 或 通道（channel），因此剪除神经元也就是剪除：

对于 全连接层（FC）：剪除一整行权重（对应一个神经元）
对于 卷积层（Conv）：剪除一个输出通道（即一个 filter）

The network architecture is regular.

Why Pruning?

大网络更容易训练、更容易获得好的初始表示能力，而剪枝是在此基础上精简冗余参数的过程。

大网络具有更强的表示能力和学习能力

大网络的容量更大，可以更好地拟合复杂数据。
在训练过程中，网络会形成大量“冗余结构”，以便更好地寻找最优解。
剪枝是在模型已经学会了任务之后，对这些冗余结构的精简过程。

小网络不一定容易训练好（optimization issue）

训练小网络更容易陷入局部最优或优化困难。
剪枝相当于：先用大模型找到全局结构，再从中剔除不必要的部分。
从理论上讲，剪枝后的子网络可能更优于从头训练的小网络。

剪枝后的子网络 ≠ 简单的小网络

剪枝后的模型虽然小，但它继承了原大模型的初始化+结构+训练经验。
简单从头训练的小网络可能找不到这种“有利结构”。
剪枝得到的子网络有时被称为 “winning tickets”：在大模型中找到的可独立成功的小网络。

Lottery Ticket Hypothesis

To Learn More:

https://arxiv.org/abs/1803.03635
https://arxiv.org/abs/1905.01067
https://arxiv.org/abs/1906.04358

乐透票假说（Lottery Ticket Hypothesis, LTH）认为：

对于一个随机初始化的大型神经网络，总存在一个子网络，它：

使用原始网络中的一部分参数（即“子网络”）

使用原始的初始化值

可以在相同的训练步骤下达到和原始大网络相当甚至更好的性能

这个“子网络”被称为一个： Lottery Ticket

Rethinking the Value of Network Pruning

To Learn More: https://arxiv.org/abs/1810.05270

training a large, over-parameterized model is often not necessary to obtain an efficient final model,

learned “important” weights of the large model are typically not useful for the small pruned model,

the pruned architecture itself, rather than a set of inherited “important” weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm.

We also compare with the “Lottery Ticket Hypothesis”, and find that with optimal learning rate, the “winning ticket” initialization does not bring improvement over random initialization.

New random initialization, not original random initialization in “Lottery Ticket Hypothesis”
Limitation of “Lottery Ticket Hypothesis” (small lr , unstructured)

本文观点：剪枝过程中“得到的网络结构”远比“保留的原始权重”重要。

论文提出提出了一个替代策略：

Train small network with architecture inherited from pruning：

先训练大网络 + 剪枝
获取小网络的结构（即通道数等）
用相同结构、随机初始化重新训练 ➜ 表现同样好甚至更好

Knowledge Distillation

Knowledge Distillation（知识蒸馏） 是指将一个性能较强的 大模型（teacher） 的知识“转移”到一个 小模型（student） 上，从而提升小模型的性能。

To Learn More:

https://arxiv.org/pdf/1503.02531.pdf
https://arxiv.org/pdf/1312.6184.pdf

Temperature for SoftMax

使用温度参数T控制 softmax 的平滑程度：

T=1：普通 softmax
T>1：分布更平滑，便于学习类间关系

T也不宜太大，过大会导致老师的输出概率基本都为零向量。

Parameter Quantization

Parameter Quantization 是指将原本以高精度（如 32-bit 浮点数，float32）存储的模型参数，转换为低精度表示（如 16-bit、8-bit、甚至 1-bit）的一种模型压缩方法。

Using less bits to represent a value
Weight clustering
Represent frequent clusters by less bits, represent rare clusters by more bits
- e.g. Huffman encoding

Binary Weights

Your weights are always +1 or -1

To Learn More:

Binary Connect:https://arxiv.org/abs/1511.00363
Binary Network:https://arxiv.org/abs/1602.02830
XNOR-net:https://arxiv.org/abs/1603.05279
https://arxiv.org/abs/1511.00363

Binary Connect

步骤	描述
1. 训练时仍保留实数权重	用来进行权重更新（储存信息）
2. 前向和反向传播时，将权重二值化	将实数权重 `w` 转换为二值权重 `w_b ∈ {+1, -1}`
3. 使用实数权重更新	使用 SGD 更新实数权重 `w`，然后再进行二值化

Architecture Design

Depthwise Separable Convolution

Depthwise Convolution(逐通道卷积)

Filter number = Input channel number
Each filter only considers one channel
The filters are 𝑘 × 𝑘 matrices
There is no interaction between channels.

Pointwise Convolution(逐点卷积)

使用卷积融合通道

To learn more:

SqueezeNet: https://arxiv.org/abs/1602.07360
MobileNet: https://arxiv.org/abs/1704.04861
ShuffleNet: https://arxiv.org/abs/1707.01083
Xception: https://arxiv.org/abs/1610.02357

Dynamic Computation

https://arxiv.org/abs/1703.09844

传统神经网络对所有输入都执行固定的计算图，即每个输入都经历相同的层、计算量和参数。

而Dynamic Computation则允许模型根据输入或当前中间状态，动态地：

调整需要执行的层或模块；
激活/跳过部分神经元或通道；
改变模型的精度或宽度/深度；
在推理过程中提前退出计算（early exiting）。

简单来说，不再“一个模型对所有输入一视同仁”，而是“对不同输入量身定制计算量”。

Train multiple classifiers
Classifiers at the intermedia layer

在主模型的多个中间层处接入独立的分类器，使模型能够根据当前层的输出判断是否已经足够自信，可以“提前输出预测”而不继续向下计算。

Training Methods:

Joint Training: 将所有出口的损失加权求和，一起训练主干网络和各个中间分类器
Stage-wise Training: 先训练主网络，然后冻结主网络参数，只训练中间分类器

Homework13: Network Compression

Task Description

Network Compression: Use a small model to simulate the prediction/accuracy of the large model.
In this task, you need to train a very small model to complete HW3, that is, do the classification on the food-11 dataset.

Intro

Knowledge Distillation

When training a small model, add some information from the large model (such as the probability distribution of the prediction) to help the small model learn better.
We have provided a well-trained network to help you do knowledge distillation (Acc ~= 0.855).
Please note that you can only use the pre-trained model we provide when writing homework.

Design Architecture

Depthwise & Pointwise Convolution Layer (Proposed in MobileNet)
- You can consider the original convolution as a Dense/Linear Layer, but each line/each weight is a filter, and the original multiplication becomes a convolution operation. (inputweight → input filter)
- Depthwise: let each channel pass a respective filter first, and let every pixel pass the shared-weight Dense/Linear.
- It is strongly recommended that you use similar techniques to design your model.(NMkk / Nkk+NM)

Baseline Guides

Simple Baseline (2pts, acc ≥ 0.59856, 2 hour)
- Just run the code and submit answer.
Medium Baseline (2 pts, acc ≥ 0.65412, 2 hours)
- Complete the loss in knowledge distillation and control alpha & T.
Strong Baseline (1.5 pts, acc ≥ 0.72819, 4 hours)
- Modify model architecture with depth- and point-wise convolution layer.
  - Or, you can take great ideas from MobileNet, ShuffleNet, DenseNet, SqueezeNet, GhostNet, etc.
- Any techniques and methods you learned in HW3 - CNN. For example, make data augmentation stronger, modify semi-supervised learning, etc.
Boss Baseline (0.5 pts, acc ≥ 0.81003)
- Make your teacher net more stronger.
  - If your teacher net is too strong, you can consider TAKD techniques.
- Implement other advanced knowledge distillation.
  - For example, DML, Relational KD….
- If the number of the parameters of your model is slightly larger than the constraint (100, 000), you can use network pruning.
- If you got confused of previous techniques, you can check out TA’s lesson in last year. (slides, video)

作业提供的所有Teacher Network链接都已经失效了，所以我只修改了代码，并没有运行。

Simple Baseline

Code Link

Medium Baseline

Code Link

Complete the loss in knowledge distillation and control alpha & T.

def loss_fn_kd(outputs, labels, teacher_outputs, alpha=0.7, T=4):
    hard_loss = F.cross_entropy(outputs, labels) * (1. - alpha)    
    # ---------- TODO ----------
    # Complete soft loss in knowledge distillation
    teacher_soft = F.softmax(teacher_outputs / T, dim=1)
    student_log_soft = F.log_softmax(outputs / T, dim=1)   
    kl_loss = F.kl_div(student_log_soft, teacher_soft, reduction='batchmean')
    soft_loss = alpha * (T ** 2) * kl_loss
    return hard_loss + soft_loss

Strong baseline

Code Link

Modify model architecture with depth- and point-wise convolution layer.

# 定义深度可分离卷积块
def depthwise_separable_conv(in_channels, out_channels, stride=1):
    return nn.Sequential(
        # 深度卷积：每个输入通道独立进行卷积，groups=in_channels
        # 参数量：in_channels * kernel_size^2 = in_channels * 9
        nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=stride, 
                 padding=1, groups=in_channels, bias=False),
        nn.BatchNorm2d(in_channels),
        nn.ReLU(inplace=True),
        
        # 逐点卷积：使用1x1卷积混合通道信息
        # 参数量：in_channels * out_channels * 1 = in_channels * out_channels
        nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, 
                 padding=0, bias=False),
        nn.BatchNorm2d(out_channels),
        nn.ReLU(inplace=True),
    )

# 定义学生网络类，继承自nn.Module
class StudentNet(nn.Module):
    def __init__(self):
      super(StudentNet, self).__init__()

      # 参考MobileNet的设计思想
      self.cnn = nn.Sequential(
        nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1, bias=False),
        nn.BatchNorm2d(16),
        nn.ReLU(inplace=True),
        
        depthwise_separable_conv(16, 32, stride=1),
        
        depthwise_separable_conv(32, 64, stride=2),
        
        depthwise_separable_conv(64, 64, stride=1),
        
        depthwise_separable_conv(64, 128, stride=2),
        
        depthwise_separable_conv(128, 128, stride=1),
        
        depthwise_separable_conv(128, 256, stride=2),
        
        nn.AdaptiveAvgPool2d((1, 1)),
        
        nn.Dropout(0.2),
      )
      
      self.fc = nn.Sequential(
        nn.Linear(256, 11),
      )
      
    def forward(self, x):
      out = self.cnn(x)
      out = out.view(out.size()[0], -1)
      return self.fc(out)

Life Long Learning

Lifelong Learning（终身学习，也称为持续学习、增量学习）是指模型在训练过程中持续接收新任务或新知识，而无需访问原始旧任务的数据，同时尽可能保留和整合先前学到的知识。

Goal: A model can beat all task

Catastrophic Forgetting

Catastrophic Forgetting 指的是神经网络在学习新任务时，对先前已学任务的性能急剧下降，即新知识的学习干扰并覆盖了旧知识。

本质原因：神经网络通常通过梯度下降全局更新参数，当在不访问旧任务数据的前提下训练新任务时，参数调整可能会破坏对旧任务有用的表征。

Evaluation

First of all, we need a sequence of tasks.

Accuracy=
- 这是在所有任务学习完之后，模型在每个任务上的平均测试准确率。衡量模型在完成所有任务后，是否仍然保留对旧任务的知识。
Backward Transfer=
- 衡量模型在学习新任务后，对旧任务表现变化的平均值。
- R_T, i: 学习完所有任务后，在第 i 个任务上的准确率（最终表现）。
- R_i, i: 刚学完任务i时，在该任务上的准确率（即时表现）。
Forward Transfer=
- 衡量模型在尚未学习任务 i 前（即只学到任务 i − 1）在任务 i 上的表现 相对于初始水平的提升。
- R_{i − 1, i}: 模型在完成任务 i − 1 后，在任务 i 上的准确率（虽然还没学过）。
- R_0, i: 模型在任务 i 尚未学习前的准确率（通常为随机初始化状态或预训练状态下的结果）。

R_i, j: : after training task i , performance on task j

if i>j: After training task i , does task j be forgot

if i<j: Can we transfer the skill of task i to task j

Research Directions

Selective Synaptic Plasticity

Basic Idea: Some parameters in the model are important to the previous tasks. Only change the unimportant parameters.

θ^b is the model learned from the previous tasks.

Each parameter θ_i^b has a “guard” b_i

If b_i = 0, there is no constraint on θ_i: Catastrophic Forgetting
If b_i = ∞, θ_i would always be equal to θ_i^b: Intransigence(固执)

To Learn More:

Elastic Weight Consolidation (EWC): https://arxiv.org/abs/1612.00796
Synaptic Intelligence (SI): https://arxiv.org/abs/1703.04200
Memory Aware Synapses (MAS): https://arxiv.org/abs/1711.09601
RWalk: https://arxiv.org/abs/1801.10112
Sliced Cramer Preservation (SCP): https://openreview.net/forum?id=BJge3TNKwH
Gradient Episodic Memory (GEM): https://arxiv.org/abs/1706.08840

Additional Neural Resource Allocation

Progressive Neural Networks

https://arxiv.org/abs/1606.04671

PackNet

https://arxiv.org/abs/1711.05769

Compacting, Picking, and Growing (CPG)

https://arxiv.org/abs/1910.06562

Memory Reply

Generating Data

https://arxiv.org/abs/1705.08690 https://arxiv.org/abs/1711.10563 https://arxiv.org/abs/1909.03329

Generating pseudo data using generative model for previous tasks

Adding new classes

Learning without forgetting (LwF): https://arxiv.org/abs/1606.09282
iCaRL: Incremental Classifier and Representation Learning: https://arxiv.org/abs/1611.07725

Three scenarios for continual learning

https://arxiv.org/abs/1904.07734

Homework 14: Lifelong Learning

本次作业是NTU COOL上的选择题，非台大学生无法看到题目。

Introduction

Dataset

Permuted MNIST

Sample Code

Colab Link

Guideline

Utility
- Permutation
- Dataloader and Training Argument
- Model
- train
- evaluate
- evaluate metric
Visualization
Methods
- Baseline
- EWC
- MAS
- SI
- RWalk
- SCP

Baseline

baseline类实现了一个“什么都不做”的终身学习算法，作为其他复杂算法（EWC、MAS等）的性能基准。

class baseline(object):
    """
    baseline technique: do nothing in regularization term [initialize and all weight is zero]
    """
    def __init__(self, model, dataloaders, device):
    
        self.model = model
        self.dataloaders = dataloaders
        self.device = device

        self.params = {n: p for n, p in self.model.named_parameters() if p.requires_grad} #extract all parameters in models
        self.p_old = {} # store current parameters
        self._precision_matrices = self._calculate_importance() # generate weight matrix 

        for n, p in self.params.items():
            self.p_old[n] = p.clone().detach() # keep the old parameter in self.p_old
  
    def _calculate_importance(self):
        precision_matrices = {}
        for n, p in self.params.items(): # initialize weight matrix（fill zero）
            precision_matrices[n] = p.clone().detach().fill_(0)

        return precision_matrices

    def penalty(self, model: nn.Module):
        loss = 0
        for n, p in model.named_parameters():
            _loss = self._precision_matrices[n] * (p - self.p_old[n]) ** 2
            loss += _loss.sum()
        return loss
    
    def update(self, model):
        # do nothing
        return

EWC - Elastic Weight Consolidation

EWC通过Fisher信息矩阵来衡量每个参数对旧任务的重要性，并在学习新任务时保护重要参数不被大幅修改。

class ewc(object):
    """
    @article{kirkpatrick2017overcoming,
        title={Overcoming catastrophic forgetting in neural networks},  
        author={Kirkpatrick, James and Pascanu, Razvan and Rabinowitz, Neil and Veness, Joel and Desjardins, Guillaume and Rusu, Andrei A and Milan, Kieran and Quan, John and Ramalho, Tiago and Grabska-Barwinska, Agnieszka and others},
        journal={Proceedings of the national academy of sciences},
        year={2017},
        url={https://arxiv.org/abs/1612.00796}
    }
    """
    def __init__(self, model, dataloaders, device):
        self.model = model
        self.dataloaders = dataloaders
        self.device = device

        self.params = {n: p for n, p in self.model.named_parameters() if p.requires_grad}
        self.p_old = {}
        self._precision_matrices = self._calculate_importance()

        for n, p in self.params.items():
            self.p_old[n] = p.clone().detach()
  
    def _calculate_importance(self):
        precision_matrices = {}
        # 为每个参数创建对应的Fisher矩阵，初始化为零
        for n, p in self.params.items(): 
            precision_matrices[n] = p.clone().detach().fill_(0)

        self.model.eval()
        if self.dataloaders[0] is not None:
            dataloader_num = len(self.dataloaders)
            number_data = sum([len(loader) for loader in self.dataloaders])
            for dataloader in self.dataloaders:
                for data in dataloader:
                    self.model.zero_grad()
                    input = data[0].to(self.device)
                    output = self.model(input)
                    label = data[1].to(self.device)
                    
                    ############################################################################
                    #####                     generate Fisher(F) matrix for EWC            #####
                    ############################################################################    
                    # 计算负对数似然损失：F.log_softmax()计算对数概率，然后计算负对数似然
                    loss = F.nll_loss(F.log_softmax(output, dim=1), label)             
                    # 对损失进行反向传播，计算每个参数的梯度
                    loss.backward()                                                    
                    ############################################################################

                    # 遍历模型的所有参数
                    for n, p in self.model.named_parameters():
                        # 将参数梯度的平方累加到Fisher矩阵中，并除以总样本数进行平均
                        # p.grad.data ** 2 是Fisher信息矩阵的核心：梯度平方的期望
                        precision_matrices[n].data += p.grad.data ** 2 / number_data   
                                                                            
            precision_matrices = {n: p for n, p in precision_matrices.items()}

        return precision_matrices

    def penalty(self, model: nn.Module):
        loss = 0
        for n, p in model.named_parameters():
            # 计算EWC正则化项：Fisher矩阵 × (当前参数 - 旧参数)²
            # self._precision_matrices[n]：Fisher信息矩阵（参数重要性权重）
            # (p - self.p_old[n]) ** 2：参数变化量的平方
            _loss = self._precision_matrices[n] * (p - self.p_old[n]) ** 2
            loss += _loss.sum()
        return loss
    
    def update(self, model):
        # EWC算法在训练过程中不需要更新任何内部状态
        # Fisher矩阵和旧参数只在任务切换时重新计算
        return

MAS - Memory Aware Synapse

trace the class mas and its calculate_importance function

class mas(object):
    """
    @article{aljundi2017memory,
      title={Memory Aware Synapses: Learning what (not) to forget},
      author={Aljundi, Rahaf and Babiloni, Francesca and Elhoseiny, Mohamed and Rohrbach, Marcus and Tuytelaars, Tinne},
      booktitle={ECCV},
      year={2018},
      url={https://eccv2018.org/openaccess/content_ECCV_2018/papers/Rahaf_Aljundi_Memory_Aware_Synapses_ECCV_2018_paper.pdf}
    }
    """
    def __init__(self, model: nn.Module, dataloaders: list, device):
        self.model = model 
        self.dataloaders = dataloaders
        self.params = {n: p for n, p in self.model.named_parameters() if p.requires_grad}
        self.p_old = {}
        self.device = device
        self._precision_matrices = self.calculate_importance()
    
        for n, p in self.params.items():
            self.p_old[n] = p.clone().detach()
    
    def calculate_importance(self):
        precision_matrices = {}
        # 为每个参数创建对应的Omega矩阵，初始化为零
        for n, p in self.params.items():
            precision_matrices[n] = p.clone().detach().fill_(0)

        self.model.eval()
        if self.dataloaders[0] is not None:
            dataloader_num = len(self.dataloaders)
            num_data = sum([len(loader) for loader in self.dataloaders])
            for dataloader in self.dataloaders:                
                for data in dataloader:
                    self.model.zero_grad()
                    output = self.model(data[0].to(self.device))

                    ###########################################################################################################################################
                    #####  TODO BLOCK: generate Omega(Ω) matrix for MAS. (Hint: square of l2 norm of output vector, then backward and take its gradients  #####
                    ###########################################################################################################################################
                    # 将输出向量的每个元素平方（原地操作）
                    # 这是计算L2范数平方的第一步：每个元素自乘
                    output.pow_(2)                                                   
                    # 对每个样本的输出向量求和，得到每个样本的L2范数平方
                    # dim=1表示沿着特征维度（类别维度）求和
                    loss = torch.sum(output,dim=1)                                   
                    # 对批次中所有样本的L2范数平方取平均，得到标量损失
                    loss = loss.mean()   
                    # 对损失进行反向传播，计算关于模型参数的梯度
                    # 这里计算的是 ∂(||output||²)/∂θ
                    loss.backward() 
                    ###########################################################################################################################################                          
                                            
                    for n, p in self.model.named_parameters():                      
                        # 将参数梯度的绝对值累加到Omega矩阵中，并除以总样本数进行平均
                        # 注意：这里使用p.grad.abs()而不是p.grad.data ** 2（与EWC的区别）
                        precision_matrices[n].data += p.grad.abs() / num_data
                        
        precision_matrices = {n: p for n, p in precision_matrices.items()}
        return precision_matrices

    def penalty(self, model: nn.Module):
        loss = 0
        for n, p in model.named_parameters():
            # 计算MAS正则化项：Omega矩阵 × (当前参数 - 旧参数)²
            # self._precision_matrices[n]：Omega重要性矩阵
            # (p - self.p_old[n]) ** 2：参数变化量的平方
            _loss = self._precision_matrices[n] * (p - self.p_old[n]) ** 2
            loss += _loss.sum()
        return loss
    
    def update(self, model):
        # MAS算法在训练过程中不需要更新任何内部状态
        # Omega矩阵和旧参数只在任务切换时重新计算
        return

SI - Synaptic Intelligence

Accumulated loss change in each update step

class si(object):
    """
    @article{kirkpatrick2017overcoming,
        title={Overcoming catastrophic forgetting in neural networks},
        author={Kirkpatrick, James and Pascanu, Razvan and Rabinowitz, Neil and Veness, Joel and Desjardins, Guillaume and Rusu, Andrei A and Milan, Kieran and Quan, John and Ramalho, Tiago and Grabska-Barwinska, Agnieszka and others},
        journal={Proceedings of the national academy of sciences},
        year={2017},
        url={https://arxiv.org/abs/1612.00796}
    }
    """
    def __init__(self, model, dataloaders, epsilon, device):
        self.model = model
        self.dataloaders = dataloaders
        self.device = device
        # 存储epsilon值，用于数值稳定性（防止除零）
        self.epsilon = epsilon
        self.params = {n: p for n, p in self.model.named_parameters() if p.requires_grad}
        # 调用方法计算重要性权重，返回前一任务参数值和omega重要性矩阵
        self._n_p_prev, self._n_omega = self._calculate_importance() 
        # 调用初始化方法，获取W矩阵（梯度累积）和旧参数值
        self.W, self.p_old = self._init_()

    def _init_(self):
        # 初始化W矩阵字典（用于累积梯度信息）
        W = {}
        p_old = {}
        for n, p in self.model.named_parameters():
            # 将参数名中的点替换为双下划线（避免属性访问冲突）
            n = n.replace('.', '__')
            # 只处理需要梯度的参数
            if p.requires_grad:
                # 初始化W矩阵为与参数同形状的零张量
                W[n] = p.data.clone().zero_()
                p_old[n] = p.data.clone()
        return W, p_old

    def _calculate_importance(self):
        # 初始化前一任务参数字典
        n_p_prev = {}
        # 初始化omega重要性权重字典
        n_omega = {}

        if self.dataloaders[0] != None:
            for n, p in self.model.named_parameters():
                # 将参数名中的点替换为双下划线
                n = n.replace('.', '__')
                # 只处理需要梯度的参数
                if p.requires_grad:

                    # 查找/计算参数的二次惩罚新值
                    # 从模型中获取前一任务的参数值
                    p_prev = getattr(self.model, '{}_SI_prev_task'.format(n))
                    # 从模型中获取累积的W矩阵
                    W = getattr(self.model, '{}_W'.format(n))
                    # 获取当前参数值
                    p_current = p.detach().clone()
                    # 计算参数变化量
                    p_change = p_current - p_prev
                    # 计算omega增量：W / (参数变化平方 + epsilon)
                    # 这是SI算法的核心公式，衡量参数重要性
                    omega_add = W/(p_change**2 + self.epsilon)
                    try:
                        # 尝试获取已存在的omega值
                        omega = getattr(self.model, '{}_SI_omega'.format(n))
                    except AttributeError:
                        # 如果不存在，初始化为零张量
                        omega = p.detach().clone().zero_()
                    # 累积omega值（新的重要性 = 旧重要性 + 新增重要性）
                    omega_new = omega + omega_add
                    # 保存新的omega值
                    n_omega[n] = omega_new
                    # 保存当前参数值作为新的前一任务参数
                    n_p_prev[n] = p_current

                    # 将这些新值存储到模型中作为缓冲区
                    # register_buffer确保这些值会随模型一起保存/加载，但不参与梯度计算
                    self.model.register_buffer('{}_SI_prev_task'.format(n), p_current)
                    self.model.register_buffer('{}_SI_omega'.format(n), omega_new)
        else:
            # 第一个任务的情况：初始化所有值
            for n, p in self.model.named_parameters():
                # 将参数名中的点替换为双下划线
                n = n.replace('.', '__')
                # 只处理需要梯度的参数
                if p.requires_grad:
                    # 将当前参数值作为前一任务参数
                    n_p_prev[n] = p.detach().clone()
                    # 初始化omega为零
                    n_omega[n] = p.detach().clone().zero_()
                    # 在模型中注册前一任务参数缓冲区
                    self.model.register_buffer('{}_SI_prev_task'.format(n), p.detach().clone())

        # 返回前一任务参数和omega重要性权重字典
        return n_p_prev, n_omega

    def penalty(self, model: nn.Module):
        # 初始化正则化损失为0
        loss = 0.0
        # 遍历模型的所有命名参数
        for n, p in model.named_parameters():
            n = n.replace('.', '__')
            if p.requires_grad:
                # 获取前一任务的参数值
                prev_values = self._n_p_prev[n]
                # 获取omega重要性权重
                omega = self._n_omega[n]
                # 计算SI正则化项：omega × (当前参数 - 前一任务参数)²
                # omega值越大，表示参数越重要，变化时惩罚越大
                _loss = omega * (p - prev_values) ** 2
                loss += _loss.sum()
         
        return loss
    
    def update(self, model):
        # 在训练过程中更新W矩阵（累积梯度信息）
        for n, p in model.named_parameters():
            n = n.replace('.', '__')
            if p.requires_grad:
                if p.grad is not None:
                    # 更新W矩阵：W = W - grad × (当前参数 - 旧参数)
                    # 这累积了梯度和参数变化的乘积，用于后续计算重要性
                    self.W[n].add_(-p.grad * (p.detach() - self.p_old[n]))
                    # 将更新后的W矩阵注册到模型中
                    self.model.register_buffer('{}_W'.format(n), self.W[n])
                # 更新旧参数值为当前参数值
                self.p_old[n] = p.detach().clone()
        return

RWalk - Remanian Walk

RWalk（Riemannian Walk）算法是对SI算法的改进，主要创新点包括：

融合Fisher信息：在SI的omega计算中加入了EWC的Fisher矩阵信息

修改的重要性计算：

1 2	`SI: ω = W / (Δθ² + ε) RWalk: ω = W / (0.5 × F × Δθ² + ε)`

其中F是Fisher信息矩阵

平滑更新策略：使用加权平均而非直接累加

1 2	`SI: ω_new = ω_old + ω_add RWalk: ω_new = 0.5 × ω_old + 0.5 × ω_add`

class rwalk(object):
    """

    """
    def __init__(self, model, dataloaders, epsilon, device):
    
        self.model = model
        self.dataloaders = dataloaders
        self.device = device
        self.epsilon = epsilon
        self.update_ewc_parameter = 0.4
        self.params = {n: p for n, p in self.model.named_parameters() if p.requires_grad} # extract model parameters and store in dictionary
        self._means = {} # initialize the guidance matrix
        self._precision_matrices = self._calculate_importance_ewc() # Generate Fisher (F) Information Matrix 
        self._n_p_prev, self._n_omega = self._calculate_importance() 
        self.W, self.p_old = self._init_()


    def _init_(self):
        W = {}
        p_old = {}
        for n, p in self.model.named_parameters():
            n = n.replace('.', '__')
            if p.requires_grad:
                W[n] = p.data.clone().zero_()
                p_old[n] = p.data.clone()
        return W, p_old

    def _calculate_importance(self):
        n_p_prev = {}
        n_omega = {}

        if self.dataloaders[0] != None:
            for n, p in self.model.named_parameters():
                n = n.replace('.', '__')
                if p.requires_grad:

                    # Find/calculate new values for quadratic penalty on parameters
                    p_prev = getattr(self.model, '{}_SI_prev_task'.format(n))
                    W = getattr(self.model, '{}_W'.format(n))
                    p_current = p.detach().clone()
                    p_change = p_current - p_prev
                    # RWalk的核心创新：结合Fisher矩阵修改SI的omega计算公式
                    # omega_add = W / (0.5 * Fisher矩阵 * 参数变化平方 + epsilon)
                	# 这里融合了EWC的Fisher信息和SI的累积梯度信息
                    omega_add = W / (1.0 / 2.0*self._precision_matrices[n] *p_change**2 + self.epsilon)
                    try:
                        omega = getattr(self.model, '{}_SI_omega'.format(n))
                    except AttributeError:
                        omega = p.detach().clone().zero_()
                    # RWalk特有：使用加权平均更新omega值，而不是直接累加
                	# omega_new = 0.5 * 旧omega + 0.5 * 新增omega
                    omega_new = 0.5 * omega + 0.5 *omega_add
                    n_omega[n] = omega_new
                    n_p_prev[n] = p_current


                    # Store these new values in the model
                    self.model.register_buffer('{}_SI_prev_task'.format(n), p_current)
                    self.model.register_buffer('{}_SI_omega'.format(n), omega_new)
        else:
            for n, p in self.model.named_parameters():
                n = n.replace('.', '__')
                if p.requires_grad:
                    n_p_prev[n] = p.detach().clone()
                    n_omega[n] = p.detach().clone().zero_()
                    self.model.register_buffer('{}_SI_prev_task'.format(n), p.detach().clone())


        return n_p_prev, n_omega
    

    def _calculate_importance_ewc(self):
        precision_matrices = {}
        for n, p in self.params.items(): 
            n = n.replace('.', '__') # 初始化 Fisher (F) 的矩陣（都補零）
            precision_matrices[n] = p.clone().detach().fill_(0)

        self.model.eval()
        if self.dataloaders[0] is not None:
            dataloader_num=len(self.dataloaders)
            number_data = sum([len(loader) for loader in self.dataloaders])
            for dataloader in self.dataloaders:
                for n, p in self.model.named_parameters():                         
                    n = n.replace('.', '__')
                    precision_matrices[n].data *= (1 -self.update_ewc_parameter)   
                for data in dataloader:
                    self.model.zero_grad()
                    input = data[0].to(self.device)
                    output = self.model(input)
                    label = data[1].to(self.device)

                    
                    ############################################################################
                    #####                      Generate Fisher Matrix                      #####
                    ############################################################################    
                   	# RWalk的核心公式：(omega + Fisher矩阵) × (当前参数 - 前一任务参数)²
                	# 这结合了SI的动态重要性(omega)和EWC的理论基础(Fisher矩阵)
                	# omega反映了参数变化对损失的累积影响
                	# Fisher矩阵反映了参数变化对损失函数曲率的影响
                    loss = F.nll_loss(F.log_softmax(output, dim=1), label)             
                    loss.backward()                                                    
                                                                                    
                    for n, p in self.model.named_parameters():                         
                        n = n.replace('.', '__')
                        precision_matrices[n].data += self.update_ewc_parameter*p.grad.data ** 2 / number_data  
                                                                            
            precision_matrices = {n: p for n, p in precision_matrices.items()}

        return precision_matrices


    def penalty(self, model: nn.Module):
        loss = 0.0
        for n, p in model.named_parameters():
            n = n.replace('.', '__')
            if p.requires_grad:
                prev_values = self._n_p_prev[n]
                omega = self._n_omega[n]

                #################################################################################
                ####        Generate regularization term  _loss by omega and Fisher Matrix   ####
                #################################################################################
                _loss = (omega + self._precision_matrices[n]) * (p - prev_values) ** 2
                loss += _loss.sum()
         
        return loss
    
    def update(self, model):
        for n, p in model.named_parameters():
            n = n.replace('.', '__')
            if p.requires_grad:
                if p.grad is not None:
                    self.W[n].add_(-p.grad * (p.detach() - self.p_old[n]))
                    self.model.register_buffer('{}_W'.format(n), self.W[n])
                self.p_old[n] = p.detach().clone()
        return

SCP - Sliced Cramer Preservation

# 定义采样球面向量的辅助函数
def sample_spherical(npoints, ndim=3):
    # 生成ndim维、npoints个随机向量，每个向量的元素服从标准正态分布
    vec = np.random.randn(ndim, npoints)
    # 对每个向量进行L2范数归一化，使其成为单位向量（在球面上）
    # axis=0表示对每一列（每个向量）分别计算范数并归一化
    vec /= np.linalg.norm(vec, axis=0)
    return torch.from_numpy(vec)

class scp(object):
    """
    SCP (Sliced Cramer Preservation) 算法类
    用于终身学习中的灾难性遗忘问题
    参考论文：https://openreview.net/forum?id=BJge3TNKwH
    """
    def __init__(self, model: nn.Module, dataloaders: list, L: int, device):
        self.model = model 
        self.dataloaders = dataloaders
        self.params = {n: p for n, p in self.model.named_parameters() if p.requires_grad}
        self._state_parameters = {}
        # 存储L值，表示随机采样的球面向量数量
        self.L= L
        self.device = device
        # 调用方法计算重要性矩阵（SCP的Gamma矩阵）
        self._precision_matrices = self.calculate_importance()
    
        for n, p in self.params.items():
            self._state_parameters[n] = p.clone().detach()
    
    def calculate_importance(self):
        # 初始化重要性矩阵字典（SCP的Gamma矩阵）
        precision_matrices = {}
        for n, p in self.params.items():
            precision_matrices[n] = p.clone().detach().fill_(0)

        self.model.eval()
        if self.dataloaders[0] is not None:
            dataloader_num = len(self.dataloaders)
            num_data = sum([len(loader) for loader in self.dataloaders])
            for dataloader in self.dataloaders:
                for data in dataloader:
                    self.model.zero_grad()
                    output = self.model(data[0].to(self.device))
                    
                    ####################################################################################
                    ##### 生成SCP的Gamma(Γ)矩阵（类似于MAS的Omega(Ω)和EWC的Fisher(F)） #####
                    ####################################################################################
                    #####   步骤1: 对批次输出向量取平均得到向量φ(:,θ_A*)                    ####
                    ####################################################################################
                    # 计算当前批次所有样本输出的平均值
                    # output.shape通常为[batch_size, output_dim]
                    # mean_vec.shape为[output_dim]，代表平均输出向量φ
                    mean_vec = output.mean(dim=0)

                    ####################################################################################
                    #####   步骤2: 随机采样L个球面向量ξ（使用sample_spherical()函数）     #####
                    ####################################################################################
                    # 采样L个单位向量，维度与输出向量相同
                    # output.shape[-1]是输出维度（最后一维）
                    L_vectors = sample_spherical(self.L, output.shape[-1])
                    # 转置矩阵使其形状从[output_dim, L]变为[L, output_dim]
                    # 然后移动到指定设备并转换为float类型
                    L_vectors = L_vectors.transpose(1,0).to(self.device).float()

                    ####################################################################################
                    #####   步骤3: 每个向量ξ与向量φ(:,θ_A*)内积得到标量ρ                       ####
                    #####          对标量ρ取backward，每个参数得到各自的梯度∇ρ                 ####
                    #####          每个参数的梯度∇ρ取绝对值并对L个向量取平均得到各参数的Γ标量        ####  
                    #####          所有参数的Γ标量组合而成就是Γ矩阵                          ####
                    ####	(记得每次backward之后要zero_grad去清除梯度，否则梯度会累加)        ####   
                    ####################################################################################
                    # 初始化总标量为0（用于累积L个内积结果）
                    total_scalar = 0
                    # 遍历L个随机采样的球面向量
                    for vec in L_vectors:
                        # 计算当前向量ξ与平均输出向量φ的内积，得到标量ρ
                        # torch.matmul执行向量内积运算：ξ·φ = Σ(ξᵢ×φᵢ)
                        scalar=torch.matmul(vec, mean_vec)
                        # 累积所有内积结果
                        total_scalar += scalar
                    # 对L个内积结果取平均
                    total_scalar /= L_vectors.shape[0] 
                    # 对平均内积进行反向传播，计算相对于模型参数的梯度
                    # 这个梯度反映了参数变化对"投影距离"的敏感性
                    total_scalar.backward()
                    ##################################################################################      
                     
                    # 遍历模型的所有命名参数                                
                    for n, p in self.model.named_parameters():                      
                        # 累积每个参数的重要性权重
                        # 使用梯度的绝对值（与EWC的梯度平方不同）
                        # 这反映了参数对输出分布"切片投影"的重要性
                        precision_matrices[n].data += p.grad.abs() / num_data ## 与EWC的差异      
                        
        precision_matrices = {n: p for n, p in precision_matrices.items()}
        # 返回重要性矩阵（Gamma矩阵）
        return precision_matrices

    def penalty(self, model: nn.Module):
        # 初始化正则化损失为0
        loss = 0
        for n, p in model.named_parameters():
            # 计算SCP正则化项：Γ × (当前参数 - 前一任务参数)²
            # _precision_matrices[n]是参数n的重要性权重（Gamma值）
            # _state_parameters[n]是前一任务中参数n的值
            # 公式：L_reg = Σ Γᵢ × (θᵢ - θᵢ*)²
            _loss = self._precision_matrices[n] * (p - self._state_parameters[n]) ** 2
            # 累积所有参数的正则化损失
            loss += _loss.sum()
        # 返回总的SCP正则化损失
        return loss
    
    def update(self, model):
        # SCP算法在训练过程中不需要更新任何状态
        # 重要性权重在任务间隙计算，训练期间保持不变
        return

Meta Learning

Introduction

Step 1: What is learnable in a learning algorithm?

Step 2: Define loss function L(ϕ) for learning algorithm F_ϕ

θ^1⋆: ∗: parameters of the classifier learned by F_ϕ , using the training examples of task 1

How can we know a classifier is good or bad? Evaluate the classifier on testing set

Total Loss: ( 𝑁 is the number of the training tasks)

In typical ML, you compute the loss based on training examples
In meta, you compute the loss based on testing examples of training tasks

Step 3: Using the optimization approach you know to fine ϕ that can minimize L(ϕ)

ϕ^⋆ = arg min_ϕL(ϕ)

If you know how to compute ∂L(ϕ)/∂ϕ: Gradient descent is your friend.
What if L(ϕ) is not differentiable?: Reinforcement Learning / Evolutionary Algorithm

Framework

graph TB
  A["Step 1: 定义可学习的φ"] --> B["学习算法F_φ: D_train → θ*"]
  B --> C["Step 2: 元损失L(φ) = E_Ti[L_Ti^test(θ*)]"]
  C --> D["Step 3: 优化 min_φ L(φ)"]
  D --> E{输出最优F_φ*}
  
  subgraph "任务分布 p(T)"
    F["采样任务 Ti"]
    F --> G["内循环: F_φ(D_i^train) → θ_i*"]
    G --> H["外循环: 计算L_Ti^test(θ_i*)"]
  end
  H --> C

ML v.s. Meta

	Machine Learning	Meta Learning
Goal	find a function f	find a function F that finds a function f
Training	Within-task Training & Testing	Across-task Training & Testing
Loss

What is learnable in a learning algorithm?

Learning to initialize

Model Agnostic Meta Learning (MAML)
Reptile

MAML

Model-Agnostic Meta-Learning（MAML，模型无关元学习）是一种通用的元学习方法，旨在学习一个模型参数初始化，使其可以在仅经过少量梯度更新后快速适应新任务。

模型无关性（Model-Agnostic）：MAML 不依赖于模型的具体结构，只要该模型可以通过梯度下降进行优化，它就可以应用 MAML。

initialize ϕ randomly

for each meta-iteration:
    sample batch of tasks {T₁, T₂, ..., T_N}

    for each task Tᵢ:
        # Inner loop: 任务内的快速适应
        θᵢ = ϕ - α ∇_ϕ L_trainᵢ(ϕ)

        # Outer loop: 用任务测试集评估效果
        L_metaᵢ = L_testᵢ(θᵢ)

    # Meta-update：在多个任务上平均
    L_meta = mean(L_metaᵢ for all i)
    ϕ ← ϕ - β ∇_ϕ L_meta

Reptile

To Learn More: https://arxiv.org/abs/1803.02999

Reptile的设计目标是实现 MAML 的目标（快速适应新任务），但避免计算昂贵的二阶梯度。它是一种first-order 方法（只用一阶梯）。

Basic Idea：通过在任务上训练几步后，把初始化参数朝着 task-specific 参数的方向移动。

也就是说，在多个任务上重复执行“训练一会儿 → 返回初始点 → 更新方向”的过程，使得初始点最终接近所有任务的最优解的“公共区域”。

Optimizer

Marcin Andrychowicz , et al., Learning to learn by gradient descent by gradient descent, NIPS, 2016

Network Architecture Search (NAS)

Reinforcement Learning

Barret Zoph , et al., Neural Architecture Search with Reinforcement Learning, ICLR 2017
Barret Zoph , et al., Learning Transferable Architectures for Scalable Image Recognition, CVPR, 2018
Hieu Pham, et al., Efficient Neural Architecture Search via Parameter Sharing, ICML, 2018

An agent uses a set of actions to determine the network architecture.

ϕ: the agent’s parameters

−L(ϕ): Reward to be maximized

Evolution Algorithm

Esteban Real, et al., Large Scale Evolution of Image Classifiers, ICML 2017
Esteban Real, et al., Regularized Evolution for Image Classifier Architecture Search, AAAI, 2019
Hanxiao Liu, et al., Hierarchical Representations for Efficient Architecture Search, ICLR, 2018

DARTS

Hanxiao Liu, et al., DARTS: Differentiable Architecture Search, ICLR, 2019

将神经网络结构搜索问题从离散的组合优化问题转化为连续的可导问题，从而可以使用梯度下降直接优化结构参数。

初始化结构参数 α，模型参数 w

repeat until convergence:
    # Inner update (model weights)
    使用训练集 D_train 更新 w
    
    # Outer update (architecture)
    使用验证集 D_val 更新 α

提取最优结构（每对节点选择最大权重操作）
使用该结构重新训练网络参数 w

Data Augmentation

Yonggang Li, Guosheng Hu, Yongtao Wang, Timothy Hospedales Neil M. Robertson, Yongxin Yang, DADA: Differentiable Automatic Data Augmentation, ECCV, 2020
Daniel Ho, Eric Liang, Ion Stoica, Pieter Abbeel, Xi Chen, Population Based Augmentation: Efficient Learning of Augmentation Policy Schedules, ICML, 2019
Ekin D. Cubuk Barret Zoph Dandelion Mane Vijay Vasudevan Quoc V. Le, AutoAugment : Learning Augmentation Policies from Data, CVPR, 2019

Sample Reweighting

Give different samples different weights

在训练过程中，对训练样本赋予不同的权重（importance weight），以调控模型更关注哪些样本。

Jun Shu, Qi Xie Lixuan Yi Qian Zhao Sanping Zhou Zongben Xu Deyu Meng, Meta-Weight Net: Learning an Explicit Mapping For Sample Weighting, NeurIPS , 2019
Mengye Ren, Wenyuan Zeng, Bin Yang, Raquel Urtasun, Learning to Reweight Examples for Robust Deep Learning, ICML, 2018

Homework 15: Few-shot Classification

Task Description

The Omniglot dataset - background set: 30 alphabets - evaluation set: 20 alphabets

Problem setup: 5-way 1-shot classification

Training MAML on Omniglot classification task.

Baseline Guide

Simple Baseline

Sample Code

Medium baseline

Use FO-MAML

# TODO: Finish the inner loop update rule
            # 计算损失对快速权重的梯度
            # create_graph=True 是关键：保持计算图以便后续的二阶梯度计算
            grads = torch.autograd.grad(loss, fast_weights.values())
            
            # 执行梯度下降更新快速权重
            # 这是MAML内循环的核心：θ' = θ - α∇L(fθ(support_set), support_labels)
            fast_weights = OrderedDict(
                (name, param - inner_lr * grad)
                for ((name, param), grad) in zip(fast_weights.items(), grads)
            )
            # raise NotImplementedError

# TODO: Finish the outer loop update
        # 计算元损失对原始模型参数的梯度（外循环更新）
        # 这是MAML外循环的核心：θ = θ - β∇L(fθ'(query_set), query_labels)
        # 其中θ'是经过内循环更新后的参数，β是外循环学习率（由optimizer管理）
        meta_batch_loss.backward()
        optimizer.step()
        # raise NotimplementedError

Output:

Strong Baseline

Use MAML+ Epoch = 100

# TODO: Finish the inner loop update rule
            # 计算损失对快速权重的梯度
            # create_graph=True 是关键：保持计算图以便后续的二阶梯度计算
            grads = torch.autograd.grad(loss, fast_weights.values(), create_graph=True)
            
            # 执行梯度下降更新快速权重
            # 这是MAML内循环的核心：θ' = θ - α∇L(fθ(support_set), support_labels)
            fast_weights = OrderedDict(
                (name, param - inner_lr * grad)
                for ((name, param), grad) in zip(fast_weights.items(), grads)
            )
            # raise NotImplementedError

# TODO: Finish the outer loop update
        # 计算元损失对原始模型参数的梯度（外循环更新）
        # 这是MAML外循环的核心：θ = θ - β∇L(fθ'(query_set), query_labels)
        # 其中θ'是经过内循环更新后的参数，β是外循环学习率（由optimizer管理）
        meta_batch_loss.backward()
        optimizer.step()
        # raise NotimplementedError

Output:

Boss Baseline

MAML + Task Augmentation

def MetaSolver(
    model,
    optimizer,
    x,
    n_way,
    k_shot,
    q_query,
    loss_fn,
    inner_train_step=1,
    inner_lr=0.4,
    train=True,
    return_labels=False,
    task_augment=True,  # 新增：是否启用任务增强
    augment_prob=0.5    # 新增：任务增强的概率
):
    criterion, task_loss, task_acc = loss_fn, [], []
    labels = []

    def apply_task_augmentation(support_set, query_set):
        augmented_support = support_set.clone()
        augmented_query = query_set.clone()
        
        if np.random.random() < augment_prob:
            # 添加高斯噪声
            noise_std = 0.05
            noise_support = torch.randn_like(augmented_support) * noise_std
            noise_query = torch.randn_like(augmented_query) * noise_std
            augmented_support = torch.clamp(augmented_support + noise_support, 0, 1)
            augmented_query = torch.clamp(augmented_query + noise_query, 0, 1)
        
        if np.random.random() < augment_prob:
            # 随机亮度调整
            brightness_factor = 0.8 + 0.4 * np.random.random()  # 0.8-1.2
            augmented_support = torch.clamp(augmented_support * brightness_factor, 0, 1)
            augmented_query = torch.clamp(augmented_query * brightness_factor, 0, 1)
        
        if np.random.random() < augment_prob:
            # 随机对比度调整
            contrast_factor = 0.8 + 0.4 * np.random.random()  # 0.8-1.2
            mean_support = augmented_support.mean(dim=(2, 3), keepdim=True)
            mean_query = augmented_query.mean(dim=(2, 3), keepdim=True)
            augmented_support = torch.clamp(
                mean_support + contrast_factor * (augmented_support - mean_support), 0, 1
            )
            augmented_query = torch.clamp(
                mean_query + contrast_factor * (augmented_query - mean_query), 0, 1
            )
        
        return augmented_support, augmented_query

    for meta_batch in x:
        # Get data
        support_set = meta_batch[: n_way * k_shot]
        query_set = meta_batch[n_way * k_shot :]

        # 应用任务增强（仅在训练时）
        if train and task_augment:
            support_set, query_set = apply_task_augmentation(support_set, query_set)

        # Copy the params for inner loop
        fast_weights = OrderedDict(model.named_parameters())

        ### ---------- INNER TRAIN LOOP ---------- ###
        for inner_step in range(inner_train_step):
            # Simply training
            train_label = create_label(n_way, k_shot).to(device)
            logits = model.functional_forward(support_set, fast_weights)
            loss = criterion(logits, train_label)
            # Inner gradients update! vvvvvvvvvvvvvvvvvvvv #
            """ Inner Loop Update """
            grads = torch.autograd.grad(loss, fast_weights.values(), create_graph=True)
            
            fast_weights = OrderedDict(
                (name, param - inner_lr * grad)
                for ((name, param), grad) in zip(fast_weights.items(), grads)
            )
        ### ---------- INNER VALID LOOP ---------- ###
        if not return_labels:
            """ training / validation """
            val_label = create_label(n_way, q_query).to(device)

            # Collect gradients for outer loop
            logits = model.functional_forward(query_set, fast_weights)
            loss = criterion(logits, val_label)
            task_loss.append(loss)
            task_acc.append(calculate_accuracy(logits, val_label))
        else:
            """ testing """
            logits = model.functional_forward(query_set, fast_weights)
            labels.extend(torch.argmax(logits, -1).cpu().numpy())

    if return_labels:
        return labels

    # Update outer loop
    model.train()
    optimizer.zero_grad()

    meta_batch_loss = torch.stack(task_loss).mean()
    if train:
        """ Outer Loop Update """
        meta_batch_loss.backward()
        optimizer.step()

    task_acc = np.mean(task_acc)
    return meta_batch_loss, task_acc

Output:

学习

#深度学习 #pytorch

机器学习笔记

https://striver98.github.io/2025/05/20/机器学习笔记/

作者

Wang Zhixuan

发布于

2025年5月20日

许可协议

本科毕业旅行（一）：襄阳&宜城上一篇

2025音律联觉熠曲丰碑个人出行记录下一篇

机器学习笔记

Introduction of Machine/Deep Learning

Different types of Functions

Case Study: Prediction of no. of views of a YouTube Channel

Change Model: Piecewise Linear Function

Homework 1: COVID-19 Cases Prediction

What to do if my network fails to train

Local Minima & Saddle Point

Mathematical Principles

Tayler Series Approximation

Hessian

Batch & Momentum

Small Batch v.s. Large Batch

Gradient Descent + Momentum

Summary

Adaptive Learning Rate

Root Mean Square

RMSProp

Learning Rate Scheduling

Learning Rate Decay

Warm Up

Summary of Optimization

Classification

Classification as Regression

Loss of Classification

Case Study: Pokémon v.s. Digimon

Observation

Function with Unknown Parameters

Loss of a function (given data)

Training Examples

Probability of Failure

Hoeffding’s Inequality:

Model Complexity

Tradeoff of Model Complexity

Homework 2: Framewise phoneme prediction from speech

Report Questions

Images Input

Convolutional Neural Network(CNN)

Simplification 1: Receptive Field

Simplification 2: Parameter Sharing

Convolutional Layer

Pooling

Summary

Homework 3: Image Classification

Baseline

Record

Simple Baseline

Medium Baseline

Strong Baseline

Self-attention

Basic Mechanism

Multi-head Self-attention

Positional Encoding

Homework 4: Speaker Identification

Task: Multiclass Classification

Simple Baseline

Medium Baseline

Strong Baseline

Boss Baseline

Sequence to sequence (Seq2seq)

Batch Normalization

Transformer

the Applications of Seq2seq

The Architecture of Transformer

Encoder

Decoder

Training

Self-attention Variances

Fixed Pattern

Learnable Patterns

Clustering

Sinkhorn Sorting Network

Reduce Number of Keys

Change the order of matrix multiplication

Homework 5: Machine Translation

Training datasets

Evaluation

Workflow

Baselines

Simple Baseline