Skip to main content

Concepts

Empirical Risk Minimization

We are interested finding a function ff that minimized the expected risk:

RTRUE(f)=E[l(f(x),y)]=l(f(x),y)dp(x,y)R_{TRUE}(f) = E[\mathcal{l}(f(x), y)] = \int \mathcal{l}(f(x), y) \, dp(x,y)

with the optimal function:

f=arg minfRTRUE(f)f^{*} = \argmin_{f}{R_{TRUE}(f)}

  • RTRUE(f)R_{TRUE}(f) is the true risk if we have access to an infinite set of all possible data and labels
  • In practical, the joint probability distribution P(x,y)=P(yx)P(x)P(x,y) = P(y|x)P(x) is unknown and the only available information is contained in the training set

Thus, the true risk is replaced by the empirical risk, which is the average of sample losses over the training set DD:

Rn(f)=1ni=1nl(f(xi),yi)R_n(f) = \frac{1}{n} \sum_{i=1}^{n}{\mathcal{l}(f(x_i), y_i)}

with attempting to find a function in F\mathcal{F} which minimizes the emprical risk:

fn=arg minfRn(f)f_{n} = \argmin_{f}{R_{n}(f)}

  • F\mathcal{F} is a family of candidate functions
  • In the case of CNNs, this involves choosing the relevant hyperparameters, model architecture, etc.

Thus finding a function that is as close as possible to ff^{*} can be broken down into:

  1. Choosing a class of models that is more likely to contain the optimal function
  2. Having a large and broad range of training examples in DD to better approximate an infinite set of all possible data and labels

Deep transfer learning

deep-transfer-learning

  • Given a source domain DS\mathcal{D}_S and learning task TS\mathcal{T}_S, a target domain DT\mathcal{D}_T and learning task DT\mathcal{D}_T
  • Deep transfer learning aims to improve the performance of the target model MM on the target task DT\mathcal{D}_T by initializing it with weights WW
  • Weights WW are trained on source task TS\mathcal{T}_S using source dataset DS\mathcal{D}_S (pretraining)
  • Where DSDT\mathcal{D}_S \neq \mathcal{D}_T, or TSTT\mathcal{T}_S \neq \mathcal{T}_T
  • With deep neural networks, once the weights have been pretrained to respond to particular features in a large source dataset, the weights will not change far from their pretrained values during fine-tuning

Datasets commonly used in transfer learning for image classification

  1. Imagenet 1K, 5K, 9K, 21K: different subset of classes, e.g. 21K has 21000 classes
  2. JFT: internal Google Dataset

Negative Transfer

  • If the source dataset is not well related to the target dataset, the target model can be negatively impacted by pretraining
  • Negative transfer occurs when NTG is positive
  • Divergence between the source and target domains, the size and quality of the source and target datasets affect negative transfer

Negative Transfer Gap (NTG)

NTG=ϵτ(θ(S,τ))ϵτ(θ(,τ))NTG = \epsilon_{\tau}(\theta(S, \tau)) - \epsilon_{\tau}(\theta(\emptyset, \tau))

  • ϵτ\epsilon_{\tau} as the test error on the target domain
  • θ\theta as the specific transfer learning algorithm
  • \emptyset as the case where the source domain data/information are not used by the target domain learner

Class Activation Map (CAM)

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

grad-cam

  • If a certain pixel is important, then the CNN will have a large activation at those pixels
  • If a certain convolutional channel is important with respect to the required class, the gradients at that channel will be very large

Caveats

  • Imbalanced data
    • Confusion matrix
    • Loss function (binary or categorical cross-entropy) ensures that the loss values are high when the amount of misclassification is high
    • Higher class weights to rare class image
    • Over-sample rare class image
    • Data augmentation
    • Transfer learning
  • The size of the object (small) within an image
    • Object detection: divide input image into smaller grid cells, then identify whether a grid cell contains the object of interest
    • Model is trained and inferred on images with high resolution
  • Data drift
  • The number of nodes in the flatten layer
    • Typically around 500-5000 nodes
  • Image size
    • Images of objects might not lose information if resized
    • Images of text might lose considerable information