Tegg Sung Deep learning researcher

Log Derivative Trick

  • Alternating between normalized-probabilities and log-probabilities
  • The derivative of the logarithm trick well-used to solve stochastic optimization problems.

Score Functions

  • The central computation for MLE, often used in generalized linear regression, deep learning, kernel machines, dimensionality reduction, and tensor decompositions
  • The expected value of the score is zero. (used in the proof of REINFORCE algorithm)
  • The variance of the score is the Fisher information. It is used to determine the Cramer-Rao lower bounds.

Score Function Estimators

  • A recurring task in ML
    • Posterior computation in VI
    • Value function and policy learning in RL
    • Derivative pricing in computational finance
    • Inventory control in operations research
  • The gradient of expectation of function $f$ is difficult to compute, because the integral is typically unknown and the parameters , with respect to which we are computing the gradient, are of the distribution .
  • Moreover, we (perhaps) want to compute this gradient when the function $f$ is not differentiable.
  • Score function is an unbuased estimator of the gradient.
    • The function need not be differentiable. Instead, we should be able to evaluate it or observe its value for a given .
  1. Score function estimators
  2. Likelihood ratio methods
  3. Automated variational inference
  4. REINFORCE and policy gradients
  5. Any gradients of the policy that correspond to high rewards are weighted higher—reinforced—by the estimator.
  6. The estimator was called REINFORCE, and its generalization now forms the policy gradient theorem.

Control Variates

  • To make MC estimator effective, its variance is as low as possible. (The gradient will not be useful otherwise.)
  • Control variates: used for variance reduction in MC estimators (baseline technique)
  • The choice of control variate is the principal challenge in the use of the score function estimators.
  • Ex. Constant baselines, clever sampling schemes (antithetic or stratified), delta methods, or adaptive baselines

Familes of Stochastic Estimators

  • Approaches
    • Differentiate the function f, using pathwise derivatives, if it is differentiable
    • Differentiate the density , using the score function
  • Using stochastic computation graph, PD and SF can be combined (providing the lowest variance)