Deep Learning

notes and work on deep learning.

The aim of these is to synthesize theory from several different "classical" sources and introduce insights from newer literature.

See statistical_theory_background.md in this repo here for some of the needed math background.

Measures
Hyperparameter Tuning
- Function Gradients
Gradient Descent

Measures

Measures that we use to measure distances (ex: between images in nearest neighbors) include $L_p$ distances.

From real analysis, we know a metric is a bivariate operator $d: X \times Y \mapsto \mathbb{R}^n$ that is nonnegative, has identity, symmetry, and obeys the triangle inequality. AKA

$d(a,b) \geq 0$
$d(a,b) = 0 \iff a = b$
$d(a,b) = a(b,a)$
$d(a,b) \leq d(a,c) + d(b,c)$

We call $d$ pseudometric if it is nonnegative, has symmetry, and obeys the triangle inequality.

We call $d$ quasimetric if it obeys all conditions except symmetry.

$L_p$ Distances

Then distances we often use in vector math are $L_p$ distances.

$$ L_p = \lVert a-b \rVert_p = (\sum_{i=1}^{n}{(a_i - b_i)})^{1/p} $$

$L_p$ distances are metrics

You can prove that $L_p$ distances are valid metrics, like this:

Notice that for any $p \in \mathbb{N}$ that you replace in the formula for $L_p$, we have nonnegativity, identity, symmetry, and triangle inequality.

So, $L_p := d(a,b)_p$

Distance Example 01

If we have two images and we want to perform a nearest neighbors classification on them, we use the $L_1$ distance, treating the images as matrices and pixels as entires. Then

$L_1 = d(a,b)_{1}$, so

$$L_1 = \lVert a - b \rVert_1 = \sum_{i=1}^{n}{a_i - b_i}$$

where $a_i \in a, b_i \in b$ are pixel values of image 1 and image 2, respectively.

Distance Example 02

The Euclidian distance is $L_2$, or the vector norm linear algebra.

Hyperparameter Tuning

When training something like a k nearest neighbors classifier, we have a choice for what k is.

Tuning this $k$, as well as tuning the choice for the distance metric $L_p$, is called tuning a hyperparameter.

Usually, selecting the best parameter is a matter of trial and error. You try training the classifier on several hyperparameter combinations before deciding which configuration is best.

Function Gradients

Recall from multivariable calculus that the gradient of a multivariate function $f$ is the vector $\nabla f(a)$ whose compoenents are partial derivaties of $f$ with respect to each variable:

$$ \nabla f(a) = (Df(a_1), Df(a_2), \dots Df(a_n)) $$

Vector-Valued Function Gradients

We can also define the gradient on vector-valued functions in an alternative way (this one was popular in problem sets for my real analysis class) which more clearly shows the gradient as a derivative.

Say $f$ maps a column vector $a = (a_1, \dots a_n)^{\intercal}$ from $\mathbb{R}^n$ to $\mathbb{R}$.

Then the gradient $\nabla f$ must satisfy:

$$ f(a + \epsilon h) = f(a) + \epsilon(\nabla f(a))^{\intercal} h + O(\epsilon^2) $$

Where $\epsilon h$ is some random perturbation.

Gradient Descent

Then gradient descent is the method of using the gradient vector to find the local min of a (hopefully) convex loss function.

So, if we have the MSE($L_2$ loss) function $$f(m,b) = \frac{1}{n}\sum_{i = n}^{n}{(y_i - \hat{y_i})^2} =\ $$

where $m$ is the mean, $b$ is the bias, then the gradient is

$$ f(m,b) = (Df(m), Df(b)) \ = \begin{bmatrix} Df(m) \\ Df(b) \end{bmatrix} \ = \begin{bmatrix} \sum_{i = 1}^{n} 2(y_i - \hat{y_i}) \frac{\partial{f}}{\partial{m}}\hat{y_i} \\ \sum_{i = 1}^{n} 2(y_i - \hat{y_i}) \frac{\partial{f}}{\partial{b}}\hat{y_i} \end{bmatrix} $$

Gradient Descent then minimizes this function step by step (where the step size is determined by the learning rate)

issue : that thing doesnt render properly on readme - ton of bugs. need to make a pdf copy of these.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
README.md		README.md
sources.md		sources.md
statistical_theory_background.md		statistical_theory_background.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning

Measures

$L_p$ Distances

$L_p$ distances are metrics

Distance Example 01

Distance Example 02

Hyperparameter Tuning

Function Gradients

Vector-Valued Function Gradients

Gradient Descent

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Deep Learning

Measures

$L_p$ Distances

$L_p$ distances are metrics

Distance Example 01

Distance Example 02

Hyperparameter Tuning

Function Gradients

Vector-Valued Function Gradients

Gradient Descent

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages