Lecture 0

Basics of Artifficial neural networks (NNs)

primitive models – started in 50ties
- like if a drunk neuroscientist and a programmer chatted in a bar about their fields
NNs are data hungry
- often other solution is way better (SVM, Dec. trees etc.)

Biological NN

Humans have approx $10^{11}$ neurons, $10^4$ connections each
Biological neurons are quiet until their electric potential is exceeded, then they “fire”
Learning is believed to change the synaptic potential

Formal neuron

$x_1, ..., x_n \in \mathbb R$ are inputs
TODO:
Single neuron divides the hyperspace in “half” by a hyperplane
The hyperplane is perpendicular to the weight vector

1 plane diferention

Architecture – how the neurons are connected
Activity – how the network transforms inputs to outputs
Learning – how the weights transform
Architecture example
- Multilayer Perceptron (MLP)
  - Neurons in layers
  - Layers fully connected
Activity
- State – output values of a layer
- State-space – set of all states
- A computation is finite if there is finite number of steps
- NN is a function: $\mathbb R ^ n \rightarrow \mathbb R ^ m$
- Activation functions
  - Unit step function $\sigma (\xi) = {1: \xi \geq 0; 0: \xi < 0}$
  - (Logistic) sigmoid $\sigma (\xi) = \frac 1 {1 + e ^ {-\lambda \cdot \xi}} \text{ here } \lambda \in \mathbb R \text{ is the steepness}$
  - Hyperbolic tangens
- XOR with 2 layer

2 geometrical NN 3 geometry NN equations

Learning
- Configuration – vector of all values of weights
- Weight-space – set of all (possible) configurations
- Initial configuration
- Supervised learning
- Unsupervised learning
  - The goal is to determine distribution of the inputs by learning to reproduce -

FIX

Lecture 1

How strong are NN?

Boolean functions

Any boolean function can be replicated by 2 layer NN

Any other function

Split the area into convex areas
Create tangents and “OR” them

or even stronger

Split into grid of (really small) squares
Any square is aproxitable with 4 neurons
“OR” all squares

Theorem – Cybenko (1989)

They are strong: TODO

NN and computability

Recurrent NN with rational weights are at least as strong as Turing machines
Recurrent NN with irational weights are super-Turing machines (can decide Turing machines)

	Neural networks	Classical computers
Data	implicitly in weights	explicit
Computations	naturally parallel	sequential, localized*
Precision	imprecise	(typically) precise
Robustness	robust – 90+% can be deleted	fragile
TODO	✔	✔

How to learn NN (software)

Tensorflow (Google) – highly optimized
- Keras High level API
Keras
Theano – academic, clean, dead

Lecture 2

$E (w_0, ..., w_n) = \sum _ {i = 0}^n \frac {\delta} {w_i}$ TODO
Why sigmoid as a loss function?
1. let $y$ be the probability of class A
2. funciton of the odds: $\frac {y} {1 - y}$
3. log it: $\log \frac {y} {1 - y} = \vec w \cdot \vec x$
4. multiply by -1: $\log \frac {1 - y} {y} = - \vec w \cdot \vec x$
5. move log: $\frac {y} {1 - y} = e ^ {\vec w \cdot \vec x}$
6. stand alone $y$: $y = \frac 1 {1 + e ^ {\vec w \cdot \vec x}}$

Lecture 3

MLP error fction gradient chain

\[y_1 = o_1 (\xi _1 ); \xi _1 = w _{1, 2} \cdot y _2\]
\[y_2 = o_2 (\xi _2 ); \xi _2 = w _{2, 3} \cdot y _3\]
…
Error for k: $E_k = \frac 1 2 (y _ 1 - d _ {k _ 1}) ^ 2$
Derivatives:
- \[\frac {\partial E_k} {\partial y _ 1} = y_1 - d_ {k _ 1}\]
- \[\frac {\partial E_k} {\partial y _ 2} = \frac {\partial E_k} {\partial y _ 1} \cdot \frac {\partial y _ 1} { \partial \xi _ 1} \cdot \frac {\partial \xi _ 1} {\partial w _ {2, 3}} = \frac {\partial E_k} {\partial y _ 1} \cdot \sigma ' _1 (\xi _ 1) \cdot y _ 2\]
Generalized:
- \[\frac {\partial E _ k} {\partial w _ {j i}} = \frac {\partial E _ k} {\partial y _ j} \cdot \sigma ' _j (\xi _ j) \cdot y_i\]
- trivial for j in last layer: $\frac {\partial E_k} {\partial y _ 1} = y_1 - d_ {k _ 1}$
- otherwise: $\frac {\partial E_k} {\partial y _ j} = \sum_{r \in \vec j} \frac {\partial E _ k} {\partial y _r} \cdot \sigma ' _r (\xi _ r) \cdot w _{rj}$
Derivations for different functions
- logistic sigmoid: $\sigma _j (\xi) = \frac 1 {1 + e ^ {-\lambda_j \xi}}$
  - \[\sigma ' _j (\xi _ j) = \lambda _j y_j (1- y_j)\]
- tanh:
Computation: (exactly in slides)
1. Forward pass
2. Backward pass
3. Compute derivation for all weights
  - the only place where we train the network using Gradient descent!
- The steps 1–3 are linear with respect to the size of the NN
Basically turn the network upside down and exchange weights with $\sigma_x ' (\xi_x) \cdot w_{xy}$

Lecture 4

Practical issues of SGD

Training variables
- Size of minibatch
- Learning rate (and it’s decrease)
- Pre-process of the inputs
- Weights initialization
- Output values
Result quality check
- When to stop
- Regularization
- Metrics

Not to forget when solving our own MNIST task

TODO?

How to process BIG data

Use minibatch – wouldn’t be random
1. Split into subsets with equal distribution
2. Train on each subset
3. Shuffle all data
4. Repeat 1.

Random claims

With higher dimension, different inicialization will get
It’s easy to get stuck in a flat place, especially if using the Sigmoid function
It’s typical to choose power of 2:
- The bigger, the better – theoretically
- The smaller the better (32 to 256) – empirically

Issues

Vanishing and exploding gradients

\[\frac {\partial E_k} {\partial y _j} = \sum _ {r \in j^ \rightarrow} \frac {\partial E_k} {\partial y _r} \cdot \sigma '_r (\xi _r) \cdot w_{rj}\]
- Every layer in the back propagation multiplies by a number under 1
- The “signal” going back is getting lower and lower =>
- We may truly train just the last few layers in a very big network

Moment

Gradient descent ( $\nabla E(w)$ TODO…) changes the direction with every step a lot
Solution: In every step of G. descent add a part of the direction you were going from

Learning rate

“The single most influential hyperparameter.”

TODO: Img from the presentation

Use scheduler
Lower the rate
Use different optimizer

AdaGrad

parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training.
The more updates a parameter receives, the smaller the updates.
Problem: r get bigger and bigger

RMSprop

Like AdaGrad, but forgets the past

Adam

Like RMSprop, but with build-in momentum

Choice of (hidden) activations

differenctiability
- for gradient descent
non-linearility
- linear * linar = linear
monotonicity
- do not add more local extrems
“linearity”
- linear-like functions are easier (derivations, learning speed..)

Before training

Range not around 0

Typical standardization:
1. Average = 0 (subtract the mean)
2. Variance = 1 (divide by the standard deviation)

Initial weights

We want to have the standard
Too small $w$ we get (almost) linear model
Too large $w$ we get to the saturated regions (binary?)
\[o_j = \sqrt \frac d 3 \cdot w\]
Therefore: $w = (- \frac {\sqrt 3} {\sqrt d}, \frac {\sqrt 3} {\sqrt d})$
- Where $d$ is # of input weights
The forward pass is nice, but back propagation can Explode
How does this interfere with Dropout?

Glorot & Belgio initialization (GLOROT)

\[(- \sqrt \frac 6 {m + n}, \sqrt \frac 6 {m + n})\]
Takes in count both sides of the neuron (m = # of inputs, n = # of outputs)
Should work better even for back propagation

Output neurons

Regression: linear or sigmoid
Binary classification: tanh or sigmoid
N classification: softmax
Multiple classes: sigmoidal outputs and classes individually

Lecture 5

Generalization

Data has, in general, some noise in it
We don’t want to learn the noise

Overfitting

“With four parameters I can fit an elephant, with five I can make him wiggle his trunk.” – John Von Neumann

What can we overfitt with GB of parameters?

Early stopping

Split into
- Training set (train the network)
- Validation set (use to stop the training)
- Test set (evaluation)
- Never ever seen data (do not even look at them)

Sigmoidal & cross-entropy

It’s (generaly) bad idea to use Sigmoidal + MSError
- The derivation is symetrical, so if you are completely wrong (or right), the error is small
Cross-entropy solves this
- The error is big when the prediction is completely wrong and small only when you are almost right

Ensemble methods

The idea: train several different models separately, then have all of the models vote on the output for test

Dropout

Include the Ensemble method into one network
Original implementation:
- In every step of SGD, every neuron is included with $$ \frac 1 2$$ probability
Modern: just randomly choose X % of the neurons of the layer
Usually used without L2 norm

Weight decay and L2 regularization

Push on the weights, make them prove themselves
In every step, decrease weights
L1 norm – (probably) good for sparsity
L2 norm – penalizes large weights, theoretically the same
- Usually used without Dropout

Convolutional layers

Extracts features, read elsewhere

Pooling

Max-pooling: max of inputs
L2 … # TODO

Dataset expansion

Move by $$x$$ pixels to all directions
Add noise
Distort images (especially for Health data)
=> able to create 100 000 examples from only 1000

Notes from Neural networks course

Ronald Luc

Lecture 0

Basics of Artifficial neural networks (NNs)

Biological NN

Formal neuron

Lecture 1

How strong are NN?

Boolean functions

Any other function

Theorem – Cybenko (1989)

NN and computability

How to learn NN (software)

Lecture 2

Lecture 3

MLP error fction gradient chain

Lecture 4

Practical issues of SGD

Not to forget when solving our own MNIST task

How to process BIG data

Random claims

Issues

Vanishing and exploding gradients

Moment

Learning rate

AdaGrad

RMSprop

Adam

Choice of (hidden) activations

Before training

Range not around 0

Initial weights

Glorot & Belgio initialization (GLOROT)

Output neurons

Lecture 5

Generalization

Overfitting

Early stopping

Sigmoidal & cross-entropy

Ensemble methods

Dropout

Weight decay and L2 regularization

Convolutional layers

Pooling

Dataset expansion