Lecture 0
Basics of Artifficial neural networks (NNs)
 primitive models – started in 50ties
 like if a drunk neuroscientist and a programmer chatted in a bar about their fields
 NNs are data hungry
 often other solution is way better (SVM, Dec. trees etc.)
Biological NN
 Humans have approx \(10^{11}\) neurons, \(10^4\) connections each
 Biological neurons are quiet until their electric potential is exceeded, then they “fire”
 Learning is believed to change the synaptic potential
Formal neuron
 \(x_1, ..., x_n \in \mathbb R\) are inputs

TODO:
 Single neuron divides the hyperspace in “half” by a hyperplane
 The hyperplane is perpendicular to the weight vector
 Architecture – how the neurons are connected
 Activity – how the network transforms inputs to outputs

Learning – how the weights transform
 Architecture example
 Multilayer Perceptron (MLP)
 Neurons in layers
 Layers fully connected
 Multilayer Perceptron (MLP)
 Activity
 State – output values of a layer
 Statespace – set of all states
 A computation is finite if there is finite number of steps

NN is a function: \(\mathbb R ^ n \rightarrow \mathbb R ^ m\)
 Activation functions
 Unit step function \(\sigma (\xi) = {1: \xi \geq 0; 0: \xi < 0}\)
 (Logistic) sigmoid \(\sigma (\xi) = \frac 1 {1 + e ^ {\lambda \cdot \xi}} \text{ here } \lambda \in \mathbb R \text{ is the steepness}\)
 Hyperbolic tangens
 XOR with 2 layer
 Learning
 Configuration – vector of all values of weights
 Weightspace – set of all (possible) configurations

Initial configuration
 Supervised learning
 Unsupervised learning
 The goal is to determine distribution of the inputs by learning to reproduce 
FIX
Lecture 1
How strong are NN?
Boolean functions
 Any boolean function can be replicated by 2 layer NN
Any other function
 Split the area into convex areas
 Create tangents and “OR” them
or even stronger
 Split into grid of (really small) squares
 Any square is aproxitable with 4 neurons
 “OR” all squares
Theorem – Cybenko (1989)
 They are strong: TODO
NN and computability
 Recurrent NN with rational weights are at least as strong as Turing machines
 Recurrent NN with irational weights are superTuring machines (can decide Turing machines)
Neural networks  Classical computers  

Data  implicitly in weights  explicit 
Computations  naturally parallel  sequential, localized* 
Precision  imprecise  (typically) precise 
Robustness  robust – 90+% can be deleted  fragile 
TODO  ✔  ✔ 
How to learn NN (software)
 Tensorflow (Google) – highly optimized
 Keras High level API
 Keras
 Theano – academic, clean, dead
Lecture 2
 \(E (w_0, ..., w_n) = \sum _ {i = 0}^n \frac {\delta} {w_i}\) TODO
 Why sigmoid as a loss function?
 let \(y\) be the probability of class A
 funciton of the odds: \(\frac {y} {1  y}\)
 log it: \(\log \frac {y} {1  y} = \vec w \cdot \vec x\)
 multiply by 1: \(\log \frac {1  y} {y} =  \vec w \cdot \vec x\)
 move log: \(\frac {y} {1  y} = e ^ {\vec w \cdot \vec x}\)
 stand alone \(y\): \(y = \frac 1 {1 + e ^ {\vec w \cdot \vec x}}\)
Lecture 3
MLP error fction gradient chain
 \[y_1 = o_1 (\xi _1 ); \xi _1 = w _{1, 2} \cdot y _2\]
 \[y_2 = o_2 (\xi _2 ); \xi _2 = w _{2, 3} \cdot y _3\]

…
 Error for k: \(E_k = \frac 1 2 (y _ 1  d _ {k _ 1}) ^ 2\)
 Derivatives:
 \[\frac {\partial E_k} {\partial y _ 1} = y_1  d_ {k _ 1}\]
 \[\frac {\partial E_k} {\partial y _ 2} = \frac {\partial E_k} {\partial y _ 1} \cdot \frac {\partial y _ 1} { \partial \xi _ 1} \cdot \frac {\partial \xi _ 1} {\partial w _ {2, 3}} = \frac {\partial E_k} {\partial y _ 1} \cdot \sigma ' _1 (\xi _ 1) \cdot y _ 2\]
 Generalized:
 \[\frac {\partial E _ k} {\partial w _ {j i}} = \frac {\partial E _ k} {\partial y _ j} \cdot \sigma ' _j (\xi _ j) \cdot y_i\]
 trivial for j in last layer: \(\frac {\partial E_k} {\partial y _ 1} = y_1  d_ {k _ 1}\)
 otherwise: \(\frac {\partial E_k} {\partial y _ j} = \sum_{r \in \vec j} \frac {\partial E _ k} {\partial y _r} \cdot \sigma ' _r (\xi _ r) \cdot w _{rj}\)
 Derivations for different functions
 logistic sigmoid: \(\sigma _j (\xi) = \frac 1 {1 + e ^ {\lambda_j \xi}}\)
 \[\sigma ' _j (\xi _ j) = \lambda _j y_j (1 y_j)\]
 tanh:
 logistic sigmoid: \(\sigma _j (\xi) = \frac 1 {1 + e ^ {\lambda_j \xi}}\)
 Computation: (exactly in slides)
 Forward pass
 Backward pass
 Compute derivation for all weights
 the only place where we train the network using Gradient descent!
 The steps 1–3 are linear with respect to the size of the NN
 Basically turn the network upside down and exchange weights with \(\sigma_x ' (\xi_x) \cdot w_{xy}\)
Lecture 4
Practical issues of SGD
 Training variables
 Size of minibatch
 Learning rate (and it’s decrease)
 Preprocess of the inputs
 Weights initialization
 Output values
 Result quality check
 When to stop
 Regularization
 Metrics
Not to forget when solving our own MNIST task
 TODO?
How to process BIG data
 Use minibatch – wouldn’t be random
 Split into subsets with equal distribution
 Train on each subset
 Shuffle all data
 Repeat 1.
Random claims
 With higher dimension, different inicialization will get
 It’s easy to get stuck in a flat place, especially if using the Sigmoid function
 It’s typical to choose power of 2:
 The bigger, the better – theoretically
 The smaller the better (32 to 256) – empirically
Issues
Vanishing and exploding gradients

\[\frac {\partial E_k} {\partial y _j} = \sum _ {r \in j^ \rightarrow} \frac {\partial E_k} {\partial y _r} \cdot \sigma '_r (\xi _r) \cdot w_{rj}\]
 Every layer in the back propagation multiplies by a number under 1
 The “signal” going back is getting lower and lower =>
 We may truly train just the last few layers in a very big network
Moment
 Gradient descent ( \(\nabla E(w)\) TODO…) changes the direction with every step a lot
 Solution: In every step of G. descent add a part of the direction you were going from
Learning rate
 “The single most influential hyperparameter.”
TODO: Img from the presentation
 Use scheduler
 Lower the rate
 Use different optimizer
AdaGrad
 parameterspecific learning rates, which are adapted relative to how frequently a parameter gets updated during training.

The more updates a parameter receives, the smaller the updates.
 Problem: r get bigger and bigger
RMSprop
 Like AdaGrad, but forgets the past
Adam
 Like RMSprop, but with buildin momentum
Choice of (hidden) activations
 differenctiability
 for gradient descent
 nonlinearility
 linear * linar = linear
 monotonicity
 do not add more local extrems
 “linearity”
 linearlike functions are easier (derivations, learning speed..)
Before training
Range not around 0
 Typical standardization:
 Average = 0 (subtract the mean)
 Variance = 1 (divide by the standard deviation)
Initial weights
 We want to have the standard
 Too small \(w\) we get (almost) linear model

Too large \(w\) we get to the saturated regions (binary?)
 \[o_j = \sqrt \frac d 3 \cdot w\]
 Therefore: \(w = ( \frac {\sqrt 3} {\sqrt d}, \frac {\sqrt 3} {\sqrt d})\)
 Where \(d\) is # of input weights

The forward pass is nice, but back propagation can Explode
 How does this interfere with Dropout?
Glorot & Belgio initialization (GLOROT)
 \[( \sqrt \frac 6 {m + n}, \sqrt \frac 6 {m + n})\]
 Takes in count both sides of the neuron (m = # of inputs, n = # of outputs)
 Should work better even for back propagation
Output neurons
 Regression: linear or sigmoid
 Binary classification: tanh or sigmoid
 N classification: softmax
 Multiple classes: sigmoidal outputs and classes individually
Lecture 5
Generalization
 Data has, in general, some noise in it
 We don’t want to learn the noise
Overfitting
“With four parameters I can fit an elephant, with five I can make him wiggle his trunk.” – John Von Neumann
 What can we overfitt with GB of parameters?
Early stopping
 Split into
 Training set (train the network)
 Validation set (use to stop the training)
 Test set (evaluation)
 Never ever seen data (do not even look at them)
Sigmoidal & crossentropy
 It’s (generaly) bad idea to use Sigmoidal + MSError
 The derivation is symetrical, so if you are completely wrong (or right), the error is small
 Crossentropy solves this
 The error is big when the prediction is completely wrong and small only when you are almost right
Ensemble methods
 The idea: train several different models separately, then have all of the models vote on the output for test
Dropout
 Include the Ensemble method into one network
 Original implementation:
 In every step of SGD, every neuron is included with \($ \frac 1 2\)$ probability
 Modern: just randomly choose X % of the neurons of the layer
 Usually used without L2 norm
Weight decay and L2 regularization
 Push on the weights, make them prove themselves

In every step, decrease weights
 L1 norm – (probably) good for sparsity
 L2 norm – penalizes large weights, theoretically the same
 Usually used without Dropout
Convolutional layers
 Extracts features, read elsewhere
Pooling
 Maxpooling: max of inputs
 L2 … # TODO
Dataset expansion
 Move by \($x\)$ pixels to all directions
 Add noise
 Distort images (especially for Health data)
 => able to create 100 000 examples from only 1000

.
.
.