# Lecture 0

## Basics of Artifficial neural networks (NNs)

• primitive models – started in 50ties
• like if a drunk neuroscientist and a programmer chatted in a bar about their fields
• NNs are data hungry
• often other solution is way better (SVM, Dec. trees etc.)

### Biological NN

• Humans have approx $10^{11}$ neurons, $10^4$ connections each
• Biological neurons are quiet until their electric potential is exceeded, then they “fire”
• Learning is believed to change the synaptic potential

### Formal neuron

• $x_1, ..., x_n \in \mathbb R$ are inputs
• TODO:

• Single neuron divides the hyperspace in “half” by a hyperplane
• The hyperplane is perpendicular to the weight vector • Architecture – how the neurons are connected
• Activity – how the network transforms inputs to outputs
• Learning – how the weights transform

• Architecture example
• Multilayer Perceptron (MLP)
• Neurons in layers
• Layers fully connected
• Activity
• State – output values of a layer
• State-space – set of all states
• A computation is finite if there is finite number of steps
• NN is a function: $\mathbb R ^ n \rightarrow \mathbb R ^ m$

• Activation functions
• Unit step function $% $
• (Logistic) sigmoid $\sigma (\xi) = \frac 1 {1 + e ^ {-\lambda \cdot \xi}} \text{ here } \lambda \in \mathbb R \text{ is the steepness}$
• Hyperbolic tangens
• XOR with 2 layer  • Learning
• Configuration – vector of all values of weights
• Weight-space – set of all (possible) configurations
• Initial configuration

• Supervised learning
• Unsupervised learning
• The goal is to determine distribution of the inputs by learning to reproduce -

FIX

# Lecture 1

## How strong are NN?

### Boolean functions

• Any boolean function can be replicated by 2 layer NN

### Any other function

1. Split the area into convex areas
2. Create tangents and “OR” them

or even stronger

1. Split into grid of (really small) squares
2. Any square is aproxitable with 4 neurons
3. “OR” all squares

#### Theorem – Cybenko (1989)

• They are strong: TODO

#### NN and computability

• Recurrent NN with rational weights are at least as strong as Turing machines
• Recurrent NN with irational weights are super-Turing machines (can decide Turing machines)
Neural networks Classical computers
Data implicitly in weights explicit
Computations naturally parallel sequential, localized*
Precision imprecise (typically) precise
Robustness robust – 90+% can be deleted fragile
TODO

## How to learn NN (software)

• Tensorflow (Google) – highly optimized
• Keras High level API
• Keras

# Lecture 2

• $E (w_0, ..., w_n) = \sum _ {i = 0}^n \frac {\delta} {w_i}$ TODO
• Why sigmoid as a loss function?
1. let $y$ be the probability of class A
2. funciton of the odds: $\frac {y} {1 - y}$
3. log it: $\log \frac {y} {1 - y} = \vec w \cdot \vec x$
4. multiply by -1: $\log \frac {1 - y} {y} = - \vec w \cdot \vec x$
5. move log: $\frac {y} {1 - y} = e ^ {\vec w \cdot \vec x}$
6. stand alone $y$: $y = \frac 1 {1 + e ^ {\vec w \cdot \vec x}}$

# Lecture 3

## MLP error fction gradient chain

• Error for k: $E_k = \frac 1 2 (y _ 1 - d _ {k _ 1}) ^ 2$
• Derivatives:
• Generalized:
• trivial for j in last layer: $\frac {\partial E_k} {\partial y _ 1} = y_1 - d_ {k _ 1}$
• otherwise: $\frac {\partial E_k} {\partial y _ j} = \sum_{r \in \vec j} \frac {\partial E _ k} {\partial y _r} \cdot \sigma ' _r (\xi _ r) \cdot w _{rj}$
• Derivations for different functions
• logistic sigmoid: $\sigma _j (\xi) = \frac 1 {1 + e ^ {-\lambda_j \xi}}$
• tanh:
• Computation: (exactly in slides)
1. Forward pass
2. Backward pass
3. Compute derivation for all weights
• the only place where we train the network using Gradient descent!
• The steps 1–3 are linear with respect to the size of the NN
• Basically turn the network upside down and exchange weights with $\sigma_x ' (\xi_x) \cdot w_{xy}$

# Lecture 4

## Practical issues of SGD

• Training variables
• Size of minibatch
• Learning rate (and it’s decrease)
• Pre-process of the inputs
• Weights initialization
• Output values
• Result quality check
• When to stop
• Regularization
• Metrics

• TODO?

## How to process BIG data

• Use minibatch – wouldn’t be random
1. Split into subsets with equal distribution
2. Train on each subset
3. Shuffle all data
4. Repeat 1.

### Random claims

• With higher dimension, different inicialization will get
• It’s easy to get stuck in a flat place, especially if using the Sigmoid function
• It’s typical to choose power of 2:
• The bigger, the better – theoretically
• The smaller the better (32 to 256) – empirically

## Issues

• Every layer in the back propagation multiplies by a number under 1
• The “signal” going back is getting lower and lower =>
• We may truly train just the last few layers in a very big network

### Moment

• Gradient descent ( $\nabla E(w)$ TODO…) changes the direction with every step a lot
• Solution: In every step of G. descent add a part of the direction you were going from

### Learning rate

• “The single most influential hyperparameter.”

TODO: Img from the presentation

1. Use scheduler
2. Lower the rate
3. Use different optimizer

• parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training.

• Problem: r get bigger and bigger

#### RMSprop

• Like RMSprop, but with build-in momentum

## Choice of (hidden) activations

1. differenctiability
2. non-linearility
• linear * linar = linear
3. monotonicity
• do not add more local extrems
4. “linearity”
• linear-like functions are easier (derivations, learning speed..)

## Before training

### Range not around 0

• Typical standardization:
1. Average = 0 (subtract the mean)
2. Variance = 1 (divide by the standard deviation)

### Initial weights

• We want to have the standard
• Too small $w$ we get (almost) linear model
• Too large $w$ we get to the saturated regions (binary?)

• Therefore: $w = (- \frac {\sqrt 3} {\sqrt d}, \frac {\sqrt 3} {\sqrt d})$
• Where $d$ is # of input weights
• The forward pass is nice, but back propagation can Explode

• How does this interfere with Dropout?

#### Glorot & Belgio initialization (GLOROT)

• Takes in count both sides of the neuron (m = # of inputs, n = # of outputs)
• Should work better even for back propagation

### Output neurons

• Regression: linear or sigmoid
• Binary classification: tanh or sigmoid
• N classification: softmax
• Multiple classes: sigmoidal outputs and classes individually

# Lecture 5

## Generalization

• Data has, in general, some noise in it
• We don’t want to learn the noise

### Overfitting

“With four parameters I can fit an elephant, with five I can make him wiggle his trunk.” – John Von Neumann

• What can we overfitt with GB of parameters?

### Early stopping

• Split into
• Training set (train the network)
• Validation set (use to stop the training)
• Test set (evaluation)
• Never ever seen data (do not even look at them)

### Sigmoidal & cross-entropy

• It’s (generaly) bad idea to use Sigmoidal + MSError
• The derivation is symetrical, so if you are completely wrong (or right), the error is small
• Cross-entropy solves this
• The error is big when the prediction is completely wrong and small only when you are almost right

### Ensemble methods

• The idea: train several different models separately, then have all of the models vote on the output for test

### Dropout

• Include the Ensemble method into one network
• Original implementation: