Introduction to Neural Networks

# Introduction to Neural Networks
## PBHL-B385 Contemporary Regression
### Kyle Harris @KoderKow
### <a href="https://koderkow.rbind.io" class="uri">https://koderkow.rbind.io</a> Follow Along: <a href="http://bit.ly/intro-to-neural-networks" class="uri">http://bit.ly/intro-to-neural-networks</a>

---

# Overview

- [What Are Neural Networks?](#what_nn)
- [Structure of a Neuron](#neuron)
- [Visual Representation of Neural Networks](#visual_nn)
- [Layers of a NN](#layers)
- [Feedforward and Feedback Neural Networks](#feed)
- [Neural Network Application - The Faraway Way](#application)
  - [Examine the Estimated Weights](#examine_weight)
  - [Drawbacks of Our NN Model](#drawbacks)
  - [Weight Interpretation](#weight_interpretation)
  - [Improving the Fit](#improve_fit)
  - [Demonstration Wrap-Up](#wrap_up)
  - [Final Model Fit](#final_model)
- [Conclusion](#conclusion)
  
Follow Along: **http://bit.ly/intro-to-neural-networks**

---

# What Are Neural Networks?

- A Neural network, or Artificial Neural Network, is a set of algorithms, modeled loosely after the human brain to help recognize patterns

- The brain has about `$1.5 \times 10^{10}$` neurons each with 10 to 104 connections called synapses. The speed of messages between neurons is about 100 m/sec which is much slower than CPU speed

- The human brain's fastest reaction time is around 100 ms. A neuron computation time is 1–10 ms. Computation (to no surprise) is 10 times faster! That is just for one simple task!

- The original idea behind neural networks was to use a computer-based model of the human brain. We can recognize people in fractions of a second, but computers find this task diffucult. So why not make software more like the human brain?

- The brain model of connected neurons, first suggested by McCulloch and Pitts (1943)1, is too simplistic given more recent research

.footnote[
[1] [McCulloch and Pitts (1943)](http://wwwold.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node12.html)
]

---

# What Are Neural Networks?

- As with artificial intelligence and the sentient takeover, the promise of neural networks is not matched by the reality of their performance. Atleast for now..

<center>
 <figure>
 <img src="https://img.buzzfeed.com/buzzfeed-static/static/2015-04/1/17/enhanced/webdr07/anigif_enhanced-29933-1427925503-3.gif" width=20% />
 <figcaption><a href="https://www.buzzfeed.com/norbertobriceno/01101010101001001">Image Source</a></figcaption>
 </figure>
</center>

Neural networks have various purposes:
- Biological models
- Hardware implementation for adaptive contorl
- Many more!

We are inerested in the data analysis application of neural networks:
- Classification
- Clustering Methods
- Regression Methods
      
---

# Structure of a Neuron

.right-column[
<center>
 <figure>
 <img src="https://3c1703fe8d.site.internapcdn.net/newman/csz/news/800/2018/2-whyareneuron.jpg" width=80% />
 <figcaption><a href="https://medicalxpress.com/news/2018-07-neuron-axons-spindly-theyre-optimizing.html">Image Source</a></figcaption>
 </figure>
</center>
]

- *Dendrites* receive signals

- *Cell body* sums up the inputs of the signals to generate output

- *Axon terminals* is the final output2

]

.footnote[
[2] [DataCamp: Neural Network Models](https://www.datacamp.com/community/tutorials/neural-network-models-r)
]

---

# Visual Representation of Neural Networks

<center>
 <figure>
 <img src="https://cdn-images-1.medium.com/max/1600/1*UA30b0mJUPYoPvN8yJr2iQ.jpeg" width=40% />
 <figcaption><a href="https://cdn-images-1.medium.com/max/1600/1*UA30b0mJUPYoPvN8yJr2iQ.jpeg">Image Source</a></figcaption>
 </figure>
</center>

- Here we can see how a neural network resembles a neuron

- Neural networks are collections of thousands  of these simple processing units that together perform useful computations

**Inputs `$x_1, x_2, \dots, x_n$`**: independent variables

**Weights `$w_1, w_2, \dots, w_n$`:** learns the weights from the data

**Bias `$b$`:** the intercept

]

.pull-right[
**Activation Function `$f$`:** defines the output of the nueron
- *Identity Function:* linear fit
- *Sigmoid Function:* logistic fit, where `$y$` is binary
- *ReLu (rectified linear fit):* linear fit, outputs 0 for negative values
]

---

# Layers of a NN

**Input Layer:** the raw data, think of each "node" as a variable in our data

**Hidden Layer:** this is where the "black magic" happens

- Each layer is focused on learning about the data
- We can think about each layer is learning about an aspect of the data
- Larger and more complex data may require multiple hidden layers
  
--

**Output Layer:** the final output. This is generally a single output of the input(s)

.footnote[
[3] [Stack Overflow](https://stackoverflow.com/questions/35345191/what-is-a-layer-in-a-neural-network)
]

---

# Feedforward and Feedback Neural Networks

.pull-left[
<h3>Feedforward</h3>
- Signal goes from input to output
- No loops
 
.center[
<img src="https://thumbs.gfycat.com/EnviousNiftyCorydorascatfish-size_restricted.gif" />
]
]

.pull-right[
<h3>Feedback</h3>
- The neural network is recursive
- These networks go in cycles and the data goes in both directions4
.center[
<img src="https://thumbs.gfycat.com/MiniatureDependentCob-size_restricted.gif"/>
]
]

---

# Neural Network Application - The Faraway Way

```r
library(nnet)
data(ozone, package = "faraway")
```

- We will start with three variables for demonstrative purposes
- We fit a feed-forward NN with one hidden layer containing two units with a linear output unit:

```r
set.seed(2019)
nnet_model <- nnet(O3 ~ temp + ibh + ibt, ozone,
 size = 2, linout = TRUE, trace = FALSE)
```

- nnet() fits a single-hidden-layer neural network
- `formula = O3 ~ temp + ibh + ibt`: formula interface
- `data = ozone`: data where the formula variables reside
- `size = 2`: number of neurons in the hidden layer (this can be optimized)
- `linout = TRUE`: tells the model that it will have lienar output units
- `trace = FALSE`: hides the printed out optimization information

---

# NN Application

- If you repeat this, your result may differ slightly because of the random starting point of the algorithm, but you will likely get a similar result

```
## RSS Value: 21099.4
```

- The RSS of 21099.4 is almost equal to `$\sum_i(y_i - \hat{y})^2$`, so the fit is not any better than the null model

```r
sum((ozone$O3 - mean(ozone$O3))^2)
```

```
## [1] 21115.41
```

- The problem lies with the initial selection of weights. It is hard to do this well when the variables have very different scales

```r
scale_ozone <- scale(ozone)
```

- Due to the random starting point the algorithm uses it may not actually converge. We will fit the model 100 times and pick the one that has the lowest RSS

---

# NN Application

```r
set.seed(2019)
## fit 100 nn models
results <- 1:100 %>%
 map(~nnet(O3 ~ temp + ibh + ibt, scale_ozone, size = 2, linout = TRUE, trace = FALSE))
## get the index of the model with the lowest RSS
best_model_index <- results %>%
 map_dbl(~.x$value) %>%
 which.min()
## select best model
best_nn <- results[[best_model_index]]
```

<center>
<div id="htmlwidget-700e2388af74c141ff3f" style="width:504px;height:288px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-700e2388af74c141ff3f">{"x":{"data":[{"orientation":"v","width":[0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033246,0.316498347033217,0.316498347033217],"base":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"x":[88.9360355163356,89.2525338633688,89.5690322104021,89.8855305574353,90.2020289044685,90.5185272515017,90.8350255985349,91.1515239455682,91.4680222926014,91.7845206396346,92.1010189866678,92.4175173337011,92.7340156807343,93.0505140277675,93.3670123748007,93.683510721834,94.0000090688672,94.3165074159004,94.6330057629336,94.9495041099669,95.2660024570001,95.5825008040333,95.8989991510665,96.2154974980997,96.531995845133,96.8484941921662,97.1649925391994,97.4814908862326,97.7979892332659,98.1144875802991],"y":[1,0,0,0,0,0,0,0,0,3,0,0,0,0,0,7,2,0,0,6,43,5,14,4,11,0,2,0,1,1],"text":["count: 1 value: 88.93604","count: 0 value: 89.25253","count: 0 value: 89.56903","count: 0 value: 89.88553","count: 0 value: 90.20203","count: 0 value: 90.51853","count: 0 value: 90.83503","count: 0 value: 91.15152","count: 0 value: 91.46802","count: 3 value: 91.78452","count: 0 value: 92.10102","count: 0 value: 92.41752","count: 0 value: 92.73402","count: 0 value: 93.05051","count: 0 value: 93.36701","count: 7 value: 93.68351","count: 2 value: 94.00001","count: 0 value: 94.31651","count: 0 value: 94.63301","count: 6 value: 94.94950","count: 43 value: 95.26600","count: 5 value: 95.58250","count: 14 value: 95.89900","count: 4 value: 96.21550","count: 11 value: 96.53200","count: 0 value: 96.84849","count: 2 value: 97.16499","count: 0 value: 97.48149","count: 1 value: 97.79799","count: 1 value: 98.11449"],"type":"bar","marker":{"autocolorscale":false,"color":"rgba(255,0,0,0.7)","line":{"width":1.88976377952756,"color":"rgba(0,0,0,1)"}},"showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":56.5147364051474,"r":7.97011207970112,"b":49.813200498132,"l":43.8356164383562},"plot_bgcolor":"transparent","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(117,117,117,1)","family":"sans","size":15.9402241594022},"title":{"text":"Distribution of RSS from 100 Model Fits","font":{"color":"rgba(117,117,117,1)","family":"sans","size":26.5670402656704},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[88.3030388222692,98.7474842743655],"tickmode":"array","ticktext":["90","92","94","96","98"],"tickvals":[90,92,94,96,98],"categoryorder":"array","categoryarray":["90","92","94","96","98"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.98505603985056,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(117,117,117,1)","family":"sans","size":15.9402241594022},"tickangle":-0,"showline":true,"linecolor":"rgba(0,0,0,1)","linewidth":0.724555643609193,"showgrid":true,"gridcolor":"rgba(204,204,204,1)","gridwidth":0.724555643609193,"zeroline":false,"anchor":"y","title":{"text":"RSS","font":{"color":"rgba(102,102,102,1)","family":"sans","size":15.9402241594022}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-2.15,45.15],"tickmode":"array","ticktext":["0","10","20","30","40"],"tickvals":[0,10,20,30,40],"categoryorder":"array","categoryarray":["0","10","20","30","40"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.98505603985056,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(117,117,117,1)","family":"sans","size":15.9402241594022},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(204,204,204,1)","gridwidth":0.724555643609193,"zeroline":false,"anchor":"x","title":{"text":"Count","font":{"color":"rgba(102,102,102,1)","family":"sans","size":15.9402241594022}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":"transparent","line":{"color":"transparent","width":0.724555643609193,"linetype":"solid"},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":2.06156048675734,"font":{"color":"rgba(117,117,117,1)","family":"sans","size":15.9402241594022}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false,"collaborate":false,"displaylogo":false,"modeBarButtonsToRemove":["sendDataToCloud","zoom2d","pan2d","select2d","lasso2d","zoomIn2d","zoomOut2d","autoScale2d","resetScale2d","hoverClosestCartesian","hoverCompareCartesian","toggleSpikelines",""]},"source":"A","attrs":{"8c407049fe0d":{"x":{},"type":"bar"}},"cur_data":"8c407049fe0d","visdat":{"8c407049fe0d":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
</center>

---

# NN Application

```r
best_nn
```

```
## a 3-2-1 network with 11 weights
## inputs: temp ibh ibt 
## output(s): O3 
## options were - linear output units
```

```
## Best RSS Value: 89.0786
```

- Our `best_nn` model has 11 parameters or weights (The parameters are shown on the next slide)
  
--

- For each of the parameteres there is optimization that occurs

- The surface optimization problem has multiple peaks and valleys

- The model can converge on one of these minimums

- This is why we run our model 100 times to test out multiple random starting points for our model, to hopefully find the global minimum!

---

# Examine the Estimated Weights

```r
summary(best_nn)
```

```
## a 3-2-1 network with 11 weights
## options were - linear output units 
##  b->h1 i1->h1 i2->h1 i3->h1 
##  -1.14   0.95  -0.83  -0.28 
##  b->h2 i1->h2 i2->h2 i3->h2 
##  35.90 -18.32  63.10  34.91 
##   b->o  h1->o  h2->o 
##  -1.83   4.51   0.69
```

.pull-left[
- `$b$`: bias
- `$i_x$`: input where x is the index of the variable
- `$h_y$`: hidden neuron where y is the index of the hidden neuron
- `$o$`: output
- `$i_1 \rightarrow h_1$`: refers to the link between input 1 and the first hidden neuron
- `$b \rightarrow o$`: is a one skip-layer connection, from the bias to the output
]

.pull-right[
<center>
 <figure>
 <img src="https://memegenerator.net/img/instances/68631115/neural-networks-so-hot-right-now.jpg" width = 60%/>
 <figcaption><a href="https://memegenerator.net/">Image Source</a></figcaption>
 </figure>
</center>
]

---

# Drawbacks of a NN Model

- Parameters are uninterpretable

- Not based on a probability models that expresses the structure and variation
  - No standard errors
  - Some inference is possible with bootstrapping
  
--

- We can get an `$R^2$` estimate:

```r
1 - best_nn$value / sum((scale_ozone[, 1] - mean(scale_ozone[, 1]))^2)
```

```
## [1] 0.7292443
```

- This is a similar to the additive model fit for these predictors that Faraway fits in previous chapters

---

# Weight Interpretation

- Although the NN weights may be difficult to interpret, we can get some sense of the effect of the predictors by observing the marginal effect of changes in one or more predictor as other predictors are held fixed

- Here, we vary each predictor individually while keeping the other predictors fixed at their mean values

- Because the data has been centered and scaled for the NN fitting, we need to restore the original scales

---

# Weight Interpretation

- As seen in the plots there are large discontinuities in the lines plots. This does not follow the linear trend we are expecting

- Looking back at the weights of `summary(best_nn)` we can see that some weights have extremely large values despite the scaling of the data
  - `$i_2 \rightarrow h_2 = 63.10$`
  - This means there is a lot of variability in this neuron
  - This is analogous to the collinearity problem in linear regression
  
--

- The NN is choosing these large values to optimize the fit

---

# Improving the Fit

- We can use a penalty function, as with smoothing splines, to obtain a more stable fit.

- Instead of minimizing MSE, we minimize: `$MSE + \lambda \sum\limits_{i} w_i^2$`

- We can introduce a *weight decay* to our nueral network, this is a similar approach we take with ridge regression

- Lets set `$\lambda = 0.001$` and create 100 models again

```r
set.seed(2019)
## fit 100 nn models
results_decay <- 1:100 %>%
 map(~nnet(O3 ~ temp + ibh + ibt, scale_ozone, size = 2, linout = TRUE, trace = FALSE, `decay = 0.001`))
## get the index of the model with the lowest RSS
best_decay_index <- results_decay %>%
 map_dbl(~.x$value) %>%
 which.min()
## select best model
best_decay <- results[[best_decay_index]]
```

---

# Improving the Fit

```r
best_decay$value
```

```
## [1] 91.8121
```

- Our previous value was 89.0786, our new RSS is a little bit higher. This is expected because we are sacrificing some of the fit for a more stable result

.pull-left[
<img src="index_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<center>
 <figure>
 <img src="https://media1.tenor.com/images/154e8427624e163c030970a795b6f169/tenor.gif?itemid=5143620" />
 <figcaption><a href="https://tenor.com/">Image Source</a></figcaption>
 </figure>
</center>
]

---

# Demonstration Wrap-Up

```r
summary(best_decay)
```

```
## a 3-2-1 network with 11 weights
## options were - linear output units 
##  b->h1 i1->h1 i2->h1 i3->h1 
##   1.16  -0.63   0.42  -0.44 
##  b->h2 i1->h2 i2->h2 i3->h2 
##  13.09   2.46   8.71  -3.30 
##  b->o h1->o h2->o 
##  1.55 -3.93  1.28
```

- The weights of the second row are not as extreme now

- There is not a way to assess significance of any of the variables

- NN's do have interactions built in and these can be observed by the method we used before by varying two variables in our model at a time

---

# Final Model Fit

```r
set.seed(2019)
## fit 100 nn models
results <- 1:100 %>%
 map(~nnet(O3 ~ ., scale_ozone, size = 4, linout = TRUE, trace = FALSE))
## get the index of the model with the lowest RSS
best_model_index <- results_decay %>%
 map_dbl(~.x$value) %>%
 which.min()
## select best model
best_model <- results[[best_model_index]]
best_model
```

```
## a 9-4-1 network with 45 weights
## inputs: vh wind humidity temp ibh dpg ibt vis doy 
## output(s): O3 
## options were - linear output units
```

---

# Final Model Fit

```r
summary(best_model)
```

```
## a 9-4-1 network with 45 weights
## options were - linear output units 
##  b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 i9->h1 
##  -1.47   0.29  -0.36   0.40   0.35  -0.29   0.49   0.29   0.15  -0.02 
##  b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 i8->h2 i9->h2 
##  40.19   6.03  15.36 -17.61 -13.86  17.07 -22.75   7.20  -6.12 -22.51 
##  b->h3 i1->h3 i2->h3 i3->h3 i4->h3 i5->h3 i6->h3 i7->h3 i8->h3 i9->h3 
## -11.16  -8.76  11.96   3.34   8.17  -1.90 -11.21  17.59 -25.30  -5.20 
##  b->h4 i1->h4 i2->h4 i3->h4 i4->h4 i5->h4 i6->h4 i7->h4 i8->h4 i9->h4 
##  38.72   6.96  13.35 -22.89 -34.75   9.49   5.97   7.99  -4.86   1.02 
##   b->o  h1->o  h2->o  h3->o  h4->o 
##  -1.47   3.60   1.03   0.53  -0.62
```

`$R^2$` estimate:

```r
1 - best_model$value / sum((scale_ozone[,1] - mean(scale_ozone))^2)
```

```
## [1] 0.8314248
```

---

# Conclusion

- NN's cannot be used for inference

- Flexible, Easy to fit to large complex data

- Can be easily over fit

- Truly a "black box", plots only give a rough idea of what is happening with our data

- Lacks the diagnostics, model selection, and theory

- Initially developed to address real life issues, not statistical issues

- "NNs can outperform their statistical competitors for some problems provided they are carefully used. However, one should not be fooled by the evocative name, as NNs are just another tool in the box."5

.footnote[
[5] [Faraway: Extending the Linear Model with R](https://www.amazon.com/Extending-Linear-Model-Generalized-Nonparametric/dp/149872096X)
]

---

# Thanks!

.center[
<figure>
 <img src="https://media.giphy.com/media/6tHy8UAbv3zgs/giphy.gif" />
 <figcaption><a href="https://giphy.com/explore/media">Image Source</a></figcaption>
</figure>
]
.pull-left[
### Source Code for Slides
- https://github.com/KoderKow/intro-to-neural-networks
]

.pull-right[
### Slides Made with Xaringan
- [R Package: Xaringan](https://github.com/yihui/xaringan)
]