feedforward neural network
broken link: attachment:network-graph.png
coupling multiple units together into a network creates a complex function that is a composition of the algebraic expressions represented by the individual units. for example, the network above represents a function
we generally use
first we apply the loss function, for simplicity the squared loss function is used here, we will calculate the gradient for the network we broken link: blk:fig-fnn-1 with respect to a single training example
we also know
another important characteristic of these gradient expressions is that they have as factors the local derivatives
initial common lisp implementation
this code uses common lisp (aswell as code/functions from the actual link itself, click it for those).
;; neural network class
(defclass network ()
((weights :initarg :weights :initform nil :accessor network-weights)
(input-layer-size :initarg :input-layer-size :accessor network-input-layer-size)
(output-layer-size :initarg :output-layer-size :accessor network-output-layer-size)
(hidden-layer-sizes :initarg :hidden-layer-sizes :accessor network-hidden-layer-sizes)
(learning-rate :initform 0.0005 :initarg :learning-rate :accessor network-learning-rate)))
(defmethod network-feedforward ((nw network) x)
"pass the vector x into the network, x is treated as the input layer, the activations of the entire network are returned"
(let ((previous-layer-activations x)
(network-activations (list x))
(network-unsquashed-activations (list x)))
(loop for weights-index from 0 below (length (network-weights nw))
do (let* ((weights (elt (network-weights nw) weights-index))
(layer-activations (list->vector (make-list (length (elt weights 0)) :initial-element 0))))
(loop for i from 0 below (length previous-layer-activations)
do (let ((previous-unit-activation (elt previous-layer-activations i))
(previous-unit-weights (elt weights i)))
(setf
layer-activations
(map
'list
(lambda (weight activation)
(+ activation (* previous-unit-activation weight)))
previous-unit-weights layer-activations))))
(setf previous-layer-activations (map 'list #'sigmoid layer-activations))
(setf network-activations
(append network-activations
(list previous-layer-activations)))
(setf network-unsquashed-activations
(append network-unsquashed-activations
(list layer-activations)))))
(values (list->vector network-activations) (list->vector network-unsquashed-activations))))
(defmethod network-generate-weights ((nw network))
"generate random weights based on the sizes of the layers"
(setf
(network-weights nw)
(loop for i from 0 below (1+ (length (network-hidden-layer-sizes nw)))
collect (let* ((layer-sizes (append
(list (network-input-layer-size nw))
(network-hidden-layer-sizes nw)
(list (network-output-layer-size nw))))
(layer-size (elt layer-sizes i))
(next-layer-size (elt layer-sizes (1+ i))))
(loop for j from 0 below layer-size
collect (loop for k from 0 below next-layer-size
collect (generate-random-weight))))))
;; convert weights from nested lists to nested vector for fast access
(setf (network-weights nw) (list->vector (network-weights nw))))
(defmethod print-object ((nw network) stream)
(print-unreadable-object (nw stream :type t)
(format stream "total weights: ~a"
(reduce
#'*
(append (list (network-input-layer-size nw) (network-output-layer-size nw))
(network-hidden-layer-sizes nw))))))
(defun generate-random-weight ()
(/ (- (random 2000) 1000) 1000))
(defun make-network (&key input-layer-size hidden-layer-sizes output-layer-size
learning-rate)
(let ((nw (make-instance 'network
:input-layer-size input-layer-size
:hidden-layer-sizes hidden-layer-sizes
:output-layer-size output-layer-size
:learning-rate learning-rate)))
(network-generate-weights nw)
nw))
while implementing backpropagation i faced a challenge(s), can i simply iterate through the layers one by one like i did in network-feedforward
and just handle them each at a time? at first i thought i couldnt, because i need to go through each possible path from the output layer to the input layer and update each weight, each weight's gradient depends on the gradient for the weight preceding it in the path, but after alot of thinking i realized that i could just store the gradients of each matrix of weights between 2 layers and update the gradient matrix as we go, this would probably work i think, but im more curious about how i would go about finding all paths and operating on them in the way i described.we have a variable number of layers
say we did use recursion, and on each recurrence we hopped backwards into the previous layer in the network, how are we supposed to return the different units from a layer to a function call? well we dont need to, we could just call the function on the many units we have in that previous layer and pass it an index to say which unit we're referring to.
(defmethod network-train ((nw network) xs ys)
"train on the given data, xs is a list of vectors, each vector is treated as an input layer to the network, and ys is a list of vectors, each vector representing the output layer corresponding to the vector in xs thats at the same index"
(loop for i from 0 below (length xs)
do (let*
((x (elt xs i))
(y (elt ys i))
(layer-index (1+ (length (network-hidden-layer-sizes nw)))))
(multiple-value-bind (activations unsquashed-activations)
(network-feedforward nw x)
(loop
for unit-index
from 0
below (network-output-layer-size nw)
do (network-back-propagate
nw
layer-index
unit-index
(* 2 (- (elt (elt activations layer-index) unit-index)
(elt y unit-index)))
activations
unsquashed-activations))))))
(defmethod network-back-propagate ((nw network) layer-index unit-index gradient activations unsquashed-activations)
"backpropagate the error through the network, each layers gradients depend on those of the layer succeeding it in the network"
(let*
((predecessor-layer-weights (elt (network-weights nw) (1- layer-index)))
(weights-into-unit
(map
'vector
(lambda (predecessor-unit-weights)
(elt predecessor-unit-weights unit-index))
predecessor-layer-weights))
(unit-activation (elt (elt activations layer-index) unit-index))
(unit-input (elt (elt unsquashed-activations layer-index) unit-index)))
(loop for predecessor-unit-index from 0 below (length predecessor-layer-weights)
do (let*
((weight (elt weights-into-unit predecessor-unit-index))
(predecessor-unit-activation
(elt (elt activations (1- layer-index))
predecessor-unit-index))
(gradient-to-back-propagate
(* gradient
(sigmoid-derivative unit-input)
weight))
(actual-gradient (* gradient
(sigmoid-derivative unit-input)
predecessor-unit-activation))
(weight-change (* (network-learning-rate nw)
actual-gradient)))
(if (> layer-index 1)
(network-back-propagate
nw
(1- layer-index)
predecessor-unit-index
gradient-to-back-propagate
activations
unsquashed-activations))
(setf
(elt (elt (elt (network-weights nw) (1- layer-index))
predecessor-unit-index)
unit-index)
(- weight weight-change))))))
lets try training on some simple data: (defparameter nw (make-network
:input-layer-size 3
:hidden-layer-sizes '(2)
:output-layer-size 1
:learning-rate 0.01))
(loop for i from 0 below 10000
do (network-train nw '((0 0 1) (1 1 1) (1 0 1) (0 1 1)) '((0) (1) (1) (0))))
(multiple-value-bind (activations unsquashed-activations)
(network-feedforward nw '(1 1 0))
(print activations))
(multiple-value-bind (activations unsquashed-activations)
(network-feedforward nw '(0 1 0))
(print activations))
#(#(0 1 0) #(0.59563345 0.5919403) #(0.2278899))
#(#(1 1 0) #(0.84486073 0.013950213) #(0.9093194))
it predicts correctly that the first number is the output number.
using this code, i was getting 125k weight updates in 0.01 seconds (only 1 thread, no gpu), which i think is as far as i can get in terms of efficiency without some more advanced matrix algorithms, i tried python and got 1m (simple addition) operations in 0.04 seconds:
~ λ time python -c '
i=1
for i in range(999999):
i = i + 1
print(i)'
999999
python -c ' i=1; for i in range(999999): i = i + 1; print(i)' 0.04s user 0.00s system 98% cpu 0.045 total
we need a method to keep to keep track of loss, aswell as visual feedback would be nice, see common lisp graphics which ill be also using here. (defmethod network-test ((nw network) xs ys)
"test the given data (collection of vectors for input/output layers), return the total loss"
(let ((loss 0))
(loop for i from 0 below (length xs)
do (let*
((x (elt xs i))
(y (elt ys i)))
(multiple-value-bind (activations unsquashed-activations)
(network-feedforward nw x)
(let* ((output-layer (elt activations (1- (length activations))))
(example-loss (expt (vector-sum (vector-sub y output-layer)) 2)))
(incf loss example-loss)))))
loss))
we can use this function to plot the loss function after each epoch: (defun feedforward-network-first-test ()
(defparameter *train-in* '((0 0 1) (1 1 1) (1 0 1) (0 1 1)))
(defparameter *train-out* '((0) (1) (1) (0)))
(defparameter *win* (make-instance 'window :width 800 :height 800))
(defparameter *nw* (make-network
:input-layer-size 3
:hidden-layer-sizes '(2)
:output-layer-size 1
:learning-rate 0.01))
(let* ((loss-history '())
(epochs 10000)
(axis (make-axis
:min-x (/ (- epochs) 10)
:max-x epochs
:min-y -0.1
:pos (make-point-2d :x 0 :y 0)
:width (window-width *win*)
:height (window-height *win*))))
(loop for i from 0 below epochs
do (network-train *nw* *train-in* *train-out*)
(let ((loss (network-test *nw* *train-in* *train-out*)))
(setf loss-history (append loss-history (list loss)))
(if (> loss (axis-max-y axis))
(setf (axis-max-y axis) (* loss 1.1)))))
;; (setf (axis-min-y axis) (* loss 0.1)))))
(let* ((loss-plot (make-discrete-plot
:points (map
'vector
(lambda (loss epoch) (make-point-2d :x epoch :y (elt loss-history epoch)))
loss-history
(loop for i from 0 below epochs by 100 collect i))))) ;; by 100 to reduce number of points to plot
(axis-add-renderable axis loss-plot)
(window-add-renderable *win* axis)
(window-run *win*))))
although notice that here we're testing the network on the same data we trained it with, which generally isnt a good measure of the performance of the network, but this is just an example of how we can keep track of a property.making use of gpu
before implementing this i started experimenting with cuda in common lisp, one unfortunate thing is that cl-cuda
doesnt allow using arbitrary data types like you would have in C
with structs, we're gonna have to live with arrays of floats on the gpu.
at first i was gonna implement the most straightforward approach, which is to run every backpropagation/feedforward operation by a single gpu thread, so that if we have 256 gpu threads, we'd be doing 256 backpropagation operations in parallel whereas on the cpu (with a single thread) we'd be doing only 1, i thought this would definitely work, until i actually thought about it and then realized, how the hell can this be a thread-safe process? each thread is modifying every weight in the network and many threads are running at once, it would be a nightmare to actually use such an approach, and a miracle if it even works as intended, so i had to think of something else, something thread-safe.