Introduction to Neural Networks

Introduction

Networks are everywhere. They're an intrinsic part of our daily lives, whether we realise it or not. From the social networks we use to stay connected with friends and family, to the transportation networks that deliver our packages, to the neural networks in our brains that allow us to think and feel - networks truly shape the world around us. But what exactly is a network, and how does it function? Moreover, how does the concept of networks relate to artificial intelligence? In this post, we'll delve into the fascinating world of networks, with a particular focus on artificial neural networks and their role in modern artificial intelligence.

The Nature of Networks

At its most basic, a network is a collection of items, often called nodes or vertices, connected by lines called edges or links. The internet (Interconnected Network), for example, is a network of computers connected by physical cables and wireless connections. Social networks, on the other hand, are made up of people (or nodes) that are connected by various relationships (edges), such as friendship, family ties, or professional relationships.

However, the concept of a network goes beyond physical or social structures. Networks can represent abstract concepts and relationships too. Consider the English language: we can think of each word as a node, and create a link between two words whenever they appear next to each other in a sentence. This network of words can help us understand how language is structured, and can even be used to generate new, meaningful sentences.

Representing Networks

Networks can be represented in a vast multitude of different ways. For the sake of simplicity, the code in this series will be simple to follow, and intended for illustrative purposes. However, if you find yourself struggling, I would suggest learning python a bit more first. There'll be some ways to extend what I put here that you can try to solidify your understanding (and work on your skills). Nevertheless, the python implementations aren't essential for understanding the concepts.

Anyway, our goal is to represent the following network:

Example Neural Network

As mentioned, at their core, networks are made up of nodes that have some connection between them. Here, node A is connected to B and C. A really simple implementation in python could just make use of adjacency lists. An adjacency list is a list of all the nodes that a certain node is adjacent (connected) to. First, however, we simply define the nodes themselves.

# Define nodes
node_A = 'A'
node_B = 'B'
node_C = 'C'

Here, we set the variable node_A to be equal to the character 'A', and we do the same for the other nodes. Now we have our nodes, the next step is to define the network. Here we create a dictionary representing the network, with the keys being nodes and the values being the adjacency lists for that key (node).

# Create the network (graph)
network = {node_A: [node_B, node_C],
		 node_B: [node_A],
		 node_C: [node_A]}

As we can see, A is connected to B and C, B is connected to A, and C is connected to A. This network is an example of whats known as an undirected network, meaning that the connections go both ways (if A is connected to B, then B is also connected to A).

We can print our network out using the following code:

# Print the network
for node, connections in network.items():
	print(f"Node {node} is connected to: {connections}")

This will give us the following output:

Node A is connected to: ['B', 'C']
Node B is connected to: ['A']
Node C is connected to: ['A']

Which is exactly what we wanted. Now, this method clearly isn't the best if we want to potentially expand our networks, or represent more complex networks. For example, representing the following network would be cumbersome using this method:

Example Neural Network

This is a directed network, meaning that not every connection goes both ways unless specified. A more general approach to networks would be to use object-orientation and classes. Defining a Node class and a Network class should suffice.

class Node:
    def __init__(self, name):
        self.name = name
        self.connections = set()

    def connect(self, node):
        self.connections.add(node)

    def __str__(self):
	    return self.name


class Network:
    def __init__(self):
        self.nodes = set()

    def add(self, node):
        self.nodes.add(node)

    def __str__(self):
		for node in self.nodes:
			print(f"Node {node} is connected to: {[connected_node.name for connected_node in node.connections]}")

Here, we define a Node class that has name and connections variables. name is defined upon instantiation of the instance, and by default the connections is defined as an empty set. A set is a built-in python type that's similar to a list, but it automatically handles duplications and its add method can be used for adding both new and existing items. A key difference is that a set is unordered, however.

These classes allow for a representation that is much easier to read and follow, an example of how these classes are used is as follows:

# Create nodes
A = Node("A")
B = Node("B")
C = Node("C")
D = Node("D")
# Connect nodes
A.connect(B)
A.connect(C)
C.connect(A)
D.connect(C)

# Create network and add nodes to it
network = Network()
network.add(A)
network.add(B)
network.add(C)
network.add(D)

# Print the network
print(network)

Here, the nodes are created and assigned to variables, and then, by using the connect function, the network's connections are defined. Then, we define the network itself, and add each node to it in turn. (Is there an easier way to do this, that doesn't require the add call for each node?)

The result of the print statement is:

Node A is connected to: B, C
Node B is connected to:
Node C is connected to: A
Node D is connected to: C

Which is an accurate representation of the network.

Extra tasks

  1. Network Navigation: Modify the class-based network so that you can find the shortest path between two nodes. You may find the Breadth-First-Search (BFS) or Depth-First-Search (DFS) algorithms useful to accomplish this.
  2. Network Analysis: Create your own network, which could be your own social network or a transportation system for some imaginary company. Then, write functions that some compute basic network statistics, such as the degree of each node (number of connections), the node with the highest degree (most connections). Then, write a function to determine if the graph is connected (there is a path from any node to any other node in the graph).

Neural Networks

Drawing inspiration from our own brains, scientists have created artificial neural networks (ANNs) to solve complex problems. An ANN is a computational model based on the structure of a biological brain. It consists of interconnected artificial 'neurons' or nodes that work together to make decisions.

A neuron in an ANN receives input from other neurons, similar to how biological neurons receive signals from other neurons in the brain. Each input is given a certain weight that dictates its importance. The artificial neuron then performs some calculations on these weighted inputs and passes the result through a special function (the activation function) to produce the final output.

Much like the human brain, the power of neural networks lies in their ability to learn and adapt. This makes them a key component of machine learning and AI, where the goal is to learn from experience and improve over time.

Architecture of a Neural Network

The architecture of a neural network is composed of three main types of layers: the input layer, hidden layers, and the output layer. The input layer receives the raw data that feeds into the network. This could be anything: from the pixel values of an image to the frequency of words in a text document. The hidden layers, which may number anywhere from one to many, perform the bulk of the computation. Here, the data is processed and transformed, with each successive layer extracting and building upon the patterns and features identified in the previous layer. Lastly, the output layer provides the final output or prediction of the network.

Example Neural Network
Diagram of a neural network.

The data flows from the input layer, through the hidden layers, to the output layer in a process called forward propagation. The network's prediction is based on the final output, and the accuracy of this prediction hinges largely on the weights and biases of the perceptrons, which initially start out as random values.

Training Neural Networks

Training a neural network is essentially a process of fine-tuning these weights and biases to improve the network's predictions. This begins with the network making an initial prediction based on the input data and the initial random weights and biases. The accuracy of this prediction is measured using a loss function, which calculates the difference or 'loss' between the network's prediction and the actual output.

Armed with this measure of its performance, the network then adjusts its weights and biases to minimise this loss in a process called back-propagation. During back-propagation, the error is passed back through the network from the output layer to the input layer. As this error is propagated, each perceptron's weights and biases are updated in a manner that minimises the overall error of the network's output.

This cycle of making predictions, computing the loss, and back-propagating the error is repeated many times over multiple epochs, which are complete passes through the dataset. Each cycle incrementally improves the network's performance, ultimately yielding a trained neural network capable of making accurate predictions.

Seems easy?

While the aforementioned explanation provides a broad-brush picture of how neural networks learn, it does gloss over some of the complex, underlying mechanics that make this learning possible. This simplified picture is certainly useful for gaining an initial understanding, and it might even suffice for some practical purposes. However, if we truly want to master the art and science of training neural networks, it's beneficial to delve deeper and look closely at the intricate dance of numbers that occurs beneath the surface.

The process of training a neural network is fundamentally an optimisation problem grounded in calculus and linear algebra. At the heart of this problem are several key mathematical concepts: matrix multiplication, activation functions, loss functions, gradients, backpropagation, gradient descent, and regularisation. Each of these concepts plays a crucial role in how a neural network learns from data.

In the upcoming subsections, each of these concepts will be explained in detail in order to fully comprehend the process of training a neural network. I'll warn you, if you have a lack of understanding in calculus and linear algebra, or maths in general, some aspects might seem challenging to follow. However, for the sake of thoroughness and to truly understand the underpinnings of these overarching explanations, it's essential to touch upon the fundamental, intricate structures.

Forward Propagation

The process begins with forward propagation, where the network makes a prediction based on the current weights and biases. We start with the input layer, where we feed in our data. The data then travels through the hidden layers (if any), with each layer applying weights, adding biases, and passing the result through an activation function. This continues until the output layer is reached, which gives the final prediction of the network.

Forward propagation makes use of matrix multiplication. In each layer, the output of each neuron is computed from a matrix multiplication of the inputs to that layer (which could be the raw input data for the input layer or the outputs from the previous layer) and the weights of the connections to that neuron. If we denote the inputs as a matrix XX, the weights as a matrix WW, and the outputs as a matrix YY, the computation can be summarised as Y=XWY=X \cdot W

Activation Functions

Once we have the result of the matrix multiplication and bias addition, we pass it through an activation function. Activation functions introduce non-linearity into the network, allowing it to learn more complex relationships.

The most commonly used activation functions are the sigmoid function, which squashes its input to a value between 00 and 11; the hyperbolic tangent function, which squashes its input to a value between 1-1 and 11; and the ReLU (Rectified Linear Unit) function, which sets all negative inputs to 00 and leaves positive inputs unchanged.

The choice of activation function can have a significant impact on the performance of the network and is usually determined by trial and error.

Calculating Loss

Once we have the network's prediction, we need to quantify how good (or bad) it is. To do this, we use a loss function (also called a cost function or objective function). The loss function measures the difference between the network's prediction and the actual value. For example, a common loss function for regression tasks (predicting a continuous value) is the Mean Squared Error (MSE), which calculates the average of the squares of the differences between the predicted and actual values. For MSE, the loss LL can be calculated by the function:

L(y,y^)=1Ni=1N(yiy^i)2\large L(y, \hat{y}) = \frac{1}{N} \sum^N_{i=1}(y_i-\hat{y}_i)^2

Where yy is the actual value, y^\hat y is the value that the neural network predicted, and NN is the number of data points.

Backward Propagation

Now, we need to adjust the weights and biases to decrease the loss. To figure out how to adjust them, we need to know how much each weight and bias contributes to the loss. This is done through a process called backward propagation, or backpropagation.

Backpropagation involves calculating the gradient of the loss function LL with respect to each weight ww and bias bb in the network. The gradient is a vector that points in the direction of the steepest increase of the function, so the negative gradient points in the direction of the steepest decrease. Thus, by subtracting the gradient from the weights and biases, we can adjust them in the direction that decreases the loss. Or, for each weight wijw_{ij} in the network, the gradient of the loss function with respect to wijw_{ij} is calculated as:

Lwij=kLykykwij\frac{\partial L}{\partial w_{ij}}=\sum_k \frac{\partial L}{\partial y_k} \frac{\partial y_k}{\partial w_{ij}}

Calculating the gradient involves using the chain rule from calculus, as the loss is a function of the weights and biases through multiple layers of the network. The calculations start at the output layer and work backward through the hidden layers, hence the name 'backpropagation'.

Updating Weights and Biases

Once we have the gradients, we can update the weights and biases. This is done using an optimisation algorithm, the most simple and commonly used of which is called gradient descent.

In gradient descent, each weight and bias is updated by subtracting a fraction of the corresponding gradient. The size of this fraction is determined by the learning rate, a hyperparameter that controls how big the steps we take are. If the learning rate is too large, we may overshoot the minimum of the loss function; if it's too small, training may be slow or get stuck in a suboptimal solution. Formalising this, gradient descent can be shown as:

wij=wijαLwijbi=biαLbi\large \begin{equation} \begin{aligned} w_{ij} &= w_{ij} - \alpha \frac{\partial L}{\partial w_{ij}} \\ b_i &= b_i - \alpha \frac{\partial L}{\partial b_i} \end{aligned} \end{equation}

Epochs and Batches

This process of forward propagation, calculating loss, backpropagation, and updating weights and biases constitutes one iteration of training. However, usually we need to go through this process many times for the network to learn effectively. An epoch is one complete pass through the entire training dataset, and training typically consists of multiple epochs.

Often, instead of calculating the loss over the entire training dataset at once, we calculate it over smaller batches of data (hence the term 'batch size'). This makes the training process less computationally intensive and can also help the network escape from suboptimal solutions.

Regularisation

Finally, it's worth mentioning that neural networks have a tendency to overfit the training data if they have too many parameters (i.e., they are too complex). Overfitting is when the network learns the training data so well that it performs poorly on unseen data.

To prevent overfitting, we can use techniques such as regularisation, which involves adding a term to the loss function that penalizes complex models, and dropout, which randomly 'drops out' (sets to zero) a fraction of the neurons in the hidden layers during training. The two most common forms of regularisation are L1 and L2 regularisation.

L1 regularisation: Lreg=L+λi,jwijL_{reg} = L + \lambda \sum_{i,j} \vert w_{ij} \vert

L2 regularisation: Lreg=L+λ2i,jwij2L_{reg} = L + \frac{\lambda}{2} \sum_{i,j} w^2_{ij}

Here, LL is the original loss function, wijw_{ij} are the weights in the network, and λ\lambda is the regularisation parameter, which determines the strength of the regularisation.

Putting it all together

A simplified diagram can help piece these different features together into the cohesive structure that a neural network follows.

Example layer of a neural network
Diagram of an example layer of a basic neural network.

In this diagram, the inputs are labelled xix_i, the weights are labelled wiw_i, Σ\Sigma is a function that computes the matrix multiplication that we spoke about that forward propogation implements, bb is the bias for the layer, φ\varphi is the activation function, and yy is the output.

Types of Neural Networks

In practice, neural networks have evolved into several different types that specialise in different things. It's important to have a brief overview of these, however each one is worthy of it's own series, so I'll just give a quick explanation.

Feedforward Neural Networks (FNNs): These are the simplest type of artificial neural network. Information in a FNN only travels forward, from the input layer, through any hidden layers, and finally to the output layer. There are no loops in the network - information is always fed forward, never back.

Convolutional Neural Networks (CNNs): CNNs are especially good at processing grid-like data, such as images. They utilize 'convolutional' layers that slide small windows, or 'filters', over the input data to detect features, such as edges and shapes in an image.

Recurrent Neural Networks (RNNs): RNNs excel at processing sequential data, like time series or text. They use loops to allow information to be passed from one step in the sequence to the next, effectively giving the network a form of memory.

Long Short-Term Memory Networks (LSTMs): LSTMs are a specific type of RNN that are much better at remembering long-term dependencies, making them useful for tasks like language modeling, where each word in a sentence depends not just on the previous word, but potentially on words much earlier in the sentence.

Generative Adversarial Networks (GANs): The network we're discussing in depth in this series, GANs are a more advanced type of network that we will discuss in more detail in future posts. Briefly, they consist of two networks, a generator and a discriminator, that compete against each other. The generator tries to create fake data to fool the discriminator, while the discriminator tries to get better at distinguishing real data from fake.

Representing Neural Networks

In this section, we'll implement a simple neural network in Python and train it on a synthetic dataset to do binary classification. We'll also visualize the decision boundary created by our neural network.

Firstly, we're going to need some imports:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

Here, we import numpy for numerical computations, matplotlib.pyplot for data visualization, and make_moons from sklearn.datasets to create a synthetic dataset for our binary classification problem.

Next, we define the sigmoid and sigmoid_derivative functions:

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

The sigmoid function is used as the activation function in our neural network, and sigmoid_derivative is its derivative, which we'll use during the backpropagation process to update the network's weights.

The bulk of our code is contained within the NeuralNetwork class:

class NeuralNetwork:
    def __init__(self, input_nodes, hidden_nodes, output_nodes):
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Initialize weights with random values
        self.weights_input_to_hidden = np.random.rand(self.input_nodes, self.hidden_nodes)
        self.weights_hidden_to_output = np.random.rand(self.hidden_nodes, self.output_nodes)

    def forward_pass(self, X):
        # Compute hidden layer outputs
        self.hidden_layer_input = np.dot(X, self.weights_input_to_hidden)
        self.hidden_layer_output = sigmoid(self.hidden_layer_input)

        # Compute output layer outputs
        self.output_layer_input = np.dot(self.hidden_layer_output, self.weights_hidden_to_output)
        self.output_layer_output = sigmoid(self.output_layer_input)

        return self.output_layer_output

    def backward_pass(self, X, y, output):
        # Compute output error
        self.output_layer_error = y - output
        self.output_layer_delta = self.output_layer_error * sigmoid_derivative(output)

        # Compute hidden layer error
        self.hidden_layer_error = self.output_layer_delta.dot(self.weights_hidden_to_output.T)
        self.hidden_layer_delta = self.hidden_layer_error * sigmoid_derivative(self.hidden_layer_output)

        # Update weights
        self.weights_input_to_hidden += X.T.dot(self.hidden_layer_delta)
        self.weights_hidden_to_output += self.hidden_layer_output.T.dot(self.output_layer_delta)

    def train(self, X, y):
        output = self.forward_pass(X)
        self.backward_pass(X, y, output)

This class is responsible for creating the neural network and includes methods for performing a forward pass (making predictions), a backward pass (learning from errors), and training (performing forward and backward passes over multiple iterations). I'd suggest inspecting each function, and from what you've learnt already you should be able to tell what they do and why.

So, now we have our Neural Network, we need some data to train it on. Let's create our fake dataset:

X, y = make_moons(200, noise=0.2, random_state=1)
y = y[:, np.newaxis]  # Reshape y to make it a 2D array

We use the make_moons function to create a synthetic dataset of 200 samples. The noise parameter adds some randomness to the data to make it more challenging, and the random_state parameter ensures that the generated dataset is the same each time the code is run. We also reshape y to make it a 2D array because our neural network expects input in this format.

Now, we can visualize our data:

plt.scatter(X[:,0], X[:,1], c=y[:,0], cmap=plt.cm.Spectral)
plt.title("Synthetic Dataset")
plt.show()

When we run this code, we get this plot:

Our datapoints
Our dataset, with different classes represented by different colours.

Next, we create an instance of the NeuralNetwork class:

nn = NeuralNetwork(input_nodes=2, hidden_nodes=3, output_nodes=1)

Our neural network has two input nodes (because our data has two features), three hidden nodes, and one output node (since this is a binary classification problem).

We can then train the network:

for _ in range(10000):
    nn.train(X, y)

Finally, we visualize the decision boundaries learned by the neural network:

x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
h = 0.01
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = nn.forward_pass(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], c=y[:,0], cmap=plt.cm.Spectral)
plt.title("Decision Boundary")
plt.show()

For me, this plot looked like this:

Our datapoints
This plot shows the decision boundary learned by our neural network set to a single boundary line.
Our datapoints
In this case, the decision boundary is displayed as a gradient of colours.

As we can see, this is not a perfect classification by any means. However, if our simple network can successfully complete this binary classification task, it isn't hard to see how powerful this idea is, and how more complex networks can model increasingly difficult tasks.

You can find the complete code here

Summary

In our journey through networks, we started from the fundamentals, understanding the basic concepts and representations, and gradually delved deeper into the intricate world of neural networks. We studied the architecture, training process, and various types of neural networks, digging into the mathematics underpinning their operation. From forward propagation to backpropagation, from understanding loss to updating weights and biases, and from epochs and batches to regularization techniques, we've built a robust understanding of how neural networks work and their role in mimicking human cognition.

Applying this knowledge, we created and visualized a simple neural network using Python, illustrating these complex concepts in a practical, hands-on manner. Yet, our exploration of artificial intelligence has only just begun. We've established a firm foundation that we'll build upon in the next post, where we turn our focus to the broader landscape of Machine Learning, and understanding its core concepts.