Neural networks are the foundation of modern machine learning and AI. They are the most essential component in understanding what AI is and how it works. In this article, you’ll learn the basics of neural networks, and then we’ll delve into some of the most common variants, like the feedforward and recurrent networks, which drive everything from large language models like ChatGPT and Bard to image generation with stable diffusion.
The perceptron
All neural networks share one basic characteristic: they are interrelated groups of nodes. More technically, they are graphs. The attributes of the nodes and the ways the edges are connected vary widely. The very simplest node structure is, of course, a single node.
The perceptron is the earliest vision of a mathematical model inspired by the human brain cell—though it is important to note that the association between the two is very loose. The human brain is radically more sophisticated, subtle, and epistemologically dicey than a neural network. In the case of a human brain, the thing to be understood is also part of the apparatus of understanding. That’s not the case with software neurons.
Generally, the “neuron” idea means a node that accepts one or more inputs, makes a decision as to what output to produce, and then sends that output onward toward the next node (or to a final output).
Figure 1 shows a simple perceptron with two inputs and a single output.
Each input is multiplied by a weight. This allows tuning the influence of the inputs, which are then added together and added to a bias. The bias allows for tuning the node’s overall impact. (For a more mathematical diagram, see the single-layer perceptron model here.)
The resulting value is then given to the activation function. This function can be many things, but in a perceptron it is a threshold function, often the Heaviside step function, which essentially outputs 1 if the value is high enough or 0 otherwise. In short, this function is a gate. The simple on/off output is a defining characteristic of perceptrons.
At a node-internals level, this basic layout is fairly universal to neural nets. The number of inputs and outputs can vary. The information fed into a neuron is often called its features.
To help avoid confusion, note that very often, perceptrons are used in isolation; that is, as single-node networks. Sometimes, the term perceptron denotes a single-node neural network. Several perceptrons may be combined in a single layer. If more layers are used, it’s considered a feedforward network, which I will discuss further below.
Loss functions and machine learning
In general, perceptrons and neural networks need a way to tune their weights and biases to improve performance. Performance is measured by a loss function. This function tells the network how it did with a calculation, and that information is then used to tune the node(s).
The modification of weights and biases in neurons is the essence of neural network machine learning.
Note that I am intentionally avoiding the details of loss functions and how weights and biases are adjusted. In general, gradient descent is the common algorithm used for this purpose. Gradient descent looks at the network as a calculus function and adjusts the values to minimize the loss function.
Next, we will look at a variety of neural network styles that learn from and also move beyond the perceptron model.
Feedforward networks
Feedforward networks are perhaps the most archetypal neural net. They offer a much higher degree of flexibility than perceptrons but still are fairly simple. The biggest difference in a feedforward network is that it uses more sophisticated activation functions, which usually incorporate more than one layer. The activation function in a feedforward is not just 0/1, or on/off: the nodes output a dynamic variable.
The form of gradient descent used in feedforwards is more involved; most typically, it is backpropagation, which looks at the network as one big multivariate calculus equation and uses partial differentiation for tuning.
In Figure 3, we have a prototypical feedforward network. There is an input layer (sometimes considered as the layer or layer 1) and then two neuron layers. There can be great variety in how the nodes and layers are connected. In this case, we have “fully connected” or “dense” layers because each node’s output is sent to each node in the next layer. The interior layers in a neural net are also called hidden layers.
The key in feedforward networks is that they always push the input/output forward, never backward, as occurs in a recurrent neural network, discussed next.
Recurrent neural network (RNN)
Recurrent neural networks, or RNNs, are a style of neural network that involve data moving backward among layers. This style of neural network is also known as a cyclical graph. The backward movement opens up a variety of more sophisticated learning techniques, and also makes RNNs more complex than some other neural nets. We can say that RNNs incorporate some form of feedback. Figure 4 shows the cyclical pattern of data movement in an RNN.
Another trick employed by RNN is hidden state. This means that nodes can hold some data internally during the run, essentially a form of machine memory. Since layers are able to run repeatedly in an RNN (the output of a downstream layer becoming the input for the upstream), hidden state enables the network to learn about long-term effects from the data.
RNN variants are among the most prominent in use today. There is great variety in how they are implemented. The most common is the long short-term memory (LSTM) network. LSTMs use fairly complex nodes with a series of gates and internal state for determining what is valuable (“forget gate”) and what to input/output (“input and output gates”).
RNNs are most suited and applied to sequential data such as a time series, where the ability to remember past influence across sequences is key.
Convolutional neural network (CNN)
Convolutional neural networks, or CNNs, are designed for processing grids of data. In particular, that means images. They are used as a component in the learning and loss phase of generative AI models like stable diffusion, and for many image classification tasks.
CNNs use matrix filters that act like a window moving across the two-dimensional source data, extracting information in their view and relating them together. This is what makes them so suited to image handling. As the window moves across the view, it creates a detailed, interconnected picture of the data. In that way, a CNN works well on a two-dimensional spatial plane, just as an RNN works well in time-sequenced data in a series.
Most CNNs operate in a two-phase process: the filtering is followed by a flattening, which is fed into a feedforward network. The filtering phase often proceeds using a grid of data, rather than a neural net-style node graph, and so even though it uses a gradient descent algorithm to learn based on a loss function, the overall process is dissimilar to a neural net.
Another important operation in a CNN is pooling, which takes the data produced by the filtering phase and compresses it for efficiency. Pooling is designed to keep the relevant aspects of the output while reducing the dimensionality of the data.
Figure 5 has a generalized view of a typical CNN flow.
The term convolution refers to the well-known mathematical procedure. For a great animated visualization of the convolution process, see this guide to convolutional neural networks.
Transformers and attention
Transformers are a hot topic these days because they are the architecture of LLMs like ChatGPT and Bard. They use an encoder-decoder structure and allow for an attention mechanism. (You can find the paper that introduced transformer networks here.)
Attention is a breakthrough in language processing because it lets the model focus on parts of the sentences and what is significant. Transformers use an encoder/decoder structure and positional encoding of word tokens. This video is a good breakdown of the architecture in plain English.
Transformers are very powerful, and also very complex. They use a dense feedforward network as a sub-neural net inside the encoder and decoder components. They also demand considerable computing power.
Adversarial networks
One of the most interesting newer ideas is the adversarial network, which pits two models against each other. One model attempts to output predictions and the other attempts to find the fakes. At a high level, this acts as a sophisticated loss mechanism, where the adversary acts as the loss function.
Conclusion
Neural networks are a powerful way of thinking about problems and applying machine learning algorithms based on loss reduction. There are some involved and complex variants, and an enormous amount of money and thought is being invested in this space.
Understanding the basics of neural networks also helps us address deeper questions about the nature of consciousness and what artificial intelligence (in its current incarnation) means for it.
Fortunately, the fundamental ideas of neural networks are not hard to grasp. Understanding the variations of neural networks and how they work is useful and increasingly essential knowledge for software developers. This is an area of innovation that will continue to impact the larger industry and world for decades to come.