Or The Architecture of Decision-Making and Learning
The Neural Model closely mirrors the structure of biological neural systems. Just as biological neurons connect through axons and synapses to form complex networks, neurons in machine learning models connect through weighted paths to process information. This parallel extends beyond structural similarity to functional resemblance in information processing and decision-making.
Having discussed a position on the definition of machines that learn, what follows is a comparative study of biological neurons and how they are modeled in machines that learn.
What are Neurons?
Before addressing the neural model, we need a basic understanding of biological neurons. The generic neuron has, on one end, the input end, several fine processes called Dendrites (because they resemble a tree, dendro- is a Greek root meaning “tree,” hence dendrite, dendrochronology, etc.). On the other end, axons. The cell body is often called soma (as in somatic, related to the body).

The axon is a long, thin process that leaves the cell body and may run for meters. The axon is the transmission line of the neuron. Axons can give rise to collateral branches, along with the main branch, but the actual connectivity of a neuron can be pretty complicated. Neurons are among the most significant cells in the human body and are certainly the most extended. For example, single spinal motor neurons in the small of the back can have axons running to the toes and can be well over a meter long. Many even longer ones are known in other animals. When axons reach their final destination, they branch again in a terminal arborization (arbor is Latin for “tree,” hence Arbor Day, arboretum, arboreal, etc.). Complex, highly specialized structures called synapses are at the ends of the axonal branches. In the standard picture of the neuron, dendrites receive inputs from other cells, the soma and dendrites process and integrate the inputs, and information is transmitted along the axon to the synapses, whose outputs provide input to other neurons or effector organs.1
To recap, a neuron is the fundamental cell unit of the nervous system, which is thought of as a processing unit. It has several key parts:
- Cell body (soma) – contains the nucleus and maintains essential cell functions
- Dendrites – branch-like structures that receive signals from other neurons
- Axon – a long fiber that carries signals away from the cell body
- Axon terminals – the ends of the axon that connect to other neurons
- Myelin Sheath – Some axons are surrounded by a fatty insulating material called the myelin sheath and have regular gaps, called the nodes of Ranvier, that allow the action potential to jump from one node to the next. Think of them as insulation material, like plastic over electrical wire.

Action Potentials: The Original Activation Function
Neurons are cells and, as such, have a membrane—a lipid bilayer2 that serves as nature’s implementation of a decision boundary3. This membrane maintains an electrical potential difference through an intricate balance of ion concentrations, primarily sodium (Na+) and potassium (K+), creating a resting potential of approximately -70 mV4.
Outside the membrane:
- High Na+ (sodium) concentration (460 mM)
- Low K+ (potassium) concentration (10 mM)
- Electrical potential of 0 millivolts
Inside the membrane:
- Low Na+ concentration (50 mM)
- High K+ concentration (400 mM)
- The electrical potential of -70 millivolts
The Na+/K+ pump actively maintains this gradient, expending cellular energy to push sodium(Na) out and potassium(K) in—a biological equivalent to the computational cost of maintaining neural network states.
Each ion type has its specific channels in the membrane. These channels have selective filters that only allow specific ions to pass through. For example, sodium channels only allow Na+ ions, while potassium channels only allow K+ ions. This molecular selectivity means the movement of Na+ cannot directly interfere with K+ movement through their respective channels. Ion channels achieve their selectivity through specific protein structures, particularly the selectivity filter.
The sodium-potassium pump (Na+/K+ ATPase):
Pushes:
- Na+ OUT (from inside to outside)
- K+ IN (from outside to inside)
The ratio is:
- 3 Na+ pushed OUT
- 2 K+ pushed IN
for each ATP molecule used.
This pushes against both ions’ concentration gradients – it uses energy (ATP) to maintain these “unnatural” concentration differences, which is why it’s called an active transport mechanism. Without this constant pumping:
- Na+ would flow IN (following its concentration gradient)
- K+ would flow OUT (following its concentration gradient)
- The resting potential would be lost
Where does ATP(Adenosinetriphosphate) come from? You guessed it, from aerobic or anaerobic respiration.

When the neuron gets enough stimulation5, sodium channels in the membrane open, allowing positive sodium ions (Na+) to rush into the cell. This sudden inflow of positive charge makes the inside of the cell positive compared to the outside – an action potential. The action potential, a neuron’s fundamental unit of signal propagation, is a sudden change in the electrical charge across a neuron’s membrane. Once an action potential starts, it moves along the axon. As the action potential travels, sodium channels open sequentially down the axon, like a row of falling dominoes. After each section of the membrane becomes positive, potassium channels open to let positive potassium ions (K+) flow out, returning that section to its resting state.
Consequently, when the signal reaches the end of the axon, it triggers the release of chemicals called neurotransmitters. These chemicals are released into a tiny synaptic cleft gap between neurons. When the neurotransmitters reach the next neuron, they bind to special receptors that can start the process all over again in the next cell. This system allows signals to travel long distances, from your spinal cord to your toes, while maintaining their strength and precision.
The Neural Model
The fundamental idea behind the need for neural models is to find a functional representation of the relationship between a particular set of given inputs and outputs. This opposes the traditional computing paradigm of analytical reasoning, which is about a model to create output out of inputs.


Traditional Modeling Approach:
- Input defined: Image pixels of a handwritten digit
- Function defined: Rules like:
- If there’s a complete circle, → probably 0 or 6 or 8
- If there’s a vertical line → probably 1
- If there’s a curved top and straight bottom,→ probably 2
- …
- Output: The predicted digit
Neural Machine Learning Approach:
- Input defined: Same image pixels
- Function unknown: The Network learns from millions of labeled handwritten digits
- Output defined: We know the correct digit for each training example
At the core, in their simplest forms, neural networks are the unknown function and are modeled as follows:

Historically, networks have been depicted in this format to convey the idea of information flow between nodes. By nodes, the machine learning scientist means a point where data is transferred (conceptually or literally).
A neuron model takes multiple inputs (x1, x2, x3, xn) and processes them through a simple mathematical operation. Each input is multiplied by its weight (w1, w2, w3, wn), and these weighted inputs are summed together (W.X). A bias term (b) is then added to this sum. The result (z) passes through an activation function σ(z), which determines the neuron’s final output (a). This activation function introduces non-linearity into the system, allowing the neuron to make more complex decisions than simple linear combinations would allow.
The whole game of machines that learn is to find (train the machine for) weights that faithfully classify those inputs when multiplied by any set of inputs.
The parallel between this artificial model and biological neurons is striking. The input values represent signals from other neurons, similar to dendrites receiving neurotransmitters in biological neurons. The weights mirror synaptic strength – how strongly one neuron influences another. The summation (W.X) parallels how a biological neuron integrates all incoming signals at its cell body. Most fascinating is how the activation function σ(z) mimics the biological action potential – just as a biological neuron only fires when its total input exceeds a threshold, the activation function determines whether and how strongly the artificial neuron “fires.” The bias term (b) is analogous to the neuron’s base excitability – how easily it fires regardless of input. Even the final output (a) corresponds to neurotransmitter release at synapses, influencing the next layer of neurons.

There should be no allusions in creating direct analogies; the point here is the essentiality of nonlinearity, which can only be achieved through a gated response to stimuli. Nothing happens until a certain threshold is reached. Afterward, a gate is opened and closed to let the information in.

John D. Enderle PhD, in Introduction to Biomedical Engineering (Third Edition), 2012
Essential Nonlinearity
In examining what we have discussed, we have encountered a fundamental property in neural networks that makes them function the way they do. Action potentials, capacity, synaptic cleft, and neurotransmitter vesicles all seem to fulfill one essential property: Nonlinearity. We can explore the problem of linear machines through 2 approaches.
1- The Linear Boundary Problem
Consider a neural network layer defined by the transformation, similar to our simple model.
z(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + bCan a function as such ever learn a nonlinear behavior? Well, in its simplest form, obviously not. Linear in, linear out. What if we combine multiple layers in between? Indeed, one might think adding more layers in between and multiplying them would help since it adds some complexity.

But upon further investigation, we realize that our naive assumption of adding layers in between has achieved nothing. The network can be simplified, and the middle layer (a1 , a2) can be factored out. This is a fundamental property of superposition, a linear combination of linear operations, is linear.
In detail, if we start with
a₁ = w₁₁¹x₁ + w₁₂¹x₂ + w₁₃¹x₃
a₂ = w₂₁¹x₁ + w₂₂¹x₂ + w₂₃¹x₃
z = w₁²a₁ + w₂²a₂ + b
and substitute the values of (a1 , a2)
z = w₁²(w₁₁¹x₁ + w₁₂¹x₂ + w₁₃¹x₃) + w₂²(w₂₁¹x₁ + w₂₂¹x₂ + w₂₃¹x₃) + b
z = (w₁²w₁₁¹)x₁ + (w₁²w₁₂¹)x₂ + (w₁²w₁₃¹)x₃ + (w₂²w₂₁¹)x₁ + (w₂²w₂₂¹)x₂ + (w₂²w₂₃¹)x₃ + b
subsequently collecting the terms
z = (w₁²w₁₁¹ + w₂²w₂₁¹)x₁ + (w₁²w₁₂¹ + w₂²w₂₂¹)x₂ + (w₁²w₁₃¹ + w₂²w₂₃¹)x₃ + b
and finally defining new equivalent weights
w₁ = w₁²w₁₁¹ + w₂²w₂₁¹ w₂ = w₁²w₁₂¹ + w₂²w₂₂¹ w₃ = w₁²w₁₃¹ + w₂²w₂₃¹
z = w₁x₁ + w₂x₂ + w₃x₃ + b
This final form is equivalent to a single-layer network with direct connections from inputs to output. The key observations are:
- The hidden layer creates linear combinations of inputs
- The output layer creates a linear combination of these combinations
- When expanded, this results in just another linear combination of the original inputs
- No nonlinear activation functions are present to justify the hidden layer

The solution to Linear Boundary Problem
Introducing activation functions, particularly ReLU (Rectified Linear Unit) (Fig.4, right), fundamentally alters this linear reduction. But before we delve into the mathematics, let us contemplate what these activation functions represent in the broader scope of machine learning. They are not just mathematical conveniences but rather embody a kind of selective force, a threshold of becoming that transforms the continuous flow of information into discrete activation patterns.
Consider the ReLU function in its simplest form(Fig.4)
f(x) = max(0, x)
This seemingly simple operation introduces a profound break in the linearity of our system. It is a decision point where the network must “choose between” (recalling the etymology of intelligentia) allowing information to flow forward or blocking it entirely. When we apply ReLU after each layer’s transformation, our previous reduction breaks down:
a₁ = max(w₁₁¹x₁ + w₁₂¹x₂ + w₁₃¹x₃, 0)
a₂ = max(w₂₁¹x₁ + w₂₂¹x₂ + w₂₃¹x₃, 0)
z = w₁²a₁ + w₂²a₂ + b
Now, when we attempt to substitute and collect terms, we encounter a fundamental impossibility. The ReLU function creates regions of absolute silence (where its output is zero) and areas of linear transmission (where its output equals its input). This piecewise nature shatters our ability to reduce the network to a single linear transformation.
The network now behaves like a collection of different linear functions, each active in other regions of the input space, switched on and off by the activation functions. This is not just a quantitative change in complexity but a qualitative transformation in the network’s learning capacity. It can approximate nonlinear boundaries and carve the input space into regions of different behaviors.
The activation function is a productive constraint, a limitation that paradoxically enables greater expressive power. This mirrors a fundamental insight that constraints and resistance are necessary to develop strength and capability6. The network’s power emerges not from unlimited linear combinations but from the strategic placement of these nonlinear barriers that force it to create more sophisticated response patterns. This transformation from linear to nonlinear behavior is not just a mathematical necessity. Still, it reflects a more profound truth about learning systems: meaningful adaptation often requires discontinuity, moments of decisive change that cannot be reduced to smooth, continuous transformations. Discontinuity is fundamental to how knowledge systems7 transform and develop. The ReLU function, in its brutal simplicity, embodies this principle of productive discontinuity.
2- The XOR Problem
The XOR (exclusive or) operation provides a pristine example of this limitation. Consider this table a straightforward machine that decides truth or false based on two inputs.
Input A | Input B | Output |
0 | 0 | 0 |
1 | 1 | 0 |
1 | 0 | 1 |
0 | 1 | 1 |
As visualized in our first diagram, these points form a pattern that any linear decision boundary cannot separate. This is not a limitation of our algorithmic approach but a fundamental property of the problem space. No linear transformation can correctly classify these points, as they are not linearly separable in their input space.
The necessity of nonlinearity emerges here not as a theoretical preference but as an ontological requirement. To solve the XOR problem, we must transform the input space to make the patterns linearly separable in some higher-dimensional space. This is precisely what nonlinear activation functions enable. Without the activation:
y = β₀ + β₁x₁ + β₂x₂
For an XOR Problem, our target function is:
f(x₁,x₂) = x₁ ⊕ x₂ = x₁(1-x₂) + x₂(1-x₁)
For a linear regression to fit these points, we would need coefficients β₀, β₁, β₂ that satisfy:
β₀ + 0β₁ + 0β₂ = 0
β₀ + 0β₁ + 1β₂ = 1
β₀ + 1β₁ + 0β₂ = 1
β₀ + 1β₁ + 1β₂ = 0
This system of equations is inconsistent. By subtracting the first equation from the fourth:
β₁ + β₂ = 0
β₁ = β₂ = 1
This contradiction proves that no linear combination of the original features can solve the XOR problem.

The Power of Feature Transformation
However, linear regression can learn XOR if we transform the input space appropriately. The key insight is introducing a nonlinear feature transformation φ(x) before applying linear regression.
Consider the transformation:
φ(x₁,x₂) = [x₁, x₂, x₁x₂]
The transformation is chosen somewhat arbitrarily; the only important factor is that the third dimension (x₁x₂
) should create a nonlinear term, e.g. (x₁+x₂
) would not have worked.
Making our model:
y = β₀ + β₁x₁ + β₂x₂ + β₃(x₁x₂)
This transformed model can solve XOR. One solution is:
y = x₁ + x₂ - 2(x₁x₂)
With β₀ = 0, β₁ = 1, β₂ = 1, β₃ = -2, this perfectly captures the XOR function:
f(0,0) = 0 + 0 - 0 = 0
f(0,1) = 0 + 1 - 0 = 1
f(1,0) = 1 + 0 - 0 = 1
f(1,1) = 1 + 1 - 2 = 0
Let’s recap: The XOR relationship could not be represented as a simple function of x₁ x₂
, but when we added a third axis (x₁, x₂)
, we could transform the problem into a higher dimension, where a linear decision boundary could be established.
This connects to a profound ontological point about representation learning. The success of deep neural networks comes not just from their nonlinear activations but from their ability to learn hierarchical feature transformations. Each layer can potentially create new features that make the classification problem linearly separable.

Road Ahead
Let’s review; we started with the biological aspects of neurons, where we discovered a nonlinearity (Action Potential). We introduced and investigated the neural model to answer the question of its ontological necessity, where we started to develop a sense of nonlinearity. Much like our transformation, which provides higher dimensions for learning, action potentials in synapses seem to create higher dimensions from sensory inputs.
Further, two roads opened up. Action potentials (Activation Functions) and Features (Transformations). Either way, we are introducing nonlinearity in the system. As such, Learning and decision-making presuppose nonlinearity. However, to make clear decisions, we need to think nonlinearly.
The necessity of nonlinearity in machines that learn goes beyond technical efficiency. It is a recognition that the phenomena we seek to understand – whether in image recognition, language processing, or any other domain – are fundamentally nonlinear. They emerge from complex interactions that cannot be reduced to simple, linear combinations. As such, discontinuities and moments of silence are essential to the learning process.
This understanding challenges the positivist assumption that complex systems can be fully understood through linear decomposition. Instead, it suggests that machines that learn must embrace nonlinearity not as a mathematical convenience but as an ontological necessity for engaging with the world’s complexity.
In conclusion, nonlinearity in machines that learn is not a technical solution to a mathematical problem. It is a philosophical stance that acknowledges the inherent complexity and irreducibility of the phenomena we seek to understand. As we continue developing these learning systems, we must remember that choosing nonlinear activation functions is not just an architectural decision but a statement about the nature of transformation and understanding.
In the following articles, we will delve even deeper into the technical aspects of both approaches with practical examples.
A simple but clear animation representing the ideas of this article
- An Introduction to Neural Networks. James A. Anderson ISBN: 9780262510813 ↩︎
- The lipid bilayer is the barrier that keeps ions, proteins, and other molecules where needed and prevents them from diffusing into areas where they should not be. Lipid bilayers are ideally suited to this role, even though they are only a few nanometers in width, because they are impermeable to most water-soluble (hydrophilic) molecules. Bilayers are remarkably impermeable to ions, which allows cells to regulate salt concentrations and pH by transporting ions across their membranes using proteins called ion pumps; these are the primary drivers in our discussion of information across neurons. ↩︎
- Decision Boundary proves to be a fundamental property of neural models. A last layer in which consciousness appears. Something that classifies the complex underlying behavior of learning. ↩︎
- Think of electrical potential like height differences in water: just as water flows from high to low elevation due to gravity, positive ions flow from high to low electrical potential. Much like water, Ions can be actively pumped “uphill” against their electrical gradient by molecular pumps like the Na+/K+ ATPase—an essential process that maintains the neuron’s resting potential. It is worth mentioning that literature sometimes mentions -60mV imbalance instead of -70mV. ↩︎
- Stimulations are usually neurotransmitters from other neurons. When neurotransmitters bind to receptors, they cause small ion channels to open. Neurons connect at specific points called synapses. The sending neuron has a specialized ending (presynaptic terminal), and the receiving neuron has a specialized receiving area (postsynaptic terminal). There’s a tiny gap between them (synaptic cleft). Neurotransmitters are stored in the presynaptic terminal’s tiny sacs (synaptic vesicles). ↩︎
- “On the Genealogy of Morals,” Second Essay, sections 16-18, where Nietzsche discusses how the resistance against primitive human instincts and the “internalization of man” led to the development of higher capabilities. ↩︎
- As Foucault argues in “Archaeology of Knowledge” (1969) ↩︎