Understanding the balance between model complexity and built-in assumptions is fundamental to successful machine learning. This trade-off affects everything from training efficiency to generalization performance, and mastering it is key to choosing the right approach for your specific problem.

Imagine I have a simple dataset:

  • x = {1, 2, 4, 6, 8, 10}
  • y = {2, 4, 8, 12, 16, 20}

It's immediately obvious that a linear function y = ax + b can easily fit this data with minimal effort for optimization and finding the coefficients. But what if we wanted to fit a quadratic function y = ax2 + bx + c to this same dataset?

The first question is: can we? We definitely can, as the quadratic model has higher expressivity. In other words, the quadratic function represents a larger hypothesis space that encompasses all possible linear relationships plus non-linear ones. Any pattern that a linear model can capture, a quadratic model can also represent.

But what's the caveat? The caveat is that for the quadratic formula, we have to optimize 3 parameters instead of 2 (compared to the linear formula), and depending on the optimization method, the quadratic model might need more data to be properly optimized and avoid overfitting.

Thus, the quadratic formula can definitely explain our data, but at the cost of more complex optimization. This is generally how machine learning works in reality: ML models normally have much higher expressivity compared to the complexity of our underlying data patterns. In our analogy, even if our data follows a simple linear relationship (degree 1), we typically fit high-degree polynomials y = anxn + an-1xn-1 + ... + a1x + c, regardless of the true degree of our data. This over-parameterization is common because we rarely know the true complexity of our data patterns beforehand.

These expressive models can generally explain our data, but the success depends on how we configure our experimental setup, adjust our hyperparameters, and—critically—the amount of data we have.

The more data we have, the more expressive model we can afford to use. As you can see, the computational cost is related to how large our models are, and it's somewhat independent of how simple or linear our underlying data patterns are. That's why the training procedure is affected more by model complexity than by data simplicity.

The Role of Inductive Bias

Now, what if we have limited data, and we know our relationship is linear? In this case, choosing the linear function makes sense. In this analogy, the linear model is more efficient for optimization but carries a very strong pre-assumption: that the data follows a linear relationship.

Models that have such built-in assumptions are said to have inductive bias. Inductive bias makes training faster and helps models converge sooner, but it comes with the risk that if our assumptions are wrong, the model may be too constrained to capture the true underlying patterns.

Real-World Example: CNNs vs. Vision Transformers

This trade-off is beautifully illustrated in the evolution from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs) in computer vision:

CNNs have strong inductive biases:

  • Translation invariance (assuming objects look the same regardless of position)
  • Local connectivity (assuming nearby pixels are more related)
  • Hierarchical processing (learning simple to complex features)

ViTs have weaker inductive biases:

  • Minimal assumptions about spatial relationships
  • Can learn any pattern given sufficient data
  • More flexible but require larger datasets

When data is limited, CNNs often outperform ViTs because their inductive biases act as helpful constraints. However, with massive datasets (like ImageNet-21K), ViTs can learn the "right" spatial relationships from data and often surpass CNNs because they're not constrained by potentially incorrect assumptions.

Beyond the Trade-off: Combining Both Advantages

It's important to note that expressivity and inductive bias are not necessarily mutually exclusive. Modern machine learning increasingly focuses on architectures that combine high expressivity with carefully designed inductive biases, giving us the best of both worlds.

Consider these examples:

Hybrid CNN-Transformer Architectures: Models like ConViT and CvT combine convolutional layers (with strong spatial inductive bias) with transformer blocks (high expressivity). They start with CNN-like processing to leverage spatial assumptions, then use transformers for more flexible global reasoning.

Graph Neural Networks (GNNs): These models have strong inductive bias for graph-structured data (assuming relationships matter) while maintaining high expressivity through multiple layers and attention mechanisms. They can model complex relationships while respecting the underlying graph structure.

Physics-Informed Neural Networks (PINNs): These combine the expressivity of deep networks with the inductive bias of physical laws. The model can learn complex patterns while being constrained to respect known physics, leading to better generalization with less data.

Regularization Techniques: Methods like dropout, batch normalization, and data augmentation add inductive biases to highly expressive models without reducing their theoretical capacity. They guide the model toward better solutions while maintaining flexibility.

The key insight is that smart architectural choices can embed useful inductive biases into highly expressive models, allowing us to benefit from both fast convergence and strong generalization without sacrificing the ability to capture complex patterns when needed.