The Unreasonable Effectiveness of Deep Learning in Artificial Intelligence PDF
The Unreasonable Effectiveness of Deep Learning in Artificial Intelligence PDF
PAPER
artificial intelligence
Terrence J. Sejnowskia,b,1
a
Computational Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037; and bDivision of Biological Sciences, University of
California San Diego, La Jolla, CA 92093
Edited by David L. Donoho, Stanford University, Stanford, CA, and approved November 22, 2019 (received for review September 17, 2019)
Deep learning networks have been trained to recognize speech, NeurIPS conferences, I oversaw the remarkable evolution of a
caption photographs, and translate text between languages at community that created modern machine learning. This confer-
high levels of performance. Although applications of deep learn- ence has grown steadily and in 2019 attracted over 14,000 par-
ing networks to real-world problems have become ubiquitous, our ticipants. Many intractable problems eventually became tractable,
understanding of why they are so effective is lacking. These empirical and today machine learning serves as a foundation for contem-
results should not be possible according to sample complexity in porary artificial intelligence (AI).
statistics and nonconvex optimization theory. However, paradoxes The early goals of machine learning were more modest than
in the training and effectiveness of deep learning networks are those of AI. Rather than aiming directly at general intelligence,
being investigated and insights are being found in the geometry of machine learning started by attacking practical problems in
high-dimensional spaces. A mathematical theory of deep learning perception, language, motor control, prediction, and inference
would illuminate how they function, allow us to assess the strengths using learning from data as the primary tool. In contrast, early
and weaknesses of different network architectures, and lead to attempts in AI were characterized by low-dimensional algorithms
major improvements. Deep learning has provided natural ways for that were handcrafted. However, this approach only worked for
NEUROSCIENCE
humans to communicate with digital devices and is foundational for well-controlled environments. For example, in blocks world all
building artificial general intelligence. Deep learning was inspired by objects were rectangular solids, identically painted and in an envi-
the architecture of the cerebral cortex and insights into autonomy ronment with fixed lighting. These algorithms did not scale up to
and general intelligence may be found in other brain regions that vision in the real world, where objects have complex shapes, a wide
are essential for planning and survival, but major breakthroughs will range of reflectances, and lighting conditions are uncontrolled. The
be needed to achieve these goals. real world is high-dimensional and there may not be any low-
COMPUTER SCIENCES
dimensional model that can be fit to it (2). Similar problems were
deep learning | artificial intelligence | neural networks encountered with early models of natural languages based on
symbols and syntax, which ignored the complexities of semantics
conference was held at the Denver Tech Center in 1987 and has been held
belief it was possible to train multilayer networks (8). The Boltzmann annually since then. The first few meetings were sponsored by the IEEE In-
machine learning algorithm is local and only depends on correlations formation Theory Society.
2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1907373117 Sejnowski
COLLOQUIUM
PAPER
NEUROSCIENCE
Fig. 3. Early perceptrons were large-scale analog systems (3). (Left) An analog perceptron computer receiving a visual input. The racks contained poten-
tiometers driven by motors whose resistance was controlled by the perceptron learning algorithm. (Right) Article in the New York Times, July 8, 1958, from a
UPI wire report. The perceptron machine was expected to cost $100,000 on completion in 1959, or around $1 million in today’s dollars; the IBM 704 computer
that cost $2 million in 1958, or $20 million in today’s dollars, could perform 12,000 multiplies per second, which was blazingly fast at the time. The much less
expensive Samsung Galaxy S6 phone, which can perform 34 billion operations per second, is more than a million times faster. Reprinted from ref. 5.
COMPUTER SCIENCES
even simple methods for regularization, such as weight decay, led the massively parallel architectures of deep learning networks can
to models with surprisingly good generalization. be efficiently implemented by multicore chips. The complexity of
Even more surprising, stochastic gradient descent of nonconvex learning and inference with fully parallel hardware is O(1). This
loss functions was rarely trapped in local minima. There were long means that the time it takes to process an input is independent of
plateaus on the way down when the error hardly changed, followed the size of the network. This is a rare conjunction of favorable
by sharp drops. Something about these network models and the computational properties.
geometry of their high-dimensional parameter spaces allowed them When a new class of functions is introduced, it takes genera-
to navigate efficiently to solutions and achieve good generalization, tions to fully explore them. For example, when Joseph Fourier
contrary to the failures predicted by conventional intuition. introduced Fourier series in 1807, he could not prove conver-
Network models are high-dimensional dynamical systems that gence and their status as functions was questioned. This did not
learn how to map input spaces into output spaces. These functions stop engineers from using Fourier series to solve the heat equation
have special mathematical properties that we are just beginning and apply them to other practical problems. The study of this class
to understand. Local minima during learning are rare because in of functions eventually led to deep insights into functional analysis,
the high-dimensional parameter space most critical points are a jewel in the crown of mathematics.
saddle points (11). Another reason why good solutions can be
found so easily by stochastic gradient descent is that, unlike low- The Nature of Deep Learning
dimensional models where a unique solution is sought, different The third wave of exploration into neural network architectures,
networks with good performance converge from random starting unfolding today, has greatly expanded beyond its academic ori-
points in parameter space. Because of overparameterization (12), gins, following the first 2 waves spurred by perceptrons in the
the degeneracy of solutions changes the nature of the problem 1950s and multilayer neural networks in the 1980s. The press has
from finding a needle in a haystack to a haystack of needles. rebranded deep learning as AI. What deep learning has done for
Many questions are left unanswered. Why is it possible to AI is to ground it in the real world. The real world is analog,
generalize from so few examples and so many parameters? Why noisy, uncertain, and high-dimensional, which never jived with
is stochastic gradient descent so effective at finding useful func- the black-and-white world of symbols and rules in traditional AI.
tions compared to other optimization methods? How large is the Deep learning provides an interface between these 2 worlds. For
set of all good solutions to a problem? Are good solutions related example, natural language processing has traditionally been cast
to each other in some way? What are the relationships between as a problem in symbol processing. However, end-to-end learning
architectural features and inductive bias that can improve gener- of language translation in recurrent neural networks extracts both
alization? The answers to these questions will help us design better syntactic and semantic information from sentences. Natural lan-
network architectures and more efficient learning algorithms. guage applications often start not with symbols but with word
What no one knew back in the 1980s was how well neural net- embeddings in deep learning networks trained to predict the next
work learning algorithms would scale with the number of units and word in a sentence (14), which are semantically deep and represent
weights in the network. Unlike many AI algorithms that scale com- relationships between words as well as associations. Once regarded
binatorially, as deep learning networks expanded in size training as “just statistics,” deep recurrent networks are high-dimensional
Downloaded by guest on December 5, 2020
scaled linearly with the number of parameters and performance dynamical systems through which information flows much as elec-
continued to improve as more layers were added (13). Furthermore, trical activity flows through brains.
Fig. 4. Nature has optimized birds for energy efficiency. (A) The curved
feathers at the wingtips of an eagle boosts energy efficiency during gliding. of times during the night and are associated with the consolida-
(B) Winglets on a commercial jets save fuel by reducing drag from vortices. tion of memories. Spindles are triggered by the replay of recent
4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1907373117 Sejnowski
COLLOQUIUM
states to guide behavior, representing negative rewards, surprise,
PAPER
confidence, and temporal discounting (28).
Motor systems are another area of AI where biologically in-
spired solutions may be helpful. Compare the fluid flow of ani-
mal movements to the rigid motions of most robots. The key
difference is the exceptional flexibility exhibited in the control of
high-dimensional musculature in all animals. Coordinated be-
havior in high-dimensional motor planning spaces is an active
area of investigation in deep learning networks (29). There is
also a need for a theory of distributed control to explain how the
multiple layers of control in the spinal cord, brainstem, and
forebrain are coordinated. Both brains and control systems have
to deal with time delays in feedback loops, which can become
unstable. The forward model of the body in the cerebellum pro-
vides a way to predict the sensory outcome of a motor command,
and the sensory prediction errors are used to optimize open-loop
control. For example, the vestibulo-ocular reflex (VOR) stabilizes
image on the retina despite head movements by rapidly using head
acceleration signals in an open loop; the gain of the VOR is
adapted by slip signals from the retina, which the cerebellum uses
to reduce the slip (30). Brains have additional constraints due to
the limited bandwidth of sensory and motor nerves, but these can
be overcome in layered control systems with components having a
diversity of speed–accuracy trade-offs (31). A similar diversity is
NEUROSCIENCE
also present in engineered systems, allowing fast and accurate
control despite having imperfect components (32).
COMPUTER SCIENCES
niches, but general abstract reasoning emerged more recently in
the human lineage. However, we are not very good at it and need
Fig. 5. Levels of investigation of brains. Energy efficiency is achieved by long training to achieve the ability to reason logically. This is be-
signaling with small numbers of molecules at synapses. Interconnects be-
cause we are using brain systems to simulate logical steps that have
tween neurons in the brain are 3D. Connectivity is high locally but relatively
sparse between distant cortical areas. The organizing principle in the cortex
not been optimized for logic. Students in grade school work for years
is based on multiple maps of sensory and motor surfaces in a hierarchy. The to master simple arithmetic, effectively emulating a digital computer
cortex coordinates with many subcortical areas to form the central nervous with a 1-s clock. Nonetheless, reasoning in humans is proof of
system (CNS) that generates behavior. principle that it should be possible to evolve large-scale systems of
deep learning networks for rational planning and decision making.
However, a hybrid solution might also be possible, similar to neural
episodes experienced during the day and are parsimoniously in- Turing machines developed by DeepMind for learning how to copy,
tegrated into long-term cortical semantic memory (21, 22). sort, and navigate (33). According to Orgel’s Second Rule, nature is
cleverer than we are, but improvements may still be possible.
The Future of Deep Learning Recent successes with supervised learning in deep networks
Although the focus today on deep learning was inspired by the have led to a proliferation of applications where large datasets
cerebral cortex, a much wider range of architectures is needed to are available. Language translation was greatly improved by train-
control movements and vital functions. Subcortical parts of mam- ing on large corpora of translated texts. However, there are many
malian brains essential for survival can be found in all vertebrates, applications for which large sets of labeled data are not available.
including the basal ganglia that are responsible for reinforcement Humans commonly make subconscious predictions about out-
learning and the cerebellum, which provides the brain with forward comes in the physical world and are surprised by the unexpected.
models of motor commands. Humans are hypersocial, with ex- Self-supervised learning, in which the goal of learning is to predict
tensive cortical and subcortical neural circuits to support complex the future output from other data streams, is a promising direction
social interactions (23). These brain areas will provide inspiration (34). Imitation learning is also a powerful way to learn important
to those who aim to build autonomous AI systems. behaviors and gain knowledge about the world (35). Humans have
For example, the dopamine neurons in the brainstem compute many ways to learn and require a long period of development to
achieve adult levels of performance.
reward prediction error, which is a key computation in the tem-
Brains intelligently and spontaneously generate ideas and so-
poral difference learning algorithm in reinforcement learning and,
lutions to problems. When a subject is asked to lie quietly at rest
in conjunction with deep learning, powered AlphaGo to beat Ke in a brain scanner, activity switches from sensorimotor areas to a
Jie, the world champion Go player in 2017 (24, 25). Recordings default mode network of areas that support inner thoughts, in-
from dopamine neurons in the midbrain, which project diffusely cluding unconscious activity. Generative neural network models
throughout the cortex and basal ganglia, modulate synaptic plas- can learn without supervision, with the goal of learning joint
ticity and provide motivation for obtaining long-term rewards (26). probability distributions from raw sensory data, which is abundant.
Subsequent confirmation of the role of dopamine neurons in The Boltzmann machine is an example of generative model (8).
humans has led to a new field, neuroeconomics, whose goal is to
Downloaded by guest on December 5, 2020
1. E. A. Abbott, Flatland: A Romance in Many Dimensions (Seeley & Co., London, 1884). 21. L. Muller et al., Rotating waves during human sleep spindles organize global patterns
2. L. Breiman, Statistical modeling: The two cultures. Stat. Sci. 16, 199–231 (2001). of activity during the night. eLife 5, 17267 (2016).
3. N. Chomsky, Knowledge of Language: Its Nature, Origins, and Use (Convergence, 22. R. Todorova, M. Zugaro, Isolated cortical computations during delta waves support
Praeger, Westport, CT, 1986). memory consolidation. Science 366, 377–381 (2019).
4. T. J. Sejnowski, The Deep Learning Revolution: Artificial Intelligence Meets Human 23. P. S. Churchland, Conscience: The Origins of Moral Intuition (W. W. Norton, New
Intelligence (MIT Press, Cambridge, MA, 2018). York, 2019).
5. F. Rosenblatt, Perceptrons and the Theory of Brain Mechanics (Cornell Aeronautical 24. Wikipedia, AlphaGo versus Ke Jie. https://en.wikipedia.org/wiki/AlphaGo_versus_Ke_Jie.
Lab Inc., Buffalo, NY, 1961), vol. VG-1196-G, p. 621. Accessed 8 January 2020.
6. W. S. McCulloch, W. H. Pitts, A logical calculus of the ideas immanent in nervous 25. D. Silver et al., A general reinforcement learning algorithm that masters chess, shogi,
activity. Bull. Math. Biophys. 5, 115–133 (1943). and go through self-play. Science 362, 1140–1144 (2018).
7. M. Minsky, S. Papert, Perceptrons (MIT Press, Cambridge, MA, 1969). 26. P. R. Montague, P. Dayan, T. J. Sejnowski, A framework for mesencephalic dopamine
8. D. H. Ackley, G. E. Hinton, T. J. Sejnowski, A learning algorithm for Boltzmann Machines. systems based on predictive Hebbian learning. J. Neurosci. 16, 1936–1947 (1996).
Cogn. Sci. 9, 147–169 (1985). 27. P. W. Glimcher, C. Camerer, R. A. Poldrack, E. Fehr, Neuroeconomics: Decision Making
9. T. J. Sejnowski, The book of Hebb. Neuron 24, 773–776 (1999). and the Brain (Academic Press, New York, 2008).
10. D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back- 28. E. Marder, Neuromodulation of neuronal circuits: Back to the future. Neuron 76, 1–11 (2012).
29. I. Akkaya et al., Solving Rubik’s cube with a robot hand. arXiv:1910.07113 (16 October
propagating errors. Nature 323, 533–536 (1986).
11. R. Pascanu, Y. N. Dauphin, S. Ganguli, Y. Bengio, On the saddle point problem for 2019).
30. S. du Lac, J. L. Raymond, T. J. Sejnowski, S. G. Lisberger, Learning and memory in the
non-convex optimization. arXiv:1405.4604 (19 May 2014).
vestibulo-ocular reflex. Annu. Rev. Neurosci. 18, 409–441 (1995).
12. P. L. Bartlett, P. M. Long, G. Lugosi, A. Tsigler, Benign overfitting in linear regression.
31. Y. Nakahira, Q. Liu, T. J. Sejnowski, J. C. Doyle, Fitts’ Law for speed-accuracy trade-off
arXiv:1906.11300 (26 June 2019).
describes a diversity-enabled sweet spot in sensorimotor control. arXiv:1906.00905
13. T. Poggio, A. Banburski, Q. Liao, Theoretical issues in deep networks: Approximation,
(18 September 2019).
optimization and generalization. arXiv:1908.09375 (25 August 2019)
32. Y. Nakahira, Q. Liu, T. J. Sejnowski, J. C. Doyle, Diversity-enabled sweet spots in lay-
14. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, “Distributed representations of words
ered architectures and speed-accuracy trade-offs in sensorimotor control. arXiv:
and phrases and their compositionality” in Proceedings of the 26th International Confer-
1909.08601 (18 September 2019).
ence on Neural Imaging Processing Systems (Curran Associates, 2013), vol. 2, pp. 3111–3119. 33. A. Graves, G. Wayne, I. Danihelka, Neural turing machines. arXiv:1410.540 (20 Octo-
15. D. McCullough, The Wright Brothers (Simon & Schuster, New York, 2015). ber 2014).
16. S. Navlakha, Z. Bar-Joseph, Algorithms in nature: The convergence of systems biology 34. A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, A. Torralba, Self-supervised audio-
and computational thinking. Mol. Syst. Biol. 7, 546 (2011). visual co-segmentation. arXiv:1904.09013 (18 April 2019).
17. S. B. Laughlin, T. J. Sejnowski, Communication in neuronal networks. Science 301, 35. S. Schaal, Is imitation learning the route to humanoid robots? Trends Cogn. Sci. 3,
1870–1874 (2003). 233–242 (1999).
18. K. Zhang, T. J. Sejnowski, A universal scaling law between gray matter and white 36. G. E. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief nets.
matter of cerebral cortex. Proc. Natl. Acad. Sci. U.S.A. 97, 5621–5626 (2000). Neural Comput. 18, 1527–1554 (2006).
19. S. Srinivasan, C. F. Stevens, Scaling principles of distributed circuits. Curr. Biol. 29, 37. I. J. Goodfellow et al., Generative adversarial nets. arXiv:1406.2661(10 June 2014).
2533–2540.e7 (2019). 38. E. P. Wigner, The unreasonable effectiveness of mathematics in the natural sciences.
Downloaded by guest on December 5, 2020
20. G. Gary Anthes, Lifelong learning in artificial neural networks. Commun. ACM 62, 13– Richard Courant lecture in mathematical sciences delivered at New York University,
15 (2019). May 11, 1959. Commun. Pure Appl. Math. 13, 1–14 (1960).
6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1907373117 Sejnowski