0% found this document useful (0 votes)
137 views6 pages

The Unreasonable Effectiveness of Deep Learning in Artificial Intelligence PDF

This 3 sentence summary provides the essential information about the document: The document discusses how deep learning networks have achieved high performance in tasks like speech recognition and translation, despite theoretical limitations. While their effectiveness is not fully understood, insights are being gained into how deep learning functions through the geometry of high-dimensional spaces. The document argues that a mathematical theory of deep learning could illuminate how networks work and lead to major improvements in artificial intelligence.

Uploaded by

Shafi Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views6 pages

The Unreasonable Effectiveness of Deep Learning in Artificial Intelligence PDF

This 3 sentence summary provides the essential information about the document: The document discusses how deep learning networks have achieved high performance in tasks like speech recognition and translation, despite theoretical limitations. While their effectiveness is not fully understood, insights are being gained into how deep learning functions through the geometry of high-dimensional spaces. The document argues that a mathematical theory of deep learning could illuminate how networks work and lead to major improvements in artificial intelligence.

Uploaded by

Shafi Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COLLOQUIUM

The unreasonable effectiveness of deep learning in

PAPER
artificial intelligence
Terrence J. Sejnowskia,b,1
a
Computational Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037; and bDivision of Biological Sciences, University of
California San Diego, La Jolla, CA 92093

Edited by David L. Donoho, Stanford University, Stanford, CA, and approved November 22, 2019 (received for review September 17, 2019)

Deep learning networks have been trained to recognize speech, NeurIPS conferences, I oversaw the remarkable evolution of a
caption photographs, and translate text between languages at community that created modern machine learning. This confer-
high levels of performance. Although applications of deep learn- ence has grown steadily and in 2019 attracted over 14,000 par-
ing networks to real-world problems have become ubiquitous, our ticipants. Many intractable problems eventually became tractable,
understanding of why they are so effective is lacking. These empirical and today machine learning serves as a foundation for contem-
results should not be possible according to sample complexity in porary artificial intelligence (AI).
statistics and nonconvex optimization theory. However, paradoxes The early goals of machine learning were more modest than
in the training and effectiveness of deep learning networks are those of AI. Rather than aiming directly at general intelligence,
being investigated and insights are being found in the geometry of machine learning started by attacking practical problems in
high-dimensional spaces. A mathematical theory of deep learning perception, language, motor control, prediction, and inference
would illuminate how they function, allow us to assess the strengths using learning from data as the primary tool. In contrast, early
and weaknesses of different network architectures, and lead to attempts in AI were characterized by low-dimensional algorithms
major improvements. Deep learning has provided natural ways for that were handcrafted. However, this approach only worked for

NEUROSCIENCE
humans to communicate with digital devices and is foundational for well-controlled environments. For example, in blocks world all
building artificial general intelligence. Deep learning was inspired by objects were rectangular solids, identically painted and in an envi-
the architecture of the cerebral cortex and insights into autonomy ronment with fixed lighting. These algorithms did not scale up to
and general intelligence may be found in other brain regions that vision in the real world, where objects have complex shapes, a wide
are essential for planning and survival, but major breakthroughs will range of reflectances, and lighting conditions are uncontrolled. The
be needed to achieve these goals. real world is high-dimensional and there may not be any low-

COMPUTER SCIENCES
dimensional model that can be fit to it (2). Similar problems were
deep learning | artificial intelligence | neural networks encountered with early models of natural languages based on
symbols and syntax, which ignored the complexities of semantics

I n 1884, Edwin Abbott wrote Flatland: A Romance of Many


Dimensions (1) (Fig. 1). This book was written as a satire on
Victorian society, but it has endured because of its exploration of
(3). Practical natural language applications became possible once
the complexity of deep learning language models approached the
complexity of the real world. Models of natural language with
how dimensionality can change our intuitions about space. Flat- millions of parameters and trained with millions of labeled exam-
land was a 2-dimensional (2D) world inhabited by geometrical ples are now used routinely. Even larger deep learning language
creatures. The mathematics of 2 dimensions was fully understood networks are in production today, providing services to millions of
by these creatures, with circles being more perfect than triangles. users online, less than a decade since they were introduced.
In it a gentleman square has a dream about a sphere and wakes up
to the possibility that his universe might be much larger than he or Origins of Deep Learning
anyone in Flatland could imagine. He was not able to convince I have written a book, The Deep Learning Revolution: Artificial
anyone that this was possible and in the end he was imprisoned. Intelligence Meets Human Intelligence (4), which tells the story of
We can easily imagine adding another spatial dimension when how deep learning came about. Deep learning was inspired by
going from a 1-dimensional to a 2D world and from a 2D to a the massively parallel architecture found in brains and its origins
3-dimensional (3D) world. Lines can intersect themselves in 2 di- can be traced to Frank Rosenblatt’s perceptron (5) in the 1950s
mensions and sheets can fold back onto themselves in 3 dimen- that was based on a simplified model of a single neuron in-
sions, but imagining how a 3D object can fold back on itself in a troduced by McCulloch and Pitts (6). The perceptron performed
4-dimensional space is a stretch that was achieved by Charles Howard pattern recognition and learned to classify labeled examples (Fig.
Hinton in the 19th century (https://en.wikipedia.org/wiki/Charles_ 3). Rosenblatt proved a theorem that if there was a set of pa-
Howard_Hinton). What are the properties of spaces having even rameters that could classify new inputs correctly, and there were
higher dimensions? What is it like to live in a space with 100 dimen-
sions, or a million dimensions, or a space like our brain that has a
million billion dimensions (the number of synapses between neurons)? This paper results from the Arthur M. Sackler Colloquium of the National Academy of
The first Neural Information Processing Systems (NeurIPS) Sciences, “The Science of Deep Learning,” held March 13–14, 2019, at the National Acad-
emy of Sciences in Washington, DC. NAS colloquia began in 1991 and have been pub-
Conference and Workshop took place at the Denver Tech Center lished in PNAS since 1995. From February 2001 through May 2019 colloquia were
in 1987 (Fig. 2). The 600 attendees were from a wide range of supported by a generous gift from The Dame Jillian and Dr. Arthur M. Sackler Foun-
disciplines, including physics, neuroscience, psychology, statistics, dation for the Arts, Sciences, & Humanities, in memory of Dame Sackler’s husband,
electrical engineering, computer science, computer vision, speech Arthur M. Sackler. The complete program and video recordings of most presentations
are available on the NAS website at www.nasonline.org/science-of-deep-learning.
recognition, and robotics, but they all had something in common:
Author contributions: T.J.S. wrote the paper.
They all worked on intractably difficult problems that were not
The author declares no competing interest.
easily solved with traditional methods and they tended to be out-
liers in their home disciplines. In retrospect, 33 y later, these misfits This article is a PNAS Direct Submission.

were pushing the frontiers of their fields into high-dimensional


Downloaded by guest on December 5, 2020

Published under the PNAS license.


1
spaces populated by big datasets, the world we are living in to- Email: terry@salk.edu.
day. As the president of the foundation that organizes the annual

www.pnas.org/cgi/doi/10.1073/pnas.1907373117 PNAS Latest Articles | 1 of 6


between the inputs and outputs of single neurons, a form of Hebbian
plasticity that is found in the cortex (9). Intriguingly, the corre-
lations computed during training must be normalized by cor-
relations that occur without inputs, which we called the sleep state,
to prevent self-referential learning. It is also possible to learn the
joint probability distributions of inputs without labels in an unsu-
pervised learning mode. However, another learning algorithm in-
troduced at around the same time based on the backpropagation of
errors was much more efficient, though at the expense of locality
(10). Both of these learning algorithm use stochastic gradient de-
scent, an optimization technique that incrementally changes the
parameter values to minimize a loss function. Typically this is done
after averaging the gradients for a small batch of training examples.
Lost in Parameter Space
The network models in the 1980s rarely had more than one layer
of hidden units between the inputs and outputs, but they were
already highly overparameterized by the standards of statistical
learning. Empirical studies uncovered a number of paradoxes that
could not be explained at the time. Even though the networks were
tiny by today’s standards, they had orders of magnitude more pa-
rameters than traditional statistical models. According to bounds
from theorems in statistics, generalization should not be possible
with the relatively small training sets that were available. However,

Fig. 1. Cover of the 1884 edition of Flatland: A Romance in Many Dimen-


sions by Edwin A. Abbott (1). Inhabitants were 2D shapes, with their rank in
society determined by the number of sides.

enough examples, his learning algorithm was guaranteed to find it.


The learning algorithm used labeled data to make small changes to
parameters, which were the weights on the inputs to a binary
threshold unit, implementing gradient descent. This simple para-
digm is at the core of much larger and more sophisticated
neural network architectures today, but the jump from perceptrons
to deep learning was not a smooth one. There are lessons to be
learned from how this happened.
The perceptron learning algorithm required computing with real
numbers, which digital computers performed inefficiently in the
1950s. Rosenblatt received a grant for the equivalent today of $1
million from the Office of Naval Research to build a large analog
computer that could perform the weight updates in parallel using
banks of motor-driven potentiometers representing variable weights
(Fig. 3). The great expectations in the press (Fig. 3) were dashed by
Minsky and Papert (7), who showed in their book Perceptrons that a
perceptron can only represent categories that are linearly separable
in weight space. Although at the end of their book Minsky and
Papert considered the prospect of generalizing single- to multiple-
layer perceptrons, one layer feeding into the next, they doubted
there would ever be a way to train these more powerful multilayer
perceptrons. Unfortunately, many took this doubt to be definitive,
and the field was abandoned until a new generation of neural net-
work researchers took a fresh look at the problem in the 1980s.
The computational power available for research in the 1960s
was puny compared to what we have today; this favored pro-
gramming rather than learning, and early progress with writing
programs to solve toy problems looked encouraging. By the 1970s,
learning had fallen out of favor, but by the 1980s digital computers
had increased in speed, making it possible to simulate modestly
sized neural networks. During the ensuing neural network revival Fig. 2. The Neural Information Processing Systems conference brought to-
in the 1980s, Geoffrey Hinton and I introduced a learning algo- gether researchers from many fields of science and engineering. The first
rithm for Boltzmann machines proving that contrary to general
Downloaded by guest on December 5, 2020

conference was held at the Denver Tech Center in 1987 and has been held
belief it was possible to train multilayer networks (8). The Boltzmann annually since then. The first few meetings were sponsored by the IEEE In-
machine learning algorithm is local and only depends on correlations formation Theory Society.

2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1907373117 Sejnowski
COLLOQUIUM
PAPER
NEUROSCIENCE
Fig. 3. Early perceptrons were large-scale analog systems (3). (Left) An analog perceptron computer receiving a visual input. The racks contained poten-
tiometers driven by motors whose resistance was controlled by the perceptron learning algorithm. (Right) Article in the New York Times, July 8, 1958, from a
UPI wire report. The perceptron machine was expected to cost $100,000 on completion in 1959, or around $1 million in today’s dollars; the IBM 704 computer
that cost $2 million in 1958, or $20 million in today’s dollars, could perform 12,000 multiplies per second, which was blazingly fast at the time. The much less
expensive Samsung Galaxy S6 phone, which can perform 34 billion operations per second, is more than a million times faster. Reprinted from ref. 5.

COMPUTER SCIENCES
even simple methods for regularization, such as weight decay, led the massively parallel architectures of deep learning networks can
to models with surprisingly good generalization. be efficiently implemented by multicore chips. The complexity of
Even more surprising, stochastic gradient descent of nonconvex learning and inference with fully parallel hardware is O(1). This
loss functions was rarely trapped in local minima. There were long means that the time it takes to process an input is independent of
plateaus on the way down when the error hardly changed, followed the size of the network. This is a rare conjunction of favorable
by sharp drops. Something about these network models and the computational properties.
geometry of their high-dimensional parameter spaces allowed them When a new class of functions is introduced, it takes genera-
to navigate efficiently to solutions and achieve good generalization, tions to fully explore them. For example, when Joseph Fourier
contrary to the failures predicted by conventional intuition. introduced Fourier series in 1807, he could not prove conver-
Network models are high-dimensional dynamical systems that gence and their status as functions was questioned. This did not
learn how to map input spaces into output spaces. These functions stop engineers from using Fourier series to solve the heat equation
have special mathematical properties that we are just beginning and apply them to other practical problems. The study of this class
to understand. Local minima during learning are rare because in of functions eventually led to deep insights into functional analysis,
the high-dimensional parameter space most critical points are a jewel in the crown of mathematics.
saddle points (11). Another reason why good solutions can be
found so easily by stochastic gradient descent is that, unlike low- The Nature of Deep Learning
dimensional models where a unique solution is sought, different The third wave of exploration into neural network architectures,
networks with good performance converge from random starting unfolding today, has greatly expanded beyond its academic ori-
points in parameter space. Because of overparameterization (12), gins, following the first 2 waves spurred by perceptrons in the
the degeneracy of solutions changes the nature of the problem 1950s and multilayer neural networks in the 1980s. The press has
from finding a needle in a haystack to a haystack of needles. rebranded deep learning as AI. What deep learning has done for
Many questions are left unanswered. Why is it possible to AI is to ground it in the real world. The real world is analog,
generalize from so few examples and so many parameters? Why noisy, uncertain, and high-dimensional, which never jived with
is stochastic gradient descent so effective at finding useful func- the black-and-white world of symbols and rules in traditional AI.
tions compared to other optimization methods? How large is the Deep learning provides an interface between these 2 worlds. For
set of all good solutions to a problem? Are good solutions related example, natural language processing has traditionally been cast
to each other in some way? What are the relationships between as a problem in symbol processing. However, end-to-end learning
architectural features and inductive bias that can improve gener- of language translation in recurrent neural networks extracts both
alization? The answers to these questions will help us design better syntactic and semantic information from sentences. Natural lan-
network architectures and more efficient learning algorithms. guage applications often start not with symbols but with word
What no one knew back in the 1980s was how well neural net- embeddings in deep learning networks trained to predict the next
work learning algorithms would scale with the number of units and word in a sentence (14), which are semantically deep and represent
weights in the network. Unlike many AI algorithms that scale com- relationships between words as well as associations. Once regarded
binatorially, as deep learning networks expanded in size training as “just statistics,” deep recurrent networks are high-dimensional
Downloaded by guest on December 5, 2020

scaled linearly with the number of parameters and performance dynamical systems through which information flows much as elec-
continued to improve as more layers were added (13). Furthermore, trical activity flows through brains.

Sejnowski PNAS Latest Articles | 3 of 6


One of the early tensions in AI research in the 1960s was its real neurons is inherited from cell biology—the need for each
relationship to human intelligence. The engineering goal of AI cell to generate its own energy and maintain homeostasis under a
was to reproduce the functional capabilities of human intelli- wide range of challenging conditions. However, other features of
gence by writing programs based on intuition. I once asked Allen neurons are likely to be important for their computational func-
Newell, a computer scientist from Carnegie Mellon University tion, some of which have not yet been exploited in model net-
and one of the pioneers of AI who attended the seminal Dart- works. These features include a diversity of cell types, optimized
mouth summer conference in 1956, why AI pioneers had ignored for specific functions; short-term synaptic plasticity, which can be
brains, the substrate of human intelligence. The performance of either facilitating or depressing on a time scales of seconds; a
brains was the only existence proof that any of the hard problems cascade of biochemical reactions underlying plasticity inside syn-
in AI could be solved. He told me that he personally had been apses controlled by the history of inputs that extends from seconds
open to insights from brain research but there simply had not to hours; sleep states during which a brain goes offline to re-
been enough known about brains at the time to be of much help. structure itself; and communication networks that control traffic
Over time, the attitude in AI had changed from “not enough is between brain areas (17). Synergies between brains and AI may
known” to “brains are not relevant.” This view was commonly now be possible that could benefit both biology and engineering.
justified by an analogy with aviation: “If you want to build a The neocortex appeared in mammals 200 million y ago. It is a
flying machine, you would be wasting your time studying birds folded sheet of neurons on the outer surface of the brain, called
that flap their wings or the properties of their feathers.” Quite to the gray matter, which in humans is about 30 cm in diameter and
the contrary, the Wright Brothers were keen observers of gliding 5 mm thick when flattened. There are about 30 billion cortical
birds, which are highly efficient flyers (15). What they learned neurons forming 6 layers that are highly interconnected with each
from birds was ideas for designing practical airfoils and basic other in a local stereotyped pattern. The cortex greatly expanded
principles of aerodynamics. Modern jets have even sprouted wing- in size relative the central core of the brain during evolution, es-
lets at the tips of wings, which saves 5% on fuel and look suspi- pecially in humans, where it constitutes 80% of the brain volume.
ciously like wingtips on eagles (Fig. 4). Much more is now known This expansion suggests that the cortical architecture is scalable—
about how brains process sensory information, accumulate evi- more is better—unlike most brain areas, which have not expanded
dence, make decisions, and plan future actions. Deep learning was relative to body size. Interestingly, there are many fewer long-
similarly inspired by nature. There is a burgeoning new field in range connections than local connections, which form the white
computer science, called algorithmic biology, which seeks to de- matter of the cortex, but its volume scales as the 5/4 power of the gray
scribe the wide range of problem-solving strategies used by bi- matter volume and becomes larger than the volume of the gray matter
ological systems (16). The lesson here is we can learn from nature in large brains (18). Scaling laws for brain structures can provide in-
general principles and specific solutions to complex problems, sights into important computational principles (19). Cortical archi-
honed by evolution and passed down the chain of life to humans. tecture including cell types and their connectivity is similar throughout
There is a stark contrast between the complexity of real neu-
the cortex, with specialized regions for different cognitive systems. For
rons and the simplicity of the model neurons in neural network
example, the visual cortex has evolved specialized circuits for vision,
models. Neurons are themselves complex dynamical systems with
which have been exploited in convolutional neural networks, the most
a wide range of internal time scales. Much of the complexity of
successful deep learning architecture. Having evolved a general
purpose learning architecture, the neocortex greatly enhances the
performance of many special-purpose subcortical structures.
Brains have 11 orders of magnitude of spatially structured
computing components (Fig. 5). At the level of synapses, each
cubic millimeter of the cerebral cortex, about the size of a rice
grain, contains a billion synapses. The largest deep learning net-
works today are reaching a billion weights. The cortex has the
equivalent power of hundreds of thousands of deep learning
networks, each specialized for solving specific problems. How are
all these expert networks organized? The levels of investigation
above the network level organize the flow of information between
different cortical areas, a system-level communications problem.
There is much to be learned about how to organize thousands of
specialized networks by studying how the global flow of informa-
tion in the cortex is managed. Long-range connections within the
cortex are sparse because they are expensive, both because of the
energy demand needed to send information over a long distance
and also because they occupy a large volume of space. A switching
network routes information between sensory and motor areas that
can be rapidly reconfigured to meet ongoing cognitive demands (17).
Another major challenge for building the next generation of
AI systems will be memory management for highly heterogeneous
systems of deep learning specialist networks. There is need to
flexibly update these networks without degrading already learned
memories; this is the problem of maintaining stable, lifelong
learning (20). There are ways to minimize memory loss and in-
terference between subsystems. One way is to be selective about
where to store new experiences. This occurs during sleep, when
the cortex enters globally coherent patterns of electrical activity.
Brief oscillatory events, known as sleep spindles, recur thousands
Downloaded by guest on December 5, 2020

Fig. 4. Nature has optimized birds for energy efficiency. (A) The curved
feathers at the wingtips of an eagle boosts energy efficiency during gliding. of times during the night and are associated with the consolida-
(B) Winglets on a commercial jets save fuel by reducing drag from vortices. tion of memories. Spindles are triggered by the replay of recent

4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1907373117 Sejnowski
COLLOQUIUM
states to guide behavior, representing negative rewards, surprise,

PAPER
confidence, and temporal discounting (28).
Motor systems are another area of AI where biologically in-
spired solutions may be helpful. Compare the fluid flow of ani-
mal movements to the rigid motions of most robots. The key
difference is the exceptional flexibility exhibited in the control of
high-dimensional musculature in all animals. Coordinated be-
havior in high-dimensional motor planning spaces is an active
area of investigation in deep learning networks (29). There is
also a need for a theory of distributed control to explain how the
multiple layers of control in the spinal cord, brainstem, and
forebrain are coordinated. Both brains and control systems have
to deal with time delays in feedback loops, which can become
unstable. The forward model of the body in the cerebellum pro-
vides a way to predict the sensory outcome of a motor command,
and the sensory prediction errors are used to optimize open-loop
control. For example, the vestibulo-ocular reflex (VOR) stabilizes
image on the retina despite head movements by rapidly using head
acceleration signals in an open loop; the gain of the VOR is
adapted by slip signals from the retina, which the cerebellum uses
to reduce the slip (30). Brains have additional constraints due to
the limited bandwidth of sensory and motor nerves, but these can
be overcome in layered control systems with components having a
diversity of speed–accuracy trade-offs (31). A similar diversity is

NEUROSCIENCE
also present in engineered systems, allowing fast and accurate
control despite having imperfect components (32).

Toward Artificial General Intelligence


Is there a path from the current state of the art in deep learning
to artificial general intelligence? From the perspective of evolu-
tion, most animals can solve problems needed to survive in their

COMPUTER SCIENCES
niches, but general abstract reasoning emerged more recently in
the human lineage. However, we are not very good at it and need
Fig. 5. Levels of investigation of brains. Energy efficiency is achieved by long training to achieve the ability to reason logically. This is be-
signaling with small numbers of molecules at synapses. Interconnects be-
cause we are using brain systems to simulate logical steps that have
tween neurons in the brain are 3D. Connectivity is high locally but relatively
sparse between distant cortical areas. The organizing principle in the cortex
not been optimized for logic. Students in grade school work for years
is based on multiple maps of sensory and motor surfaces in a hierarchy. The to master simple arithmetic, effectively emulating a digital computer
cortex coordinates with many subcortical areas to form the central nervous with a 1-s clock. Nonetheless, reasoning in humans is proof of
system (CNS) that generates behavior. principle that it should be possible to evolve large-scale systems of
deep learning networks for rational planning and decision making.
However, a hybrid solution might also be possible, similar to neural
episodes experienced during the day and are parsimoniously in- Turing machines developed by DeepMind for learning how to copy,
tegrated into long-term cortical semantic memory (21, 22). sort, and navigate (33). According to Orgel’s Second Rule, nature is
cleverer than we are, but improvements may still be possible.
The Future of Deep Learning Recent successes with supervised learning in deep networks
Although the focus today on deep learning was inspired by the have led to a proliferation of applications where large datasets
cerebral cortex, a much wider range of architectures is needed to are available. Language translation was greatly improved by train-
control movements and vital functions. Subcortical parts of mam- ing on large corpora of translated texts. However, there are many
malian brains essential for survival can be found in all vertebrates, applications for which large sets of labeled data are not available.
including the basal ganglia that are responsible for reinforcement Humans commonly make subconscious predictions about out-
learning and the cerebellum, which provides the brain with forward comes in the physical world and are surprised by the unexpected.
models of motor commands. Humans are hypersocial, with ex- Self-supervised learning, in which the goal of learning is to predict
tensive cortical and subcortical neural circuits to support complex the future output from other data streams, is a promising direction
social interactions (23). These brain areas will provide inspiration (34). Imitation learning is also a powerful way to learn important
to those who aim to build autonomous AI systems. behaviors and gain knowledge about the world (35). Humans have
For example, the dopamine neurons in the brainstem compute many ways to learn and require a long period of development to
achieve adult levels of performance.
reward prediction error, which is a key computation in the tem-
Brains intelligently and spontaneously generate ideas and so-
poral difference learning algorithm in reinforcement learning and,
lutions to problems. When a subject is asked to lie quietly at rest
in conjunction with deep learning, powered AlphaGo to beat Ke in a brain scanner, activity switches from sensorimotor areas to a
Jie, the world champion Go player in 2017 (24, 25). Recordings default mode network of areas that support inner thoughts, in-
from dopamine neurons in the midbrain, which project diffusely cluding unconscious activity. Generative neural network models
throughout the cortex and basal ganglia, modulate synaptic plas- can learn without supervision, with the goal of learning joint
ticity and provide motivation for obtaining long-term rewards (26). probability distributions from raw sensory data, which is abundant.
Subsequent confirmation of the role of dopamine neurons in The Boltzmann machine is an example of generative model (8).
humans has led to a new field, neuroeconomics, whose goal is to
Downloaded by guest on December 5, 2020

After a Boltzmann machine has been trained to classify inputs,


better understand how humans make economic decisions (27). clamping an output unit on generates a sequence of examples
Several other neuromodulatory systems also control global brain from that category on the input layer (36). Generative adversarial

Sejnowski PNAS Latest Articles | 5 of 6


knowledge into wisdom. We have taken our first steps toward
dealing with complex high-dimensional problems in the real world;
like a baby’s, they are more stumble than stride, but what is im-
portant is that we are heading in the right direction. Deep learning
networks are bridges between digital computers and the real
world; this allows us to communicate with computers on our own
terms. We already talk to smart speakers, which will become much
smarter. Keyboards will become obsolete, taking their place in
museums alongside typewriters. This makes the benefits of deep
learning available to everyone.
In his essay “The Unreasonable Effectiveness of Mathematics
in the Natural Sciences,” Eugene Wigner marveled that the
mathematical structure of a physical theory often reveals deep
insights into that theory that lead to empirical predictions (38).
Also remarkable is that there are so few parameters in the equa-
tions, called physical constants. The title of this article mirrors
Wigner’s. However, unlike the laws of physics, there is an abundance
of parameters in deep learning networks and they are variable. We
are just beginning to explore representation and optimization in
very-high-dimensional spaces. Perhaps someday an analysis of the
Fig. 6. The caption that accompanies the engraving in Flammarion’s book structure of deep learning networks will lead to theoretical predic-
reads: “A missionary of the Middle Ages tells that he had found the point tions and reveal deep insights into the nature of intelligence. We can
where the sky and the Earth touch . . ..” Image courtesy of Wikimedia
benefit from the blessings of dimensionality.
Commons/Camille Flammarion.
Having found one class of functions to describe the complexity
of signals in the world, perhaps there are others. Perhaps there is
networks can also generate new samples from a probability dis- a universe of massively parallel algorithms in high-dimensional
tribution learned by self-supervised learning (37). Brains also gen- spaces that we have not yet explored, which go beyond intuitions
erate vivid visual images during dream sleep that are often bizarre. from the 3D world we inhabit and the 1-dimensional sequences of
instructions in digital computers. Like the gentleman square in
Looking ahead Flatland (Fig. 1) and the explorer in the Flammarion engraving (Fig.
We are at the beginning of a new era that could be called the 6), we have glimpsed a new world stretching far beyond old horizons.
age of information. Data are gushing from sensors, the sources for
pipelines that turn data into information, information into knowl- Data Availability
edge, knowledge into understanding, and, if we are fortunate, There are no data associated with this paper.

1. E. A. Abbott, Flatland: A Romance in Many Dimensions (Seeley & Co., London, 1884). 21. L. Muller et al., Rotating waves during human sleep spindles organize global patterns
2. L. Breiman, Statistical modeling: The two cultures. Stat. Sci. 16, 199–231 (2001). of activity during the night. eLife 5, 17267 (2016).
3. N. Chomsky, Knowledge of Language: Its Nature, Origins, and Use (Convergence, 22. R. Todorova, M. Zugaro, Isolated cortical computations during delta waves support
Praeger, Westport, CT, 1986). memory consolidation. Science 366, 377–381 (2019).
4. T. J. Sejnowski, The Deep Learning Revolution: Artificial Intelligence Meets Human 23. P. S. Churchland, Conscience: The Origins of Moral Intuition (W. W. Norton, New
Intelligence (MIT Press, Cambridge, MA, 2018). York, 2019).
5. F. Rosenblatt, Perceptrons and the Theory of Brain Mechanics (Cornell Aeronautical 24. Wikipedia, AlphaGo versus Ke Jie. https://en.wikipedia.org/wiki/AlphaGo_versus_Ke_Jie.
Lab Inc., Buffalo, NY, 1961), vol. VG-1196-G, p. 621. Accessed 8 January 2020.
6. W. S. McCulloch, W. H. Pitts, A logical calculus of the ideas immanent in nervous 25. D. Silver et al., A general reinforcement learning algorithm that masters chess, shogi,
activity. Bull. Math. Biophys. 5, 115–133 (1943). and go through self-play. Science 362, 1140–1144 (2018).
7. M. Minsky, S. Papert, Perceptrons (MIT Press, Cambridge, MA, 1969). 26. P. R. Montague, P. Dayan, T. J. Sejnowski, A framework for mesencephalic dopamine
8. D. H. Ackley, G. E. Hinton, T. J. Sejnowski, A learning algorithm for Boltzmann Machines. systems based on predictive Hebbian learning. J. Neurosci. 16, 1936–1947 (1996).
Cogn. Sci. 9, 147–169 (1985). 27. P. W. Glimcher, C. Camerer, R. A. Poldrack, E. Fehr, Neuroeconomics: Decision Making
9. T. J. Sejnowski, The book of Hebb. Neuron 24, 773–776 (1999). and the Brain (Academic Press, New York, 2008).
10. D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back- 28. E. Marder, Neuromodulation of neuronal circuits: Back to the future. Neuron 76, 1–11 (2012).
29. I. Akkaya et al., Solving Rubik’s cube with a robot hand. arXiv:1910.07113 (16 October
propagating errors. Nature 323, 533–536 (1986).
11. R. Pascanu, Y. N. Dauphin, S. Ganguli, Y. Bengio, On the saddle point problem for 2019).
30. S. du Lac, J. L. Raymond, T. J. Sejnowski, S. G. Lisberger, Learning and memory in the
non-convex optimization. arXiv:1405.4604 (19 May 2014).
vestibulo-ocular reflex. Annu. Rev. Neurosci. 18, 409–441 (1995).
12. P. L. Bartlett, P. M. Long, G. Lugosi, A. Tsigler, Benign overfitting in linear regression.
31. Y. Nakahira, Q. Liu, T. J. Sejnowski, J. C. Doyle, Fitts’ Law for speed-accuracy trade-off
arXiv:1906.11300 (26 June 2019).
describes a diversity-enabled sweet spot in sensorimotor control. arXiv:1906.00905
13. T. Poggio, A. Banburski, Q. Liao, Theoretical issues in deep networks: Approximation,
(18 September 2019).
optimization and generalization. arXiv:1908.09375 (25 August 2019)
32. Y. Nakahira, Q. Liu, T. J. Sejnowski, J. C. Doyle, Diversity-enabled sweet spots in lay-
14. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, “Distributed representations of words
ered architectures and speed-accuracy trade-offs in sensorimotor control. arXiv:
and phrases and their compositionality” in Proceedings of the 26th International Confer-
1909.08601 (18 September 2019).
ence on Neural Imaging Processing Systems (Curran Associates, 2013), vol. 2, pp. 3111–3119. 33. A. Graves, G. Wayne, I. Danihelka, Neural turing machines. arXiv:1410.540 (20 Octo-
15. D. McCullough, The Wright Brothers (Simon & Schuster, New York, 2015). ber 2014).
16. S. Navlakha, Z. Bar-Joseph, Algorithms in nature: The convergence of systems biology 34. A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, A. Torralba, Self-supervised audio-
and computational thinking. Mol. Syst. Biol. 7, 546 (2011). visual co-segmentation. arXiv:1904.09013 (18 April 2019).
17. S. B. Laughlin, T. J. Sejnowski, Communication in neuronal networks. Science 301, 35. S. Schaal, Is imitation learning the route to humanoid robots? Trends Cogn. Sci. 3,
1870–1874 (2003). 233–242 (1999).
18. K. Zhang, T. J. Sejnowski, A universal scaling law between gray matter and white 36. G. E. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief nets.
matter of cerebral cortex. Proc. Natl. Acad. Sci. U.S.A. 97, 5621–5626 (2000). Neural Comput. 18, 1527–1554 (2006).
19. S. Srinivasan, C. F. Stevens, Scaling principles of distributed circuits. Curr. Biol. 29, 37. I. J. Goodfellow et al., Generative adversarial nets. arXiv:1406.2661(10 June 2014).
2533–2540.e7 (2019). 38. E. P. Wigner, The unreasonable effectiveness of mathematics in the natural sciences.
Downloaded by guest on December 5, 2020

20. G. Gary Anthes, Lifelong learning in artificial neural networks. Commun. ACM 62, 13– Richard Courant lecture in mathematical sciences delivered at New York University,
15 (2019). May 11, 1959. Commun. Pure Appl. Math. 13, 1–14 (1960).

6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1907373117 Sejnowski

You might also like