100% found this document useful (8 votes)
90 views55 pages

Full Download Compiling Algorithms For Heterogeneous Systems Steven Bell PDF

Compiling

Uploaded by

edmanlopaulj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (8 votes)
90 views55 pages

Full Download Compiling Algorithms For Heterogeneous Systems Steven Bell PDF

Compiling

Uploaded by

edmanlopaulj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Download the Full Version of textbook for Fast Typing at textbookfull.

com

Compiling Algorithms for Heterogeneous Systems


Steven Bell

https://textbookfull.com/product/compiling-algorithms-for-
heterogeneous-systems-steven-bell/

OR CLICK BUTTON

DOWNLOAD NOW

Download More textbook Instantly Today - Get Yours Now at textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Colloidal Nanoparticles for Heterogeneous Catalysis


Priscila Destro

https://textbookfull.com/product/colloidal-nanoparticles-for-
heterogeneous-catalysis-priscila-destro/

textboxfull.com

Radio Systems Engineering Steven W. Ellingson

https://textbookfull.com/product/radio-systems-engineering-steven-w-
ellingson/

textboxfull.com

Intelligent Algorithms for Analysis and Control of


Dynamical Systems Rajesh Kumar

https://textbookfull.com/product/intelligent-algorithms-for-analysis-
and-control-of-dynamical-systems-rajesh-kumar/

textboxfull.com

International environmental risk management: a systems


approach Second Edition Bell

https://textbookfull.com/product/international-environmental-risk-
management-a-systems-approach-second-edition-bell/

textboxfull.com
Data Parallel C++ Mastering DPC++ for Programming of
Heterogeneous Systems using C++ and SYCL 1st Edition James
Reinders
https://textbookfull.com/product/data-parallel-c-mastering-dpc-for-
programming-of-heterogeneous-systems-using-c-and-sycl-1st-edition-
james-reinders/
textboxfull.com

Tools and Algorithms for the Construction and Analysis of


Systems Dirk Beyer

https://textbookfull.com/product/tools-and-algorithms-for-the-
construction-and-analysis-of-systems-dirk-beyer/

textboxfull.com

Tools and Algorithms for the Construction and Analysis of


Systems Dirk Beyer

https://textbookfull.com/product/tools-and-algorithms-for-the-
construction-and-analysis-of-systems-dirk-beyer-2/

textboxfull.com

How to Draw Manga Volume 1 Compiling Characters Society


For The Study Of Manga Techniques

https://textbookfull.com/product/how-to-draw-manga-volume-1-compiling-
characters-society-for-the-study-of-manga-techniques/

textboxfull.com

Smart Electronic Systems Heterogeneous Integration of


Silicon and Printed Electronics Li-Rong Zheng

https://textbookfull.com/product/smart-electronic-systems-
heterogeneous-integration-of-silicon-and-printed-electronics-li-rong-
zheng/
textboxfull.com
Compiling Algorithms
for Heterogeneous Systems
Synthesis Lectures on
Computer Architecture
Editor
Margaret Martonosi, Princeton University
Founding Editor Emeritus
Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. The scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.

Compiling Algorithms for Heterogeneous Systems


Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz
2018

Architectural and Operating System Support for Virtual Memory


Abhishek Bhattacharjee and Daniel Lustig
2017

Deep Learning for Computer Architects


Brandon Reagen, Robert Adolf, Paul Whatmough, Gu-Yeon Wei, and David Brooks
2017

On-Chip Networks, Second Edition


Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh
2017

Space-Time Computing with Temporal Neural Networks


James E. Smith
2017

Hardware and Software Support for Virtualization


Edouard Bugnion, Jason Nieh, and Dan Tsafrir
2017
iv
Datacenter Design and Management: A Computer Architect’s Perspective
Benjamin C. Lee
2016

A Primer on Compression in the Memory Hierarchy


Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
2015

Research Infrastructures for Hardware Accelerators


Yakun Sophia Shao and David Brooks
2015

Analyzing Analytics
Rajesh Bordawekar, Bob Blainey, and Ruchir Puri
2015

Customizable Computing
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
2015

Die-stacking Architecture
Yuan Xie and Jishen Zhao
2015

Single-Instruction Multiple-Data Execution


Christopher J. Hughes
2015

Power-Efficient Computer Architectures: Recent Advances


Magnus Själander, Margaret Martonosi, and Stefanos Kaxiras
2014

FPGA-Accelerated Simulation of Computer Systems


Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe
2014

A Primer on Hardware Prefetching


Babak Falsafi and Thomas F. Wenisch
2014

On-Chip Photonic Interconnects: A Computer Architect’s Perspective


Christopher J. Nitta, Matthew K. Farrens, and Venkatesh Akella
2013
v
Optimization and Mathematical Modeling in Computer Architecture
Tony Nowatzki, Michael Ferris, Karthikeyan Sankaralingam, Cristian Estan, Nilay Vaish, and
David Wood
2013

Security Basics for Computer Architects


Ruby B. Lee
2013

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale


Machines, Second edition
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle
2013

Shared-Memory Synchronization
Michael L. Scott
2013

Resilient Architecture Design for Voltage Variation


Vijay Janapa Reddi and Meeta Sharma Gupta
2013

Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013

Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012

Automatic Parallelization: An Overview of Fundamental Compiler Techniques


Samuel P. Midkiff
2012

Phase Change Memory: From Devices to Systems


Moinuddin K. Qureshi, Sudhanva Gurumurthi, and Bipin Rajendran
2011

Multi-Core Cache Hierarchies


Rajeev Balasubramonian, Norman P. Jouppi, and Naveen Muralimanohar
2011

A Primer on Memory Consistency and Cache Coherence


Daniel J. Sorin, Mark D. Hill, and David A. Wood
2011
vi
Dynamic Binary Modification: Tools, Techniques, and Applications
Kim Hazelwood
2011

Quantum Computing for Computer Architects, Second Edition


Tzvetan S. Metodi, Arvin I. Faruque, and Frederic T. Chong
2011

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities


Dennis Abts and John Kim
2011

Processor Microarchitecture: An Implementation Perspective


Antonio González, Fernando Latorre, and Grigorios Magklis
2010

Transactional Memory, 2nd edition


Tim Harris, James Larus, and Ravi Rajwar
2010

Computer Architecture Performance Evaluation Methods


Lieven Eeckhout
2010

Introduction to Reconfigurable Supercomputing


Marco Lanzagorta, Stephen Bique, and Robert Rosenberg
2009

On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009

The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009

Fault Tolerant Computer Architecture


Daniel J. Sorin
2009

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale


Machines
Luiz André Barroso and Urs Hölzle
2009
vii
Computer Architecture Techniques for Power-Efficiency
Stefanos Kaxiras and Margaret Martonosi
2008

Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency


Kunle Olukotun, Lance Hammond, and James Laudon
2007

Transactional Memory
James R. Larus and Ravi Rajwar
2006

Quantum Computing for Computer Architects


Tzvetan S. Metodi and Frederic T. Chong
2006
Copyright © 2018 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.

Compiling Algorithms for Heterogeneous Systems


Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz
www.morganclaypool.com

ISBN: 9781627059619 paperback


ISBN: 9781627057301 ebook
ISBN: 9781681732633 hardcover

DOI 10.2200/S00816ED1V01Y201711CAC043

A Publication in the Morgan & Claypool Publishers series


SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE

Lecture #43
Series Editor: Margaret Martonosi, Princeton University
Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
Print 1935-3235 Electronic 1935-3243
Compiling Algorithms
for Heterogeneous Systems

Steven Bell
Stanford University

Jing Pu
Google

James Hegarty
Oculus

Mark Horowitz
Stanford University

SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #43

M
&C Morgan & cLaypool publishers
ABSTRACT
Most emerging applications in imaging and machine learning must perform immense amounts
of computation while holding to strict limits on energy and power. To meet these goals, archi-
tects are building increasingly specialized compute engines tailored for these specific tasks. The
resulting computer systems are heterogeneous, containing multiple processing cores with wildly
different execution models. Unfortunately, the cost of producing this specialized hardware—and
the software to control it—is astronomical. Moreover, the task of porting algorithms to these
heterogeneous machines typically requires that the algorithm be partitioned across the machine
and rewritten for each specific architecture, which is time consuming and prone to error.
Over the last several years, the authors have approached this problem using domain-
specific languages (DSLs): high-level programming languages customized for specific domains,
such as database manipulation, machine learning, or image processing. By giving up general-
ity, these languages are able to provide high-level abstractions to the developer while producing
high-performance output. The purpose of this book is to spur the adoption and the creation of
domain-specific languages, especially for the task of creating hardware designs.
In the first chapter, a short historical journey explains the forces driving computer archi-
tecture today. Chapter 2 describes the various methods for producing designs for accelerators,
outlining the push for more abstraction and the tools that enable designers to work at a higher
conceptual level. From there, Chapter 3 provides a brief introduction to image processing al-
gorithms and hardware design patterns for implementing them. Chapters 4 and 5 describe and
compare Darkroom and Halide, two domain-specific languages created for image processing
that produce high-performance designs for both FPGAs and CPUs from the same source code,
enabling rapid design cycles and quick porting of algorithms. The final section describes how
the DSL approach also simplifies the problem of interfacing between application code and the
accelerator by generating the driver stack in addition to the accelerator configuration.
This book should serve as a useful introduction to domain-specialized computing for com-
puter architecture students and as a primer on domain-specific languages and image processing
hardware for those with more experience in the field.

KEYWORDS
domain-specific languages, high-level synthesis, compilers, image processing accel-
erators, stencil computation
xi

Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 CMOS Scaling and the Rise of Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 What Will We Build Now? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Performance, Power, and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 The Cost of Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Good Applications for Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Computations and Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


2.1 Direct Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 High-level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Domain-specific Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Image Processing with Stencil Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


3.1 Image Signal Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Darkroom: A Stencil Language for Image Processing . . . . . . . . . . . . . . . . . . . . 33


4.1 Language Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 A Simple Pipeline in Darkroom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Optimal Synthesis of Line-buffered Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Generating Line-buffered Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Shift Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Finding Optimal Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 ASIC and FPGA Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.2 CPU Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
xii
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 Scheduling for Hardware Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.2 Scheduling for General-purpose Processors . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Programming CPU/FPGA Systems from Halide . . . . . . . . . . . . . . . . . . . . . . . . 51


5.1 The Halide Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Mapping Halide to Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.1 Architecture Parameter Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.2 IR Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.3 Loop Perfection Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.4 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Implementation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.1 Programmability and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.2 Quality of Hardware Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6 Interfacing with Specialized Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


6.1 Common Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 The Challenge of Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Solutions to the Interface Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.1 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.2 Library Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.3 API plus DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 Drivers for Darkroom and Halide on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.1 Memory and Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.2 Running the Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4.3 Generating Systems and Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.4 Generating the Whole Stack with Halide . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.5 Heterogeneous System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
xiii

Preface
Cameras are ubiquitous, and computers are increasingly being used to process image data to
produce better images, recognize objects, build representations of the physical world, and extract
salient bits from massive streams of video, among countless other things. But while the data
deluge continues to increase, and while the number of transistors that can be cost-effectively
placed on a silicon die is still going up (for now), limitations on power and energy mean that
traditional CPUs alone are insufficient to meet the demand. As a result, architects are building
more and more specialized compute engines tailored to provide energy and performance gains
on these specific tasks.
Unfortunately, the cost of producing this specialized hardware—and the software to con-
trol it—is astronomical. Moreover, the resulting computer systems are heterogeneous, contain-
ing multiple processing cores with wildly different execution models. The task of porting al-
gorithms to these heterogeneous machines typically requires that the algorithm be partitioned
across the machine and rewritten for each specific architecture, which is time consuming and
prone to error.
Over the last several years, we have approached this problem using domain-specific lan-
guages (DSLs)—high-level programming languages customized for specific domains, such as
database manipulation, machine learning, or image processing. By giving up generality, these
languages are able to provide high-level abstractions to the developer while producing high-
performance output. Our purpose in writing this book is to spur the adoption and the creation
of domain-specific languages, especially for the task of creating hardware designs.
This book is not an exhaustive description of image processing accelerators, nor of domain-
specific languages. Instead, we aim to show why DSLs make sense in light of the current state
of computer architecture and development tools, and to illustrate with some specific examples
what advantages DSLs provide, and what tradeoffs must be made when designing them. Our
examples will come from image processing, and our primary targets are mixed CPU/FPGA
systems, but the underlying techniques and principles apply to other domains and platforms as
well. We assume only passing familiarity with image processing, and focus our discussion on the
architecture and compiler sides of the problem.
In the first chapter, we take a short historical journey to explain the forces driving com-
puter architecture today. Chapter 2 describes the various methods for producing designs for
accelerators, outlining the push for more abstraction and the tools that enable designers to work
at a higher conceptual level. In Chapter 3, we provide a brief introduction to image processing
algorithms and hardware design patterns for implementing them, which we use through the
rest of the book. Chapters 4 and 5 describe Darkroom and Halide, two domain-specific lan-
xiv PREFACE
guages created for image processing. Both are able to produce high-performance designs for
both FPGAs and CPUs from the same source code, enabling rapid design cycles and quick
porting of algorithms. We present both of these examples because comparing and contrasting
them illustrates some of the tradeoffs and design decisions encountered when creating a DSL.
The final portion of the book discusses the task of controlling specialized hardware within a het-
erogeneous system running a multiuser operating system. We give a brief overview of how this
works on Linux and show how DSLs enable us to automatically generate the necessary driver
and interface code, greatly simplifying the creation of that interface.
This book assumes at least some background in computer architecture, such as an advanced
undergraduate or early graduate course in CPU architecture. We also build on ideas from com-
pilers, programming languages, FPGA synthesis, and operating systems, but the book should
be accessible to those without extensive study on these topics.

Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz


January 2018
xv

Acknowledgments
Any work of this size is necessarily the result of many collaborations. We are grateful to John
Brunhaver, Zachary DeVito, Pat Hanrahan, Jonathan Ragan-Kelley, Steve Richardson, Jeff Set-
ter, Artem Vasilyev, and Xuan Yang, who influenced our thinking on these topics and helped
develop portions of the systems described in this book. We’re also thankful to Mike Morgan,
Margaret Martonosi, and the team at Morgan & Claypool for shepherding us through the
writing and production process, and to the reviewers whose feedback made this a much bet-
ter manuscript than it would have been otherwise.

Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz


January 2018
1

CHAPTER 1

Introduction
When the International Technology Roadmap for Semiconductors organization announced its
final roadmap in 2016, it was widely heralded as the official end of Moore’s law [ITRS, 2016].
As we write this, 7 nm technology is still projected to provide cheaper transistors than current
technology, so it isn’t over just yet. But after decades of transistor scaling, the ITRS report
revealed at least modest agreement across the industry that cost-effective scaling to 5 nm and
below was hardly a guarantee.
While the death of Moore’s law remains a topic of debate, there isn’t any debate that the
nature and benefit of scaling has decreased dramatically. Since the early 2000s, scaling has not
brought the power reductions it used to provide. As a result, computing devices are limited by
the electrical power they can dissipate, and this limitation has forced designers to find more
energy-efficient computing structures. In the 2000s this power limitation led to the rise of mul-
ticore processing, and is the reason that practically all current computing devices (outside of
embedded systems) contain multiple CPUs on each die. But multiprocessing was not enough to
continue to scale performance, and specialized processors were also added to systems to make
them more energy efficient. GPUs were added for graphics and data-parallel floating point op-
erations, specialized image and video processors were added to handle video, and digital signal
processors were added to handle the processing required for wireless communication.
On one hand, this shift in structure has made computation more energy efficient; on the
other, it has made programming the resulting systems much more complex. The vast major-
ity of algorithms and programming languages were created for an abstract computing machine
running a single thread of control, with access to the entire memory of the machine. Changing
these algorithms and languages to leverage multiple threads is difficult, and mapping them to
use the specialized processors is near impossible. As a result, accelerators only get used when
performance is essential to the application; otherwise, the code is written for CPU and declared
“good enough.” Unless we develop new languages and tools that dramatically simplify the task
of mapping algorithms onto these modern heterogeneous machines, computing performance
will stagnate.
This book describes one approach to address this issue. By restricting the application do-
main, it is possible to create programming languages and compilers that can ease the burden of
creating and mapping applications to specialized computing resources, allowing us to run com-
plete applications on heterogeneous platforms. We will illustrate this with examples from image
processing and computer vision, but the underlying principles extend to other domains.
2 1. INTRODUCTION
The rest of this chapter explains the constraints that any solution to this problem must
work within. The next section briefly reviews how computers were initially able to take advantage
of Moore’s law scaling without changing the programming model, why that is no longer the case,
and why energy efficiency is now key to performance scaling. Section 1.2 then shows how to
compare different power-constrained designs to determine which is best. Since performance
and power are tightly coupled, they both need to be considered to make the best decision. Using
these metrics, and some information about the energy and area cost of different operations, this
section also points out the types of algorithms that benefit the most from specialized compute
engines. While these metrics show the potential of specialization, Section 1.3 describes the costs
of this approach, which historically required large teams to design the customized hardware and
develop the software that ran on it. The remaining chapters in this book describe one approach
that addresses these cost issues.

1.1 CMOS SCALING AND THE RISE OF SPECIALIZATION


From the earliest days of electronic computers, improvements in physical technology have con-
tinually driven computer performance. The first few technology changes were discrete jumps,
first from vacuum tubes to bipolar transistors in the 1950s, and then from discrete transistors to
bipolar integrated circuits (ICs) in the 1960s. Once computers were built with ICs, they were
able to take advantage of Moore’s law, the prediction-turned-industry-roadmap which stated
that the number of components that could be economically packed onto an integrated circuit
would double every two years [Moore, 1965].
As MOS transistor technology matured, gates built with MOS transistors used less power
and area than gates built with bipolar transistors, and it became clear in the late 1970s that MOS
technology would dominate. During this time Robert Dennard at IBM Research published his
paper on MOS scaling rules, which showed different approaches that could be taken to scale
MOS transistors [Dennard et al., 1974]. In particular, he observed that if a transistor’s operating
voltage and doping concentration were scaled along with its physical dimensions, then a number
of other properties scaled nicely as well, and the resized transistor would behave predictably.
If a MOS transistor is shrunk by a factor of 1= in each linear dimension, and the operating
voltage is lowered by the same 1= , then several things follow:
1. Transistors get smaller, allowing  2 more logic gates in the same silicon area.
2. Voltages and currents inside the transistor scale by a factor of 1= .
3. The effective resistance of the transistor, I =V , remains constant, due to 2 above.
4. The gate capacitance C shrinks by a factor of 1= (1= 2 due to decreased area, multiplied
by  due to reduced electrode spacing).
The switching time for a logic gate is proportional to the resistance of the driving transistor
multiplied by the capacitance of the driven transistor. If the effective resistance remains constant
1.1. CMOS SCALING AND THE RISE OF SPECIALIZATION 3
while the capacitance decreases by 1= , then the overall delay also decreases by 1= , and the chip
can be run faster by a factor of  .
Taken together, these scaling factors mean that  2 more logic gates are switched  faster,
for a total increase of  3 more gate evaluations per second. At the same time, the energy required
to switch a logic gate is proportional to CV 2 . With both capacitance and voltage decreasing by
a factor of 1= , the energy per gate evaluation decreased by a factor of 1= 3 .
During this period,
p roughly every other year, a new technology process yielded transistors
which were about 1= 2 as large in each dimension. Following Dennard scaling, this would give
a chip with twice as many gates and a faster clock by a factor of 1.4, making it 2.8 more
powerful than the previous one. Simultaneously, however, the energy dissipated by each gate
evaluation dropped by 2.8, meaning that total power required was the same as the previous
chip. This remarkable result allowed each new generation to achieve nearly a 3 improvement
for the same die area and power.
This scaling is great in theory, but what happened in practice is somewhat more circuitous.
First, until the mid-1980s, most complex ICs were made with nMOS rather than CMOS gates,
which dissipate power even when they aren’t switching (known as static power). Second, during
this period power supply voltages remained at 5 V, a standard set in the bipolar IC days. As
a result of both of these, the power per gate did not change much even as transistors scaled
down. As nMOS chips grew more complex, the power dissipation of these chips became a
serious problem. This eventually forced the entire industry to transition from nMOS to CMOS
technology, despite the additional manufacturing complexity and lower intrinsic gate speed of
CMOS.
After transitioning to CMOS ICs in the mid-1980s, power supply voltages began to scale
down, but not exactly in sync with technology. While transistor density and clock speed contin-
ued to scale, the energy per logic gate dropped more slowly. With the number of gate evaluations
per second increasing faster than the energy of gate evaluation was scaling down, the overall chip
power grew exponentially.
This power scaling is exactly what we see when we look at historical data from CMOS
microprocessors, shown in Figure 1.1. From 1980 to 2000, the number of transistors on a chip
increased by about 500 (Figure 1.1a), which corresponds to scaling transistor feature size by
roughly 20. During this same period of time, processor clock frequency increased by 100,
which is 5 faster than one would expect from simple gate speed (Figure 1.1b). Most of this ad-
ditional clock speed gain came from microarchitectural changes to create more deeply pipelined
“short tick” machines with fewer gates per cycle, which were enabled by better circuit designs
of key functional units. While these fast clocks were good for performance, they were bad from
a power perspective.
By 2000, computers were executing 50,000 more gate evaluations per second than they
had in the 1980s. During this time the average capacitance had scaled down, providing a 20
energy savings, but power supply voltages had only scaled by 4–5 (Figure 1.1c), giving roughly
4 1. INTRODUCTION
a 25 savings. Taken together the capacitance and supply scaling only reduce the gate energy
by around 500, which means that the power dissipation of the processors should increase by
two orders of magnitude during this period. Figure 1.1d shows that is exactly what happened.

1B 4 GHz
1 GHz
100 M
Number of Transistors

Clock Frequency
10 M 100 MHz

1M 10 MHz

100 k
1 MHz
10 k

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
(a) Transistors Per Chip (b) CPU Frequency

5V
150 W

Thermal Design Power (TDP)


100 W

3.3 V
Voltage

2.5 V 10 W

1.2 V
1W

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
(c) Operating Voltage (d) Power Dissipation

Figure 1.1: From the 1960s until the early 2000s, transistor density and operating frequency
scaled up exponentially, providing exponential performance improvements. Power dissipa-
tion increased but was kept in check by lowering the operating voltage. Data from CPUDB
[Danowitz et al., 2012].

Up to this point, all of these additional transistors were used for a host of architectural im-
provements that increased performance even further, including pipelined datapaths, superscalar
instruction issue, and out-of-order execution. However, the instruction set architectures (ISAs)
for various processors generally remained the same through multiple hardware revisions, mean-
1.1. CMOS SCALING AND THE RISE OF SPECIALIZATION 5
ing that existing software could run on the newer machine without modification—and reap a
performance improvement.
But around 2004, Dennard scaling broke down. Lowering the gate threshold voltage fur-
ther caused the leakage power to rise unacceptably high, so it began to level out just below 1 V.
Without the possibility to manage the power density by scaling voltage, manufacturers hit
the “power wall” (the red line in Figure 1.1d). Chips such as the Intel Pentium 4 were dissipating
a little over 100 W at peak performance, which is roughly the limit of a traditional package
with a heatsink-and-fan cooling system. Running a CPU at significantly higher power than this
requires an increasingly complex cooling system, both at a system level and within the chip itself.
Pushed up against the power wall, the only choice was to stop increasing the clock fre-
quency and find other ways to increase performance. Although Intel had predicted processor
clock rates over 10 GHz, actual numbers peaked around 4 GHz and settled back between 2 and
4 GHz (Figure 1.1b).
Even though Dennard scaling had stopped, taking down frequency scaling with it,
Moore’s law continued its steady march forward. This left architects with an abundance of tran-
sistors, but the traditional microarchitectural approaches to improving performance had been
mostly mined out. As a result, computer architecture has turned in several new directions to
improve performance without increasing power consumption.
The first major tack was symmetric multicore, which stamped down two (and then four,
and then eight) copies of the CPU on each chip. This has the obvious benefit of delivering more
computational power for the same clock rate. Doubling the core count still doubles the total
power, but if the clock frequency is dialed back, the chip runs at a lower voltage, keeping the
energy constant while maintaining some of the performance advantage of having multiple cores.
This is especially true if the parallel cores are simplified and designed for energy efficiency rather
than single-thread performance. Nonetheless, even simple CPU cores incur significant overhead
to compute their results, and there is a limit to how much efficiency can be achieved simply by
making more copies.
The next theme was to build processors to exploit regularity in certain applications, lead-
ing to the rise of single-instruction-multiple-data (SIMD) instruction sets and general-purpose
GPU computing (GPGPU). These go further than symmetric multicore in that they amortize
the instruction fetch and decode steps across many hardware units, taking advantage of data
parallelism. Neither SIMD nor GPUs were new; SIMD had existed for decades as a staple
of supercomputer architectures and made its way into desktop processors for multimedia ap-
plications along with GPUs in the late 1990s. But in the mid-2000s, they started to became
prominent as a way to accelerate traditional compute-intensive applications.
A third major tack in architecture was the proliferation of specialized accelerators, which
go even further in stripping out control flow and optimizing data movement for particular appli-
cations. This trend was hastened by the widespread migration to mobile devices and “the cloud,”
where power is paramount and typical use is dominated by a handful of tasks. A modern smart-
6 1. INTRODUCTION
phone System-on-chip (SoC) contains more than a dozen custom compute engines, created
specifically to perform intensive tasks that would be impossible to run in real time on the main
CPU. For example, communicating over WiFi and cellular networks requires complex coding
and modulation/demodulation, which is performed on a small collection of hardware units spe-
cialized for these signal processing tasks. Likewise, decoding or encoding video—whether for
watching Netflix, video chatting, or camera filming—is handled by hardware blocks that only
perform this specific task. And the process of capturing raw pixels and turning them into a
pleasing (or at least presentable) image is performed by a long pipeline of hardware units that
demosaic, color balance, denoise, sharpen, and gamma-correct the image.
Even low-intensity tasks are getting accelerators. For example, playing music from an
MP3 file requires relatively little computational work, but the CPU must wake up a few dozen
times per second to fill a buffer with sound samples. For power efficiency, it may be better to
have a dedicated chip (or accelerator within the SoC, decoupled from the CPU) that just handles
audio.
While there remain some performance gains still to be squeezed out of thread and data
parallelism by incrementally advancing CPU and GPU architectures, they cannot close the gap
to a fully customized ASIC. The reason, as we’ve already hinted, comes down to power.
Cell phones are power-limited both by their battery capacity (roughly 8–12 Wh) and the
amount of heat it is acceptable to dissipate in the user’s hand (around 2 W). The datacenter is
the same story at a different scale. A warehouse-sized datacenter consumes tens of megawatts,
requiring a dedicated substation and a cheap source of electrical power. And like phones, data
center performance is constrained partly by the limits of our ability to get heat out, as evidenced
by recent experiments and plans to build datacenters in caves or in frigid parts of the ocean.
Thus, in today’s power-constrained computing environment, the formula for improvement is
simple: performance per watt is performance.
Only specialized architectures can optimize the data storage and movement to achieve the
energy reduction we want. As we will discuss in Section 1.4, specialized accelerators are able to
eliminate the overhead of instructions by “baking” them into the computation hardware itself.
They also eliminate waste for data movement by designing the storage to match the algorithm.
Of course, general-purpose processors are still necessary for most code, and so modern
systems are increasingly heterogeneous. As mentioned earlier, SoCs for mobile devices contain
dozens of processors and specialized hardware units, and datacenters are increasingly adding
GPUs, FPGAs, and ASIC accelerators [AWS, 2017, Norman P. Jouppi et al., 2017].
In the remainder of this chapter, we’ll describe the metrics that characterize a “good”
accelerator and explain how these factors will determine the kind of systems we will build in the
future. Then we lay out the challenges to specialization and describe the kinds of applications
for which we can expect accelerators to be most effective.
1.2. WHAT WILL WE BUILD NOW? 7
1.2 WHAT WILL WE BUILD NOW?
Given that specialized accelerators are—and will continue to be—an important part of computer
architecture for the foreseeable future, the question arises: What makes a good accelerator? Or
said another way, if I have a potential set of designs, how do I choose what to add to my SoC
or datacenter, if anything?

1.2.1 PERFORMANCE, POWER, AND AREA


On the surface, the good things we want are obvious. We want high performance, low power,
and low cost.
Raw performance—the speed at which a device is able to perform a computation—is
the most obvious measure of “good-ness.” Consumers will throw down cash for faster devices,
whether that performance means quicker web page loads or richer graphics. Unfortunately, this
isn’t easy to quantify with the most commonly advertised metrics.
Clock speed matters, but we also need to account for how much work is done on each
clock cycle. Multiplying clock speed by the number of instructions issued per cycle is better, but
still ignores the fact that some instructions might do much more work than others. And on top
of this, we have the fact that utilization is rarely 100% and depends heavily on the architecture
and application.
We can quantify performance in a device-independent way by counting the number of
essential operations performed per unit time. For the purposes of this metric, we define “essen-
tial operations” to include only the operations that form the actual result of the computation.
Most devices require a great deal of non-essential computation, such as decoding instructions or
loading and storing intermediate data. These are “non-essential” not because they are pointless
or unnecessary but because they are not intrinsically required to perform the computation. They
are simply overhead incurred by the specific architecture.
With this definition, adding two pieces of data to produce an intermediate result is an
essential operation, but incrementing a loop counter is not since the latter is required by the
implementation and not the computation itself.
To make things concrete, a 3  3 convolution on a single-channel image requires nine mul-
tiplications (multiplying 3  3 pixels by their corresponding weights) and eight 2-input additions
per output pixel. For a 640  480 image (307,200 pixels), this is a little more than 5.2 million
total operations.
A CPU implementation requires many more instructions than this to compute the result
since the instruction stream includes conditional branches, loop index computations, and so
forth. On the flip side, some implementations might require fewer instructions than operations,
if they process multiple pieces of data on each instruction or have complex instructions that
fuse multiple operations. But implementations across this whole spectrum can be compared if
we calculate everything in terms of device-independent operations, rather than device-specific
instructions.
8 1. INTRODUCTION
The second metric is power consumption, measured in Watts. In a datacenter context,
the power consumption is directly related to the operating cost, and thus to the total cost of
ownership (TCO). In a mobile device, power consumption determines how long the battery will
last (or how large a battery is necessary for the device to survive all day). Power consumption also
determines the maximum computational load that can be sustained without causing the device
to overheat and throttle back.
The third metric is cost. We’ll discuss development costs further in the following section,
but for now it is sufficient to observe that the production cost of the final product is closely related
to the silicon area of the chip, typically measured in square millimeters (mm2 ). More chips of a
smaller design will fit on a fixed-size wafer, and smaller chips are likely to have somewhat higher
yield percentages, both of which reduce the manufacturing cost.
However, as important as performance, power, and silicon area are as metrics, they can’t
be used directly to compare designs, because it is relatively straightforward to trade one for the
other.
Running a chip at a higher operating voltage causes its transistors to switch more rapidly,
allowing us to increase the clock frequency and get increased performance, at the cost of in-
creased power consumption. Conversely, lowering the operating voltage along with the clock
frequency saves energy, at the cost of lower performance.1
It isn’t fair to compare the raw performance of a desktop Intel Core i7 to an ARM phone
SoC, if for no other reason than that the desktop processor has a 20–50 power advantage.
Instead, it is more appropriate to divide the power ( Joules per second) by the performance (op-
erations per second) to get the average energy used per computation ( Joules per operation).
Throughout the rest of this book, we’ll refer to this as “energy per operation” or pJ/op. We
could equivalently think about maximizing the inverse: operations/Joule.
For a battery-powered device, energy per operation relates directly to the amount of com-
putation that can be performed with a single battery charge; for anything plugged into the wall,
this relates the amount of useful computation that was done with the money you paid to the
electric company.
A similar difficulty is related to the area metric. For applications with sufficient parallelism,
we can double performance simply by stamping down two copies of the same processor on a chip.
This benefit requires no increase in clock speed or operating voltage—only more silicon. This
was, of course, the basic impetus behind going to multi core computation.
Even further, it is possible to lower the voltage and clock frequency of the two cores,
trading performance for energy efficiency as described earlier. As a result, it is possible to improve
either power or performance by increasing silicon area as long as there is enough parallelism.
Thus, when comparing between architectures for highly parallel applications, it is helpful to

1 Of
course, modern CPUs do this scaling on the fly to match their performance to the ever-changing CPU load, known as
“Dynamic Voltage and Frequency Scaling” (DVFS).
1.2. WHAT WILL WE BUILD NOW? 9
normalize performance by the silicon area used. This gives us operations/Joule divided by area,
ops
or mm 2 J .
ops
These two compound metrics, pJ=operation and mm 2 J , give us meaningful ways to com-
pare and evaluate vastly different architectures. However, it isn’t sufficient to simply minimize
these in the abstract; we must consider the overall system and application workload.

1.2.2 FLEXIBILITY
Engineers building a system are concerned with a particular application, or perhaps a collection
of applications, and the metrics discussed are only helpful insofar as they represent performance
on the applications of interest. If a specialized hardware module cannot run our problem, its
energy and area efficiency are irrelevant. Likewise, if a module can only accelerate parts of the
application, or only some applications out of a larger suite, then its benefit is capped by Ahm-
dahl’s law. As a result, we have a flexibility tradeoff: more flexible devices allow us to accelerate
computation that would otherwise remain on the CPU, but increased flexibility often means
reduced efficiency.
Suppose a hypothetical fixed-function device can accelerate 50% of a computation by a
factor of 100, reducing the total computation time from 1 second to 0.505 seconds. If adding
some flexibility to the device drops the performance to only 10 but allows us to accelerate 70%
of the computation, we will now complete the computation in 0.37 seconds—a clear win.
Moreover, many applications demand flexibility, whether the product is a networking de-
vice that needs to support new protocols or an augmented-reality headset that must incorporate
the latest advances in computer vision. As more and more devices are connected to the internet,
consumers increasingly expect that features can be upgraded and bugs can be fixed via over-the-
air updates. In this market, a fixed-function device that cannot support rapid iteration during
prototyping and cannot be reconfigured once deployed is a major liability.
The tradeoff is that flexibility isn’t free, as we have already alluded to. It almost always hurts
ops
efficiency (performance per watt or mm 2 J ) since overhead is spent processing the configuration.
Figure 1.2 illustrates this by comparing the performance and efficiency for a range of designs
proposed at ISSCC a number of years ago. While newer semiconductor processes have reduced
energy across the board, the same trend holds: the most flexible devices (CPUs) are the least
efficient, and increasing specialization also increases performance, by as much as three orders of
magnitude.
In certain domains, this tension has created something of a paradox: applications that were
traditionally performed completely in hardware are moving toward software implementations,
even while competing forces push related applications away from software toward hardware. For
example, the fundamental premise of software defined radio (SDR) is that moving much (or all)
of the signal processing for a radio from hardware to software makes it possible to build a system
that is simpler, cheaper, and more flexible. With only a minimal analog front-end, an SDR
system can easily run numerous different coding and demodulation schemes, and be upgraded
10 1. INTRODUCTION
1,000

Energy Efficiency (MOPS/mW)


Microprocessors General Purpose DSPs Dedicated

100

10

0.1

0.01

Figure 1.2: Comparison of efficiency for a number of designs from ISSCC, showing the clear
tradeoff between flexibility and efficiency. Designs are sorted by efficiency and grouped by overall
design paradigm. Figure from Marković and Brodersen [2012].

over the air. But because real-time signal processing requires extremely high computation rates,
many SDR platforms use an FPGA, and carefully optimized libraries have been written to fully
exploit the SIMD and digital signal processing (DSP) hardware in common SoCs. Likewise,
software-defined networking aims to provide software-based reconfigurability to networks, but
at the same time more and more effort is being poured into custom networking chips.

1.3 THE COST OF SPECIALIZATION


To fit these metrics together, we must consider one more factor: cost. After all, given the enor-
mous benefits of specialization, the only thing preventing us from making a specialized acceler-
ator for everything is the expense.
Figure 1.3 compares the non-recurring engineering (NRE) cost of building a new high-
end SoC on the past few silicon process nodes. The price tags for the most recent technologies
are now well out of reach for all but the largest companies. Most ASICs are less expensive than
this, by virtue of being less complex, using purchased or existing IP, having lower performance
targets, and being produced on older and mature processes [Khazraee et al., 2017]. Yet these
costs still run into the millions of dollars and remain risky undertakings for many businesses.
Several components contribute to this cost. The most obvious is the price of the lithogra-
phy masks and tooling setup, which has been driven up by the increasingly high precision of each
process node. Likewise, these processes have ever-more-stringent design rules, which require
more engineering effort during the place and route process and in verification. The exponen-
tial increase in number of transistors has enabled a corresponding growth in design complexity,
which comes with increased development expense. Some of these additional transistors are used
1.3. THE COST OF SPECIALIZATION 11
500

Validation
Prototye
400

Software
300
Cost (million USD)

200
Physical

100
Verification

Architecture
IP
0
65 nm 45/40 nm 28 nm 22 nm 16/14 nm 10 nm 7 nm 5 nm
(2006) (2008) (2010) (2012) (2014) (2017)

Figure 1.3: Estimated cost breakdown to build a large SoC. The overall cost is increasing expo-
nentially, and software comprises nearly half of the total cost. (Data from International Business
Strategies [IBS, 2017].)

in ways that do not appreciably increase the design complexity, such as additional copies of pro-
cessor cores or larger caches. But while the exact slope of the correlation is debatable, the trend
is clear: More transistors means more complexity, and therefore higher design costs. Moreover,
with increased complexity comes increased costs for testing and verification.
Last, but particularly relevant to this book, is the cost of developing software to run the
chip, which in the IBS estimates accounts for roughly 40% of the total cost. The accelerator
must be configured, whether with microcode, a set of registers, or something else, and it must
be interfaced with the software running on the rest of the system. Even the most rigid of “fixed”
devices usually have some degree of configurability, such as the ability to set an operating mode
or to control specific parameters or coefficients.
This by itself is unremarkable, except that all of these “configurations” are tied to a pro-
gramming model very different than the idealized CPU that most developers are used to. Timing
details become crucial, instructions execute out of order or in a massively parallel fashion, and
12 1. INTRODUCTION
concurrency and synchronization are handled with device-specific primitives. Accelerators are,
almost by definition, difficult to program.
To state the obvious, the more configurable a device is, the more effort must go into con-
figuring it. In highly configurable accelerators such as GPUs or FPGAs, it is quite easy—even
typical—to produce configurations that do not perform well. Entire job descriptions revolve
around being able to work the magic to create high-performance configurations for accelera-
tors. These people, informally known as “the FPGA wizards” or “GPU gurus,” have an intimate
knowledge of the device hardware and carry a large toolbox of techniques for optimizing appli-
cations. They also have excellent job security.
This difficulty is exacerbated by a lack of tools. Specialized accelerators need specialized
tools, often including a compiler toolchain, debugger, and perhaps even an operating system.
This is not a problem in the CPU space: there are only a handful of competitive CPU archi-
tectures, and many groups are developing tools, both commercial and open source. Intel is but
one of many groups with an x86 C++ compiler, and the same is true for ARM. But specialized
accelerators are not as widespread, and making tools for them is less profitable. Unsurprisingly,
NVIDIA remains the primary source of compilers, debuggers, and development tools for their
GPUs. This software design effort cannot easily be pushed onto third-party companies or the
open-source community, and becomes part of the chip development cost.
As we stand today, bringing a new piece of silicon to market is as much about writing
software as it is designing logic. It isn’t sufficient to just “write a driver” for the hardware; what
is needed is an effective bridge to application-level code.
Ultimately, companies will only create and use accelerators if the improvement justifies
the expense. That is, an accelerator is only worthwhile if the engineering cost can be recouped
by savings in the operating cost, or if the accelerator enables an application that was previously
impossible. The operating cost is closely tied to the efficiency of the computing system, both in
terms of the number of units necessary (buying a dozen CPUs vs. a single customized accelerator)
and in terms of time and electricity. Because it is almost always easier to implement an algorithm
on a more flexible device, this cost optimization results in a tug-of-war between performance
and flexibility, illustrated in Figure 1.4.
This is particularly true for low-volume products, where the NRE cost dominates the
overall expense. In such cases, the cheapest solution—rather than the most efficient—might be
the best. Often, the most cost-effective solution to speed up an application is to buy a more
powerful computer (or a whole rack of computers!) and run the same horribly inefficient code
on it. This is why an enormous amount of code, even deployed production code, is written in
languages like Python and Matlab, which have poor runtime performance but terrific developer
productivity.
Our goal is to reduce the cost of developing accelerators and of mapping emerging applica-
tions onto heterogeneous systems, pushing down the NRE of the high-cost/high-performance
1.4. GOOD APPLICATIONS FOR ACCELERATION 13

CPU

Operating Cost
Optimized CPU

GPU

FPGA

ASIC

Engineering Cost

Figure 1.4: Tradeoff of operating cost (which is inversely related to runtime performance) vs.
non-recurring engineering cost (which is inversely related to flexibility). More flexible devices
(CPUs and GPUs) require less development effort but achieve worse performance compared to
FPGAs and ASICs. We aim to reduce the engineering development cost (red arrows), making
it more feasible to adopt specialized computing.

areas of this tradeoff space. Unless we do so, it will remain more cost effective to use general-
purpose systems, and computer performance in many areas will suffer.

1.4 GOOD APPLICATIONS FOR ACCELERATION


Before we launch into systems for programming accelerators, we’ll examine which applications
can be accelerated most effectively. Can all applications be accelerated with specialized proces-
sors, or just some of them?
The short answer is that only a few types of applications are worth accelerating. To see
why, we have to go back to the fundamentals of power and energy. Given that, for a modern chip,
performance per watt is equivalent to performance, we want to minimize the energy consumed
per unit of computation. That is, if the way to maximize operations per second is to maximize
operations per second per watt, we can cancel “seconds,” and simply maximize operations per
Joule.
Table 1.1 shows the energy required for a handful of fundamental operations in a 45 nm
process. The numbers are smaller for more recent process nodes, but the relative scale remains
essentially the same.
The crucial observation here is that a DRAM fetch requires 500 more energy than a
32-bit multiplication, and 50; 000 more than an 8-bit addition. The cost of fetching data from
memory completely dwarfs the cost of computing with it. The cache hierarchy helps, of course,
Exploring the Variety of Random
Documents with Different Content
Iloks kansan tään paras laatia ois
runotekstejä katkismuksiin,
myös kuutamoiltoina puhjeta vois
ah, lemmenhuokauksiin…

Hyvä ois, vähän myöskin jos kutkuttais


isänmaallista ihramahaa:
pian kunniatohtorin miekan sais,
tukun täyteläisen rahaa.

Runon työ on vienosti kuiskuttaa


poven tyhjyyttä eleegistä.
— Mut lehmän häntiä huiskuttaa
runoss' on epäesteettistä!

Niin kai se on, enkä minäkään


kovin lehmän häntää kiitä.
Mut lammaskuorohon määkyvään
voi määkyä myöskin siitä.

Niin aattelin, mutta sen aatoksen


oli siittänyt järki lyhä.
Minä viisastuin ja nyt aattelen:
On lehmän häntä pyhä;

maholehmän hännän ainiaan tulis rauhoitettu olla: se on


tehtaanmerkkinä paikallaan monen laulajan tuotannolla.
IV

SATU MEISTÄ KAIKISTA.

Olen kävellyt kaiken iltaa puiston teitä ja ajatellut itseäni: mitä


olen ja mitä voisin olla.

Olen katsellut tähtiä ja ihmetellyt, kuinka niiden silmät ovat


suuret tänä iltana. Minä melkein näen, kuinka ne kaikki istuvat
suuressa kristallisalissa sinisellä matolla Jumalan jalkojen
juuressa — pienien lasten parvi, valkoiset mekot yllä. Ja
Jumala katsoo kaikkia heitä aurinkoisilla silmillä, joissa on
isällistä hymyä. Ja Jumala miettii satua, minkä hän kertoisi
heille.

Ja silloin Isä Jumala muistaa, kuinka hän itse oli lapsi:


kuinka hän leikki kerran suuressa vihreässä puistossa ihanien
puiden alla.

Hän oli katsellut huvikseen, kuinka hänen hymynsä putosi


kukkina maahan, kuinka hänen naurunsa lehahti perhosina
ilmaan. Ja kun hän väsyi siihen, keksi hän tyhmän leikin:
Hän oli yksin, leikkitoveria vailla, ja hänen oli ikävä. Hän
teki sentähden savesta pojan, ihan kaltaisensa, puhalsi
siihen: ja katso! se hymyili hänelle. Ja Jumala katseli poikaa
suurella ilolla.

Mutta hän saikin jo aamulla tuta, että oli pahoin tehnyt: hän
tuli leikkitoverin luokse — ja säikähti tätä: Hän oli outo ja ruma
ja vanha. Puisto oli mykkä ja surullinen, kukkaset vapisivat
kovin ja kauniit linnut ja perhoset pakenivat vieroen kauas.

Ja Jumala itsekin tunsi, miten hänen ilonsa äkisti kuoli,


miten hänen lapsenmieltänsä puistatti vieras ja paha tunne,
hänelle käsittämätön. Ja hän oli kääntynyt pois. Ja hän oli
itkenyt silloin ensimmäisen kerran.

ATLANTIS.

Oli onnellisten saari mun sydämeni kerta, ja auringossa kylpi


se keskellä merta.

Ja linnassa siellä monin jalokivin hohti Maan prinsessa


kaunein ja prinsessa nuorin — mut sielua vailla ja peikotar
siksi.

Meren äärillä kultaisin pähkinänkuorin unet soutivat


vehmaita rantoja kohti: ihanaisinten huulien kostuttimiksi Ilon
maasta ne noutivat ruusujen mettä.
Jos kuinkakin valtiatarta ne palvoi, jano prinsessan julmilla
huulilla valvoi: — Ei sammu se ihmisen kyynelettä!

Hän kyyneltä pyys — mereks ei sitä tiennyt. Sen sai:


tuhotulvana saaren se peitti, kuin simpukan aaltojen kohtuun
sen heitti — merenpohjainen peikkojen luola se lie nyt.

SANOJA YÖSSÄ.

Yö on noussut mustin äänettömin kasvoin vuoren alta,


sydämestä maan. Yö on niellyt auringon ja maan ja metsän,
meren, joita rakastamme niinkuin elämää. Yö on niinkuin
peikon paha katse, josta joka hetki lentää lepakot.

Yö on tullut meille. Tunnetko? — kuin vieras kylmä käsi


lepäis silmillämme, jotka äsken vielä niinkuin kukanterät
paisuvaiset tulta, tulta joivat auringosta, jok' on silmä kaikkein
tulisin.

Yö on tullut meille.
Sanat lakastuvat huulillemme.
Emme nähdä, tuntea voi toisiamme,
emme itseämme yöltä pelastaa.

VERKOSSA.
Olen verkon silmässä kala. En pääse pois:
ovat viiltävät säikeet jo syvällä lihassa mulla.
Vesi päilyvä, selvä ja syvä minun silmäini edessä ois.
Vesiaavikot vapaat, en voi minä luoksenne tulla!

Meren silmiin vihreisiin vain loitolta katsonut oon:


mikä autuus ois lohen kilpaveikkona olla! —
Sameassa ken liejussa uupuu, hän pian uupukoon!
Omat sammuvat silmäni kuoletan auringolla.

PURJEHTIJAT.

On syvyys allamme kuin lohikäärmeen suu,


kuin syli Atlantin — me kohta vajoamme.
Niin kiiluu alta pilvipäärmeen kuu
kuin paholaisen silmä. Kaukana on meistä jumalamme.

Ja jokin maston lailla meissä katkeaa


kuin veitsi leikkais meidät irti juuristamme.
Lyö salama. Ja taivas päämme päällä ratkeaa.
Me tyhjin silmäkuopin syvään pimeyteen tuijotamme.

Ja kaikki elonlaivan purjehtijat hullut on:


söi myrsky niinkuin skorpiooni pään ja tunnon heistä.
Ja sydän jokainen koht' aronkaltaiseksi tullut on.
Ja elo putoaa kuin kimmeltävä kalansuomus meistä.

Kuin olisimme kahlein kiinni vuoressa


niin käsiämme ylös kurkoitamme.
Olemme toukanruoka mätänevän ruumiin kuoressa.
Ja aavelaiva kantaa varjojamme.

VANHA MAA.

Maata katso, Äiti, katso tytärtäsi: myöhä rintalapses käynyt


onhan vanhemmaksi itseäsi!

Itse synnyt, sytyt joka aamu hehkuun uuteen, uuteen


nuoruuteen ja häikäisevään valonihanuuteen. Maa, sun
tyttäres, on vanhuutensa runtelema, taudin lyömä, syksyn-
kuumehinen, kaikelt' ytimeltään madonsyömä.

Tähtirakkautes hyljäten ja kupeeltasi luopuin Maa on


syvään langennut, Kuun peikonhyväilyihin juopuin; raiskattu
on, vereen, tuskan-syntiin tahrattu on nuoruus tyttäresi —
itke, Äiti jumalallinen, tai nouskoon pilvi häpeän sun otsallesi.

Lahjaas kalleintakin — kulmillansa laulun kimmellystä —


hylki tyttäres: niin vihas sydämensä hengitystä.

Älä katso, avaruutta syleilevä Äiti, tytärtäsi:


häntä kiroten myös kirota sun täytyis itseäsi!

DELUGE.
Minun silmieni rovio on syttynyt, kätenikin palavat ja vereni
vapisee. Minun sanani tukehtuvat tai ovat tuiki mielettömät
niinkuin samum, jossa koko erämaa puhuu ja kaikki hiekan
henget taistelevat.

Olen tullut enkä päästä sinua, ennenkuin vereni


maanjäristys määrättömien tuhon ja hukkumisen sekuntien
jälkeen asettuu.

Minun äitini on suudellut kaikkina päivinä Jumalan otsaa,


kun hän minua kantoi. Olen tuli sinun tultasi, aurinko, ja minä
tahdon polttaa ja palaa.

APHELIUM.

Mun vuosmyriaadeja sitten avaruuksihin yöpyvän nähtiin. Ah,


arvaamattoman pitkä tie täältä on muihin tähtiin.

Mun taattoni aurinko, turhaan


tulinuolias jälkeeni ammuit:
yön, äitini, tahto mun voitti;
jo aikoja taakseni sammuit.

Edes kaikuna tänne ei kuulu


elon rannalta aaltojen pauhu,
tääll' yön kivikasvoilla viihtyy
iankaikkinen tyhjyys, kauhu.
Olen kylmä, mun vertani hyytää
tämä äänetön kuoleman aatto.
Ylt'ympäri jäässä on lapses —
mua lämmitä, lämmitä, taatto!

Tule leimuten, suurena saavu,


yön sielussa soihtusi nostain.
Olen arvoton armohos, taatto,
tule kuitenkin, vaikkapa kostain.

Tue, kutsu ja luoksesi ohjaa, kun nöyränä tieltäni palaan.


Jos kuoleva oon, minä kuolla sinun liekkihis, aurinko, halaan!

(1921.)
V

HUOMISPÄIVÄ.

Tämänpäiväinen aurinko ei ole sinua varten: se näkee vielä


ranteissasi kahleenjälkien häpeän ja otsallasi eilispäivän
polttomerkit.

Tänään — erämaa
joka nielee kaikki keitaat:
unet ilosta ja auringosta.
Tuulen tuomaa lentohiekkaa
janon runtelemille huulille.

Mutta:
rautaristikon takaa
katse,
ihmisen katse —
niinkuin liekki joka syttyy tuhkaan.
Vielä täynnä
tuskan ja ilon taistelua,
mutta voitonvarma,
huomisen-altis,
juopunut syntymättömästä valosta.
Niinkuin huuto
etäisyydestä toiseen:

— Nähkää naulanjäljet käsissämme, jotka ovat kohonneet


jaoittamaan tämänpäivän vankilaa. Kuulkaa sielujemme
salamaa: se on tuli taivaasta ja sytyttää esiuhrin
huomispäivän auringolle, joka tulee — on pian tuleva — ja
suutelee kaikkia rikkaita sydämiä.

OLIN NUORI.

Olin nuori. Ja kasvoin harmajin, sinipunervin silmänaluksin


minä arkana kouluun kuljin.

Oli rehtori vanha ja viisas mies.


Minä usein pelkäsin, että hän ties,
mikä syy minun mieltäni painoi.

Ja kotona täti, hän katsoi niin


minun silmäini lastenkamariin,
että pakenin pois ja itkin.

Kun nuoria kasvoja muita näin,


miten vihasin omaa itseäin
ja kipeää elämääni.
— Ah, silloin vielä mä tiennyt en,
että kukaan, kukaan ihminen
ei tääll' ole parempi toista.

Tuta sai sen poika. Ja rakastamaan


mies sitten on oppinut nuoruuttaan
ja omia arpiansa.

SYDÄN JA KUOLLEEN MEREN APINAT.

Kuolleen meren rannalla jossakin — maailmahan on kuolleita


meriä täynnä — asui jolloinkin ajan varrella suku, joka oli
unohtanut sydämensä jonnekin.

Ja katso, nämä ihmiset menestyivät hyvin ja lihoivat hyvin.


Heidän vehnänsä kypsyi ja rypäleensä paisuivat.

He tekivät viekkaasti kauppaa ja kävivät julmasti sotaa. Ja


kameelit kantoivat ryöstösaaliina kotiin kultaa ja kalliita kiviä
vuorten takaa.

He olivat onnelliset. Ja lihavin kaikista papeista puhui: —


Kiitetty olkoon Allah! Nyt hänen siunauksensa on vuotanut
ylitsemme runsaampana kuin koskaan. Tietkää, oi uskovaiset
kaikki, että tämä Allahin ylenpalttinen suosio johtuu siitä, että
tuo kärkäs rauhanrikkoja, Sydän, alati tyytymätön, alati
kapinoiva Sydän on vihdoinkin keskuudestamme poissa. Eikö
hän riistänyt meiltä öittemme unta ja päivällä virittänyt ansoja
kaikille teillemme. Eikö hän napissut aina lempeitä lakejamme
vastaan. Eikö hän solvannut aina, tuo niskuri Sydän, pyhää
esivaltaa ja jaloja tuomareita, että he muka olisivat pyöveleitä.
Jos me ryhdyimme sotaan tahi jos solmimme rauhan, eikö
hän syyttänyt, että me teimme väärin. Eikö hän saattanut
meidät alati häpeään.

Nyt hän on poissa. Allah, hän joka Kuolleessa meressä


nukkuu, hän ei katsellut Sydäntä suopein silmin: tämähän
häiritsi alati hänenkin untaan. Iloitkaatte, oi uskovaiset kaikki:
Allah on vienyt hänet keskelle erämaata, että hän menehtyisi
nälkään ja janoon, että me olisimme alati onnelliset! Kiitetty
olkoon Allah!

Allah, hän joka Kuolleessa meressä nukkuu, heräsi eräänä


yönä: hän oli kuullut jonkun huutavan nimeään. Ja katso,
Kuolleessa meressä paloi elävä liekki. Se lähestyi niinkuin
lentävä tähti ja lankesi jumalan jalkojen juureen. Se oli palava
Sydän. Se puhui: — Isäni, en ole löytänyt kotia ihmisten luota!

Ja Allah katseli häntä


ja kuunteli häntä
ja itki.
Ja Allah kohotti päätänsä Kuolleesta merestä.
Silloin kuuli hän sen,
mitä lihavin kaikista papeista puhui.

Ja Allah vihastui kovin.


Ja Allah kohotti kätensä:
ja katso, ihmiset olivat aamulla niinkuin apinat.

Mutta Sydän jäi Allahin luokse. Ja lempeä jumala otti hänet


käteensä, nosti hänet huulilleen ja suuteli häntä.
SANA.

Alussa oli Sana.


Ja muuta ollut ei.
Ja Ylimmäisest' oli synty sen.

Ja alkuyöhön kauas
sen Luojan sormi vei:
niin luotiin maa ja meri, ihminen.

Sana luomistyönsä päätti


ja palas Isän luo.
Ja Henki kantoi sitä huulillaan.

Mut kerran keskiyöllä


Kyy, valheen henki tuo
sen varkain otti, vei sen maailmaan.

Ja Sana, kaiken luoja,


oli hedelmätön nyt,
kun ihminen sen nosti huulilleen;

vain lasten leikkikalu,


vain malja särkynyt,
vain mykkä ääni, kaiku tyhjyyteen.

Se näki paljon nälkää


ja paljon tahraantui;
sen kimmellyksen peitti tomu maan.

— Mut joskus, ihmisistä


kun paras uneksui,
se lensi jälleen kotiin, jumalaan.

Ja Isä huulillensa
sen virvoittaen vei,
ja siihen tarttui Luojan hengitys.

Ja ihmiset, he koskaan
sitä unohtaneet ei.
Ja sydämissä kävi väristys.

SYYLLINEN MIES.

Hänet ammutaan. Hän on syyllinen mies.

Hän katsoo ääneti meihin ja meistä pois.


Hän antaa silmänsä minulle ja käskee: katso!
Ja silloin näen, kuka myöskin on syyllinen mies.

Syyn kahleen rautaisen näen ranteessa kaikkien meidän.


Syyn käärme on purrut jo kehdossa lapsen kantapäätä.
Syyn hämähäkki on kutonut verkkonsa kaikkien sydänten teille.
Syy nielee kaikki ihmiset niinkuin hete suossa.

Mutta yksi meistä on murtanut kuoleman renkaan.


Näen kuinka hän riisuu likaset vaatteet yltään,
syyn varjojen kehästä pois hän astuu valoon.

Näen uudet kasvot nuo: hän on ryöväri ristinpuussa.


hän on veljeni, syyllinen mies, joka ammutaan.
RUNO RISTIINNAULITUISTA.

Ristinpuiden varjo lankeaa mittaamattomana maailman yli.


Nähkää se pilvien hartioilla ja kaikkien vuosituhansien
kasvoilla.

Näettekö: ristiinnaulittuna maailman sydän ja elämän


nuoruus, jokainen uusi ääni, joka huomenna on kaikkien
uskonto, jokaisen uuden päivän aurinko, jokainen uusi
vapahtaja — vain messiaita ristiinnaulitaan, ei epäjumalia.

He rakensivat ristinpuita maan kuninkaille.


Mutta heidän lapsensa suutelevat ristejä.
Pian kuolevat heidän askeleensa pimeään.
Mutta valo ei ole pakeneva ristinpuiden ympäriltä,
sillä jokainen ristiinnaulittu sydän oli tähti.

PAIMENET.

Voi paimenten öitä erämaassa.

Haaskalta, kaukaa korpin viimeinen huuto. Pimeys, pimeys


vyöryi maailman yli niinkuin musta meri.

Skorpionit vaanivat hietikolla. Yö oli pantterin silmiä täynnä


ja shakaalin ulvonta leikkasi ilmaa niinkuin nuoli. Nuotiomme
kitui ja laumamme vapisi sen luona.
— Sinäkin vapisit, sydän.
Tunsithan
pahojen henkien hiipivän yössä.
Aavistit:
tuhannen pahaa tahtoa vaani
raatelu-valmiina
— niinkuin tuhannen
julmaa, korpinkyntistä mustaa kouraa
tai pedon hammasta —
ympärilläsi erämaassa,
pimeän meressä, kaikkialla,
lämmintä vertasi janoten
niinkuin pantteri janoo
nuoren karitsan verta.

Ei ole koskaan nähty tällaista yötä erämaassa. Taivas: —


tähdistä kudottu kangas. Ja kaikki tähdet kiertävät suurinta,
kauneinta tähteä, jota ei koskaan ennen mikään silmä nähnyt.
Erämaan hiekkaan sataa pieniä tähtiä lukemattomasti.

Nähkää:
ei ole varjoja, yötä.
Tunnemme:
ei ole enää pelkoa, vihaa,
paha on kuollut,
maa on kaunis ja nuori
ja ihmeitä täynnä:
pantteri nuolee nuorimman vuonan kaulaa,
jännitetyn jousensa laskee metsästäjä,
huulilla lapsen hymy.
Silmämme näkevät unta: tähdistä astuu enkeli niinkuin
kaunis Jumala itse. Siunaa maata ja siunaa meitä: — Hetki on
tullut, rauha on maanpäällä syntynyt, hyvyys astunut ihmisten
joukkoon, maa on ihmisen koti.

(Jouluna 1923.)

ME.

Me, tomun tomu, saimme elämän. Suin, sieraimin ja silmin


yhä juomme sitä kuin mehiläinen yrtin mettä juo, vaikk' yrtin
okaat punertuvat verestämme.

Ah, suurin kipumme on täynnä siunausta,


tuo syvä arpi meissä, kallis todistus:
me suutelimme suulle polttavalle
hänt', Ainoata, joka on ja jota rakastamme.

Ja kun me lähestymme sydänt' Elämän kuin pisarainen


hänen suonissaan, koht' oma valtimomme täys' on elämää, ja
oomme rikkaat lailla jumalan.

Me sormin kosketamme maata, ja se kukkii. Ja


ihmisveljiemme tomusydämiin me puhallamme, ja ne ihanasti
värähtävät kuin sydän ensi ihmisen, kun sieraimissaan hän
tunsi hengityksen jumalallisen.
MINÄ NÄEN.

Näkysarjojen vaihtuva leikki. Ah, näin ei silmillä nähdä.


Sisempään, syvemmälle ja kauemmaksi ma nään.
Minä tiedän: nyt mieleni katsoo kuulevin, tuntevin silmin.
Näyt syntyvät, kuolevat silmässä itsessään.

On silmäni leili. Sen viinillä täytin. Ja viini on käynyt.


Ihanampana leilistä juon saman viinin nyt.
Oli poutaa ja pilviä, tuulta ja taivasta mielessä mulla.
Hedelmöittyvät nyt nämä siemenet kätketyt.

Näen mieleni unta. Kuin auringon säihkyssä kotka se nousee


tomust' irti ja maasta ja myös ajan piiristä pois.
Salamalla se ratsastaa elon maallisen rannasta rantaan.
On kuin joka ihmisen silmäni tuntea vois.

Hedelmist' olen tiedon puun minä syönyt kai: tämä uusi


elontietoni siks yli parhaan tietoni käy.
Mato pieni on suurin. Ja köyhimmällä on enkelin kasvot.
Erämaissamme eikö nyt jumalan askelet näy.

Näen: ihminen kirkastuu. Hänet jumalan veljeksi tunnen.


Ota kuoleman taittuu, ja kuollehet käy elämään.
Minä nään: paratiisin portti on auki, on taivahat auki.
Joka kohtalon yllä mä sateenkaaria nään.

LENTÄVÄN HOLLANTILAISEN NÄKY.


Hän ruorissa seisoi ja katsoi veteen.

Ja kuoleman-tyyneltä ulapalta, jolla pilvien kummitusvarjot


makaa, kuva nous, kuva kaukainen, silmien eteen kuin
jostakin satojen vuosien takaa kovan uhman ja paatumuksen
alta.

Näki itsensä mies: soman poikasen, ruson poskilla, silmissä


päivänsäteen. Näki pojan poimivan kukkasen ja sen vievän
äitinsä vanhaan käteen.

Ja niinkuin emonsa rinnalla vuona hän kirkkoon äitinsä


kanssa kulki somiss' ensimmäisissä saappaissaan. Pojun
hellästi syliinsä äiti sulki, kun he olivat kivisen portin luona ja
näkivät kirkon ja hautausmaan.

Näin äiti: "Kas, kultainen katolla tuolla on risti, ja risti on


haudoillakin. Se on elävän Jumalan merkki ja vala:
iankaikkisen tuskan ei lieskassa pala, kenen voi sydän lailla
Mestarin oman ristinsä kantaa, sen nimessä kuolla."

Saman äänen kummituslaivaan kuuli uros harmaapartainen


kaukaa jostain kuin kaikuna esirukousten. Ja hän luuli
silmänsä hourailevan: ah, keskellä pilvikummitusten kuin
kirkastettuna siltaa veen kävi äiti, merestä ristiä nostain!

Kovin kalpeni Hollantilaisen huuli. Hänen nähtiin rajusti


vapisevan kuin vuoren huipulla vanha puu, johon ukkosen
vasama kohdistuu. Käsin päähänsä harmaatukkaiseen nyt
tarttui mies, joka tyynenä kesti satavuotiset kauhuntaipaleet,
jota uhmasta uhmaan jäisimpään viha taivaan paadutti
ikuisesti, jota ihmiset olivat sylkeneet, joka itse Jumalan
nimeä sylki.

Kova, uhman kallio murtui, suli.


Ja hän painoi kädellä sydäntään.
Söi silmiä liekki kuin taulaa tuli.
Ja vavahti kummituslaivan kylki,
kun Lentävän Hollantilaisen ääni
parahduksena sielun pohjalta soi:
"Mua nuolevat liekit, ne lyö yli pääni!
Kirouksesta irti en päästä voi.
Tule, äiti, mun ristiä suudella anna!
Pois täältä sun poikasi kauas kanna!"

Heti, niinkuin rautainen piikki-ies, kirous hänen kirposi


otsaltansa. Vavistuksella tunsi hän, sokea mies, miten Jumala
itse ristinmerkin hänen otsalleen teki, suuteli sitä. Nyt oli hän
taaskin poikanen herkin, syli äidin jolla on suojanansa. — —

Nyt missä on, Hollantilainen, laivas! Täys aamua kirkasta,


nuorta on itä; meren silmäkalvossa toistuu taivas.

LAPSEN SILMÄ.

Ei mitään puhtaampaa kuin silmäs sun: mä niiden kautta


katson kadotettuun paratiisiin ja aikaan ennen
syntiinlankeemusta.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like