Presented at the MLConf in Seattle, this presentation offers a quick introduction to Apache Spark, followed by an overview of two novel features for data science
Jvm & Garbage collection tuning for low latencies applicationQuentin Ambard
G1, CMS, Shenandoah, or Zing? Heap size at 8GB or 31GB? compressed pointers? Region size? What is the maximum break time? Throughput or Latency... What gain? MaxGCPauseMillis, G1HeapRegionSize, MaxTenuringThreshold, UnlockExperimentalVMOptions, ParallelGCThreads, InitiatingHeapOccupancyPercent, G1RSetUpdatingPauseTimePercent, which parameters have the most impact?
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Exported pdf slides from our talk at PyData London 2016. The online version is available on http://pydata2016.cfapps.io/.
MongoDB is a non-relational database that stores data in JSON-like documents with dynamic schemas. It features flexibility with JSON documents that map to programming languages, power through indexing and queries, and horizontal scaling. The document explains that MongoDB uses JSON and BSON formats to store data, has no fixed schema so fields can evolve freely, and demonstrates working with the mongo shell and RoboMongo GUI.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
This document summarizes a tutorial on column-oriented database systems given at VLDB 2009. It discusses the evolution of column-oriented databases from early systems like DSM to current commercial systems. Key features of column-oriented databases include storing data by column rather than by row for improved read efficiency, better compression, and indexing capabilities. The tutorial outlines optimizations made possible by the columnar format like late materialization and vectorized processing. It also compares the performance of column and row storage using a telco data warehousing example.
This document discusses MongoDB performance tuning. It emphasizes that performance tuning is an obsession that requires planning schema design, statement tuning, and instance tuning in that order. It provides examples of using the MongoDB profiler and explain functions to analyze statements and identify tuning opportunities like non-covered indexes, unnecessary document scans, and low data locality. Instance tuning focuses on optimizing writes through fast update operations and secondary index usage, and optimizing reads by ensuring statements are tuned and data is sharded appropriately. Overall performance depends on properly tuning both reads and writes.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
The document provides an overview and examples of data modeling techniques for Cassandra. It discusses four use cases - shopping cart data, user activity tracking, log collection/aggregation, and user form versioning. For each use case, it describes the business needs, issues with a relational database approach, and provides the Cassandra data model solution with examples in CQL. The models showcase techniques like de-normalizing data, partitioning, clustering, counters, maps and setting TTL for expiration. The presentation aims to help attendees properly model their data for Cassandra use cases.
This document provides a cheat sheet for MySQL with summaries of basic operations, table operations, storage engines, transaction operations, administration operations, and more. It includes commands for connecting to MySQL, starting and stopping the MySQL daemon, checking server status, creating and modifying databases and tables, importing and exporting data, and server administration tasks like backups, restores, and log maintenance. A link is provided to the original Japanese sheet as well as the MySQL official documentation.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...ScaleGrid.io
Compare top PostgreSQL high availability frameworks - PostgreSQL Automatic Failover (PAF), Replication Manager (repmgr) and Patroni to improve your app uptime. ScaleGrid blog - https://scalegrid.io/blog/whats-the-best-postgresql-high-availability-framework-paf-vs-repmgr-vs-patroni-infographic/
The document discusses performance tuning for JBoss EAP 6. It covers tuning the JVM, EAP 6 configuration, JDBC pools, EJB pools, web pools, and logging. It also discusses monitoring tools like JMX, VisualVM, JBoss Operations Network, profilers, thread dumps, and GC logging. The overall goal is to understand an application's requirements, instrument it, identify bottlenecks, and tune the various components and settings to optimize performance.
유용하(indy.jones) / kakao corp.(톡메시징파트)
---
JVM 기반 언어의 코틀린은 자바의 생태계와 완전히 호환되면서도 간결하고 안전한 코드를 위한 문법을 가진 언어다. 이런 장점으로 최근 안드로이드를 중심으로 점차 자바를 대체해 가고 있지만, 아직 안정성과 성능에 보수적인 서버 분야에서 자바의 지위는 견고해 보인다. 하지만 기존의 자바 기반 프레임웍과도 완벽히 호환되는 코틀린은 서버사이드에서도 도입하지 않을 이유가 없으며 성공적인 도입 사례가 늘어날수록 그 추세는 가속화될 것이다. 현재 카카오톡의 일부 서버들도 코틀린으로 개발되어 대량의 요청을 안정적으로 서비스하고 있고 그 영역은 점점 늘어나고 있다. 이 세션에서는 다양한 서버 분야의 코틀린 도입에 도움이 될 만한 카카오톡 서버의 코틀린 적용 경험을 공유한다.
Resilient Distributed DataSets - Apache SPARKTaposh Roy
RDDs (Resilient Distributed Datasets) provide a fault-tolerant abstraction for data reuse across jobs in distributed applications. They allow data to be persisted in memory and manipulated using transformations like map and filter. This enables efficient processing of iterative algorithms. RDDs achieve fault tolerance by logging the transformations used to build a dataset rather than the actual data, enabling recovery of lost partitions through recomputation.
Inside MongoDB: the Internals of an Open-Source DatabaseMike Dirolf
The document discusses MongoDB, including how it stores and indexes data, handles queries and replication, and supports sharding and geospatial indexing. Key points covered include how MongoDB stores data in BSON format across data files that grow in size, uses memory-mapped files for data access, supports indexing with B-trees, and replicates operations through an oplog.
CQL is a structured query language for Cassandra that replaces clients. It features a familiar SQL-like syntax for querying Cassandra in a user-friendly way. Key CQL commands include CREATE KEYSPACE, CREATE COLUMNFAMILY, SELECT, UPDATE, DELETE, BATCH, and DROP. Consistency levels can be specified for commands using USING and valid options are CONSISTENCY ZERO to CONSISTENCY DCQUORUMSYNC.
Get an overview of HashiCorp's Vault concepts.
Learn how to start a Vault server.
Learn how to use the Vault's postgresql backend.
See an overview of the Vault's SSH backend integration.
This presentation was held on the DigitalOcean Meetup in Berlin. Find more details here: https://www.meetup.com/DigitalOceanBerlin/events/237123195/
This document provides an overview of in-memory databases, summarizing different types including row stores, column stores, compressed column stores, and how specific databases like SQLite, Excel, Tableau, Qlik, MonetDB, SQL Server, Oracle, SAP Hana, MemSQL, and others approach in-memory storage. It also discusses hardware considerations like GPUs, FPGAs, and new memory technologies that could enhance in-memory database performance.
Databricks Spark Chief Architect Reynold Xin's keynote at Spark Summit East 2016, discussing streaming, continuous applications, and DataFrames in Spark.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
This document summarizes a tutorial on column-oriented database systems given at VLDB 2009. It discusses the evolution of column-oriented databases from early systems like DSM to current commercial systems. Key features of column-oriented databases include storing data by column rather than by row for improved read efficiency, better compression, and indexing capabilities. The tutorial outlines optimizations made possible by the columnar format like late materialization and vectorized processing. It also compares the performance of column and row storage using a telco data warehousing example.
This document discusses MongoDB performance tuning. It emphasizes that performance tuning is an obsession that requires planning schema design, statement tuning, and instance tuning in that order. It provides examples of using the MongoDB profiler and explain functions to analyze statements and identify tuning opportunities like non-covered indexes, unnecessary document scans, and low data locality. Instance tuning focuses on optimizing writes through fast update operations and secondary index usage, and optimizing reads by ensuring statements are tuned and data is sharded appropriately. Overall performance depends on properly tuning both reads and writes.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
The document provides an overview and examples of data modeling techniques for Cassandra. It discusses four use cases - shopping cart data, user activity tracking, log collection/aggregation, and user form versioning. For each use case, it describes the business needs, issues with a relational database approach, and provides the Cassandra data model solution with examples in CQL. The models showcase techniques like de-normalizing data, partitioning, clustering, counters, maps and setting TTL for expiration. The presentation aims to help attendees properly model their data for Cassandra use cases.
This document provides a cheat sheet for MySQL with summaries of basic operations, table operations, storage engines, transaction operations, administration operations, and more. It includes commands for connecting to MySQL, starting and stopping the MySQL daemon, checking server status, creating and modifying databases and tables, importing and exporting data, and server administration tasks like backups, restores, and log maintenance. A link is provided to the original Japanese sheet as well as the MySQL official documentation.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...ScaleGrid.io
Compare top PostgreSQL high availability frameworks - PostgreSQL Automatic Failover (PAF), Replication Manager (repmgr) and Patroni to improve your app uptime. ScaleGrid blog - https://scalegrid.io/blog/whats-the-best-postgresql-high-availability-framework-paf-vs-repmgr-vs-patroni-infographic/
The document discusses performance tuning for JBoss EAP 6. It covers tuning the JVM, EAP 6 configuration, JDBC pools, EJB pools, web pools, and logging. It also discusses monitoring tools like JMX, VisualVM, JBoss Operations Network, profilers, thread dumps, and GC logging. The overall goal is to understand an application's requirements, instrument it, identify bottlenecks, and tune the various components and settings to optimize performance.
유용하(indy.jones) / kakao corp.(톡메시징파트)
---
JVM 기반 언어의 코틀린은 자바의 생태계와 완전히 호환되면서도 간결하고 안전한 코드를 위한 문법을 가진 언어다. 이런 장점으로 최근 안드로이드를 중심으로 점차 자바를 대체해 가고 있지만, 아직 안정성과 성능에 보수적인 서버 분야에서 자바의 지위는 견고해 보인다. 하지만 기존의 자바 기반 프레임웍과도 완벽히 호환되는 코틀린은 서버사이드에서도 도입하지 않을 이유가 없으며 성공적인 도입 사례가 늘어날수록 그 추세는 가속화될 것이다. 현재 카카오톡의 일부 서버들도 코틀린으로 개발되어 대량의 요청을 안정적으로 서비스하고 있고 그 영역은 점점 늘어나고 있다. 이 세션에서는 다양한 서버 분야의 코틀린 도입에 도움이 될 만한 카카오톡 서버의 코틀린 적용 경험을 공유한다.
Resilient Distributed DataSets - Apache SPARKTaposh Roy
RDDs (Resilient Distributed Datasets) provide a fault-tolerant abstraction for data reuse across jobs in distributed applications. They allow data to be persisted in memory and manipulated using transformations like map and filter. This enables efficient processing of iterative algorithms. RDDs achieve fault tolerance by logging the transformations used to build a dataset rather than the actual data, enabling recovery of lost partitions through recomputation.
Inside MongoDB: the Internals of an Open-Source DatabaseMike Dirolf
The document discusses MongoDB, including how it stores and indexes data, handles queries and replication, and supports sharding and geospatial indexing. Key points covered include how MongoDB stores data in BSON format across data files that grow in size, uses memory-mapped files for data access, supports indexing with B-trees, and replicates operations through an oplog.
CQL is a structured query language for Cassandra that replaces clients. It features a familiar SQL-like syntax for querying Cassandra in a user-friendly way. Key CQL commands include CREATE KEYSPACE, CREATE COLUMNFAMILY, SELECT, UPDATE, DELETE, BATCH, and DROP. Consistency levels can be specified for commands using USING and valid options are CONSISTENCY ZERO to CONSISTENCY DCQUORUMSYNC.
Get an overview of HashiCorp's Vault concepts.
Learn how to start a Vault server.
Learn how to use the Vault's postgresql backend.
See an overview of the Vault's SSH backend integration.
This presentation was held on the DigitalOcean Meetup in Berlin. Find more details here: https://www.meetup.com/DigitalOceanBerlin/events/237123195/
This document provides an overview of in-memory databases, summarizing different types including row stores, column stores, compressed column stores, and how specific databases like SQLite, Excel, Tableau, Qlik, MonetDB, SQL Server, Oracle, SAP Hana, MemSQL, and others approach in-memory storage. It also discusses hardware considerations like GPUs, FPGAs, and new memory technologies that could enhance in-memory database performance.
Databricks Spark Chief Architect Reynold Xin's keynote at Spark Summit East 2016, discussing streaming, continuous applications, and DataFrames in Spark.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementationtugrulh
This document provides an overview of MapReduce theory and implementation. It discusses how MapReduce borrows concepts from functional programming like map and fold/reduce to provide automatic parallelization and fault tolerance for large-scale data processing problems across hundreds or thousands of CPUs. Users implement map and reduce functions, and MapReduce handles parallel and distributed execution.
The document discusses top trends for leading organizations and includes sections on:
- Moving to cloud computing to increase efficiencies and level the playing field for organizations.
- Using social media like Facebook and Twitter to motivate and engage with a wide audience.
- Setting organizational data free through collaboration to improve services and operations.
- Re-inventing processes through inspiration to innovate and adapt to changing needs.
The document discusses the shift from previous scientific paradigms of theory and experimentation to new paradigms of computation, simulation, and data mining. It argues that nonprofits must embrace the new "data paradigm" by using data to understand their own work, contextualize their work within other data sources, tell stories to stakeholders, and share data. It raises questions about how to raise the importance of data across the nonprofit sector through funder reporting, data sharing principles, and structuring unstructured data.
This document discusses Jim Gray's vision of a fourth paradigm of scientific discovery based on data-intensive science. It outlines three activities of data-intensive science - data capture, curation, and analysis - and argues that funding is needed to develop tools to support these activities across different scales and types of data. It also discusses the need for digital libraries to archive both data and documents, similar to traditional libraries, to support scientific communication and the construction of a permanent scientific record.
The document discusses the evolution of safety paradigms over time from a technical paradigm focused on solving problems, to an organizational paradigm, and behavioral paradigm. It introduces a potential fourth paradigm of viewing safety through a spiritual lens by standing on the shoulders of past approaches and focusing on business spirituality. This fourth paradigm may help transform mindsets to better achieve safety.
Real-Time-Analytics mit Spark und CassandraThomas Mann
Real-Time-Analytics mit Spark und Cassandra.
Eine konzeptionelle Übersicht zur Integration, den Hintergründen als auch Vorteilen einer Kombination zwischen Spark und Cassandra.
Vortrag am 6ten OSBI Workshop 05.03.2015 in Offenburg.
BibBase Linked Data Triplification Challenge 2010 PresentationReynold Xin
The document summarizes BibBase Triplified, a system that publishes bibliographic data from BibTeX files as structured data on the semantic web. It takes BibTeX files maintained by scientists, detects and resolves duplicates, and publishes the data as HTML pages and RDF triples. It also links entries to external datasets like DBLP and DBpedia. As of September 2010, the system had over 4,500 publications and 100 active users. Future work includes improving duplicate detection, linking to more external sources, and broadening the user base.
This slideshow gives feedback about using Linux in industrial projects. It is part of a conference held by our company CIO Informatique Industrielle at ERTS 2008, the European Embedded Real Time software Congress in Toulouse
This paper describes return of experiences about using Linux technologies for industrial software developments. It gives feedback about embedded and real time usages of Linux.
Ce slideshow est issu de la contribution de CIO Informatique Industrielle à la conférence / débat Comment travailler avec les logiciels Open Source, qui s'est tenue en Avril 2008 sur le salon RTS Embedded Systems
Yocto une solution robuste pour construire des applications à fort contenu ap...Christian Charreyre
Ce document est la présentation effectuée par CIO Informatique Industrielle lors de la conférence "Yocto et Linux, un couple d'avenir" du salon RTS 2013
This document discusses Spark, an open-source cluster computing framework. It provides a brief history of Spark, describing how it generalized MapReduce to support more types of applications. Spark allows for batch, interactive, and real-time processing within a single framework using Resilient Distributed Datasets (RDDs) and a logical plan represented as a directed acyclic graph (DAG). The document also discusses how Spark can be used for applications like machine learning via MLlib, graph processing with GraphX, and streaming data with Spark Streaming.
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
This document provides an introduction to Apache Hadoop and Spark for data analysis. It discusses the growth of big data from sources like the internet, science, and IoT. Hadoop is introduced as providing scalability on commodity hardware to handle large, diverse data types with fault tolerance. Key Hadoop components are HDFS for storage, MapReduce for processing, and HBase for non-relational databases. Spark is presented as improving on MapReduce by using in-memory computing for iterative jobs like machine learning. Real-world use cases of Spark at companies like Uber, Pinterest, and Netflix are briefly described.
Apache Spark and Hadoop are frameworks for distributed data processing. Spark can be used for batch processing, streaming, and machine learning. It improves on MapReduce by keeping data in memory between jobs. The document provides an overview of Spark and its components, use cases like streaming data analysis and machine learning, and how it compares to Hadoop MapReduce. Real-world examples of Spark usage at companies like Uber and Pinterest are also discussed.
Here are the steps to complete the assignment:
1. Create RDDs to filter each file for lines containing "Spark":
val readme = sc.textFile("README.md").filter(_.contains("Spark"))
val changes = sc.textFile("CHANGES.txt").filter(_.contains("Spark"))
2. Perform WordCount on each:
val readmeCounts = readme.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
val changesCounts = changes.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
3. Join the two RDDs:
val joined = readmeCounts.join(changes
This document provides an overview of Apache Spark, an open-source cluster computing framework. It discusses Spark's history and community growth. Key aspects covered include Resilient Distributed Datasets (RDDs) which allow transformations like map and filter, fault tolerance through lineage tracking, and caching data in memory or disk. Example applications demonstrated include log mining, machine learning algorithms, and Spark's libraries for SQL, streaming, and machine learning.
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
The document provides an overview of Apache Spark, including its history and key capabilities. It discusses how Spark was developed in 2009 at UC Berkeley and later open sourced, and how it has since become a major open-source project for big data. The document summarizes that Spark provides in-memory performance for ETL, storage, exploration, analytics and more on Hadoop clusters, and supports machine learning, graph analysis, and SQL queries.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
This document provides an introduction to Apache Spark presented by Maxime Dumas of Cloudera. It discusses Spark's advantages over MapReduce like leveraging distributed memory for better performance and supporting iterative algorithms. Spark concepts like RDDs, transformations and actions are explained. Examples shown include word count, logistic regression, and Spark Streaming. The presentation concludes with a discussion of SQL on Spark and a demo.
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
This document provides an introduction to Apache Spark presented by Maxime Dumas. It discusses how Spark improves on MapReduce by offering better performance through leveraging distributed memory and supporting iterative algorithms. Spark retains MapReduce's advantages of scalability, fault-tolerance, and data locality while offering a more powerful and easier to use programming model. Examples demonstrate how tasks like word counting, logistic regression, and streaming data processing can be implemented on Spark. The document concludes by discussing Spark's integration with other Hadoop components and inviting attendees to try Spark.
How Apache Spark fits into the Big Data landscapePaco Nathan
Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/
Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc.
This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.
This document provides an overview and introduction to Spark, including:
- Spark is a general purpose computational framework that provides more flexibility than MapReduce while retaining properties like scalability and fault tolerance.
- Spark concepts include resilient distributed datasets (RDDs), transformations that create new RDDs lazily, and actions that run computations and return values to materialize RDDs.
- Spark can run on standalone clusters or as part of Cloudera's Enterprise Data Hub, and examples of its use include machine learning, streaming, and SQL queries.
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Spark is an in-memory cluster computing framework that provides high performance for large-scale data processing. It excels over Hadoop by keeping data in memory as RDDs (Resilient Distributed Datasets) for faster processing. The document provides an overview of Spark architecture including its core-based execution model compared to Hadoop's JVM-based model. It also demonstrates Spark's programming model using RDD transformations and actions through an example of log mining, showing how jobs are lazily evaluated and distributed across the cluster.
Spark is a fast and general engine for large-scale data processing. It runs programs up to 100x faster than Hadoop in memory, and 10x faster on disk. Spark supports Scala, Java, Python and can run on standalone, YARN, or Mesos clusters. It provides high-level APIs for SQL, streaming, machine learning, and graph processing.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
Introducing the OSA 3200 SP and OSA 3250 ePRCAdtran
Adtran's latest Oscilloquartz solutions make optical pumping cesium timing more accessible than ever. Discover how the new OSA 3200 SP and OSA 3250 ePRC deliver superior stability, simplified deployment and lower total cost of ownership. Built on a shared platform and engineered for scalable, future-ready networks, these models are ideal for telecom, defense, metrology and more.
Data Virtualization: Bringing the Power of FME to Any ApplicationSafe Software
Imagine building web applications or dashboards on top of all your systems. With FME’s new Data Virtualization feature, you can deliver the full CRUD (create, read, update, and delete) capabilities on top of all your data that exploit the full power of FME’s all data, any AI capabilities. Data Virtualization enables you to build OpenAPI compliant API endpoints using FME Form’s no-code development platform.
In this webinar, you’ll see how easy it is to turn complex data into real-time, usable REST API based services. We’ll walk through a real example of building a map-based app using FME’s Data Virtualization, and show you how to get started in your own environment – no dev team required.
What you’ll take away:
-How to build live applications and dashboards with federated data
-Ways to control what’s exposed: filter, transform, and secure responses
-How to scale access with caching, asynchronous web call support, with API endpoint level security.
-Where this fits in your stack: from web apps, to AI, to automation
Whether you’re building internal tools, public portals, or powering automation – this webinar is your starting point to real-time data delivery.
Microsoft Build 2025 takeaways in one presentationDigitalmara
Microsoft Build 2025 introduced significant updates. Everything revolves around AI. DigitalMara analyzed these announcements:
• AI enhancements for Windows 11
By embedding AI capabilities directly into the OS, Microsoft is lowering the barrier for users to benefit from intelligent automation without requiring third-party tools. It's a practical step toward improving user experience, such as streamlining workflows and enhancing productivity. However, attention should be paid to data privacy, user control, and transparency of AI behavior. The implementation policy should be clear and ethical.
• GitHub Copilot coding agent
The introduction of coding agents is a meaningful step in everyday AI assistance. However, it still brings challenges. Some people compare agents with junior developers. They noted that while the agent can handle certain tasks, it often requires supervision and can introduce new issues. This innovation holds both potential and limitations. Balancing automation with human oversight is crucial to ensure quality and reliability.
• Introduction of Natural Language Web
NLWeb is a significant step toward a more natural and intuitive web experience. It can help users access content more easily and reduce reliance on traditional navigation. The open-source foundation provides developers with the flexibility to implement AI-driven interactions without rebuilding their existing platforms. NLWeb is a promising level of web interaction that complements, rather than replaces, well-designed UI.
• Introduction of Model Context Protocol
MCP provides a standardized method for connecting AI models with diverse tools and data sources. This approach simplifies the development of AI-driven applications, enhancing efficiency and scalability. Its open-source nature encourages broader adoption and collaboration within the developer community. Nevertheless, MCP can face challenges in compatibility across vendors and security in context sharing. Clear guidelines are crucial.
• Windows Subsystem for Linux is open-sourced
It's a positive step toward greater transparency and collaboration in the developer ecosystem. The community can now contribute to its evolution, helping identify issues and expand functionality faster. However, open-source software in a core system also introduces concerns around security, code quality management, and long-term maintenance. Microsoft’s continued involvement will be key to ensuring WSL remains stable and secure.
• Azure AI Foundry platform hosts Grok 3 AI models
Adding new models is a valuable expansion of AI development resources available at Azure. This provides developers with more flexibility in choosing language models that suit a range of application sizes and needs. Hosting on Azure makes access and integration easier when using Microsoft infrastructure.
Protecting Your Sensitive Data with Microsoft Purview - IRMS 2025Nikki Chapple
Session | Protecting Your Sensitive Data with Microsoft Purview: Practical Information Protection and DLP Strategies
Presenter | Nikki Chapple (MVP| Principal Cloud Architect CloudWay) & Ryan John Murphy (Microsoft)
Event | IRMS Conference 2025
Format | Birmingham UK
Date | 18-20 May 2025
In this closing keynote session from the IRMS Conference 2025, Nikki Chapple and Ryan John Murphy deliver a compelling and practical guide to data protection, compliance, and information governance using Microsoft Purview. As organizations generate over 2 billion pieces of content daily in Microsoft 365, the need for robust data classification, sensitivity labeling, and Data Loss Prevention (DLP) has never been more urgent.
This session addresses the growing challenge of managing unstructured data, with 73% of sensitive content remaining undiscovered and unclassified. Using a mountaineering metaphor, the speakers introduce the “Secure by Default” blueprint—a four-phase maturity model designed to help organizations scale their data security journey with confidence, clarity, and control.
🔐 Key Topics and Microsoft 365 Security Features Covered:
Microsoft Purview Information Protection and DLP
Sensitivity labels, auto-labeling, and adaptive protection
Data discovery, classification, and content labeling
DLP for both labeled and unlabeled content
SharePoint Advanced Management for workspace governance
Microsoft 365 compliance center best practices
Real-world case study: reducing 42 sensitivity labels to 4 parent labels
Empowering users through training, change management, and adoption strategies
🧭 The Secure by Default Path – Microsoft Purview Maturity Model:
Foundational – Apply default sensitivity labels at content creation; train users to manage exceptions; implement DLP for labeled content.
Managed – Focus on crown jewel data; use client-side auto-labeling; apply DLP to unlabeled content; enable adaptive protection.
Optimized – Auto-label historical content; simulate and test policies; use advanced classifiers to identify sensitive data at scale.
Strategic – Conduct operational reviews; identify new labeling scenarios; implement workspace governance using SharePoint Advanced Management.
🎒 Top Takeaways for Information Management Professionals:
Start secure. Stay protected. Expand with purpose.
Simplify your sensitivity label taxonomy for better adoption.
Train your users—they are your first line of defense.
Don’t wait for perfection—start small and iterate fast.
Align your data protection strategy with business goals and regulatory requirements.
💡 Who Should Watch This Presentation?
This session is ideal for compliance officers, IT administrators, records managers, data protection officers (DPOs), security architects, and Microsoft 365 governance leads. Whether you're in the public sector, financial services, healthcare, or education.
🔗 Read the blog: https://nikkichapple.com/irms-conference-2025/
Exploring the advantages of on-premises Dell PowerEdge servers with AMD EPYC processors vs. the cloud for small to medium businesses’ AI workloads
AI initiatives can bring tremendous value to your business, but you need to support your new AI workloads effectively. That means choosing the best possible infrastructure for your needs—and many companies are finding that the cloud isn’t right for them. According to a recent Rackspace survey of IT executives, 69 percent of companies have moved some of their applications on-premises from the cloud, with half of those citing security and compliance as the reason and 44 percent citing cost.
On-premises solutions provide a number of advantages. With full control over your security infrastructure, you can be certain that all compliance requirements remain firmly in the hands of your IT team. Opting for on-premises also gives you the ability to design your infrastructure to the precise needs of that team and your new AI workloads. Depending on the workload, you may also see performance benefits, along with more predictable costs. As you start to build your next AI initiative, consider an on-premises solution utilizing AMD EPYC processor-powered Dell PowerEdge servers.
Neural representations have shown the potential to accelerate ray casting in a conventional ray-tracing-based rendering pipeline. We introduce a novel approach called Locally-Subdivided Neural Intersection Function (LSNIF) that replaces bottom-level BVHs used as traditional geometric representations with a neural network. Our method introduces a sparse hash grid encoding scheme incorporating geometry voxelization, a scene-agnostic training data collection, and a tailored loss function. It enables the network to output not only visibility but also hit-point information and material indices. LSNIF can be trained offline for a single object, allowing us to use LSNIF as a replacement for its corresponding BVH. With these designs, the network can handle hit-point queries from any arbitrary viewpoint, supporting all types of rays in the rendering pipeline. We demonstrate that LSNIF can render a variety of scenes, including real-world scenes designed for other path tracers, while achieving a memory footprint reduction of up to 106.2x compared to a compressed BVH.
https://arxiv.org/abs/2504.21627
Jira Administration Training – Day 1 : IntroductionRavi Teja
This presentation covers the basics of Jira for beginners. Learn how Jira works, its key features, project types, issue types, and user roles. Perfect for anyone new to Jira or preparing for Jira Admin roles.
ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....Jasper Oosterveld
Sensitivity labels, powered by Microsoft Purview Information Protection, serve as the foundation for classifying and protecting your sensitive data within Microsoft 365. Their importance extends beyond classification and play a crucial role in enforcing governance policies across your Microsoft 365 environment. Join me, a Data Security Consultant and Microsoft MVP, as I share practical tips and tricks to get the full potential of sensitivity labels. I discuss sensitive information types, automatic labeling, and seamless integration with Data Loss Prevention, Teams Premium, and Microsoft 365 Copilot.
Evaluation Challenges in Using Generative AI for Science & Technical ContentPaul Groth
Evaluation Challenges in Using Generative AI for Science & Technical Content.
Foundation Models show impressive results in a wide-range of tasks on scientific and legal content from information extraction to question answering and even literature synthesis. However, standard evaluation approaches (e.g. comparing to ground truth) often don't seem to work. Qualitatively the results look great but quantitive scores do not align with these observations. In this talk, I discuss the challenges we've face in our lab in evaluation. I then outline potential routes forward.
Cyber Security Legal Framework in Nepal.pptxGhimire B.R.
The presentation is about the review of existing legal framework on Cyber Security in Nepal. The strength and weakness highlights of the major acts and policies so far. Further it highlights the needs of data protection act .
As data privacy regulations become more pervasive across the globe and organizations increasingly handle and transfer (including across borders) meaningful volumes of personal and confidential information, the need for robust contracts to be in place is more important than ever.
This webinar will provide a deep dive into privacy contracting, covering essential terms and concepts, negotiation strategies, and key practices for managing data privacy risks.
Whether you're in legal, privacy, security, compliance, GRC, procurement, or otherwise, this session will include actionable insights and practical strategies to help you enhance your agreements, reduce risk, and enable your business to move fast while protecting itself.
This webinar will review key aspects and considerations in privacy contracting, including:
- Data processing addenda, cross-border transfer terms including EU Model Clauses/Standard Contractual Clauses, etc.
- Certain legally-required provisions (as well as how to ensure compliance with those provisions)
- Negotiation tactics and common issues
- Recent lessons from recent regulatory actions and disputes
Jeremy Millul - A Talented Software DeveloperJeremy Millul
Jeremy Millul is a talented software developer based in NYC, known for leading impactful projects such as a Community Engagement Platform and a Hiking Trail Finder. Using React, MongoDB, and geolocation tools, Jeremy delivers intuitive applications that foster engagement and usability. A graduate of NYU’s Computer Science program, he brings creativity and technical expertise to every project, ensuring seamless user experiences and meaningful results in software development.
Securiport is a border security systems provider with a progressive team approach to its task. The company acknowledges the importance of specialized skills in creating the latest in innovative security tech. The company has offices throughout the world to serve clients, and its employees speak more than twenty languages at the Washington D.C. headquarters alone.
6th Power Grid Model Meetup
Join the Power Grid Model community for an exciting day of sharing experiences, learning from each other, planning, and collaborating.
This hybrid in-person/online event will include a full day agenda, with the opportunity to socialize afterwards for in-person attendees.
If you have a hackathon proposal, tell us when you register!
About Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
Agentic AI - The New Era of IntelligenceMuzammil Shah
This presentation is specifically designed to introduce final-year university students to the foundational principles of Agentic Artificial Intelligence (AI). It aims to provide a clear understanding of how Agentic AI systems function, their key components, and the underlying technologies that empower them. By exploring real-world applications and emerging trends, the session will equip students with essential knowledge to engage with this rapidly evolving area of AI, preparing them for further study or professional work in the field.
Grannie’s Journey to Using Healthcare AI ExperiencesLauren Parr
AI offers transformative potential to enhance our long-time persona Grannie’s life, from healthcare to social connection. This session explores how UX designers can address unmet needs through AI-driven solutions, ensuring intuitive interfaces that improve safety, well-being, and meaningful interactions without overwhelming users.
6. Traditional Network Programming
Message-passing between nodes (MPI, RPC, etc)
Really hard to do at scale:
• How to split problem across nodes?
– Important to consider network and data locality
• How to deal with failures?
– If a typical server fails every 3 years, a 10,000-node cluster sees 10
faults/day!
• Even without failures: stragglers (a node is slow)
Almost nobody does this!
6
7. Data-Parallel Models
Restrict the programming interface so that the system
can do more automatically
“Here’s an operation, run it on all of the data”
• I don’t care where it runs (you schedule that)
• In fact, feel free to run it twice on different nodes
7
8. MapReduce Programming Model
Data type: key-value records
Map function:
(Kin, Vin) -> list(Kinter, Vinter)
Reduce function:
(Kinter, list(Vinter)) -> list(Kout, Vout)
8
9. MapReduce Programmability
Most real applications require multiple MR steps
• Google indexing pipeline: 21 steps
• Analytics queries (e.g. count clicks & top K): 2 – 5 steps
• Iterative algorithms (e.g. PageRank): 10’s of steps
Multi-step jobs create spaghetti code
• 21 MR steps -> 21 mapper and reducer classes
• Lots of boilerplate code per step
9
11. Problems with MapReduce
MapReduce use cases showed two major limitations:
1. difficulty of programming directly in MR.
2. Performance bottlenecks
In short, MR doesn’t compose well for large applications
Therefore, people built high level frameworks and
specialized systems.
11
12. Higher Level Frameworks
SELECT count(*) FROM users
A = load 'foo';
B = group A all;
C = foreach B generate COUNT(A);
In reality, 90+% of MR jobs are generated by Hive SQL
12
13. Specialized Systems
MapReduce
General Batch Processing Specialized Systems:
iterative, interactive, streaming, graph, etc.
Pregel Giraph
Dremel Drill
Tez
Impala
GraphLab
StormS4
F1
MillWheel
13
14. Agenda
1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming, ML)
4. DataFrames
5. Internals (time permitting)
14
15. Spark: A Brief History
15
2002
2002
MapReduce @ Google
2004
MapReduce paper
2006
Hadoop @Yahoo!
2004 2006 2008 2010 2012 2014
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit
16. Spark Summary
Unlike the various specialized systems, Spark’s goal was
to generalize MapReduce to support new apps
Two small additions are enough:
• fast data sharing
• general DAGs
More efficient engine, and simpler for the end users.
16
20. Performance
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
20
21. RDD: Core Abstraction
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM or
on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g. map, filter,
groupBy)
• Actions
(e.g. count, collect,
save)
Write programs in terms of distributed datasets
and operations on them
22. RDD
Resilient Distributed Datasets are the primary
abstraction in Spark – a fault-tolerant collection of
elements that can be operated on in parallel
Two types:
• parallelized collections – take an existing single-node
collection and parallel it
• Hadoop datasets: files on HDFS or other compatible storage
22
23. Operations on RDDs
Transformations f(RDD) => RDD
§ Lazy (not computed immediately)
§ E.g. “map”
Actions:
§ Triggers computation
§ E.g. “count”, “saveAsTextFile”
23
26. Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)!
linesWithSpark.count()!
74!
!
linesWithSpark.first()!
# Apache Spark!
textFile = sc.textFile(”SomeFile.txt”)!
27. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
28. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
29. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
30. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
Base
RDD
31. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
32. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Transformed
RDD
33. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
34. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Ac5on
35. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block
1
Block
2
Block
3
36. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block
1
Block
2
Block
3
Driver
tasks
tasks
tasks
37. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block
1
Block
2
Block
3
Driver
Read
HDFS
Block
Read
HDFS
Block
Read
HDFS
Block
38. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block
1
Block
2
Block
3
Driver
Cache
1
Cache
2
Cache
3
Process
&
Cache
Data
Process
&
Cache
Data
Process
&
Cache
Data
39. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block
1
Block
2
Block
3
Driver
Cache
1
Cache
2
Cache
3
results
results
results
40. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block
1
Block
2
Block
3
Driver
Cache
1
Cache
2
Cache
3
messages.filter(lambda s: “php” in s).count()
41. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block
1
Block
2
Block
3
Cache
1
Cache
2
Cache
3
messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
Driver
42. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block
1
Block
2
Block
3
Cache
1
Cache
2
Cache
3
messages.filter(lambda s: “php” in s).count()
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache
43. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block
1
Block
2
Block
3
Cache
1
Cache
2
Cache
3
messages.filter(lambda s: “php” in s).count()
Driver
results
results
results
44. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block
1
Block
2
Block
3
Cache
1
Cache
2
Cache
3
messages.filter(lambda s: “php” in s).count()
Driver
Cache your data è Faster Results
Full-text search of Wikipedia
• 60GB on 20 EC2 machines
• 0.5 sec from mem vs. 20s for on-disk
45. Language Support
Standalone Programs
Python, Scala, & Java
Interactive Shells
Python & Scala
Performance
Java & Scala are faster due to
static typing
…but Python is often fine
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
48. Fault Recovery
RDDs track lineage information that can be used to
efficiently reconstruct lost partitions
Ex: messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘t’)(2))
HDFS
File
Filtered
RDD
Mapped
RDD
filter
(func
=
_.contains(...))
map
(func
=
_.split(...))
50. Example: Logistic Regression
Goal: find best line separating two sets of points
+
–
+
+
+
+
+
+
+
+
–
–
–
–
–
–
–
–
+
target
–
random
initial
line
51. Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
w
automa5cally
shipped
to
cluster
55. Generality of RDDs
Spark
RDDs,
Transformations,
and
Actions
Spark
Streaming
real-‐time
Spark
SQL
GraphX
graph
MLLib
machine
learning
DStream’s:
Streams
of
RDD’s
SchemaRDD’s
RDD-‐Based
Matrices
RDD-‐Based
Graphs
56. Many important apps must process large data streams at
second-scale latencies
• Site statistics, intrusion detection, online ML
To build and scale these apps users want:
• Integration: with offline analytical stack
• Fault-tolerance: both for crashes and stragglers
• Efficiency: low cost beyond base processing
Spark Streaming: Motivation
57. Discretized Stream Processing
t
=
1:
t
=
2:
stream
1
stream
2
batch
opera5on
pull
input
…
…
input
immutable
dataset
(stored
reliably)
immutable
dataset
(output
or
state);
stored
in
memory
as
RDD
…
58. Programming Interface
Simple functional API
views = readStream("http:...", "1s")
ones = views.map(ev => (ev.url, 1))
counts = ones.runningReduce(_ + _)
Interoperates with RDDs
!
// Join stream with static RDD
counts.join(historicCounts).map(...)
!
// Ad-hoc queries on stream state
counts.slice(“21:00”,“21:05”).topK(10)
t
=
1:
t
=
2:
views ones counts
map
reduce
.
.
.
=
RDD
=
partition
59. Inherited “for free” from Spark
RDD data model and API
Data partitioning and shuffles
Task scheduling
Monitoring/instrumentation
Scheduling and resource allocation
65. Benefits for Users
High performance data sharing
• Data sharing is the bottleneck in many environments
• RDD’s provide in-place sharing through memory
Applications can compose models
• Run a SQL query and then PageRank the results
• ETL your data and then run graph/ML on it
Benefit from investment in shared functionality
• E.g. re-usable components (shell) and performance
optimizations
66. Agenda
1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming, ML)
4. DataFrames
5. Internals (time permitting)
66
70. DataFrames in Spark
Distributed collection of data grouped into named
columns (i.e. RDD with schema)
DSL designed for common tasks
• Metadata
• Sampling
• Project, filter, aggregation, join, …
• UDFs
Available in Python, Scala, Java, and R (via SparkR)
70
71. Not Just Less Code: Faster
Implementations
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
76. More Than Naïve Scans
Data Sources API can automatically prune columns and
push filters to the source
• Parquet: skip irrelevant columns and blocks of data; turn
string comparison into integer comparisons for dictionary
encoded data
• JDBC: Rewrite queries to push predicates down
76
77. 77
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
optimized plan
join
scan
(users)
filter
scan
(events)
optimized plan
with intelligent data sources
join
scan
(users)
filter scan
(events)
78. Our Experience So Far
SQL is wildly popular and important
• 100% of Databricks customers use some SQL
Schema is very useful
• Most data pipelines, even the ones that start with unstructured
data, end up having some implicit structure
• Key-value too limited
• That said, semi-/un-structured support is paramount
Separation of logical vs physical plan
• Important for performance optimizations (e.g. join selection)
79. Machine Learning Pipelines
tokenizer
=
Tokenizer(inputCol="text",
outputCol="words”)
hashingTF
=
HashingTF(inputCol="words",
outputCol="features”)
lr
=
LogisticRegression(maxIter=10,
regParam=0.01)
pipeline
=
Pipeline(stages=[tokenizer,
hashingTF,
lr])
df
=
sqlCtx.load("/path/to/data")
model
=
pipeline.fit(df)
df0 df1 df2 df3tokenizer hashingTF lr.model
lr
Pipeline Model
80. 80
R Interface (SparkR)
Spark 1.4 (June)
Exposes DataFrames,
and ML library in R
df = jsonFile(“tweets.json”)
summarize(
group_by(
df[df$user == “matei”,],
“date”),
sum(“retweets”))
81. Data Science at Scale
Higher level interfaces in Scala, Java, Python, R
Drastically easier to program Big Data
• With APIs similar to single-node tools
81
84. Agenda
1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming, ML)
4. DataFrames
5. Internals (time permitting)
84
85. Spark Application
sc = new SparkContext
f = sc.textFile(“…”)"
"
f.filter(…)"
.count()"
"
...
Your program
(JVM / Python)
Spark driver"
(app master)
Spark executor
(multiple of them)
HDFS, HBase, …
Block
manager
Task
threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster"
manager
A single application often contains multiple actions
86. RDD is an interface
1. Set of partitions (“splits” in Hadoop)
2. List of dependencies on parent RDDs
3. Function to compute a partition
(as an Iterator) given its parent(s)
4. (Optional) partitioner (hash, range)
5. (Optional) preferred location(s)
for each partition
“lineage”
op5mized
execu5on
90. Example: JoinedRDD
partitions = one per reduce task
dependencies = “shuffle” on each parent
compute(partition) = read and join shuffled data
preferredLocations(part) = none
partitioner = HashPartitioner(numTasks)
Spark will now know
this data is hashed!
92. Execution Process
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD
Objects
build
operator
DAG
DAG
Scheduler
split
graph
into
stages
of
tasks
submit
each
stage
as
ready
DAG
Task
Scheduler
TaskSet
launch
tasks
via
cluster
manager
retry
failed
or
straggling
tasks
Cluster
manager
Worker
execute
tasks
store
and
serve
blocks
Block
manager
Threads
Task
93. DAG Scheduler
Input: RDD and partitions to compute
Output: output from actions on those partitions
Roles:
• Build stages of tasks
• Submit them to lower level scheduler (e.g. YARN, Mesos,
Standalone) as ready
• Lower level scheduler will schedule data based on locality
• Resubmit failed stages if outputs are lost
94. Job Scheduler
Captures RDD
dependency graph
Pipelines functions
into “stages”
Cache-aware for
data reuse & locality
Partitioning-aware
to avoid shuffles
join
union
groupBy
map
Stage
3
Stage
1
Stage
2
A:
B:
C:
D:
E:
F:
G:
=
cached
partition