Making Structured Streaming Ready for ProductionDatabricks
In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows:
- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes
Speaker: Tathagata Das
This talk was originally presented at Spark Summit East 2017.
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
Flink and Kafka are popular components to build an open source stream processing infrastructure. We present how Flink integrates with Kafka to provide a platform with a unique feature set that matches the challenging requirements of advanced stream processing applications. In particular, we will dive into the following points:
Flink’s support for event-time processing, how it handles out-of-order streams, and how it can perform analytics on historical and real-time streams served from Kafka’s persistent log using the same code. We present Flink’s windowing mechanism that supports time-, count- and session- based windows, and intermixing event and processing time semantics in one program.
How Flink’s checkpointing mechanism integrates with Kafka for fault-tolerance, for consistent stateful applications with exactly-once semantics.
We will discuss “”Savepoints””, which allows users to save the state of the streaming program at any point in time. Together with a durable event log like Kafka, savepoints allow users to pause/resume streaming programs, go back to prior states, or switch to different versions of the program, while preserving exactly-once semantics.
We explain the techniques behind the combination of low-latency and high throughput streaming, and how latency/throughput trade-off can configured.
We will give an outlook on current developments for streaming analytics, such as streaming SQL and complex event processing.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks
At Nielsen Identity, we use Apache Spark to process 10’s of TBs of data, running on AWS EMR. We started at a point where Spark was not even supported out-of-the-box by EMR, and today we’re spinning-up clusters with 1000’s of nodes on a daily basis, orchestrated by Airflow. A few months ago, we embarked on a journey to evaluate the option of using Kubernetes as our Spark infrastructure, mainly to reduce operational costs and improve stability (as we heavily rely on Spot Instances for our clusters). To allow us to achieve those goals, we combined the open-sourced GCP Spark-on-K8s operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) with a native Airflow integration we developed and recently contributed back to the Airflow project (https://issues.apache.org/jira/browse/AIRFLOW-6542). Finally, we were able to migrate our existing Airflow DAGs, with minimal changes, from AWS EMR to K8s.
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training**
This Edureka tutorial on PySpark Tutorial will provide you with a detailed and comprehensive knowledge of Pyspark, how it works, the reason why python works best with Apache Spark. You will also learn about RDDs, data frames and mllib.
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
Matthew Powers gave a presentation on optimizing Delta and Parquet data lakes. He discussed the benefits of using Delta lakes such as built-in time travel, compacting, and vacuuming capabilities. Delta lakes provide these features for free on top of Parquet files and a transaction log. Powers demonstrated how to create, compact, vacuum, partition, filter, and update Delta lakes in Spark. He showed that partitioning data significantly improves query performance by enabling data skipping and filtering at the partition level.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix+Calcite integration.
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks
Adding nodes at runtime (Upscale) to already running Spark-on-Yarn clusters is fairly easy. But taking away these nodes (Downscale) when the workload is low at some later point of time is a difficult problem. To remove a node from a running cluster, we need to make sure that it is not used for compute as well as storage.
But on production workloads, we see that many of the nodes can’t be taken away because:
Nodes are running some containers although they are not fully utilized i.e., containers are fragmented on different nodes. Example. – each node is running 1-2 containers/executors although they have resources to run 4 containers.
Nodes have some shuffle data in the local disk which will be consumed by Spark application running on this cluster later. In this case, the Resource Manager will never decide to reclaim these nodes because losing shuffle data could lead to costly recomputation of stages.
In this talk, we will talk about how we can improve downscaling in Spark-on-YARN clusters under the presence of such constraints. We will cover changes in scheduling strategy for container allocation in YARN and Spark task scheduler which together helps us achieve better packing of containers. This makes sure that containers are defragmented on fewer set of nodes and thus some nodes don’t have any compute. In addition to this, we will also cover enhancements to Spark driver and External Shuffle Service (ESS) which helps us to proactively delete shuffle data which we already know has been consumed. This makes sure that nodes are not holding any unnecessary shuffle data – thus freeing them from storage and hence available for reclamation for faster downscaling.
Optimising Geospatial Queries with Dynamic File PruningDatabricks
One of the most significant benefits provided by Databricks Delta is the ability to use z-ordering and dynamic file pruning to significantly reduce the amount of data that is retrieved from blob storage and therefore drastically improve query times, sometimes by an order of magnitude.
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
Introduces the internal architecture of Apache Arrow, DataFusion query engine
See https://arrow.apache.org/datafusion/ for more information
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Data Distribution and Ordering for Efficient Data Source V2Databricks
This presentation discusses data distribution and ordering in Apache Iceberg's Data Source V2. It explains that proper distribution and ordering of data is important for performance when writing and reading large datasets. The new version introduces an API for connectors to specify their required distribution and ordering, addressing issues in V1 where connectors could apply arbitrary transformations. Supported distribution options include ordered, clustered, and unspecified, and the API supports batch and streaming writes. Future work includes supporting distribution and ordering in table creation and improving partition handling. Proper data distribution and ordering is key to scaling performance in Iceberg.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Vladimir Rodionov (Hortonworks)
Time-series applications (sensor data, application/system logging events, user interactions etc) present a new set of data storage challenges: very high velocity and very high volume of data. This talk will present the recent development in Apache HBase that make it a good fit for time-series applications.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Grafana 7.0 introduces new features including a tracing data viewer that allows users to view and correlate metrics, logs, and traces across data sources. It also includes new data transformations that allow users to transform data before it is queried. Additionally, Grafana 7.0 features a new plugin architecture that splits core functionality into packages and supports official backend plugins running as a separate process.
ETL with SPARK - First Spark London meetupRafal Kwasny
The document discusses how Spark can be used to supercharge ETL workflows by running them faster and with less code compared to traditional Hadoop approaches. It provides examples of using Spark for tasks like sessionization of user clickstream data. Best practices are covered like optimizing for JVM issues, avoiding full GC pauses, and tips for deployment on EC2. Future improvements to Spark like SQL support and Java 8 are also mentioned.
ELK Elasticsearch Logstash and Kibana Stack for Log ManagementEl Mahdi Benzekri
Initiation to the powerful Elasticsearch Logstash and Kibana stack, it has many use cases, the popular one is the server and application log management.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Introduction of Java GC Tuning and Java Java Mission ControlLeon Chen
This document provides an introduction and overview of Java garbage collection (GC) tuning and the Java Mission Control tool. It begins with information about the speaker, Leon Chen, including his background and patents. It then outlines the Java and JVM roadmap and upcoming features. The bulk of the document discusses GC tuning concepts like heap sizing, generation sizing, footprint vs throughput vs latency. It provides examples and recommendations for GC logging, analysis tools like GCViewer and JWorks GC Web. The document is intended to outline Oracle's product direction and future plans for Java GC tuning and tools.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
- Apache Thrift is a cross-language services framework that allows for the easy definition of data types and remote procedure calls (RPCs).
- It uses an interface definition language (IDL) to define data types and services, and generates code in various languages to implement clients and servers.
- Apache Thrift supports a wide range of languages and transports, making it useful for building high-performance, scalable distributed applications and microservices.
Apache Thrift is a framework for defining and implementing service interfaces and generating code to facilitate remote procedure calls across multiple languages. It handles networking, serialization, and other low-level details, allowing developers to focus on implementing service logic. Services are defined using an interface definition language (IDL) that specifies data types, service methods, and exceptions. The Thrift compiler then generates code to implement clients and servers for the defined services in various languages. On the server side, developers implement handlers that define the logic for each service method. The generated code provides a simple way to deploy the service by connecting the various networking and serialization layers. Many large companies use Thrift for building scalable distributed systems across multiple languages and platforms.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
Matthew Powers gave a presentation on optimizing Delta and Parquet data lakes. He discussed the benefits of using Delta lakes such as built-in time travel, compacting, and vacuuming capabilities. Delta lakes provide these features for free on top of Parquet files and a transaction log. Powers demonstrated how to create, compact, vacuum, partition, filter, and update Delta lakes in Spark. He showed that partitioning data significantly improves query performance by enabling data skipping and filtering at the partition level.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix+Calcite integration.
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks
Adding nodes at runtime (Upscale) to already running Spark-on-Yarn clusters is fairly easy. But taking away these nodes (Downscale) when the workload is low at some later point of time is a difficult problem. To remove a node from a running cluster, we need to make sure that it is not used for compute as well as storage.
But on production workloads, we see that many of the nodes can’t be taken away because:
Nodes are running some containers although they are not fully utilized i.e., containers are fragmented on different nodes. Example. – each node is running 1-2 containers/executors although they have resources to run 4 containers.
Nodes have some shuffle data in the local disk which will be consumed by Spark application running on this cluster later. In this case, the Resource Manager will never decide to reclaim these nodes because losing shuffle data could lead to costly recomputation of stages.
In this talk, we will talk about how we can improve downscaling in Spark-on-YARN clusters under the presence of such constraints. We will cover changes in scheduling strategy for container allocation in YARN and Spark task scheduler which together helps us achieve better packing of containers. This makes sure that containers are defragmented on fewer set of nodes and thus some nodes don’t have any compute. In addition to this, we will also cover enhancements to Spark driver and External Shuffle Service (ESS) which helps us to proactively delete shuffle data which we already know has been consumed. This makes sure that nodes are not holding any unnecessary shuffle data – thus freeing them from storage and hence available for reclamation for faster downscaling.
Optimising Geospatial Queries with Dynamic File PruningDatabricks
One of the most significant benefits provided by Databricks Delta is the ability to use z-ordering and dynamic file pruning to significantly reduce the amount of data that is retrieved from blob storage and therefore drastically improve query times, sometimes by an order of magnitude.
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
Introduces the internal architecture of Apache Arrow, DataFusion query engine
See https://arrow.apache.org/datafusion/ for more information
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Data Distribution and Ordering for Efficient Data Source V2Databricks
This presentation discusses data distribution and ordering in Apache Iceberg's Data Source V2. It explains that proper distribution and ordering of data is important for performance when writing and reading large datasets. The new version introduces an API for connectors to specify their required distribution and ordering, addressing issues in V1 where connectors could apply arbitrary transformations. Supported distribution options include ordered, clustered, and unspecified, and the API supports batch and streaming writes. Future work includes supporting distribution and ordering in table creation and improving partition handling. Proper data distribution and ordering is key to scaling performance in Iceberg.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Vladimir Rodionov (Hortonworks)
Time-series applications (sensor data, application/system logging events, user interactions etc) present a new set of data storage challenges: very high velocity and very high volume of data. This talk will present the recent development in Apache HBase that make it a good fit for time-series applications.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Grafana 7.0 introduces new features including a tracing data viewer that allows users to view and correlate metrics, logs, and traces across data sources. It also includes new data transformations that allow users to transform data before it is queried. Additionally, Grafana 7.0 features a new plugin architecture that splits core functionality into packages and supports official backend plugins running as a separate process.
ETL with SPARK - First Spark London meetupRafal Kwasny
The document discusses how Spark can be used to supercharge ETL workflows by running them faster and with less code compared to traditional Hadoop approaches. It provides examples of using Spark for tasks like sessionization of user clickstream data. Best practices are covered like optimizing for JVM issues, avoiding full GC pauses, and tips for deployment on EC2. Future improvements to Spark like SQL support and Java 8 are also mentioned.
ELK Elasticsearch Logstash and Kibana Stack for Log ManagementEl Mahdi Benzekri
Initiation to the powerful Elasticsearch Logstash and Kibana stack, it has many use cases, the popular one is the server and application log management.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Introduction of Java GC Tuning and Java Java Mission ControlLeon Chen
This document provides an introduction and overview of Java garbage collection (GC) tuning and the Java Mission Control tool. It begins with information about the speaker, Leon Chen, including his background and patents. It then outlines the Java and JVM roadmap and upcoming features. The bulk of the document discusses GC tuning concepts like heap sizing, generation sizing, footprint vs throughput vs latency. It provides examples and recommendations for GC logging, analysis tools like GCViewer and JWorks GC Web. The document is intended to outline Oracle's product direction and future plans for Java GC tuning and tools.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
- Apache Thrift is a cross-language services framework that allows for the easy definition of data types and remote procedure calls (RPCs).
- It uses an interface definition language (IDL) to define data types and services, and generates code in various languages to implement clients and servers.
- Apache Thrift supports a wide range of languages and transports, making it useful for building high-performance, scalable distributed applications and microservices.
Apache Thrift is a framework for defining and implementing service interfaces and generating code to facilitate remote procedure calls across multiple languages. It handles networking, serialization, and other low-level details, allowing developers to focus on implementing service logic. Services are defined using an interface definition language (IDL) that specifies data types, service methods, and exceptions. The Thrift compiler then generates code to implement clients and servers for the defined services in various languages. On the server side, developers implement handlers that define the logic for each service method. The generated code provides a simple way to deploy the service by connecting the various networking and serialization layers. Many large companies use Thrift for building scalable distributed systems across multiple languages and platforms.
Apache thrift-RPC service cross languagesJimmy Lai
This slides illustrate how to use Apache Thrift for building RPC service and provide demo example code in Python. The example scenario is: we have a prepared machine learning model, and we'd like to load the model in advance as a server for providing prediction service.
Apache Thrift : One Stop Solution for Cross Language CommunicationPiyush Goel
Apache Thrift is a framework for cross-language communication that supports RPC. It was developed by Facebook and entered Apache incubation in 2008. Thrift supports languages like C++, Java, Python, PHP, Ruby, and others. It provides a type system, transport layer, protocol layer, processors, and servers to enable cross-language communication and RPC. Companies like Capillary use Thrift for tasks like processing business rules with PHP-Java communication and sending promotional SMS from a Java system.
Thrift is an interface definition language and binary communication protocol used at Facebook as a remote procedure call framework. It combines a software stack with a code generation engine to build services that work efficiently across multiple languages like C#, C++, Java, PHP and Python. Thrift allows developers to define data structures and service interfaces in an IDL file, then generates code for clients and servers. It provides features like common data transport, versioning, and supports efficient protocols to handle large service call volumes across Facebook applications.
RESTLess Design with Apache Thrift: Experiences from Apache Airavatasmarru
Apache Airavata is software for providing services to manage scientific applications on a wide range of remote computing resources. Airavata can be used by both individual scientists to run scientific workflows as well as communities of scientists through Web browser interfaces. It is a challenge to bring all of Airavata’s capabilities together in the single API layer that is our prerequisite for a 1.0 release. To support our diverse use cases, we have developed a rich data model and messaging format that we need to expose to client developers using many programming languages. We do not believe this is a good match for REST style services. In this presentation, we present our use and evaluation of Apache Thrift as an interface and data model definition tool, its use internally in Airavata, and its use to deliver and distribute client development kits.
Thrift is a software framework that allows for efficient cross-language communication. It provides features such as RPC, code generation, and serialization to make it easy to define and develop services that can be used across multiple languages. Supported languages include C++, Java, Python, PHP and more. Thrift handles low-level details like data serialization while providing an interface definition language to define services and data structures.
As a follow up to my illustration of the Axolotl Protocol as implemented by the TextSecure Protocol V2 https://github.com/WhisperSystems/TextSecure/wiki/ProtocolV2 here is TextSecure's Protocol Buffer usage https://code.google.com/p/protobuf/ illustrated
Acceleration for big data, hadoop and memcached it168文库Accenture
This document summarizes a presentation on accelerating big data, Hadoop and Memcached using high-performance networks. It provides an overview of the architectures of Memcached, Hadoop Distributed File System (HDFS) and HBase. It discusses the challenges in designing communication and I/O libraries for enterprise systems and how protocols like Remote Direct Memory Access (RDMA) can help overcome these challenges. It presents case studies on redesigning Memcached to use verbs interfaces and achieves significant latency improvements over 1GbE and 10GbE networks for both small and large messages.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
Hive User Meeting March 2010 - Hive TeamZheng Shao
The document summarizes new features and API updates in the Hive data warehouse software. It discusses enhancements to JDBC/ODBC connectivity, the introduction of CREATE TABLE AS SELECT (CTAS) functionality, improvements to join strategies including map joins and bucketed map joins, and work on views, HBase integration, user-defined functions, serialization/deserialization (SerDe), and object inspectors. It also provides guidance on developing new SerDes for custom data formats and serialization needs.
This document discusses the efficient object model in Java used by Apache Hive. It covers object inspectors, on-disk and in-memory data formats, delegation patterns for multiple in-memory formats, the role of SerDes, and optimizations for the LazySimpleSerDe. Future work areas include fully delegating object inspector methods and supporting union data types.
Facebook uses a LAMP stack with additional services and customizations for its architecture. The LAMP stack includes PHP, MySQL, and Memcache. PHP is used for its web programming capabilities but has scaling issues. MySQL is used primarily as a key-value store and is optimized for recency of data. Memcache provides high-performance caching. Additional services are created using Thrift for tasks better suited to other languages. Services include News Feed, Search, and logging via Scribe. The architecture emphasizes scalability, simplicity, and optimizing for common use cases.
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.Rishikese MR
The document provides an overview of Facebook's scalable architecture presented by Sharath Basil Kurian. It discusses how Facebook uses a variety of technologies like LAMP stack, PHP, Memcached, HipHop, Haystack, Scribe, Thrift, Hadoop and Hive to handle large amounts of user data and scale to support its massive user base. The architecture includes front-end components like PHP and BigPipe to dynamically render pages and back-end databases and caches like MySQL, Memcached and Haystack to efficiently store and retrieve user data.
Apache is an open source web server that is very popular, secure, fast, and reliable. It implements many features including CGI, SSL, virtual domains, and plug-in modules for extensibility. Apache uses simple text configuration files like httpd.conf to configure settings and is run from the command line using scripts like apachectl to start, stop, and restart the server.
Facebook uses a distributed systems architecture with services like Memcache, Scribe, Thrift, and Hip Hop to handle large data volumes and high concurrency. Key components include the Haystack photo storage system, BigPipe for faster page loading, and a PHP front-end optimized using Hip Hop. Data is partitioned horizontally and services communicate using lightweight protocols like Thrift.
From common errors seen in running Spark applications, e.g., OutOfMemory, NoClassFound, disk IO bottlenecks, History Server crash, cluster under-utilization to advanced settings used to resolve large-scale Spark SQL workloads such as HDFS blocksize vs Parquet blocksize, how best to run HDFS Balancer to re-distribute file blocks, etc. you will get all the scoop in this information-packed presentation.
Automated Reports with Rstudio Server
Automated KPI reporting with Shiny Server
Process Validation Documentation with Jupyter Notebook
Automated Machine Learning with Dataiku
Infrastructure & System Monitoring using PrometheusMarco Pas
The document introduces infrastructure and system monitoring using Prometheus. It discusses the importance of monitoring, common things to monitor like services, applications, and OS metrics. It provides an overview of Prometheus including its main components and data format. The document demonstrates setting up Prometheus, adding host metrics using Node Exporter, configuring Grafana, monitoring Docker containers using cAdvisor, configuring alerting in Prometheus and Alertmanager, instrumenting application code, and integrating Consul for service discovery. Live code demos are provided for key concepts.
This document provides an introduction to Node.js, a framework for building scalable server-side applications with asynchronous JavaScript. It discusses what Node.js is, how it uses non-blocking I/O and events to avoid wasting CPU cycles, and how external Node modules help create a full JavaScript stack. Examples are given of using Node modules like Express for building RESTful APIs and Socket.IO for implementing real-time features like chat. Best practices, limitations, debugging techniques and references are also covered.
This document summarizes how to test Java web applications on mobile devices using Arquillian and Selenium. It describes setting up Android emulators, configuring the Arquillian extension for AndroidDriver, and writing sample unit and functional tests for a mobile web application using Page Object Model patterns and the WebDriver API. Tips are provided for debugging tests, capturing screenshots on failure, and integrating tests with Jenkins.
The document provides an overview of asynchronous programming in Python. It discusses how asynchronous programming can improve performance over traditional synchronous and threaded models by keeping resources utilized continuously. It introduces key concepts like callbacks, coroutines, tasks and the event loop. It also covers popular asynchronous frameworks and modules in Python like Twisted, Tornado, gevent and asyncio. Examples are provided to demonstrate asynchronous HTTP requests and concurrent factorial tasks using the asyncio module. Overall, the document serves as an introduction to asynchronous programming in Python.
Listen up, developers. You are not special. Your infrastructure is not a beautiful and unique snowflake. You have the same tech debt as everyone else. This is a talk about a better way to build and manage infrastructure: Terraform Modules. It goes over how to build infrastructure as code, package that code into reusable modules, design clean and flexible APIs for those modules, write automated tests for the modules, and combine multiple modules into an end-to-end techs tack in minutes.
You can find the video here: https://www.youtube.com/watch?v=LVgP63BkhKQ
This document provides instructions for contributing to Apache CloudStack by translating documentation or building packages. It outlines the translation process using Transifex, how to build and package CloudStack using Maven, and future release schedules including a documentation sprint for Japanese translations in June/July.
Like it or not, many open source developers are moving to the Microsoft .NET platform, and we're bringing our favorite tools with us!
In this session, we look inside ASF projects that are creating software for .NET and Mono, like Logging and Lucene.net -- to show you how to create leading-edge ASP.NET applications with ASF open source libraries, and how you can integrate with other appications using Thrift, Chemistry/DotCMIS, QPid or ActiveMQ.
We'll also look at integrating other .NET open source projects, like Spring.NET, NVelocity, and JayRock, into your C# application to create a complete open source .NET stack.
This document provides an overview of the Tornado web server and summarizes its internals. It begins with an introduction to Tornado, describing it as a scalable, non-blocking web server and framework written in Python. It then outlines the main Tornado modules and discusses sockets, I/O monitoring using select, poll and epoll, and how Tornado sets up its server loop and handles requests.
Everything you wanted to know about writing async, concurrent http apps in java Baruch Sadogursky
As presented at CodeMotion Tel Aviv:
Facing tens of millions of clients continuously downloading binaries from its repositories, JFrog decided to offer an OSS client that natively supports these downloads. This session shares the main challenges of developing a highly concurrent, resumable, async download library on top of an Apache HTTP client. It also covers other libraries JFrog tested and why it decided to reinvent the wheel. Consider yourself forewarned: lots of HTTP internals, NIO, and concurrency ahead!
Talk given at DomCode meetup in Utrecht (August 2014) on different frameworks to do asynchronous I/O y Python, with a strong focus on asyncio (PEP-3156).
Try to imagine the amount of time and effort it would take you to write a bug-free script or application that will accept a URL, port scan it, and for each HTTP service that it finds, it will create a new thread and perform a black box penetration testing while impersonating a Blackberry 9900 smartphone. While you’re thinking, Here’s how you would have done it in Hackersh:
“http://localhost” \
-> url \
-> nmap \
-> browse(ua=”Mozilla/5.0 (BlackBerry; U; BlackBerry 9900; en) AppleWebKit/534.11+ (KHTML, like Gecko) Version/7.1.0.346 Mobile Safari/534.11+”) \
-> w3af
Meet Hackersh (“Hacker Shell”) – A new, free and open source cross-platform shell (command interpreter) with built-in security commands and Pythonect-like syntax.
Aside from being interactive, Hackersh is also scriptable with Pythonect. Pythonect is a new, free, and open source general-purpose dataflow programming language based on Python, written in Python. Hackersh is inspired by Unix pipeline, but takes it a step forward by including built-in features like remote invocation and threads. This 120 minute lab session will introduce Hackersh, the automation gap it fills, and its features. Lots of demonstrations and scripts are included to showcase concepts and ideas.
TouK Nussknacker is a GUI for Apache Flink that was created to address client needs for a more user-friendly interface. It allows defining data pipelines and business rules through expressions that are accessible for semi-technical users. The architecture uses a model-driven approach where the data, sources, sinks, processors and rules are defined through JSON configuration. It has been used successfully in production for over a year at one of the largest Polish telcos for real-time marketing and fraud detection applications.
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise
Apache Beam is a unified programming model for batch and streaming data processing that provides portability across distributed processing backends. It aims to support multiple languages like Java, Python and Go. The Beam Python SDK allows writing pipelines in Python that can run on distributed backends like Apache Flink. Lyft developed a Python SDK runner for Flink that translates Python pipelines to native Flink APIs using the Beam Fn API for communication between the SDK and runner. Future work includes improving performance of Python pipelines on JVM runners and supporting multiple languages in a single pipeline.
Faster PHP apps using Queues and WorkersRichard Baker
PHP apps typically perform tasks in a synchronous manner; Resizing an image or sending a push notification. For most applications this works well, but as apps grow or experience increased traffic, each task adds extra milliseconds to a request, leaving users waiting.
A common solution is to defer these tasks to the background using a cron task. However, there is a better way. Job queues not only help to decouple your application and improve resilience but will also cut request times.
In this talk I we’ll explore some common queue systems; the features and tradeoffs of each solution, what to queue, refactoring existing code into jobs, and running workers. By the end you’ll be ready to build your next app one job at a time.
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Spark
DevNexus 2022 Atlanta
https://devnexus.com/presentations/7150/
This talk is a quick overview of the How, What and WHY of Apache Pulsar, Apache Flink and Apache NiFi. I will show you how to design event-driven applications that scale the cloud native way.
This talk was done live in person at DevNexus across from the booth in room 311
Tim Spann
Tim Spann is a Developer Advocate for StreamNative. He works with StreamNative Cloud, Apache Pulsar, Apache Flink, Flink SQL, Apache NiFi, MiniFi, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
This presentation gives an overview of the Apache Airavata project. It explains Apache Airavata in terms of it's architecture, data models and user interface.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the Apache MADlib AI/ML project. It explains Apache MADlib AI/ML in terms of it's functionality, it's architecture, dependencies and also gives an SQL example.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the Apache MXNet AI project. It explains Apache MXNet AI in terms of it's architecture, eco system, languages and the generic problems that the architecture attempts to solve.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the Apache Gobblin project. It explains Apache Gobblin in terms of it's architecture, data sources/sinks and it's work unit processing.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the Apache Singa AI project. It explains Apache Singa in terms of it's architecture, distributed training and functionality.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the Apache Ranger project. It explains Apache Ranger in terms of it's architecture, security, audit and plugin features.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the OrientDB database project. It explains OrientDB in terms of it's functionality, its indexing and architecture. It examines the ETL functionality as well as the UI available.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the Prometheus project. It explains Prometheus in terms of it's visualisation, time series processing capabilities and architecture. It also examines it's query language PromQL.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the Apache Tephra project. It explains Tephra in terms of Pheonix, HBase and HDFS. It examines the project architecture and configuration.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
Kudu is an open source column-oriented data store that integrates with the Hadoop ecosystem to provide fast processing of online analytical processing (OLAP) workloads. It scales to large datasets and clusters, with a master-tablet server architecture providing fault tolerance and high availability. Kudu uses a columnar storage format and supports various column types, configurations, and partitioning strategies to optimize performance and distribution of data and loads.
Apache Bahir provides streaming connectors and SQL data sources for Apache Spark and Apache Flink in a centralized location. It contains connectors for ActiveMQ, Akka, Flume, InfluxDB, Kudu, Netty, Redis, CouchDB, Cloudant, MQTT, and Twitter. Bahir is an important project because it enables reuse of extensions and saves time and money compared to recreating connectors. Though small, it covers multiple Spark and Flink extensions with the potential for future extensions. The project is currently active with regular updates to the GitHub repository and comprehensive documentation for its connectors.
This presentation gives an overview of the Apache Arrow project. It explains the Arrow project in terms of its in memory structure, its purpose, language interfaces and supporting projects.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the JanusGraph DB project. It explains the JanusGraph database in terms of its architecture, storage backends, capabilities and community.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the Apache Ignite project. It explains Ignite in relation to its architecture, scaleability, caching, datagrid and machine learning abilities.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the Apache Samza project. It explains Samza's stream processing capabilities as well as its architecture, users, use cases etc.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
This presentation gives an overview of the Apache Flink project. It explains Flink in terms of its architecture, use cases and the manner in which it works.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
Apache Edgent is an open source programming model and runtime for analyzing data and events at edge devices. It allows processing data at the edge to save money by only sending essential data from devices. Edgent provides connectors for various data sources and sinks and can be used for IoT, embedded in application servers, and for monitoring machines. The edge refers to devices, gateways, and sensors at the network boundary that provide potential data. Edgent applications follow a common structure of getting a provider, creating a topology, composing processing graphs, and submitting it for execution.
CouchDB is an open-source document-oriented NoSQL database that stores data in JSON format. It provides ACID support through multi-version concurrency control and a crash-only design that ensures data integrity even if the database or servers crash. CouchDB supports single node or clustered deployments and uses bidirectional replication to synchronize data across nodes. It prioritizes availability and partition tolerance according to the CAP theorem.
Apache Mesos is a cluster manager that provides resource sharing and isolation. It allows multiple distributed systems like Hadoop, Spark, and Storm to run on the same pool of nodes. Mesos introduces resource sharing to improve cluster utilization and application performance. It uses a master/slave architecture with fault tolerance and has APIs for developers in C++, Java, and Python.
Pentaho is an open-source business intelligence system that offers analytics, visual data integration, OLAP, reports, dashboards, data mining, and ETL capabilities. It includes both a server and client components, which are available for Windows, Linux, and Mac OSX. The server provides analytics, dashboarding, reporting, and data access services, while the client offers data integration, big data support, report design, data mining, metadata management, and other tools. Pentaho also has an extensive library of plugins and supports visual drag-and-drop development of ETL jobs and integration with Hadoop for big data analytics.
New Ways to Reduce Database Costs with ScyllaDBScyllaDB
How ScyllaDB’s latest capabilities can reduce your infrastructure costs
ScyllaDB has been obsessed with price-performance from day 1. Our core database is architected with low-level engineering optimizations that squeeze every ounce of power from the underlying infrastructure. And we just completed a multi-year effort to introduce a set of new capabilities for additional savings.
Join this webinar to learn about these new capabilities: the underlying challenges we wanted to address, the workloads that will benefit most from each, and how to get started. We’ll cover ways to:
- Avoid overprovisioning with “just-in-time” scaling
- Safely operate at up to ~90% storage utilization
- Cut network costs with new compression strategies and file-based streaming
We’ll also highlight a “hidden gem” capability that lets you safely balance multiple workloads in a single cluster. To conclude, we will share the efficiency-focused capabilities on our short-term and long-term roadmaps.
UiPath Community Zurich: Release Management and Build PipelinesUiPathCommunity
Ensuring robust, reliable, and repeatable delivery processes is more critical than ever - it's a success factor for your automations and for automation programmes as a whole. In this session, we’ll dive into modern best practices for release management and explore how tools like the UiPathCLI can streamline your CI/CD pipelines. Whether you’re just starting with automation or scaling enterprise-grade deployments, our event promises to deliver helpful insights to you. This topic is relevant for both on-premise and cloud users - as well as for automation developers and software testers alike.
📕 Agenda:
- Best Practices for Release Management
- What it is and why it matters
- UiPath Build Pipelines Deep Dive
- Exploring CI/CD workflows, the UiPathCLI and showcasing scenarios for both on-premise and cloud
- Discussion, Q&A
👨🏫 Speakers
Roman Tobler, CEO@ Routinuum
Johans Brink, CTO@ MvR Digital Workforce
We look forward to bringing best practices and showcasing build pipelines to you - and to having interesting discussions on this important topic!
If you have any questions or inputs prior to the event, don't hesitate to reach out to us.
This event streamed live on May 27, 16:00 pm CET.
Check out all our upcoming UiPath Community sessions at:
👉 https://community.uipath.com/events/
Join UiPath Community Zurich chapter:
👉 https://community.uipath.com/zurich/
Maxx nft market place new generation nft marketing placeusersalmanrazdelhi
PREFACE OF MAXXNFT
MaxxNFT: Powering the Future of Digital Ownership
MaxxNFT is a cutting-edge Web3 platform designed to revolutionize how
digital assets are owned, traded, and valued. Positioned at the forefront of the
NFT movement, MaxxNFT views NFTs not just as collectibles, but as the next
generation of internet equity—unique, verifiable digital assets that unlock new
possibilities for creators, investors, and everyday users alike.
Through strategic integrations with OKT Chain and OKX Web3, MaxxNFT
enables seamless cross-chain NFT trading, improved liquidity, and enhanced
user accessibility. These collaborations make it easier than ever to participate
in the NFT ecosystem while expanding the platform’s global reach.
With a focus on innovation, user rewards, and inclusive financial growth,
MaxxNFT offers multiple income streams—from referral bonuses to liquidity
incentives—creating a vibrant community-driven economy. Whether you
'
re
minting your first NFT or building a digital asset portfolio, MaxxNFT empowers
you to participate in the future of decentralized value exchange.
https://maxxnft.xyz/
Data Virtualization: Bringing the Power of FME to Any ApplicationSafe Software
Imagine building web applications or dashboards on top of all your systems. With FME’s new Data Virtualization feature, you can deliver the full CRUD (create, read, update, and delete) capabilities on top of all your data that exploit the full power of FME’s all data, any AI capabilities. Data Virtualization enables you to build OpenAPI compliant API endpoints using FME Form’s no-code development platform.
In this webinar, you’ll see how easy it is to turn complex data into real-time, usable REST API based services. We’ll walk through a real example of building a map-based app using FME’s Data Virtualization, and show you how to get started in your own environment – no dev team required.
What you’ll take away:
-How to build live applications and dashboards with federated data
-Ways to control what’s exposed: filter, transform, and secure responses
-How to scale access with caching, asynchronous web call support, with API endpoint level security.
-Where this fits in your stack: from web apps, to AI, to automation
Whether you’re building internal tools, public portals, or powering automation – this webinar is your starting point to real-time data delivery.
Jeremy Millul - A Talented Software DeveloperJeremy Millul
Jeremy Millul is a talented software developer based in NYC, known for leading impactful projects such as a Community Engagement Platform and a Hiking Trail Finder. Using React, MongoDB, and geolocation tools, Jeremy delivers intuitive applications that foster engagement and usability. A graduate of NYU’s Computer Science program, he brings creativity and technical expertise to every project, ensuring seamless user experiences and meaningful results in software development.
Evaluation Challenges in Using Generative AI for Science & Technical ContentPaul Groth
Evaluation Challenges in Using Generative AI for Science & Technical Content.
Foundation Models show impressive results in a wide-range of tasks on scientific and legal content from information extraction to question answering and even literature synthesis. However, standard evaluation approaches (e.g. comparing to ground truth) often don't seem to work. Qualitatively the results look great but quantitive scores do not align with these observations. In this talk, I discuss the challenges we've face in our lab in evaluation. I then outline potential routes forward.
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)Peter Bittner
How do you onboard new colleagues in 2025? How long does it take? Would you love a standardized setup under version control that everyone can customize for themselves? A stable desktop setup, reinstalled in just minutes. It can be done.
This talk was given in Italian, 29 May 2025, at PyCon 25, Bologna, Italy. All slides are provided in English.
Original slides at https://slides.com/bittner/pycon25-nixos-for-python-developers
6th Power Grid Model Meetup
Join the Power Grid Model community for an exciting day of sharing experiences, learning from each other, planning, and collaborating.
This hybrid in-person/online event will include a full day agenda, with the opportunity to socialize afterwards for in-person attendees.
If you have a hackathon proposal, tell us when you register!
About Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
AI Emotional Actors: “When Machines Learn to Feel and Perform"AkashKumar809858
Welcome to the era of AI Emotional Actors.
The entertainment landscape is undergoing a seismic transformation. What started as motion capture and CGI enhancements has evolved into a full-blown revolution: synthetic beings not only perform but express, emote, and adapt in real time.
For reading further follow this link -
https://akash97.gumroad.com/l/meioex
European Accessibility Act & Integrated Accessibility TestingJulia Undeutsch
Emma Dawson will guide you through two important topics in this session.
Firstly, she will prepare you for the European Accessibility Act (EAA), which comes into effect on 28 June 2025, and show you how development teams can prepare for it.
In the second part of the webinar, Emma Dawson will explore with you various integrated testing methods and tools that will help you improve accessibility during the development cycle, such as Linters, Storybook, Playwright, just to name a few.
Focus: European Accessibility Act, Integrated Testing tools and methods (e.g. Linters, Storybook, Playwright)
Target audience: Everyone, Developers, Testers
Co-Constructing Explanations for AI Systems using ProvenancePaul Groth
Explanation is not a one off - it's a process where people and systems work together to gain understanding. This idea of co-constructing explanations or explanation by exploration is powerful way to frame the problem of explanation. In this talk, I discuss our first experiments with this approach for explaining complex AI systems by using provenance. Importantly, I discuss the difficulty of evaluation and discuss some of our first approaches to evaluating these systems at scale. Finally, I touch on the importance of explanation to the comprehensive evaluation of AI systems.
ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....Jasper Oosterveld
Sensitivity labels, powered by Microsoft Purview Information Protection, serve as the foundation for classifying and protecting your sensitive data within Microsoft 365. Their importance extends beyond classification and play a crucial role in enforcing governance policies across your Microsoft 365 environment. Join me, a Data Security Consultant and Microsoft MVP, as I share practical tips and tricks to get the full potential of sensitivity labels. I discuss sensitive information types, automatic labeling, and seamless integration with Data Loss Prevention, Teams Premium, and Microsoft 365 Copilot.
Droidal: AI Agents Revolutionizing HealthcareDroidal LLC
Droidal’s AI Agents are transforming healthcare by bringing intelligence, speed, and efficiency to key areas such as Revenue Cycle Management (RCM), clinical operations, and patient engagement. Built specifically for the needs of U.S. hospitals and clinics, Droidal's solutions are designed to improve outcomes and reduce administrative burden.
Through simple visuals and clear examples, the presentation explains how AI Agents can support medical coding, streamline claims processing, manage denials, ensure compliance, and enhance communication between providers and patients. By integrating seamlessly with existing systems, these agents act as digital coworkers that deliver faster reimbursements, reduce errors, and enable teams to focus more on patient care.
Droidal's AI technology is more than just automation — it's a shift toward intelligent healthcare operations that are scalable, secure, and cost-effective. The presentation also offers insights into future developments in AI-driven healthcare, including how continuous learning and agent autonomy will redefine daily workflows.
Whether you're a healthcare administrator, a tech leader, or a provider looking for smarter solutions, this presentation offers a compelling overview of how Droidal’s AI Agents can help your organization achieve operational excellence and better patient outcomes.
A free demo trial is available for those interested in experiencing Droidal’s AI Agents firsthand. Our team will walk you through a live demo tailored to your specific workflows, helping you understand the immediate value and long-term impact of adopting AI in your healthcare environment.
To request a free trial or learn more:
https://droidal.com/
Contributing to WordPress With & Without Code.pptxPatrick Lumumba
Contributing to WordPress: Making an Impact on the Test Team—With or Without Coding Skills
WordPress survives on collaboration, and the Test Team plays a very important role in ensuring the CMS is stable, user-friendly, and accessible to everyone.
This talk aims to deconstruct the myth that one has to be a developer to contribute to WordPress. In this session, I will share with the audience how to get involved with the WordPress Team, whether a coder or not.
We’ll explore practical ways to contribute, from testing new features, and patches, to reporting bugs. By the end of this talk, the audience will have the tools and confidence to make a meaningful impact on WordPress—no matter the skill set.
Supercharge Your AI Development with Local LLMsFrancesco Corti
In today's AI development landscape, developers face significant challenges when building applications that leverage powerful large language models (LLMs) through SaaS platforms like ChatGPT, Gemini, and others. While these services offer impressive capabilities, they come with substantial costs that can quickly escalate especially during the development lifecycle. Additionally, the inherent latency of web-based APIs creates frustrating bottlenecks during the critical testing and iteration phases of development, slowing down innovation and frustrating developers.
This talk will introduce the transformative approach of integrating local LLMs directly into their development environments. By bringing these models closer to where the code lives, developers can dramatically accelerate development lifecycles while maintaining complete control over model selection and configuration. This methodology effectively reduces costs to zero by eliminating dependency on pay-per-use SaaS services, while opening new possibilities for comprehensive integration testing, rapid prototyping, and specialized use cases.
2. Apache Thrift – What is it ?
●
A multi language software framework
●
For client / services development
●
It is scaleable
●
Uses a compiler to generate source code
●
Released with Apache 2 License
●
Uses an IDL ( interface descripton language )
●
Originally developed at Facebook
www.semtech-solutions.co.nz
[email protected]
3. Apache Thrift – Languages
Languages supported
–
C++
Cocoa
–
Java
JavaScript
–
Python
Node.js
–
PHP
Smalltalk
–
Ruby
OCaml
–
Erlang
Delphi
–
Perl
..... and others
–
Haskell
–
C#
www.semtech-solutions.co.nz
[email protected]
4. Apache Thrift – Process
So how do we start to create a Thrift based app ?
–
We define a Thrift file
–
Which includes our interface definition
● Defines our Thrift types
● Defines our services
Our clients will be able to call our services
–
We now use the Thrift compiler
–
thrift --gen <language> <Thrift file>
To generate our source code
–
Now we add our extra client / server functionality
●
●
www.semtech-solutions.co.nz
[email protected]
6. Apache Thrift – Where used ?
So who is using Thrift ?
–
Cassandra
–
Scribe
–
Hadoop – Hbase
–
ThriftDB
–
Others .........
www.semtech-solutions.co.nz
[email protected]
9. Apache Thrift – Example
Create interface in TimeServerImpl.java
package server;
import org.apache.thrift.TException;
import tserver.gen.*;
class TimeServerImpl implements TimeServer.Iface {
@Override
public long time() throws TException {
long time = System.currentTimeMillis();
System.out.println("time() called: " + time);
return time;
}
}
www.semtech-solutions.co.nz
[email protected]
10. Apache Thrift – Example – Java Server
package server;
import java.io.IOException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TBinaryProtocol.Factory;
import org.apache.thrift.server.TServer;
import org.apache.thrift.server.TThreadPoolServer;
import org.apache.thrift.transport.TServerSocket;
import org.apache.thrift.transport.TTransportException;
import tserver.gen.TimeServer;
public class Server {
private void start(){
try {
TServerSocket serverTransport = new TServerSocket(7911);
TimeServer.Processor processor = new TimeServer.Processor(new TimeServerImpl());
Factory protFactory = new TBinaryProtocol.Factory(true, true);
TServer server = new TThreadPoolServer(processor, serverTransport, protFactory);
System.out.println("Starting server on port 7911 ...");
server.serve();
} catch (TTransportException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public static void main(String args[]){
Server srv = new Server(); srv.start();
}
}
www.semtech-solutions.co.nz
[email protected]
11. Apache Thrift – Example – Java Client
package client;
import java.net.SocketException;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import tserver.gen.TimeServer.Client;
public class TimeClient {
private void start(){
TTransport transport;
try {
transport = new TSocket("localhost", 7911);
TProtocol protocol = new TBinaryProtocol(transport);Client client = new Client(protocol);
transport.open();
long time = client.time();
System.out.println("Time from server:" + time);
transport.close();
}
catch (SocketException e) { e.printStackTrace(); }
catch (TTransportException e) { e.printStackTrace();}
catch (TException e) { e.printStackTrace(); }
}
public static void main(String[] args) {
TimeClient c = new TimeClient();
c.start();
}
}
www.semtech-solutions.co.nz
[email protected]
12. Contact Us
●
Feel free to contact us at
–
www.semtech-solutions.co.nz
–
[email protected]
●
We offer IT project consultancy
●
We are happy to hear about your problems
●
You can just pay for those hours that you need
●
To solve your problems