Showing 20 open source projects for "spark"

View related business solutions
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • Gen AI apps are built with MongoDB Atlas Icon
    Gen AI apps are built with MongoDB Atlas

    Build gen AI apps with an all-in-one modern database: MongoDB Atlas

    MongoDB Atlas provides built-in vector search and a flexible document model so developers can build, scale, and run gen AI apps without stitching together multiple databases. From LLM integration to semantic search, Atlas simplifies your AI architecture—and it’s free to get started.
    Start Free
  • 1
    Spark NLP

    Spark NLP

    State of the Art Natural Language Processing

    Experience the power of large language models like never before, unleashing the full potential of Natural Language Processing (NLP) with Spark NLP, the open source library that delivers scalable LLMs. The full code base is open under the Apache 2.0 license, including pre-trained models and pipelines. The only NLP library built natively on Apache Spark. The most widely used NLP library in the enterprise. Spark ML provides a set of machine learning applications that can be built using two main...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    Apache Spark

    Apache Spark

    A unified analytics engine for large-scale data processing

    Apache Spark is a unified engine for large-scale data processing, offering APIs for batch jobs, streaming, machine learning, and graph computation. It builds on resilient distributed datasets (RDDs) and the newer DataFrame/Dataset abstractions to provide fault-tolerant, in-memory computation across clusters. Spark’s execution engine handles scheduling, shuffles, caching, and data locality so users can focus on transformations rather than infrastructure plumbing. With Spark Streaming...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    SageMaker Spark

    SageMaker Spark

    A Spark library for Amazon SageMaker

    SageMaker Spark is an open-source Spark library for Amazon SageMaker. With SageMaker Spark you construct Spark ML Pipelines using Amazon SageMaker stages. These pipelines interleave native Spark ML stages and stages that interact with SageMaker training and model hosting. With SageMaker Spark, you can train on Amazon SageMaker from Spark DataFrames using Amazon-provided ML algorithms like K-Means clustering or XGBoost, and make predictions on DataFrames against SageMaker endpoints hosting your...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    Cassandra Spark Connector

    Cassandra Spark Connector

    Apache Spark to Apache Cassandra connector

    The Apache Cassandra Spark Connector allows Spark jobs (RDDs or DataFrames/Datasets) to read from and write to Cassandra tables. Compatible with Apache Cassandra (v2.1+), Spark 1.0–3.5, and Scala 2.11–2.13, it supports mapping Cassandra rows to Scala case classes, saving results back to Cassandra, and executing arbitrary CQL within Spark applications.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Simple, Secure Domain Registration Icon
    Simple, Secure Domain Registration

    Get your domain at wholesale price. Cloudflare offers simple, secure registration with no markups, plus free DNS, CDN, and SSL integration.

    Register or renew your domain and pay only what we pay. No markups, hidden fees, or surprise add-ons. Choose from over 400 TLDs (.com, .ai, .dev). Every domain is integrated with Cloudflare's industry-leading DNS, CDN, and free SSL to make your site faster and more secure. Simple, secure, at-cost domain registration.
    Sign up for free
  • 5
    Deequ

    Deequ

    Deequ is a library built on top of Apache Spark

    Deequ is a library built atop Apache Spark that enables defining “unit tests for data” — that is, formal constraints or checks on datasets to ensure data quality along dimensions such as completeness, uniqueness, value ranges, correlations, etc. It can scale to large datasets (billions of rows) by translating those data checks into Spark jobs. Deequ supports advanced features like a metrics repository for storing computed statistics over time, anomaly detection of data quality metrics...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 6
    Apache Kyuubi

    Apache Kyuubi

    Apache Kyuubi is a distributed and multi-tenant gateway

    Apache Kyuubi™ is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for end-users to manipulate large-scale data with pre-programmed and extensible Spark SQL engines. This "out-of-the-box" model minimizes the barriers and costs for end-users to use Spark at the client side. At the server-side, Kyuubi server and engines' multi-tenant architecture provides the administrators...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 7
    Synapse Machine Learning

    Synapse Machine Learning

    Simple and distributed Machine Learning

    SynapseML (previously MMLSpark) is an open source library to simplify the creation of scalable machine learning pipelines. SynapseML builds on Apache Spark and SparkML to enable new kinds of machine learning, analytics, and model deployment workflows. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with the Open Neural Network Exchange (ONNX), LightGBM, The Cognitive Services, Vowpal Wabbit...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 8
    Scio

    Scio

    A Scala API for Apache Beam and Google Cloud Dataflow

    Scio is a Scala API developed by Spotify that builds on Apache Beam to enable expressive batch and streaming data pipelines, optimized for running on Google Cloud Dataflow. Inspired by Spark and Scalding, it provides scalable, type‑safe, and production-grade data processing, with built-in support for BigQuery, Pub/Sub, Cassandra, Elasticsearch, Redis, TensorFlow IO, and more.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 9
    almond

    almond

    A Scala kernel for Jupyter

    ..., and vice versa. Almond exposes APIs to interact with Jupyter front-ends. Call them from notebooks… or from your own libraries. Several plotting libraries are already available to plot things from notebooks, such as plotly-scala or Vegas. Load the Spark version of your choice, create a Spark session, and start using it from your notebooks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Build Securely on Azure with Proven Frameworks Icon
    Build Securely on Azure with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • 10
    Feathr

    Feathr

    A scalable, unified data and AI engineering platform for enterprise

    Feathr is a data and AI engineering platform that is widely used in production at LinkedIn for many years and was open sourced in 2022. It is currently a project under LF AI & Data Foundation. Define data and feature transformations based on raw data sources (batch and streaming) using Pythonic APIs. Register transformations by names and get transformed data(features) for various use cases including AI modeling, compliance, go-to-market and more. Share transformations and data(features)...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    SnappyData

    SnappyData

    Memory optimized analytics database, based on Apache Spark

    SnappyData (aka TIBCO ComputeDB) is a distributed, in-memory optimized analytics database. SnappyData delivers high throughput, low latency, and high concurrency for a unified analytics workload. By fusing an in-memory hybrid database inside Apache Spark, it provides analytic query processing, mutability/transactions, access to virtually all big data sources and stream processing all in one unified cluster. One common use case for SnappyData is to provide analytics at interactive speeds over...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    CoolplaySpark

    CoolplaySpark

    Spark Cool Play: Spark source code analysis, Spark class library, etc.

    CoolplaySpark is a learning and practice repository designed to help users understand and work with Apache Spark. It serves as a companion resource for the book 深入理解Spark核心思想与源码分析 (In-Depth Understanding of Spark’s Core Concepts and Source Code Analysis). The project contains annotated examples, explanations, and exercises that guide learners through Spark’s architecture, execution model, and source code internals. It is particularly valuable for developers who want to strengthen...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 13
    Spark JobServer

    Spark JobServer

    REST job server for Apache Spark

    Spark Job Server offers a RESTful interface for submitting, managing, and running jobs or contexts on Apache Spark. Rather than requiring every application to embed Spark or manage Spark contexts manually, this server abstracts a long-lived service where clients can upload JARs, start and stop contexts, submit jobs synchronously or asynchronously, and manage named objects (RDDs / DataFrames) across job executions. It supports multiple modes (transient jobs, persistent contexts for reuse...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    osm4scala

    osm4scala

    Reading OpenStreetMap Pbf files.

    Scala and polyglot Spark library (Scala, PySpark, SparkSQL, ... ) focused on reading OpenStreetMap Pbf files.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    SZT-bigdata

    SZT-bigdata

    SZT‑bigdata is an open source project

    SZT‑bigdata is an open-source project analyzing real Shenzhen metro (subway) card usage data using big‑data frameworks like Spark, Hadoop, Hive, Kafka, Flink, ClickHouse, HBase, and Elasticsearch. Aimed at exploring transit passenger flow patterns and system optimization using a variety of Scala-based technologies.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    TransmogrifAI

    TransmogrifAI

    TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library

    TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Apache Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse. Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 17
    apache spark data pipeline osDQ

    apache spark data pipeline osDQ

    osDQ dedicated to create apache spark based data pipeline using JSON

    This is an offshoot project of open source data quality (osDQ) project https://sourceforge.net/projects/dataquality/ This sub project will create apache spark based data pipeline where JSON based metadata (file) will be used to run data processing , data pipeline , data quality and data preparation and data modeling features for big data. This uses java API of apache spark. It can run in local mode also. Get json example at https://github.com/arrahtech/osdq-spark How to run Unzip...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Cosmos DB Spark

    Cosmos DB Spark

    Apache Spark Connector for Azure Cosmos DB

    Azure Cosmos DB Spark is the official connector for Azure CosmosDB and Apache Spark. The connector allows you to easily read to and write from Azure Cosmos DB via Apache Spark DataFrames in Python and Scala. It also allows you to easily create a lambda architecture for batch-processing, stream-processing, and a serving layer while being globally replicated and minimizing the latency involved in working with big data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    PredictionIO

    PredictionIO

    Machine learning server for building predictive applications

    Apache PredictionIO is an open-source machine learning server designed to simplify the process of building and deploying predictive engines. It offers a scalable infrastructure with support for multiple ML algorithms, event data collection, and deployment workflows. Developers can use templates or build custom engines, making it a flexible solution for integrating machine learning into applications.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Apache PredictionIO

    Apache PredictionIO

    Machine learning server for developers and ML engineers

    ... for comprehensive predictive analytics; speed up machine learning modeling with systematic processes and pre-built evaluation measures; support machine learning and data processing libraries such as Spark MLLib and OpenNLP; implement your own machine learning models and seamlessly incorporate them into your engine; simplify data infrastructure management.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next