I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Distributed GLM with H2O - Atlanta MeetupSri Ambati
The document outlines a presentation about H2O's distributed generalized linear model (GLM) algorithm. The presentation includes sections about H2O.ai the company, an overview of the H2O software, a 30 minute section explaining H2O's distributed GLM in detail, a 15 minute demo of GLM, and a question and answer period. The document provides background on H2O.ai and H2O, and outlines the topics that will be covered in the distributed GLM section, including the algorithm, input parameters, outputs, runtime costs, and best practices.
A survey paper on sequence pattern mining with incrementalAlexander Decker
This document summarizes four algorithms for sequential pattern mining: GSP, ISM, FreeSpan, and PrefixSpan. GSP is an Apriori-based algorithm that takes into account time constraints and taxonomies. ISM extends SPADE to incrementally update the frequent pattern set when new data is added. FreeSpan uses frequent items to recursively project databases and grow subsequences. PrefixSpan also uses projection but claims to not require candidate generation. It recursively projects databases based on short prefix patterns. The document concludes that most previous studies used GSP or PrefixSpan and that future work could focus on improving time efficiency of sequential pattern mining.
Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak
Probabilistic and approximate data structures can provide scalable solutions when exact answers are not required. They trade accuracy for speed and efficiency. Approaches like sampling, hashing, cardinality estimation, and probabilistic databases allow analyzing large datasets while controlling error rates. Example techniques discussed include Bloom filters, locality-sensitive hashing, count-min sketches, HyperLogLog, and feature hashing for machine learning. The talk provided code examples and comparisons of these probabilistic methods.
Every year the financial industry loses billions because of fraud while in the meantime fraudsters are coming up with more and more sophisticated patterns.
Financial institutions have to find the balance between fraud protection and negative customer experience. Fraudsters bury their patterns in lots of data, but the traditional technologies are not designed to detect fraud in real-time or to see patterns beyond the individual account.
Analyzing relations with graph databases helps uncover these larger complex patterns and speeds up suspicious behavior identification.
Furthermore, graph databases enable fast and effective real-time link queries and passing context to machine learning models.
The earlier fraud pattern or network is identified, the faster the activity is blocked. As a result, losses and fines are minimized.
Virtual Knowledge Graphs for Federated Log AnalysisKabul Kurniawan
This document presents a method for executing federated graph pattern queries on dispersed and heterogeneous raw log data by dynamically constructing virtual knowledge graphs (VKGs). The approach extracts only relevant log messages on demand, integrates log events into a common graph, federates queries across endpoints, and links results to background knowledge. The architecture includes modules for log parsing, query processing, and a prototype implementation demonstrates the approach for security analytics use cases. An evaluation analyzes the performance of query execution time against factors like number of extracted log lines and queried hosts.
In this paper, we propose the problem of implementing an efficient query processing system for incomplete temporal and geospatial information in RDFi as a challenge to the SSTD community.
The document discusses enabling live linked data by synchronizing semantic data stores with commutative replicated data types (CRDTs). CRDTs allow for massive optimistic replication while preserving convergence and intentions. The approach aims to complement the linked open data cloud by making linked data writable through a social network of data participants that follow each other's update streams. This would enable a "read/write" semantic web and transition linked data from version 1.0 to 2.0.
Spark DataFrames provide a more optimized way to work with structured data compared to RDDs. DataFrames allow skipping unnecessary data partitions when querying, such as only reading data partitions that match certain criteria like date ranges. DataFrames also integrate better with storage formats like Parquet, which stores data in a columnar format and allows skipping unrelated columns during queries to improve performance. The code examples demonstrate loading a CSV file into a DataFrame, finding and removing duplicate records, and counting duplicate records by key to identify potential duplicates.
Interactive Knowledge Discovery over Web of Data.Mehwish Alam
This document describes research on classifying and exploring data from the Web of Data. It discusses building a classification structure over RDF data by classifying triples based on RDF Schema and creating views through SPARQL queries. This structure can then be used for data completion and interactive knowledge discovery through data analysis and visualization. Formal concept analysis and pattern structures are introduced as techniques for dealing with complex data types from the Web of Data like graphs and linked data. Range minimum queries are also proposed as a way to compute the lowest common ancestor for structured attribute sets in the pattern structures.
A talk presented at an NSF Workshop on Data-Intensive Computing, July 30, 2009.
Extreme scripting and other adventures in data-intensive computing
Data analysis in many scientific laboratories is performed via a mix of standalone analysis programs, often written in languages such as Matlab or R, and shell scripts, used to coordinate multiple invocations of these programs. These programs and scripts all run against a shared file system that is used to store both experimental data and computational results.
While superficially messy, the flexibility and simplicity of this approach makes it highly popular and surprisingly effective. However, continued exponential growth in data volumes is leading to a crisis of sorts in many laboratories. Workstations and file servers, even local clusters and storage arrays, are no longer adequate. Users also struggle with the logistical challenges of managing growing numbers of files and computational tasks. In other words, they face the need to engage in data-intensive computing.
We describe the Swift project, an approach to this problem that seeks not to replace the scripting approach but to scale it, from the desktop to larger clusters and ultimately to supercomputers. Motivated by applications in the physical, biological, and social sciences, we have developed methods that allow for the specification of parallel scripts that operate on large amounts of data, and the efficient and reliable execution of those scripts on different computing systems. A particular focus of this work is on methods for implementing, in an efficient and scalable manner, the Posix file system semantics that underpin scripting applications. These methods have allowed us to run applications unchanged on workstations, clusters, infrastructure as a service ("cloud") systems, and supercomputers, and to scale applications from a single workstation to a 160,000-core supercomputer.
Swift is one of a variety of projects in the Computation Institute that seek individually and collectively to develop and apply software architectures and methods for data-intensive computing. Our investigations seek to treat data management and analysis as an end-to-end problem. Because interesting data often has its origins in multiple organizations, a full treatment must encompass not only data analysis but also issues of data discovery, access, and integration. Depending on context, data-intensive applications may have to compute on data at its source, move data to computing, operate on streaming data, or adopt some hybrid of these and other approaches.
Thus, our projects span a wide range, from software technologies (e.g., Swift, the Nimbus infrastructure as a service system, the GridFTP and DataKoa data movement and management systems, the Globus tools for service oriented science, the PVFS parallel file system) to application-oriented projects (e.g., text analysis in the biological sciences, metagenomic analysis, image analysis in neuroscience, information integration for health care applications, management of experimental data from X-ray sources, diffusion tensor imaging for computer aided diagnosis), and the creation and operation of national-scale infrastructures, including the Earth System Grid (ESG), cancer Biomedical Informatics Grid (caBIG), Biomedical Informatics Research Network (BIRN), TeraGrid, and Open Science Grid (OSG).
For more information, please see www.ci.uchicago/swift.
The document discusses using probabilistic data structures like Hyperloglog and Bloom filters to estimate the number of unique users or elements in a massive streaming data in real-time. Hyperloglog works by counting the number of leading zeros in randomly generated binary numbers to estimate unique elements. While it provides an approximate count, it is very space and memory efficient. The document provides an example pipeline for processing ad viewing data and counting unique users in subgroups using both Hyperloglog and Bloom filters.
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
This document summarizes benchmarking the MinHash and Locality Sensitive Hashing (LSH) algorithm for calculating pairwise similarity on Reddit post data in Spark. The MinHash algorithm was used to reduce the dimensionality of the data before applying LSH to further reduce dimensionality and find similar items. Benchmarking showed that MinHash+LSH was significantly faster than a brute force approach, calculating similarities in 7.68 seconds for 100k entries compared to 9.99 billion seconds for brute force. Precision was lower for MinHash+LSH at 0.009 compared to 1 for brute force, but recall was higher at 0.036 compared to vanishingly small for brute force. The techniques were also applied to a real-time streaming
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...CIARD Movement
Presentation delivered at the Agricultural Data Interoperability Interest Group -- Research Data Alliance (RDA) 4th Plenary Meeting -- Amsterdam, September 2014
This document discusses GraphQL and DGraph with GO. It begins by introducing GraphQL and some popular GraphQL implementations in GO like graphql-go. It then discusses DGraph, describing it as a distributed, high performance graph database written in GO. It provides examples of using the DGraph GO client to perform CRUD operations, querying for single and multiple objects, committing transactions, and more.
Implementing a VO archive for datacubes of galaxiesJose Enrique Ruiz
The document describes implementing a VO archive for galaxy datacubes. It details collections of FITS files containing 2D spatial and spectral data on galaxies from two telescopes. A MySQL database stores metadata on the datasets extracted from FITS headers using IPython notebooks. The web interface allows discovering, viewing metadata, and accessing the data through use cases like moment maps and channel maps. The archive aims to provide characterization of emission lines and provenance to better understand the radio interferometric data.
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
There's an increasing demand for real-time data ingestion and processing. Systems like Apache Kafka, Samza, and Storm have become popular for this reason. This type of high-volume, online data processing presents an interesting set of new challenges, namely, how do we drink from the firehose without getting drenched? Explore some of the fundamental primitives used in stream processing and, specifically, how we can use probabilistic methods to solve the problem.
This document discusses machine learning techniques for recommendations and clustering. It introduces recommendation algorithms that analyze user-item interaction data to find items users who interacted with one item also interacted with another. It also discusses techniques for fast, scalable clustering of large datasets including using a surrogate to quickly cluster data before applying a higher quality algorithm to cluster centroids. The document emphasizes that simple techniques like logging, counting and session analysis often work best at large scale and provides examples of using recommendations for queries, videos and music.
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServerVitor Hirota Makiyama
This document outlines a master's dissertation proposal to apply text mining techniques to previously submitted SQL queries on the Sloan Digital Sky Survey (SDSS) SkyServer database in order to improve the user experience. The proposal discusses using text mining and information retrieval methods like clustering and locality sensitive hashing to group similar past queries and recommend them to users based on new queries submitted, with the goal of helping users better explore and understand the complex database. An overview of relevant text mining, information retrieval, and machine learning concepts is also provided.
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
The next decade promises to be exciting for both astronomy and computer science with a number of large-scale astronomical surveys in preparation. One of the most important ones is Large Scale Survey Telescope, or LSST. LSST will produce the first ‘video’ of the deep sky in history by continually scanning the visible sky and taking one 3.2 giga-pixel image every 20 seconds. In this talk we will describe LSST’s unique design and how its image processing pipeline produces catalogs of astronomical objects. To process and quickly cross-match catalog data we built AXS (Astronomy Extensions for Spark), a system based on Apache Spark. We will explain its design and what is behind its great cross-matching performance.
This document discusses using R for statistical analysis with MongoDB as the database. It introduces MongoDB as a NoSQL database for storing large, complex datasets. It describes the rmongodb package for connecting R to MongoDB, allowing users to query, aggregate, and analyze MongoDB data directly in R without importing entire datasets into memory. Examples show performing queries, aggregations, and accessing results as native R objects. The document promotes R and MongoDB as a solution for big data analytics.
Carl Kesselman and I (along with our colleagues Stephan Erberich, Jonathan Silverstein, and Steve Tuecke) participated in an interesting workshop at the Institute of Medicine on July 14, 2009. Along with Patrick Soon-Shiong, we presented our views on how grid technologies can help address the challenges inherent in healthcare data integration.
Rethinking how we provide science IT in an era of massive data but modest bud...Ian Foster
A talk given in January 2012 at a wonderful conference organized in Zakopane, Poland, by colleagues from the erstwhile GridLab project. I talked about how increasing data volumes demand radically new approaches to delivering research computing. Lively discussion ensued.
Computing Outside The Box September 2009Ian Foster
Keynote talk at Parco 2009 in Lyon, France. An updated version of http://www.slideshare.net/ianfoster/computing-outside-the-box-june-2009.
I gave this talk at a conference for young scientists in New Zealand, "Running Hot": www.runninghot.org.nz. It was a great meeting. My slides are mostly images, so may not make too much sense.
Abstract follows: Impressed with the telephone, Arthur Mee predicted in 1898 that if videoconferencing could be developed, ‘earth will be in truth a paradise.’ Since his time, rapid technological change, in particular in telecommunications, has transformed the scientific playing field in ways that while not entirely paradisical, certainly have profound implications for New Zealand scientists. The Internet has abolished distance, as Mee also predicted–a New Zealand scientist can participate as fully in online discussions as anyone else, and their blog can be every bit as influential. Exponential improvements in networks, computing, sensors, and data storage are also profoundly transforming the practice of science in many disciplines. But those seeking to leverage these advances become painfully familiar with the ‘dirty underbelly’ of exponentials: if you don’t constantly innovate, you can fall behind exponentially fast. Such considerations pose big challenges for the individual scientist and for institutions, for researchers and educators, and for research funders. Some of the old ways of researching and educating need to be preserved, others need to be replaced to take advantage of new methods. But what should we preserve? What should we seek to change?
The document discusses enabling live linked data by synchronizing semantic data stores with commutative replicated data types (CRDTs). CRDTs allow for massive optimistic replication while preserving convergence and intentions. The approach aims to complement the linked open data cloud by making linked data writable through a social network of data participants that follow each other's update streams. This would enable a "read/write" semantic web and transition linked data from version 1.0 to 2.0.
Spark DataFrames provide a more optimized way to work with structured data compared to RDDs. DataFrames allow skipping unnecessary data partitions when querying, such as only reading data partitions that match certain criteria like date ranges. DataFrames also integrate better with storage formats like Parquet, which stores data in a columnar format and allows skipping unrelated columns during queries to improve performance. The code examples demonstrate loading a CSV file into a DataFrame, finding and removing duplicate records, and counting duplicate records by key to identify potential duplicates.
Interactive Knowledge Discovery over Web of Data.Mehwish Alam
This document describes research on classifying and exploring data from the Web of Data. It discusses building a classification structure over RDF data by classifying triples based on RDF Schema and creating views through SPARQL queries. This structure can then be used for data completion and interactive knowledge discovery through data analysis and visualization. Formal concept analysis and pattern structures are introduced as techniques for dealing with complex data types from the Web of Data like graphs and linked data. Range minimum queries are also proposed as a way to compute the lowest common ancestor for structured attribute sets in the pattern structures.
A talk presented at an NSF Workshop on Data-Intensive Computing, July 30, 2009.
Extreme scripting and other adventures in data-intensive computing
Data analysis in many scientific laboratories is performed via a mix of standalone analysis programs, often written in languages such as Matlab or R, and shell scripts, used to coordinate multiple invocations of these programs. These programs and scripts all run against a shared file system that is used to store both experimental data and computational results.
While superficially messy, the flexibility and simplicity of this approach makes it highly popular and surprisingly effective. However, continued exponential growth in data volumes is leading to a crisis of sorts in many laboratories. Workstations and file servers, even local clusters and storage arrays, are no longer adequate. Users also struggle with the logistical challenges of managing growing numbers of files and computational tasks. In other words, they face the need to engage in data-intensive computing.
We describe the Swift project, an approach to this problem that seeks not to replace the scripting approach but to scale it, from the desktop to larger clusters and ultimately to supercomputers. Motivated by applications in the physical, biological, and social sciences, we have developed methods that allow for the specification of parallel scripts that operate on large amounts of data, and the efficient and reliable execution of those scripts on different computing systems. A particular focus of this work is on methods for implementing, in an efficient and scalable manner, the Posix file system semantics that underpin scripting applications. These methods have allowed us to run applications unchanged on workstations, clusters, infrastructure as a service ("cloud") systems, and supercomputers, and to scale applications from a single workstation to a 160,000-core supercomputer.
Swift is one of a variety of projects in the Computation Institute that seek individually and collectively to develop and apply software architectures and methods for data-intensive computing. Our investigations seek to treat data management and analysis as an end-to-end problem. Because interesting data often has its origins in multiple organizations, a full treatment must encompass not only data analysis but also issues of data discovery, access, and integration. Depending on context, data-intensive applications may have to compute on data at its source, move data to computing, operate on streaming data, or adopt some hybrid of these and other approaches.
Thus, our projects span a wide range, from software technologies (e.g., Swift, the Nimbus infrastructure as a service system, the GridFTP and DataKoa data movement and management systems, the Globus tools for service oriented science, the PVFS parallel file system) to application-oriented projects (e.g., text analysis in the biological sciences, metagenomic analysis, image analysis in neuroscience, information integration for health care applications, management of experimental data from X-ray sources, diffusion tensor imaging for computer aided diagnosis), and the creation and operation of national-scale infrastructures, including the Earth System Grid (ESG), cancer Biomedical Informatics Grid (caBIG), Biomedical Informatics Research Network (BIRN), TeraGrid, and Open Science Grid (OSG).
For more information, please see www.ci.uchicago/swift.
The document discusses using probabilistic data structures like Hyperloglog and Bloom filters to estimate the number of unique users or elements in a massive streaming data in real-time. Hyperloglog works by counting the number of leading zeros in randomly generated binary numbers to estimate unique elements. While it provides an approximate count, it is very space and memory efficient. The document provides an example pipeline for processing ad viewing data and counting unique users in subgroups using both Hyperloglog and Bloom filters.
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
This document summarizes benchmarking the MinHash and Locality Sensitive Hashing (LSH) algorithm for calculating pairwise similarity on Reddit post data in Spark. The MinHash algorithm was used to reduce the dimensionality of the data before applying LSH to further reduce dimensionality and find similar items. Benchmarking showed that MinHash+LSH was significantly faster than a brute force approach, calculating similarities in 7.68 seconds for 100k entries compared to 9.99 billion seconds for brute force. Precision was lower for MinHash+LSH at 0.009 compared to 1 for brute force, but recall was higher at 0.036 compared to vanishingly small for brute force. The techniques were also applied to a real-time streaming
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...CIARD Movement
Presentation delivered at the Agricultural Data Interoperability Interest Group -- Research Data Alliance (RDA) 4th Plenary Meeting -- Amsterdam, September 2014
This document discusses GraphQL and DGraph with GO. It begins by introducing GraphQL and some popular GraphQL implementations in GO like graphql-go. It then discusses DGraph, describing it as a distributed, high performance graph database written in GO. It provides examples of using the DGraph GO client to perform CRUD operations, querying for single and multiple objects, committing transactions, and more.
Implementing a VO archive for datacubes of galaxiesJose Enrique Ruiz
The document describes implementing a VO archive for galaxy datacubes. It details collections of FITS files containing 2D spatial and spectral data on galaxies from two telescopes. A MySQL database stores metadata on the datasets extracted from FITS headers using IPython notebooks. The web interface allows discovering, viewing metadata, and accessing the data through use cases like moment maps and channel maps. The archive aims to provide characterization of emission lines and provenance to better understand the radio interferometric data.
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
There's an increasing demand for real-time data ingestion and processing. Systems like Apache Kafka, Samza, and Storm have become popular for this reason. This type of high-volume, online data processing presents an interesting set of new challenges, namely, how do we drink from the firehose without getting drenched? Explore some of the fundamental primitives used in stream processing and, specifically, how we can use probabilistic methods to solve the problem.
This document discusses machine learning techniques for recommendations and clustering. It introduces recommendation algorithms that analyze user-item interaction data to find items users who interacted with one item also interacted with another. It also discusses techniques for fast, scalable clustering of large datasets including using a surrogate to quickly cluster data before applying a higher quality algorithm to cluster centroids. The document emphasizes that simple techniques like logging, counting and session analysis often work best at large scale and provides examples of using recommendations for queries, videos and music.
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServerVitor Hirota Makiyama
This document outlines a master's dissertation proposal to apply text mining techniques to previously submitted SQL queries on the Sloan Digital Sky Survey (SDSS) SkyServer database in order to improve the user experience. The proposal discusses using text mining and information retrieval methods like clustering and locality sensitive hashing to group similar past queries and recommend them to users based on new queries submitted, with the goal of helping users better explore and understand the complex database. An overview of relevant text mining, information retrieval, and machine learning concepts is also provided.
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
The next decade promises to be exciting for both astronomy and computer science with a number of large-scale astronomical surveys in preparation. One of the most important ones is Large Scale Survey Telescope, or LSST. LSST will produce the first ‘video’ of the deep sky in history by continually scanning the visible sky and taking one 3.2 giga-pixel image every 20 seconds. In this talk we will describe LSST’s unique design and how its image processing pipeline produces catalogs of astronomical objects. To process and quickly cross-match catalog data we built AXS (Astronomy Extensions for Spark), a system based on Apache Spark. We will explain its design and what is behind its great cross-matching performance.
This document discusses using R for statistical analysis with MongoDB as the database. It introduces MongoDB as a NoSQL database for storing large, complex datasets. It describes the rmongodb package for connecting R to MongoDB, allowing users to query, aggregate, and analyze MongoDB data directly in R without importing entire datasets into memory. Examples show performing queries, aggregations, and accessing results as native R objects. The document promotes R and MongoDB as a solution for big data analytics.
Carl Kesselman and I (along with our colleagues Stephan Erberich, Jonathan Silverstein, and Steve Tuecke) participated in an interesting workshop at the Institute of Medicine on July 14, 2009. Along with Patrick Soon-Shiong, we presented our views on how grid technologies can help address the challenges inherent in healthcare data integration.
Rethinking how we provide science IT in an era of massive data but modest bud...Ian Foster
A talk given in January 2012 at a wonderful conference organized in Zakopane, Poland, by colleagues from the erstwhile GridLab project. I talked about how increasing data volumes demand radically new approaches to delivering research computing. Lively discussion ensued.
Computing Outside The Box September 2009Ian Foster
Keynote talk at Parco 2009 in Lyon, France. An updated version of http://www.slideshare.net/ianfoster/computing-outside-the-box-june-2009.
I gave this talk at a conference for young scientists in New Zealand, "Running Hot": www.runninghot.org.nz. It was a great meeting. My slides are mostly images, so may not make too much sense.
Abstract follows: Impressed with the telephone, Arthur Mee predicted in 1898 that if videoconferencing could be developed, ‘earth will be in truth a paradise.’ Since his time, rapid technological change, in particular in telecommunications, has transformed the scientific playing field in ways that while not entirely paradisical, certainly have profound implications for New Zealand scientists. The Internet has abolished distance, as Mee also predicted–a New Zealand scientist can participate as fully in online discussions as anyone else, and their blog can be every bit as influential. Exponential improvements in networks, computing, sensors, and data storage are also profoundly transforming the practice of science in many disciplines. But those seeking to leverage these advances become painfully familiar with the ‘dirty underbelly’ of exponentials: if you don’t constantly innovate, you can fall behind exponentially fast. Such considerations pose big challenges for the individual scientist and for institutions, for researchers and educators, and for research funders. Some of the old ways of researching and educating need to be preserved, others need to be replaced to take advantage of new methods. But what should we preserve? What should we seek to change?
The document provides tips for job interviews, including researching the company beforehand, practicing interview skills, greeting the interviewer with a handshake and making eye contact, listening carefully and asking questions if confused, answering questions directly and positively, bringing references, and preparing for common questions about work history, strengths, weaknesses, and why the employer should hire you. It also describes what to expect at assessment centers, including role plays, group exercises, interviews, and testing. Finally, it discusses the pros and cons of internal versus external recruitment.
Recruiting in a Networked World - Workshop Serieshholmes75
This document summarizes a presentation on employer branding and recruiting in a networked world. The presentation outlines how to be an employer of choice by turning jobs into opportunities and showing your company's human side on its career page. It discusses how employer branding attracts great employees and helps retain top talent. The presentation provides tips on improving job descriptions, developing an inspiring career page, and using social media to build your employer brand.
This document discusses using cloud services to facilitate materials data sharing and analysis. It proposes a "Discovery Cloud" that would allow researchers to easily store, curate, discover, and analyze materials data without needing local software or hardware. This cloud platform could accelerate discovery by automating workflows and reducing costs through on-demand scalability. It would also make long-term data preservation simpler. The document highlights Globus research data management services as an example of cloud tools that could help address the dual challenges of treating data as both a rare treasure to preserve and a "deluge" to efficiently manage.
The document provides tips for job interviews, including researching the company beforehand, practicing interview skills, greeting the interviewer with a handshake and making eye contact, listening carefully and asking questions if confused. It also discusses common interview questions and positive ways to answer them. The document then describes what happens at assessment centers, including role plays, group exercises, interviews and tests. Finally, it discusses the pros and cons of internal versus external recruitment for filling jobs.
The document discusses how emerging technologies are enabling new approaches to modeling complex systems using large numbers of autonomous agents. It describes efforts to develop agent-based modeling frameworks that can leverage exascale supercomputers to simulate phenomena like microbial ecosystems, cybersecurity, and energy systems at an unprecedented scale. These models incorporate hybrid discrete-continuous methods and very high-resolution data to better understand dynamic social and natural processes.
The document discusses how computation can accelerate the generation of new knowledge by enabling large-scale collaborative research and extracting insights from vast amounts of data. It provides examples from astronomy, physics simulations, and biomedical research where computation has allowed more data and researchers to be incorporated, advancing various fields more quickly over time. Computation allows for data sharing, analysis, and hypothesis generation at scales not previously possible.
Globus Online provides services to enable easy and reliable data transfer between campus resources and national cyberinfrastructure. It uses Globus Transfer for simple file transfers and Globus Connect to easily integrate campus resources. Globus Connect Multi-User allows administrators to easily deploy GridFTP servers and authentication for multiple users, facilitating campus bridging. Several universities have found success using these Globus services to enable terabyte-scale data sharing across their campuses and with national resources.
Screenshots prepared by Ben Blaiszik and Kyle Chard, used in our Globus publication demo at GlobusWorld 2014. See https://www.globus.org/data-publication for more information and the notes on the slides for details.
1) Quantitative medicine uses large amounts of medical data and advanced analytics to determine the most effective treatment for individual patients based on their specific clinical profile and biomarkers. This approach can help reduce healthcare costs and improve outcomes compared to the traditional one-size-fits-all model.
2) However, realizing the promise of quantitative personalized medicine is challenging due to the huge quantities of diverse medical data located in dispersed systems, lack of computing capabilities, and barriers to data sharing.
3) Grid and service-oriented computing approaches are helping to address these challenges by enabling federated querying, analysis, and sharing of medical data and services across organizations through virtual integration rather than true consolidation.
The "Recruiting in a Networked World" workshops will help you understand and capitalize on this sophisticated new environment. Focusing on hot-button topics such as Employer Branding and Social Media including "Flitterin" (Facebook, Twitter, LinkedIn), our workshops dispel myths, offer insight, and explain why HR needs to talk like PR and think like marketing.
The document discusses the rapidly growing volumes of data being generated across many scientific domains such as biology, astronomy, climate science, and others. It notes that while "big science" projects have been able to develop robust cyberinfrastructure to manage and analyze large datasets, most individual researchers and smaller research groups lack adequate computing resources and software tools to effectively handle the data. The author argues that providing research cyberinfrastructure as a cloud-based service could help address this problem by reducing costs and barriers to entry for researchers. Specific services like Globus Online for data transfer and potential future services for storage, collaboration, and integration with other tools are presented as examples of this approach.
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
This document discusses extending high-performance computing (HPC) to integrate data analytics and connect to edge computing. It presents two use cases: 1) augmenting molecular dynamics workflows with in situ and in transit analytics to capture protein structural information, and 2) connecting HPC to sensors at the edge for precision farming applications involving soil moisture data prediction. The document outlines approaches for building closed-loop workflows that integrate simulation, data generation, analytics, and data feedback between HPC and edge resources to enable real-time decision making.
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
The document summarizes a system called SQLShare that aims to make SQL-based data analysis more accessible to scientists by lowering initial setup costs and providing automated tools. It has been used by 50 unique users at 4 UW campus labs on 16GB of uploaded data from various science domains like environmental science and metagenomics. The system provides data uploading, query sharing, automatic English-to-SQL translation, and personalized query recommendations to lower barriers to working with relational databases for analysis.
Opportunities for X-Ray science in future computing architecturesIan Foster
The world of computing continues to evolve rapidly. In just the past 10 years, we have seen the emergence of petascale supercomputing, cloud computing that provides on-demand computing and storage with considerable economies of scale, software-as-a-service methods that permit outsourcing of complex processes, and grid computing that enables federation of resources across institutional boundaries. These trends shown no signs of slowing down: the next 10 years will surely see exascale, new cloud offerings, and terabit networks. In this talk I review various of these developments and discuss their potential implications for a X-ray science and X-ray facilities.
HEPData is a repository for data from high energy physics (HEP) experiments dating back to the 1950s. It provides a standardized way for scientists to submit the underlying data from their published papers and analysis results. This includes tables, plots, scripts and files used in the analysis to enable reproducibility. HEPData offers features like simplified submission processes, versioning, DOIs, and tools to access and search data in various environments and formats to help both data providers and consumers.
This document discusses using cloud computing and virtualization for scientific research. Some key points:
- Scientists can access remote sensors, share data and workflows, and store personal data in the cloud. Beginners can click to code, while experts can build complex workflows.
- Services allow publishing, finding, and binding to distributed resources through registries. Data can be queried through standards like Simple Image Access Protocol.
- Distributed registries from various organizations harvest metadata to enable semantic search across sky regions, identifiers, tags, vocabularies, schemas, and service descriptions.
- Tools provide code/presentation environments and access to distributed data in the cloud. Services include astronomical cross-matching and event notification through Sky
This document describes Emanuele Panigati's doctoral dissertation on the SuNDroPS system for managing semantic and dynamic data in pervasive systems. It provides an overview of SuNDroPS and its components for processing streaming and historical data, including Context-ADDICT for querying heterogeneous data sources and PerLa and Tesla for information flow processing. It also describes how SuNDroPS was tested in the motivating Green Move vehicle sharing scenario.
Cities are composed of complex systems with physical, cyber, and social components. Current works on extracting and understanding city events mainly rely on technology enabled infrastructure to observe and record events. In this work, we propose an approach to leverage citizen observations of various city systems and services such as traffic, public transport, water supply, weather, sewage, and public safety as a source of city events. We investigate the feasibility of using such textual streams for extracting city events from annotated text. We formalize the problem of annotating social streams such as microblogs as a sequence labeling problem. We present a novel training data creation process for training sequence labeling models. Our automatic training data creation process utilizes instance level domain knowledge (e.g., locations in a city, possible event terms). We compare this automated annotation process to a state-of-the-art tool that needs manually created training data and show that it has comparable performance in annotation tasks. An aggregation algorithm is then presented for event extraction from annotated text. We carry out a comprehensive evaluation of the event annotation and event extraction on a real-world dataset consisting of event reports and tweets collected over four months from San Francisco Bay Area. The evaluation results are promising and provide insights into the utility of social stream for extracting city events.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Propagation of Policies in Rich Data FlowsEnrico Daga
Enrico Daga† Mathieu d’Aquin† Aldo Gangemi‡ Enrico Motta†
† Knowledge Media Institute, The Open University (UK)
‡ Université Paris13 (France) and ISTC-CNR (Italy)
The 8th International Conference on Knowledge Capture (K-CAP 2015)
October 10th, 2015 - Palisades, NY (USA)
http://www.k-cap2015.org/
The document discusses several US grid projects including campus and regional grids like Purdue and UCLA that provide tens of thousands of CPUs and petabytes of storage. It describes national grids like TeraGrid and Open Science Grid that provide over a petaflop of computing power through resource sharing agreements. It outlines specific communities and projects using these grids for sciences like high energy physics, astronomy, biosciences, and earthquake modeling through the Southern California Earthquake Center. Software providers and toolkits that enable these grids are also mentioned like Globus, Virtual Data Toolkit, and services like Introduce.
Potter's Wheel is an interactive tool for data transformation, cleaning and analysis. It integrates data auditing, transformation and analysis. The user can specify transformations by example through a spreadsheet interface. It detects discrepancies and flags them for the user. Transformations can be stored as programs to apply to data. It allows interactive exploration of data without waiting through partitioning and aggregation.
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
1) Cyberinfrastructure refers to the combination of computing systems, data storage systems, advanced instruments and data repositories, visualization environments, and people that enable knowledge discovery through integrated multi-scale simulations and analyses.
2) Cloud computing, multicore processors, and Web 2.0 tools are changing the landscape of cyberinfrastructure by providing new approaches to distributed computing and data sharing that emphasize usability, collaboration, and accessibility.
3) Scientific applications are increasingly data-intensive, requiring high-performance computing resources to analyze large datasets from sources like gene sequencers, telescopes, sensors, and web crawlers.
The ultimate goal of a recommender system is to suggest interesting and not obvious items (e.g., products to buy, people to connect with, movies to watch, etc.) to users, based on their preferences.
The advent of the Linked Open Data (LOD) initiative in the Semantic Web gave birth to a variety of open knowledge bases freely accessible on the Web. They provide a valuable source of information that can improve conventional recommender systems, if properly exploited.
Here I present several approaches to recommender systems that leverage Linked Data knowledge bases such as DBpedia. In particular, content-based and hybrid recommendation algorithms will be discussed.
For full details about the presented approaches please refer to the full papers mentioned in this presentation.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
This document describes Jean-Paul Calbimonte's doctoral research on enabling semantic integration of streaming data sources. The research aims to provide semantic query interfaces for streaming data, expose streaming data for the semantic web, and integrate streaming sources through ontology mappings. The approach involves ontology-based data access to streams, a semantic streaming query language, and semantic integration of distributed streams. Work done so far includes defining a language (SPARQLSTR) for querying RDF streams and enabling an engine to support streaming data sources through ontology mappings. Future work involves query optimization and quantitative evaluation.
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...Thomas Gottron
The intensive growth of the Linked Open Data (LOD) Cloud has spawned a web of data where a multitude of data sources provides huge amounts of valuable information across different domains. Nowadays, when accessing and using Linked Data more and more often the challenging question is not so much whether there is relevant data available, but rather where it can be found, how it is structured and to make best use of it.
I this lecture I will start with giving a brief introduction to the concepts underlying LOD. Then I will focus on three aspects of current research:
(1) Managing Linked Data. Index structures play an important role for making use of the information in LOD cloud. I will give an overview of indexing approaches, present algorithms and discuss the ideas behind the index structures.
(2) Analysing Linked Data. I will present methods for analysing various aspects of LOD. From an information theoretic analysis for measuring structural redundancy, over formal concept analysis for identifying alternative declarative descriptions to a dynamics analysis for capturing the evolution of Linked Data sources.
(3) Making Use of Linked Data. Finally I will give a brief overview and outlook on where the presented techniques and approaches are of practical relevance in applications.
(Talk at the IRSS summerschool 2014 in Athens)
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. This talk looks at Twitter's operating experiences and challenges of running Heron at scale and the approaches taken to solve those challenges.
This document discusses stream reasoning, which involves making sense of gigantic, noisy data streams in real-time to support decision making. It provides background on data streams and stream processing, introduces the concept of stream reasoning, and summarizes achievements in defining continuous query languages and efficient reasoning on streams. Open challenges remain in fully combining streams with background knowledge and distributed, parallel processing.
HEPData is a repository for data from high energy physics (HEP) experiments dating back to the 1950s. It provides physicists with access to the underlying data and tables from published papers. The new HEPData system offers simplified submission processes, standard data formats, versioning, and assigning DOIs to help data providers share their work. It also improves access and search capabilities for data consumers through features like publication-driven and data-driven searching, semantic publishing, data conversion tools, and access through analysis environments like ROOT and Mathematica.
Global Services for Global Science March 2023.pptxIan Foster
We are on the verge of a global communications revolution based on ubiquitous high-speed 5G, 6G, and free-space optics technologies. The resulting global communications fabric can enable new ultra-collaborative research modalities that pool sensors, data, and computation with unprecedented flexibility and focus. But realizing these modalities requires new services to overcome the tremendous friction currently associated with any actions that traverse institutional boundaries. The solution, I argue, is new global science services to mediate between user intent and infrastructure realities. I describe our experiences building and operating such services and the principles that we have identified as needed for successful deployment and operations.
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
The Earth System Grid Federation (ESGF) is a distributed network of climate data servers that archives and shares model output data used by scientists worldwide. ESGF has led data archiving for the Coupled Model Intercomparison Project (CMIP) since its inception. The ESGF Holdings have grown significantly from CMIP5 to CMIP6 and are expected to continue growing rapidly. A new ESGF2 project funded by the US Department of Energy aims to modernize ESGF to handle exabyte scale data volumes through a new architecture based on centralized Globus services, improved data discovery tools, and data proximate computing capabilities.
Better Information Faster: Programming the ContinuumIan Foster
This document discusses the computing continuum and efforts to enable better information faster through computation. It provides examples of how techniques like executing tasks closer to data sources or on specialized hardware can significantly accelerate applications. Programming models and managed services are explored for specifying and executing workloads across diverse infrastructure. There are still open questions around optimizing networks, algorithms, and applications for the computing continuum.
ESnet6 provides an ultra-fast and reliable network that enables new smart instruments for 21st century science. The network capacity has increased dramatically over time, with 2022 bandwidth being 500,000 times greater than 1993. This network allows rapid data transfer between facilities, such as replicating 7 petabytes of climate data between three labs. It also enables fast assembly and use of new instruments like high energy diffraction microscopy that can perform an analysis in 31 seconds. The integrated research infrastructure provided by Globus further supports use of remote resources and smart instruments that will drive scientific discovery.
Linking Scientific Instruments and ComputationIan Foster
[Talk presented at Monterey Data Conference, August 31, 2022]
Powerful detectors at modern experimental facilities routinely collect data at multiple GB/s. Online analysis methods are needed to enable the collection of only interesting subsets of such massive data streams, such as by explicitly discarding some data elements or by directing instruments to relevant areas of experimental space. Thus, methods are required for configuring and running distributed computing pipelines—what we call flows—that link instruments, computers (e.g., for analysis, simulation, AI model training), edge computing (e.g., for analysis), data stores, metadata catalogs, and high-speed networks. We review common patterns associated with such flows and describe methods for instantiating these patterns. We present experiences with the application of these methods to the processing of data from five different scientific instruments, each of which engages powerful computers for data inversion, machine learning model training, or other purposes. We also discuss implications of such methods for operators and users of scientific facilities.
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
Talk in the National Science Data Fabric (NSDF) Distinguished Speaker Series
The Globus team has spent more than a decade developing software-as-a-service methods for research data management, available at globus.org. Globus transfer, sharing, search, publication, identity and access management (IAM), automation, and other services enable reliable, secure, and efficient managed access to exabytes of scientific data on tens of thousands of storage systems. For developers, flexible and open platform APIs reduce greatly the cost of developing and operating customized data distribution, sharing, and analysis applications. With 200,000 registered users at more than 2,000 institutions, more than 1.5 exabytes and 100 billion files handled, and 100s of registered applications and services, the services that comprise the Globus platform have become essential infrastructure for many researchers, projects, and institutions. I describe the design of the Globus platform, present illustrative applications, and discuss lessons learned for cyberinfrastructure software architecture, dissemination, and sustainability.
Video is at https://www.youtube.com/watch?v=p8pCHkFFq1E
Daniel Lopresti, Bill Gropp, Mark D. Hill, Katie Schuman, and I put together a white paper on "Building a National Discovery Cloud" for the Computing Community Consortium (http://cra.org/ccc). I presented these slides at a Computing Research Association "Best Practices on using the Cloud for Computing Research Workshop" (https://cra.org/industry/events/cloudworkshop/).
Abstract from White Paper:
The nature of computation and its role in our lives have been transformed in the past two decades by three remarkable developments: the emergence of public cloud utilities as a new computing platform; the ability to extract information from enormous quantities of data via machine learning; and the emergence of computational simulation as a research method on par with experimental science. Each development has major implications for how societies function and compete; together, they represent a change in technological foundations of society as profound as the telegraph or electrification. Societies that embrace these changes will lead in the 21st Century; those that do not, will decline in prosperity and influence. Nowhere is this stark choice more evident than in research and education, the two sectors that produce the innovations that power the future and prepare a workforce able to exploit those innovations, respectively. In this article, we introduce these developments and suggest steps that the US government might take to prepare the research and education system for its implications.
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
I presented to the Environmental Data Science group at UChicago, with the goal of getting them excited about the opportunities inherent in big data, big computing, and AI--and to think about how to collaborate with Argonne in those areas. We had a great and long conversation about Takuya Kurihana's work on unsupervised learning for cloud classification. I also mentioned our work making NASA and CMIP data accessible on AI supercomputers.
The document discusses using artificial intelligence (AI) to accelerate materials innovation for clean energy applications. It outlines six elements needed for a Materials Acceleration Platform: 1) automated experimentation, 2) AI for materials discovery, 3) modular robotics for synthesis and characterization, 4) computational methods for inverse design, 5) bridging simulation length and time scales, and 6) data infrastructure. Examples of opportunities include using AI to bridge simulation scales, assist complex measurements, and enable automated materials design. The document argues that a cohesive infrastructure is needed to make effective use of AI, data, computation, and experiments for materials science.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
Research Automation for Data-Driven DiscoveryIan Foster
This document discusses research automation and data-driven discovery. It notes that data volumes are growing much faster than computational power, creating a productivity crisis in research. However, most labs have limited resources to handle these large data volumes. The document proposes applying lessons from industry to create cloud-based science services with standardized APIs that can automate and outsource common tasks like data transfer, sharing, publishing, and searching. This would help scientists focus on their core research instead of computational infrastructure. Examples of existing services from Argonne National Lab and the University of Chicago Globus project are provided. The goal is to establish robust, scalable, and persistent cloud platforms to help address the challenges of data-driven scientific discovery.
Scaling collaborative data science with Globus and JupyterIan Foster
The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.
Deep learning is finding applications in science such as predicting material properties. DLHub is being developed to facilitate sharing of deep learning models, data, and code for science. It will collect, publish, serve, and enable retraining of models on new data. This will help address challenges of applying deep learning to science like accessing relevant resources and integrating models into workflows. The goal is to deliver deep learning capabilities to thousands of scientists through software for managing data, models and workflows.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
Team Argon proposes a commons platform using reusable components to promote continuous FAIRness of data. These components include Globus Connect Server for standardized data access and transfer across storage systems, Globus Auth for authentication and authorization, and BDBags for exchange of query results and cohorts using a common manifest format. Together these aim to provide uniform, secure, and reliable access, transfer, and sharing of data while supporting identification, search, and virtualization of derived data products.
This document discusses lessons learned for achieving interoperability. It recommends having a clear purpose, starting with basic conventions like identifiers, monitoring commitments to build trust, and focusing on outward-facing interoperability through simple APIs and platforms rather than full software stacks. Observance of industry practices like authentication methods and cloud-based platforms is also advised to promote rapid development and distribution of applications.
We presented these slides at the NIH Data Commons kickoff meeting, showing some of the technologies that we propose to integrate in our "full stack" pilot.
Going Smart and Deep on Materials at ALCFIan Foster
As we acquire large quantities of science data from experiment and simulation, it becomes possible to apply machine learning (ML) to those data to build predictive models and to guide future simulations and experiments. Leadership Computing Facilities need to make it easy to assemble such data collections and to develop, deploy, and run associated ML models.
We describe and demonstrate here how we are realizing such capabilities at the Argonne Leadership Computing Facility. In our demonstration, we use large quantities of time-dependent density functional theory (TDDFT) data on proton stopping power in various materials maintained in the Materials Data Facility (MDF) to build machine learning models, ranging from simple linear models to complex artificial neural networks, that are then employed to manage computations, improving their accuracy and reducing their cost. We highlight the use of new services being prototyped at Argonne to organize and assemble large data collections (MDF in this case), associate ML models with data collections, discover available data and models, work with these data and models in an interactive Jupyter environment, and launch new computations on ALCF resources.
Adtran’s SDG 9000 Series brings high-performance, cloud-managed Wi-Fi 7 to homes, businesses and public spaces. Built on a unified SmartOS platform, the portfolio includes outdoor access points, ceiling-mount APs and a 10G PoE router. Intellifi and Mosaic One simplify deployment, deliver AI-driven insights and unlock powerful new revenue streams for service providers.
Evaluation Challenges in Using Generative AI for Science & Technical ContentPaul Groth
Evaluation Challenges in Using Generative AI for Science & Technical Content.
Foundation Models show impressive results in a wide-range of tasks on scientific and legal content from information extraction to question answering and even literature synthesis. However, standard evaluation approaches (e.g. comparing to ground truth) often don't seem to work. Qualitatively the results look great but quantitive scores do not align with these observations. In this talk, I discuss the challenges we've face in our lab in evaluation. I then outline potential routes forward.
Maxx nft market place new generation nft marketing placeusersalmanrazdelhi
PREFACE OF MAXXNFT
MaxxNFT: Powering the Future of Digital Ownership
MaxxNFT is a cutting-edge Web3 platform designed to revolutionize how
digital assets are owned, traded, and valued. Positioned at the forefront of the
NFT movement, MaxxNFT views NFTs not just as collectibles, but as the next
generation of internet equity—unique, verifiable digital assets that unlock new
possibilities for creators, investors, and everyday users alike.
Through strategic integrations with OKT Chain and OKX Web3, MaxxNFT
enables seamless cross-chain NFT trading, improved liquidity, and enhanced
user accessibility. These collaborations make it easier than ever to participate
in the NFT ecosystem while expanding the platform’s global reach.
With a focus on innovation, user rewards, and inclusive financial growth,
MaxxNFT offers multiple income streams—from referral bonuses to liquidity
incentives—creating a vibrant community-driven economy. Whether you
'
re
minting your first NFT or building a digital asset portfolio, MaxxNFT empowers
you to participate in the future of decentralized value exchange.
https://maxxnft.xyz/
Droidal: AI Agents Revolutionizing HealthcareDroidal LLC
Droidal’s AI Agents are transforming healthcare by bringing intelligence, speed, and efficiency to key areas such as Revenue Cycle Management (RCM), clinical operations, and patient engagement. Built specifically for the needs of U.S. hospitals and clinics, Droidal's solutions are designed to improve outcomes and reduce administrative burden.
Through simple visuals and clear examples, the presentation explains how AI Agents can support medical coding, streamline claims processing, manage denials, ensure compliance, and enhance communication between providers and patients. By integrating seamlessly with existing systems, these agents act as digital coworkers that deliver faster reimbursements, reduce errors, and enable teams to focus more on patient care.
Droidal's AI technology is more than just automation — it's a shift toward intelligent healthcare operations that are scalable, secure, and cost-effective. The presentation also offers insights into future developments in AI-driven healthcare, including how continuous learning and agent autonomy will redefine daily workflows.
Whether you're a healthcare administrator, a tech leader, or a provider looking for smarter solutions, this presentation offers a compelling overview of how Droidal’s AI Agents can help your organization achieve operational excellence and better patient outcomes.
A free demo trial is available for those interested in experiencing Droidal’s AI Agents firsthand. Our team will walk you through a live demo tailored to your specific workflows, helping you understand the immediate value and long-term impact of adopting AI in your healthcare environment.
To request a free trial or learn more:
https://droidal.com/
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...James Anderson
The Quantum Apocalypse: A Looming Threat & The Need for Post-Quantum Encryption
We explore the imminent risks posed by quantum computing to modern encryption standards and the urgent need for post-quantum cryptography (PQC).
Bio: With 30 years in cybersecurity, including as a CISO, Tommy is a strategic leader driving security transformation, risk management, and program maturity. He has led high-performing teams, shaped industry policies, and advised organizations on complex cyber, compliance, and data protection challenges.
Introducing the OSA 3200 SP and OSA 3250 ePRCAdtran
Adtran's latest Oscilloquartz solutions make optical pumping cesium timing more accessible than ever. Discover how the new OSA 3200 SP and OSA 3250 ePRC deliver superior stability, simplified deployment and lower total cost of ownership. Built on a shared platform and engineered for scalable, future-ready networks, these models are ideal for telecom, defense, metrology and more.
Co-Constructing Explanations for AI Systems using ProvenancePaul Groth
Explanation is not a one off - it's a process where people and systems work together to gain understanding. This idea of co-constructing explanations or explanation by exploration is powerful way to frame the problem of explanation. In this talk, I discuss our first experiments with this approach for explaining complex AI systems by using provenance. Importantly, I discuss the difficulty of evaluation and discuss some of our first approaches to evaluating these systems at scale. Finally, I touch on the importance of explanation to the comprehensive evaluation of AI systems.
Securiport is a border security systems provider with a progressive team approach to its task. The company acknowledges the importance of specialized skills in creating the latest in innovative security tech. The company has offices throughout the world to serve clients, and its employees speak more than twenty languages at the Washington D.C. headquarters alone.
New Ways to Reduce Database Costs with ScyllaDBScyllaDB
How ScyllaDB’s latest capabilities can reduce your infrastructure costs
ScyllaDB has been obsessed with price-performance from day 1. Our core database is architected with low-level engineering optimizations that squeeze every ounce of power from the underlying infrastructure. And we just completed a multi-year effort to introduce a set of new capabilities for additional savings.
Join this webinar to learn about these new capabilities: the underlying challenges we wanted to address, the workloads that will benefit most from each, and how to get started. We’ll cover ways to:
- Avoid overprovisioning with “just-in-time” scaling
- Safely operate at up to ~90% storage utilization
- Cut network costs with new compression strategies and file-based streaming
We’ll also highlight a “hidden gem” capability that lets you safely balance multiple workloads in a single cluster. To conclude, we will share the efficiency-focused capabilities on our short-term and long-term roadmaps.
Exploring the advantages of on-premises Dell PowerEdge servers with AMD EPYC processors vs. the cloud for small to medium businesses’ AI workloads
AI initiatives can bring tremendous value to your business, but you need to support your new AI workloads effectively. That means choosing the best possible infrastructure for your needs—and many companies are finding that the cloud isn’t right for them. According to a recent Rackspace survey of IT executives, 69 percent of companies have moved some of their applications on-premises from the cloud, with half of those citing security and compliance as the reason and 44 percent citing cost.
On-premises solutions provide a number of advantages. With full control over your security infrastructure, you can be certain that all compliance requirements remain firmly in the hands of your IT team. Opting for on-premises also gives you the ability to design your infrastructure to the precise needs of that team and your new AI workloads. Depending on the workload, you may also see performance benefits, along with more predictable costs. As you start to build your next AI initiative, consider an on-premises solution utilizing AMD EPYC processor-powered Dell PowerEdge servers.
Jira Administration Training – Day 1 : IntroductionRavi Teja
This presentation covers the basics of Jira for beginners. Learn how Jira works, its key features, project types, issue types, and user roles. Perfect for anyone new to Jira or preparing for Jira Admin roles.
Measuring Microsoft 365 Copilot and Gen AI SuccessNikki Chapple
Session | Measuring Microsoft 365 Copilot and Gen AI Success with Viva Insights and Purview
Presenter | Nikki Chapple 2 x MVP and Principal Cloud Architect at CloudWay
Event | European Collaboration Conference 2025
Format | In person Germany
Date | 28 May 2025
📊 Measuring Copilot and Gen AI Success with Viva Insights and Purview
Presented by Nikki Chapple – Microsoft 365 MVP & Principal Cloud Architect, CloudWay
How do you measure the success—and manage the risks—of Microsoft 365 Copilot and Generative AI (Gen AI)? In this ECS 2025 session, Microsoft MVP and Principal Cloud Architect Nikki Chapple explores how to go beyond basic usage metrics to gain full-spectrum visibility into AI adoption, business impact, user sentiment, and data security.
🎯 Key Topics Covered:
Microsoft 365 Copilot usage and adoption metrics
Viva Insights Copilot Analytics and Dashboard
Microsoft Purview Data Security Posture Management (DSPM) for AI
Measuring AI readiness, impact, and sentiment
Identifying and mitigating risks from third-party Gen AI tools
Shadow IT, oversharing, and compliance risks
Microsoft 365 Admin Center reports and Copilot Readiness
Power BI-based Copilot Business Impact Report (Preview)
📊 Why AI Measurement Matters: Without meaningful measurement, organizations risk operating in the dark—unable to prove ROI, identify friction points, or detect compliance violations. Nikki presents a unified framework combining quantitative metrics, qualitative insights, and risk monitoring to help organizations:
Prove ROI on AI investments
Drive responsible adoption
Protect sensitive data
Ensure compliance and governance
🔍 Tools and Reports Highlighted:
Microsoft 365 Admin Center: Copilot Overview, Usage, Readiness, Agents, Chat, and Adoption Score
Viva Insights Copilot Dashboard: Readiness, Adoption, Impact, Sentiment
Copilot Business Impact Report: Power BI integration for business outcome mapping
Microsoft Purview DSPM for AI: Discover and govern Copilot and third-party Gen AI usage
🔐 Security and Compliance Insights: Learn how to detect unsanctioned Gen AI tools like ChatGPT, Gemini, and Claude, track oversharing, and apply eDLP and Insider Risk Management (IRM) policies. Understand how to use Microsoft Purview—even without E5 Compliance—to monitor Copilot usage and protect sensitive data.
📈 Who Should Watch: This session is ideal for IT leaders, security professionals, compliance officers, and Microsoft 365 admins looking to:
Maximize the value of Microsoft Copilot
Build a secure, measurable AI strategy
Align AI usage with business goals and compliance requirements
🔗 Read the blog https://nikkichapple.com/measuring-copilot-gen-ai/
AI Emotional Actors: “When Machines Learn to Feel and Perform"AkashKumar809858
Welcome to the era of AI Emotional Actors.
The entertainment landscape is undergoing a seismic transformation. What started as motion capture and CGI enhancements has evolved into a full-blown revolution: synthetic beings not only perform but express, emote, and adapt in real time.
For reading further follow this link -
https://akash97.gumroad.com/l/meioex
Agentic AI Explained: The Next Frontier of Autonomous Intelligence & Generati...Aaryan Kansari
Agentic AI Explained: The Next Frontier of Autonomous Intelligence & Generative AI
Discover Agentic AI, the revolutionary step beyond reactive generative AI. Learn how these autonomous systems can reason, plan, execute, and adapt to achieve human-defined goals, acting as digital co-workers. Explore its promise, key frameworks like LangChain and AutoGen, and the challenges in designing reliable and safe AI agents for future workflows.
Sticky Note Bullets:
Definition: Next stage beyond ChatGPT-like systems, offering true autonomy.
Core Function: Can "reason, plan, execute and adapt" independently.
Distinction: Proactive (sets own actions for goals) vs. Reactive (responds to prompts).
Promise: Acts as "digital co-workers," handling grunt work like research, drafting, bug fixing.
Industry Outlook: Seen as a game-changer; Deloitte predicts 50% of companies using GenAI will have agentic AI pilots by 2027.
Key Frameworks: LangChain, Microsoft's AutoGen, LangGraph, CrewAI.
Development Focus: Learning to think in workflows and goals, not just model outputs.
Challenges: Ensuring reliability, safety; agents can still hallucinate or go astray.
Best Practices: Start small, iterate, add memory, keep humans in the loop for final decisions.
Use Cases: Limited only by imagination (e.g., drafting business plans, complex simulations).
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)Peter Bittner
How do you onboard new colleagues in 2025? How long does it take? Would you love a standardized setup under version control that everyone can customize for themselves? A stable desktop setup, reinstalled in just minutes. It can be done.
This talk was given in Italian, 29 May 2025, at PyCon 25, Bologna, Italy. All slides are provided in English.
Original slides at https://slides.com/bittner/pycon25-nixos-for-python-developers
European Accessibility Act & Integrated Accessibility TestingJulia Undeutsch
Emma Dawson will guide you through two important topics in this session.
Firstly, she will prepare you for the European Accessibility Act (EAA), which comes into effect on 28 June 2025, and show you how development teams can prepare for it.
In the second part of the webinar, Emma Dawson will explore with you various integrated testing methods and tools that will help you improve accessibility during the development cycle, such as Linters, Storybook, Playwright, just to name a few.
Focus: European Accessibility Act, Integrated Testing tools and methods (e.g. Linters, Storybook, Playwright)
Target audience: Everyone, Developers, Testers
1. Towards an Open Analytics Environment Ian Foster Computation Institute Argonne National Lab & University of Chicago
2. The Computation Institute A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods. Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three). www.ci.uchicago.edu Faculty, fellows, staff, students, computers, projects.
3. The Good Old Days: Astronomy ~1600 30 years ? years 10 years 6 years 2 years
4. Astronomy, from 1600 to 2000 Automation 10 -1 10 8 Hz data capture Community 10 0 10 4 astronomers (10 6 amateur) Computation Data 10 6 10 15 B aggregate 10 -1 10 15 Hz peak Literature 10 1 10 5 pages/year
6. Biomedical Research ~2000 ... atcgaattccaggcgtcacattctcaattcca... MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT... Protein-Protein Interactions metabolism pathways receptor-ligand 4º structure Polymorphism and Variants genetic variants individual patients epidemiology Physiology Cellular biology Biochemistry Neurobiology Endocrinology etc. >10 6 ESTs Expression patterns Large-scale screens Genetics and Maps Linkage Cytogenetic Clone-based From John Wooley >10 6 >10 9 >10 6 >10 5 >10 9 DNA sequences alignments Proteins sequence 2º structure 3º structure
7. Growth of Sequences and Annotations since 1982 Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch , August 2006.
8. The Analyst in Denial “ I just need a bigger disk (and workstation)”
9. An Open Analytics Environment Data in “ No limits” Storage Computing Format Program Allowing for Versioning Provenance Collaboration Annotation Results out Programs & rules in
10. o·pen [oh-puhn] adjective having the interior immediately accessible relatively free of obstructions to sight, movement, or internal arrangement generous, liberal, or bounteous in operation; live readily admitting new members not constipated
12. What Goes In (2) Rules Workflows Dryad MapReduce Parallel programs SQL BPEL Swift SCFL R MatLab Octave
13. How it Cooks Virtualization Run any program, store any data Indexing Automated maintenance Provisioning Policy-driven allocation of resources to competing demands
17. Towards an Open Analysis Environment: (1) Applications Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics
18. Towards an Open Analysis Environment: (2) Hardware SiCortex 6K cores, 6 Top/s IBM BG/P 160K cores, 500 Top/s PADS 10-40 Gbit/s
19. PADS: Petascale Active Data Store 500 TB reliable storage (data & metadata) 180 TB, 180 GB/s 17 Top/s analysis Data ingest Dynamic provisioning Parallel analysis Remote access Offload to remote data centers P A D S Diverse users Diverse data sources 1000 TB tape backup
20. Towards an Open Analysis Environment : (3) Methods HPC systems software (MPICH, PVFS, etc.) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management Distributed data management (GridFTP, etc.)
21. Tagging & Social Networking GLOSS : Generalized Labels Over Scientific data Sources
22. XDTM: XML Data Typing & Mapping ./group23 drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC ./group23/AA : drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa . /group23/AA/04nov06aa : drwxr-xr-x 2 yongzh users 2048 Nov 5 12:52 ANATOMY drwxr-xr-x 2 yongzh users 49152 Dec 5 11:40 FUNCTIONAL . /group23/AA/04nov06aa/ANATOMY : -rw-r--r-- 1 yongzh users 348 Nov 5 12:29 coplanar.hdr -rw-r--r-- 1 yongzh users 16777216 Nov 5 12:29 coplanar.img . /group23/AA/04nov06aa/FUNCTIONAL : -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0001.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0001.img -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0002.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0002.img -rw-r--r-- 1 yongzh users 496 Nov 15 20:44 bold1_0002.mat -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0003.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0003.img Logical Physical
23. fMRI Type Definitions type Study { Group g[ ]; } type Group { Subject s[ ]; } type Subject { Volume anat; Run run[ ]; } type Run { Volume v[ ]; } type Volume { Image img; Header hdr; } type Image {}; type Header {}; type Warp {}; type Air {}; type AirVec { Air a[ ]; } type NormAnat { Volume anat; Warp aWarp; Volume nHires; }
27. Multi-level Scheduling SwiftScript Abstract computation Virtual Data Catalog SwiftScript Compiler Specification Execution Virtual Node(s) Worker Nodes Provenance data Provenance data Provenance collector launcher launcher file1 file2 file3 App F1 App F2 Scheduling Execution Engine (Karajan w/ Swift Runtime) Swift runtime callouts C C C C Status reporting Provisioning Falkon Resource Provisioner Amazon EC2
28. DOCK on SiCortex CPU cores: 5760 Power: 15,000 W Tasks: 92160 Elapsed time: 12821 sec Compute time: 1.94 CPU years (does not include ~800 sec to stage input data) Ioan Raicu, Zhao Zhang
29. LIGO Gravitational Wave Observatory Birmingham • >1 Terabyte/day to 8 sites 770 TB replicated to date: >120 million replicas MTBF = 1 month Ann Chervenak et al., ISI; Scott Koranda et al, LIGO Cardiff AEI/Golm
30. Lag Plot for Data Transfers to Caltech Credit: Kevin Flasch, LIGO
32. Social Informatics Data Grid (SIDgrid) TeraGrid PADS … SIDgrid Collaborative, multi-modal analysis of cognitive science data Diverse experimental data & metadata Browse data Search Content preview Transcode Download Analyze
35. A C ommunity I ntegrated M odel for E conomic a nd R esource T rajectories for H umankind ( CIM-EARTH ) Dynamics, foresight, uncertainty, resolution, … Agriculture, transport, taxation, … Data (global, local, …) (Super) computers CIM-EARTH Framework Community process Open code, data
36. Alleviating Poverty in Thailand: Modeling Entrepreneurship Consider only wealth, access to capital Consider also distance to 6 major cities Rob Townsend, Victor Zhorin, et al. Match High Low
41. An Open Analytics Environment Data in “ No limits” Storage Computing Format Program Allowing for Versioning Provenance Collaboration Annotation Results out Programs & rules in