Genoveva Vargas-Solar
Senior Scientist, French Council of Scientific Research, LIG-LAFMIA, France
genoveva.vargas@imag.fr
Moving forward data centric sciences
weaving AI, Big Data & HPC
AICSSA, Jordan, October-November, 2018
http://www.vargas-solar.com
3
D AT A R E V O L U T I O N
4
+
“Data is everything and everything is data”, Pythian
Turning reality phenomena into data thanks to the Big Data trend
5
6
Rendering into data, aspects of the world that have never been quantified
Any individual can analyse huge amounts of data in short periods of time
- Analytical knowledge: most of the crucial algorithms are accessible
- Use rich data to make evidence-based decisions open to virtually any person or company
DATIFICATION
DIGITAL HUMANITIES
… UNFINISHED FUGUE
Fuga a 3 Soggetti (Contrapunctus XIV):
- 4-voice triple fugue
- the third subject of which is based on
the
B A C H motif
« At the point where the composer introduces the name BACH in the
countersubject to this fugue, the composer died. »
8
What makes Bach sound like Bach?
http://www.washington.edu/news/2016/11/30/what-makes-bach-sound-like-bach-new-dataset-teaches-algorithms-classical-music/
The Art of Fugue is based on a single subject employed in some variation in each canon and fugue
9
• Identify the notes performed at specific times in a recording
• Classify the instruments that perform in a recording
• Classify the composer of a recording
• Identify precise onset times of the notes in a recording
• Predict the next note in a recording, conditioned on history
Music information retrieval
- Automatic music transcription
- Inferring a musical score from a recording
Generative models fabricating performances under various
constraints
- Can we learn to synthesize a performance given a score?
- Can we generate a fugue in the style of Bach using a melody by Brahms?
10
11
DATA SCIENCE
The representation of complex environments by rich data opens up the possibility of applying all the scientific
knowledge regarding how to infer knowledge from data
Definition:
- Methodology by which actionable insights can be inferred from data
- Complex, multifaceted field that can be approached from several points of view: ethics, methodology,
business models, how to deal with big data, data engineering, data governance, etc.
Objective:
- Production of beliefs informed by data and to be used as the basis of decision making
- N.B. In the absence of data, beliefs are uninformed and decisions are based on best practices or intuition
12
Computational Science
Digital humanities
Social Data Science Network Science
DATA CENTRIC SCIENCES
Data collections as backbone for conducting experiments, drive hypothesis and lead to “valid”
conclusions, models, simulations, understanding
Develop methodologies weaving data management, greedy algorithms, and programming
models that must be tuned to be deployed in different target computer architectures
Computational Science
Digital humanities
Social Data Science Network Science
1000 Yottabytes 1 Brontobyte
1000 Brontobytes 1 Geopbyte
13
Experimental Sciences
Computational Science
Digital humanities
Social Data Science Network Science
14
1000 Yottabytes 1 Brontobyte
1000 Brontobytes 1 Geopbyte
Computation
(Algorithm: mathematical model)
Experiment setting
(Architecture: computing environment)
D AT A
15
Consumed data:
• different sizes
• quality, uncertainty, ambiguity degree
• evolution in structure, completeness, production conditions, conditions
in which data is retrieved
• content, explicit cultural, contextual, background properties
• access policies modification
Conditions of consumption:
• reproducibility, transparency degree (avoid “software artefacts”)
16
NEITHER MANAGEABLE NOR EXPLOITABLE AS SUCH
RAW DATA
• Heterogeneous (variety)
• Huge (volume)
• Incomplete, unprecise, missing, contradictory (veracity)
• Continuous releases produced at different rates (velocity)
• Proprietary, critical, private (value)
DIGITAL DATA COLLECTIONS
Consumed data:
• different sizes
• quality, uncertainty, ambiguity degree
• evolution in structure, completeness, production conditions, conditions in which
data is retrieved
• content, explicit cultural, contextual, background properties
• access policies modification
Conditions of consumption:
• reproducibility, transparency degree (avoid “software artefacts”)
17
DIGITAL DATA COLLECTIONS
18
EXPLORING DATA COLLECTIONS
19https://web.facebook.com/data/
20
ü Helping to select the right tool for
preprocessing or analysis
ü Making use of humans’ abilities to
recognize patterns
Not always sure what we are looking for (until we find it)
Query expression [guidance ∣ automatic generation]3,2
• Multi-scale query processing for gradual exploration
• Query morphing to adjust for proximity results
• Queries as answers: query alternatives to cope with lack of providence
Results filtering, analysis, visualization2
• Result-set post processing for conveying meaningful data
Data exploration systems & environments1
• Data systems kernels are tailored for data exploration: no preparation easy-to-use fast database
cracking
• Auto-tuning database kernels : incremental, adaptive, partial indexing
1. Xi, S. L., Babarinsa, O., Wasay, A., Wei, X., Dayan, N., & Idreos, S. (2017, May). Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 557-572). ACM.
2. Athanassoulis, M., & Idreos, S. (2015, May). Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (p. 2). ACM.
3. Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
KEY MOTIVATIONS
EXPLORING DATA COLLECTIONS
21
QUANTITATIVE ANALYSIS OF DATA
Concepts:
- Population: collection of objects, items (“units”)
- Sample: a part of the observed population
Descriptive statistics: simplify data presenting quantitative descriptions
- Measures and concepts to describe the quantitative features
- Provide summaries about the samples as an approximation of the population
- Frequency of the notes performed at specific intervals in a recording
- Identify precise onset times of the notes in a recording
22
LOOKING BEYOND DATA
Inferential statistics: infer the population characteristic
- draws conclusions beyond the analysed data
- reaches conclusions regarding made hypotheses
- Classify the instruments that perform in a recording
- Predict the next note in a recording, conditioned on history
- Inferring a musical score from a recording
23
DATA CURATION
Preserving Describing
Extracting meta-data
ExploringHarvesting
ETL
Parallel Data
Processing
Platforms
Spark (RDD – Tables/Graphs)
Hadoop ecosystem tools (e.g., Pig)
Parallel Data
Processing
Platforms
NoSQL & NewSQL
(Parallel)
Parallel
Data Querying &
Analytics
Structured
Data provision
Parallel data
collection
(Flink, Stream, Flume)
Spark (descriptive statistics functions)
Hadoop ecosystem tools (e.g., Hive)
Parallel RDBMS,
Big Data Analytics Stacks (Asterix, BDAS)
Parallel analytics (Matlab, R)
CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; Gavin Kemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
ARTIFICIAL INTELLIGENCE
BEYOND KNOWLEDGE
https://ai100.stanford.edu/2016-report
24
25
26
LOOKING BEYOND KNOWLEDGE
Music information retrieval
- Automatic music transcription
- Inferring a musical score from a recording
Generative models fabricating performances under various constraints
- Can we learn to synthesize a performance given a score?
- Can we generate a fugue in the style of Bach using a melody by Brahms?
SETTING UP DATA CENTRIC EXPERIMENTS
27
28
https://web.facebook.com/data/https://azure.microsoft.com/
+
§Data collections with characteristics difficult to process on a single machine or
traditional databases
§A new generation of tools, methods and technologies to collect, process and analyse
massive data collections
à Tools imposing the use of parallel processing and distributed storage
DATA COLLECTIONS ALIAS BIG DATA
29
30
DATA SCIENCE ECOSYSTEM &
INTEGRATED DEVELOPMENT ENVIRONMENT
The integrated development environment (IDE) is an essential tool designed to
maximize programmer productivity.
- The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the
debugger.
- Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment)
Programming language:
- Python one of the most flexible programming languages because it can be seen as a multiparadigm language
- Alternatives are MATLAB and R
Fundamental libraries for data scientists in Python: NumPy, SciPy, Scikit-Learn, and Pandas
WEB INTEGRATED
DEVELOPMENT
ENVIRONMENT
31
DATA SCIENCE VIRTUAL MACHINE
COMPUTING CAPACITY
33
D AT A B E Y O N D T H E C O N F O R T Z O N E
34
35
+
Curated
Increased versatility
& complexity
Increased scalability
& speed
Data collections rawness degree
Key-Value
stores
Document
stores
NewSQL
Relational databases
Graph
Databases
Extensible
record stores
QueryingLook up (R/W)
Analytics
AggregationProcessing Navigation
ELASTIC DATA PROCESSING & MANAGEMENT AT SCALE
36
Descriptive Statistics Inferential Statistics Supervised Learning UnSupervised Learning
Sharded & colocated
Input data
Distributed File System
Classification
Data
transformation
Tagged opus execution
Multimedia
multiform data
Indexing classes
INDEXING & STORING
• the precise time of each note every recording
• the instrument that plays each note
• the note's position in the metrical structure of the composition
37
SHARDING DATA ACROSS DIFFERENT STORES
Sharded & colocated
Input data
Distributed File SystemMultimedia multiform data
38MusicNet: 330 classical music recordings, 1 million annotated labels indicatinghttp://homes.cs.washington.edu/~thickstn/musicnet.html
Automatic and elastic data collections sharding tools to parametrize data access &
exploitation by parallel programs willing to scale-up in different target architectures
SHARDING ACROSS DIFFERENT STORES
Sharded & colocated
Input data
Distributed File System
Factors:
- RAM - Disk
- CPU - Network
Sharded data architecture
39
Balanced and smooth fragmentation
(size, location, availability)
Optimum distribution across shards
providing storage spaces (chunks)
+
Persistence
- Which part of the document must persist?
- Explicit vs. implicit persistence
- In memory / hard disk Fragmentation/Sharding & replication:
- Vertical or horizontal fragmentation
- Strategies: range, hash, tagged
- Distribution & location
Availability & Fault tolerance
- Replication & distribution
Memory/Cache
SHARDING DATA ACROSS DIFFERENT STORES
Raw data collections
40
411.Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
DATA DELIVERY FOR GREEDY PROCESSING
“Multi-view computational problem”
Iterative data processing and visualization tasks need to share CPU cycles
42
Data is a bottleneck
APPLICATION
DRAM
DISK/DATABASE
CPU
Multiples Cores
GPU
Thousands of Cores
1-5GBps1-10GBps
Provide data storage, fetching and delivery
strategies
­ Architecture: distributed file system across nodes
­ Data sharding and replication: on storage and
memory
­ Fetch to fulfil multi-facet application requirements
­ Prefetching
­ Memory indexing
­ Reduce impedance mismatch
43
§ Manage data collections with different uses and access patterns because
these properties tend to reach limits of:
§ the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a
certain period, and
§ the pace (computing speed) in which data must be consumed (harvested, prepared and processed).
§ Build underlying value added data managers that can
§ Exploit available resources making a compromise between QoS properties and SLA requirements considering all the
levels of the stack
§ Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources’
availability and the data properties
OPPORTUNITIES
F I N A L C O M E N T S
44
45
Move from design based on intuition & experience to a more formal & systematic way
to design systems
Addressing data centric sciences problems is a matter of designing complex systems according
to a multidisciplinary vision
46
Let’s weave a golden trilogy
Big Data, AI & HPC
47

More Related Content

PPTX
Data Science Training | Data Science For Beginners | Data Science With Python...
PPTX
Data Lake Overview
PPT
What Is DATA MINING(INTRODUCTION)
PDF
Introduction to Big Data Analytics and Data Science
PDF
Data Science Use cases in Banking
PPTX
Data mining
PDF
The ABCs of Treating Data as Product
PPTX
Data Analytics
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Lake Overview
What Is DATA MINING(INTRODUCTION)
Introduction to Big Data Analytics and Data Science
Data Science Use cases in Banking
Data mining
The ABCs of Treating Data as Product
Data Analytics

What's hot (20)

PPT
Microstrategy Overview
PPTX
Multi-Tenancy and Virtualization in Cloud Computing
PPTX
Analytics Service Framework
PPTX
CAR PRICE PREDICTION.pptx
PPTX
Data mining presentation.ppt
PPTX
Data mining PPT
PPTX
Data Science: Past, Present, and Future
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
PDF
[MPKD1] Introduction to business analytics and simulation
PPTX
Introduction to-data-mining chapter 1
PPTX
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PDF
Food Recommendation System
PPTX
Business analytics
DOCX
Small data vs. Big data : back to the basics
PPTX
Case study
PPTX
Presentasi 1 - Business Intelligence
PPTX
Business intelligence overview
PPT
Introduction to Business Intelligence
PPTX
Big Data Analytics
Microstrategy Overview
Multi-Tenancy and Virtualization in Cloud Computing
Analytics Service Framework
CAR PRICE PREDICTION.pptx
Data mining presentation.ppt
Data mining PPT
Data Science: Past, Present, and Future
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
[MPKD1] Introduction to business analytics and simulation
Introduction to-data-mining chapter 1
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Food Recommendation System
Business analytics
Small data vs. Big data : back to the basics
Case study
Presentasi 1 - Business Intelligence
Business intelligence overview
Introduction to Business Intelligence
Big Data Analytics
Ad

Similar to Moving forward data centric sciences weaving AI, Big Data & HPC (20)

PPT
Big data analytics, survey r.nabati
PDF
Data Infrastructure for a World of Music
PPTX
Chapter 2 - EMTE.pptx
PDF
Predictive Analytics - BarCamp Boston 2011
PDF
Data Science Provenance: From Drug Discovery to Fake Fans
PPTX
Big Data Driven Solutions to Combat Covid' 19
PDF
Data science
PPTX
DataJan27.pptxDataFoundationsPresentation
PDF
PPTX
chapter_2_Data Science, Addis ababa_new.pptx
PDF
INF2190_W1_2016_public
PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PPTX
Data collection and enhancement
PDF
Python's Role in the Future of Data Analysis
PDF
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
PPTX
Chapter 2.pptx emerging technology data science
PDF
Big Data is changing abruptly, and where it is likely heading
PDF
Towards a rebirth of data science (by Data Fellas)
PPT
CS8091_BDA_Unit_I_Analytical_Architecture
PPTX
Builiding analytical apps on Hadoop
Big data analytics, survey r.nabati
Data Infrastructure for a World of Music
Chapter 2 - EMTE.pptx
Predictive Analytics - BarCamp Boston 2011
Data Science Provenance: From Drug Discovery to Fake Fans
Big Data Driven Solutions to Combat Covid' 19
Data science
DataJan27.pptxDataFoundationsPresentation
chapter_2_Data Science, Addis ababa_new.pptx
INF2190_W1_2016_public
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Data collection and enhancement
Python's Role in the Future of Data Analysis
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
Chapter 2.pptx emerging technology data science
Big Data is changing abruptly, and where it is likely heading
Towards a rebirth of data science (by Data Fellas)
CS8091_BDA_Unit_I_Analytical_Architecture
Builiding analytical apps on Hadoop
Ad

More from Genoveva Vargas-Solar (10)

PPTX
Aiccsa 2021-w-stem
PDF
Talk straps: Interactivity between Human and Artificial Intelligence
PDF
Data w-steamm
PDF
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
PDF
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
PDF
3 map reduce perspectives
PDF
2 mapreduce-model-principles
PDF
1 mapreduce-fest
PDF
Addressing dm-cloud
PDF
Vargas polyglot-persistence-cloud-edbt
Aiccsa 2021-w-stem
Talk straps: Interactivity between Human and Artificial Intelligence
Data w-steamm
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
3 map reduce perspectives
2 mapreduce-model-principles
1 mapreduce-fest
Addressing dm-cloud
Vargas polyglot-persistence-cloud-edbt

Recently uploaded (20)

PDF
Global Data and Analytics Market Outlook Report
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
DOCX
Factor Analysis Word Document Presentation
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Best Data Science Professional Certificates in the USA | IABAC
PPT
statistic analysis for study - data collection
PPTX
Business_Capability_Map_Collection__pptx
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPT
statistics analysis - topic 3 - describing data visually
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
An essential collection of rules designed to help businesses manage and reduc...
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Global Data and Analytics Market Outlook Report
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
DU, AIS, Big Data and Data Analytics.ppt
Caseware_IDEA_Detailed_Presentation.pptx
Factor Analysis Word Document Presentation
Navigating the Thai Supplements Landscape.pdf
1 hour to get there before the game is done so you don’t need a car seat for ...
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Best Data Science Professional Certificates in the USA | IABAC
statistic analysis for study - data collection
Business_Capability_Map_Collection__pptx
A biomechanical Functional analysis of the masitary muscles in man
statistics analysis - topic 3 - describing data visually
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
An essential collection of rules designed to help businesses manage and reduc...
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf

Moving forward data centric sciences weaving AI, Big Data & HPC

  • 1. Genoveva Vargas-Solar Senior Scientist, French Council of Scientific Research, LIG-LAFMIA, France [email protected] Moving forward data centric sciences weaving AI, Big Data & HPC AICSSA, Jordan, October-November, 2018 http://www.vargas-solar.com 3
  • 2. D AT A R E V O L U T I O N 4
  • 3. + “Data is everything and everything is data”, Pythian Turning reality phenomena into data thanks to the Big Data trend 5
  • 4. 6 Rendering into data, aspects of the world that have never been quantified Any individual can analyse huge amounts of data in short periods of time - Analytical knowledge: most of the crucial algorithms are accessible - Use rich data to make evidence-based decisions open to virtually any person or company DATIFICATION
  • 6. … UNFINISHED FUGUE Fuga a 3 Soggetti (Contrapunctus XIV): - 4-voice triple fugue - the third subject of which is based on the B A C H motif « At the point where the composer introduces the name BACH in the countersubject to this fugue, the composer died. » 8
  • 7. What makes Bach sound like Bach? http://www.washington.edu/news/2016/11/30/what-makes-bach-sound-like-bach-new-dataset-teaches-algorithms-classical-music/ The Art of Fugue is based on a single subject employed in some variation in each canon and fugue 9 • Identify the notes performed at specific times in a recording • Classify the instruments that perform in a recording • Classify the composer of a recording • Identify precise onset times of the notes in a recording • Predict the next note in a recording, conditioned on history Music information retrieval - Automatic music transcription - Inferring a musical score from a recording Generative models fabricating performances under various constraints - Can we learn to synthesize a performance given a score? - Can we generate a fugue in the style of Bach using a melody by Brahms?
  • 8. 10
  • 9. 11 DATA SCIENCE The representation of complex environments by rich data opens up the possibility of applying all the scientific knowledge regarding how to infer knowledge from data Definition: - Methodology by which actionable insights can be inferred from data - Complex, multifaceted field that can be approached from several points of view: ethics, methodology, business models, how to deal with big data, data engineering, data governance, etc. Objective: - Production of beliefs informed by data and to be used as the basis of decision making - N.B. In the absence of data, beliefs are uninformed and decisions are based on best practices or intuition
  • 10. 12 Computational Science Digital humanities Social Data Science Network Science DATA CENTRIC SCIENCES Data collections as backbone for conducting experiments, drive hypothesis and lead to “valid” conclusions, models, simulations, understanding Develop methodologies weaving data management, greedy algorithms, and programming models that must be tuned to be deployed in different target computer architectures
  • 11. Computational Science Digital humanities Social Data Science Network Science 1000 Yottabytes 1 Brontobyte 1000 Brontobytes 1 Geopbyte 13 Experimental Sciences
  • 12. Computational Science Digital humanities Social Data Science Network Science 14 1000 Yottabytes 1 Brontobyte 1000 Brontobytes 1 Geopbyte Computation (Algorithm: mathematical model) Experiment setting (Architecture: computing environment)
  • 14. Consumed data: • different sizes • quality, uncertainty, ambiguity degree • evolution in structure, completeness, production conditions, conditions in which data is retrieved • content, explicit cultural, contextual, background properties • access policies modification Conditions of consumption: • reproducibility, transparency degree (avoid “software artefacts”) 16 NEITHER MANAGEABLE NOR EXPLOITABLE AS SUCH RAW DATA • Heterogeneous (variety) • Huge (volume) • Incomplete, unprecise, missing, contradictory (veracity) • Continuous releases produced at different rates (velocity) • Proprietary, critical, private (value) DIGITAL DATA COLLECTIONS
  • 15. Consumed data: • different sizes • quality, uncertainty, ambiguity degree • evolution in structure, completeness, production conditions, conditions in which data is retrieved • content, explicit cultural, contextual, background properties • access policies modification Conditions of consumption: • reproducibility, transparency degree (avoid “software artefacts”) 17 DIGITAL DATA COLLECTIONS
  • 18. 20 ü Helping to select the right tool for preprocessing or analysis ü Making use of humans’ abilities to recognize patterns Not always sure what we are looking for (until we find it) Query expression [guidance ∣ automatic generation]3,2 • Multi-scale query processing for gradual exploration • Query morphing to adjust for proximity results • Queries as answers: query alternatives to cope with lack of providence Results filtering, analysis, visualization2 • Result-set post processing for conveying meaningful data Data exploration systems & environments1 • Data systems kernels are tailored for data exploration: no preparation easy-to-use fast database cracking • Auto-tuning database kernels : incremental, adaptive, partial indexing 1. Xi, S. L., Babarinsa, O., Wasay, A., Wei, X., Dayan, N., & Idreos, S. (2017, May). Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 557-572). ACM. 2. Athanassoulis, M., & Idreos, S. (2015, May). Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (p. 2). ACM. 3. Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity. KEY MOTIVATIONS EXPLORING DATA COLLECTIONS
  • 19. 21 QUANTITATIVE ANALYSIS OF DATA Concepts: - Population: collection of objects, items (“units”) - Sample: a part of the observed population Descriptive statistics: simplify data presenting quantitative descriptions - Measures and concepts to describe the quantitative features - Provide summaries about the samples as an approximation of the population - Frequency of the notes performed at specific intervals in a recording - Identify precise onset times of the notes in a recording
  • 20. 22 LOOKING BEYOND DATA Inferential statistics: infer the population characteristic - draws conclusions beyond the analysed data - reaches conclusions regarding made hypotheses - Classify the instruments that perform in a recording - Predict the next note in a recording, conditioned on history - Inferring a musical score from a recording
  • 21. 23 DATA CURATION Preserving Describing Extracting meta-data ExploringHarvesting ETL Parallel Data Processing Platforms Spark (RDD – Tables/Graphs) Hadoop ecosystem tools (e.g., Pig) Parallel Data Processing Platforms NoSQL & NewSQL (Parallel) Parallel Data Querying & Analytics Structured Data provision Parallel data collection (Flink, Stream, Flume) Spark (descriptive statistics functions) Hadoop ecosystem tools (e.g., Hive) Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS) Parallel analytics (Matlab, R) CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; Gavin Kemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
  • 23. 25
  • 24. 26 LOOKING BEYOND KNOWLEDGE Music information retrieval - Automatic music transcription - Inferring a musical score from a recording Generative models fabricating performances under various constraints - Can we learn to synthesize a performance given a score? - Can we generate a fugue in the style of Bach using a melody by Brahms?
  • 25. SETTING UP DATA CENTRIC EXPERIMENTS 27
  • 27. + §Data collections with characteristics difficult to process on a single machine or traditional databases §A new generation of tools, methods and technologies to collect, process and analyse massive data collections à Tools imposing the use of parallel processing and distributed storage DATA COLLECTIONS ALIAS BIG DATA 29
  • 28. 30 DATA SCIENCE ECOSYSTEM & INTEGRATED DEVELOPMENT ENVIRONMENT The integrated development environment (IDE) is an essential tool designed to maximize programmer productivity. - The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the debugger. - Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment) Programming language: - Python one of the most flexible programming languages because it can be seen as a multiparadigm language - Alternatives are MATLAB and R Fundamental libraries for data scientists in Python: NumPy, SciPy, Scikit-Learn, and Pandas
  • 32. D AT A B E Y O N D T H E C O N F O R T Z O N E 34
  • 33. 35
  • 34. + Curated Increased versatility & complexity Increased scalability & speed Data collections rawness degree Key-Value stores Document stores NewSQL Relational databases Graph Databases Extensible record stores QueryingLook up (R/W) Analytics AggregationProcessing Navigation ELASTIC DATA PROCESSING & MANAGEMENT AT SCALE 36 Descriptive Statistics Inferential Statistics Supervised Learning UnSupervised Learning
  • 35. Sharded & colocated Input data Distributed File System Classification Data transformation Tagged opus execution Multimedia multiform data Indexing classes INDEXING & STORING • the precise time of each note every recording • the instrument that plays each note • the note's position in the metrical structure of the composition 37
  • 36. SHARDING DATA ACROSS DIFFERENT STORES Sharded & colocated Input data Distributed File SystemMultimedia multiform data 38MusicNet: 330 classical music recordings, 1 million annotated labels indicatinghttp://homes.cs.washington.edu/~thickstn/musicnet.html Automatic and elastic data collections sharding tools to parametrize data access & exploitation by parallel programs willing to scale-up in different target architectures
  • 37. SHARDING ACROSS DIFFERENT STORES Sharded & colocated Input data Distributed File System Factors: - RAM - Disk - CPU - Network Sharded data architecture 39 Balanced and smooth fragmentation (size, location, availability) Optimum distribution across shards providing storage spaces (chunks)
  • 38. + Persistence - Which part of the document must persist? - Explicit vs. implicit persistence - In memory / hard disk Fragmentation/Sharding & replication: - Vertical or horizontal fragmentation - Strategies: range, hash, tagged - Distribution & location Availability & Fault tolerance - Replication & distribution Memory/Cache SHARDING DATA ACROSS DIFFERENT STORES Raw data collections 40
  • 39. 411.Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
  • 40. DATA DELIVERY FOR GREEDY PROCESSING “Multi-view computational problem” Iterative data processing and visualization tasks need to share CPU cycles 42 Data is a bottleneck APPLICATION DRAM DISK/DATABASE CPU Multiples Cores GPU Thousands of Cores 1-5GBps1-10GBps Provide data storage, fetching and delivery strategies ­ Architecture: distributed file system across nodes ­ Data sharding and replication: on storage and memory ­ Fetch to fulfil multi-facet application requirements ­ Prefetching ­ Memory indexing ­ Reduce impedance mismatch
  • 41. 43 § Manage data collections with different uses and access patterns because these properties tend to reach limits of: § the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a certain period, and § the pace (computing speed) in which data must be consumed (harvested, prepared and processed). § Build underlying value added data managers that can § Exploit available resources making a compromise between QoS properties and SLA requirements considering all the levels of the stack § Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources’ availability and the data properties OPPORTUNITIES
  • 42. F I N A L C O M E N T S 44
  • 43. 45 Move from design based on intuition & experience to a more formal & systematic way to design systems Addressing data centric sciences problems is a matter of designing complex systems according to a multidisciplinary vision
  • 44. 46 Let’s weave a golden trilogy Big Data, AI & HPC
  • 45. 47