0% found this document useful (0 votes)
82 views46 pages

Talend Data Fabric Webinar Insights

The webinar discusses Talend Data Fabric, a data integration and management platform. It highlights how Talend can help organizations address common data challenges related to integration, governance, and enabling data-driven business decisions. Talend provides a unified environment for data ingestion, integration, quality, and sharing across cloud, on-premises and hybrid environments. It delivers trusted data at speed through its open source DNA and capabilities for data loading, integration, quality, cataloging and stewardship.

Uploaded by

oka M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views46 pages

Talend Data Fabric Webinar Insights

The webinar discusses Talend Data Fabric, a data integration and management platform. It highlights how Talend can help organizations address common data challenges related to integration, governance, and enabling data-driven business decisions. Talend provides a unified environment for data ingestion, integration, quality, and sharing across cloud, on-premises and hybrid environments. It delivers trusted data at speed through its open source DNA and capabilities for data loading, integration, quality, cataloging and stewardship.

Uploaded by

oka M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

TECH WEBINAR

TALEND DATA FABRIC DELIVERING TRUSTED DATAAT SPEED

Dias Pambudi Satria – Presales Talend

[email protected]
WHAT ARE SOME OF THE COMMON
THEMES IN YOUR CLIENT PROJECTS?

5
THE DATA MANAGEMENT LANDSCAPE
Integration challenge Technology challenge Governance challenge User challenge

DATA INCREASING DATADRIVEN


ORGANIZATIONAL ENGINEER COMPLIANCE / BUSINESS
SILOS GAPS REGULATIONS DECISIONS

Cloud l On-prem l Datalakes Data Ops Privacy l Industry Data Democratization


l Compliance

INCREASING DIVERSITY

DEPLOYMEN
DATATYPES DATABASE USECASES
T
S 6
PLATFORMS
HOW MANY TOOLS/PRODUCTS DO YOU NEED
TO MEET THE FOLLOWING REQUIREMENTS
1. Data Integration 6.Service Based Framework
2. Big DATA Pushdown/Integration 7.Data Hub
3. Data Catalog/Metadata Management 8.Automated Data Ingestion
4.Masking 9.Self Service ETL
5.Data Quality/Customer 360 10.Exception Management/Stewardship

All delivered a CI/CD Pipeline 8


8
AVG: 3-4 TOOLS/FRAMEWORKS

5
TALEND DELIVERS BOTH SPEED AND TRUST

Unified environment One user experience


SPEED
Native code generation Best performance on any platform

Pervasive data quality End-to-end automated data quality


TRUS
T Self-service data access Collaborative governance

WITH A DNA THAT’S “OPENSOURCE”


6
TALEND DATA FABRIC DELIVERS
TRUSTED DATA AT SPEED
New User’s
RAW DATA TRUSTED DATA

INSIGHTS

Integratio Citizen
Data Data Scientists Data Analyst
Engineers n Stewards Integrator s
Specialist s
s
COLLECT GOVERN TRANSFORM SHARE

EXPERIENCES
MULTI-CLOUD HYBRID ON-PREMISES

“Through 2020, integration will take 50% of the time and cost of building a digital platform”12
12
TALEND DATA FABRIC
DELIVERS TRUSTED DATA AT SPEED – DEPLOYED ON-PREM OR IN THE CLOUD

STITCH DATA DATA DATA API &


DATA INTEGRATION DATA DATA
QUALIT PREPARATIO STEWARDSHI APPLICATION
LOADER (Studio DI/BDI CATALOG N P INTEGRATION
Y
and Pipeline
Connect, access,
• Designer) •  Profile, cleanse and •  Data •  ID errors quickly •  Control & improve •  Quickly implement
•  Rapid data
loading transform any data mask any data inventory and with ML based data integrity APIs and event-
format lineage smart guides across any data driven
•  G e t d a t a t o •  Batch,
& visual pipeline or architectures
cloud data streaming and •  ML enabled •  Single point of
discovery stores
warehouses big data de- control & •  Highly
and data lakes duplication, reference •  Apply rules to •  Clean, certify scalable
•  Across the
in minutes validation and massive and reconcile integration
cloud or on- •  End to end
standardizatio datasets data platform-as-
premises visibility across
n a- service
dataflows •  Reuse and share •  Team-based,
•  Ideal for self- (iPaaS).
•  Enrich data in one click. workflow
service (Cloud
with external approach
Pipeline
sources
Designer) &
experts
Unified Environment Native Performance Pervasive Data Quality Self-service 13
13
INGESTION PATTERNS WITH TALEND

16
START ANYWHERE WITH
TALEND
Data Integratio
Engineers n Studio
Data specialist
scientists s
Data Engineer Ready
Advanced Data
Pipeline Designer Integration Data Quality
Light-weight Data Talend Community
Data Analyst

900
Stewards s
Integration Web Based
Client
Batch & Streaming

Citizen
integrators
Ingest
Quickly
Replicate
+
Cloud Native Connectors
Simpl Advance
e Integration Complexity d 10
STITCH
STITCH IS BUILT TO GET YOUR DATA TO
YOUR WAREHOUSE

True SaaS/Web HIPAA, GDPR, and


Cloud Native Open Source DNA SOC II compliance
Based

12
WHAT DO WE DO?
CONNECTS IT MOVES IT LANDS IT

Sales & Marketing SaaSApps

•  ONE WAY
•  NO DATACHANGES
•  NO ADMINISTRATION

Legacy On-Prem DBMS

13
PIPELINE DESIGNER
PIPELINE DESIGNER: A MODERN CLOUDDATA
INTEGRATION DESIGN ENVIRONMENT

ü  Build Faster, Easier and Smarter


ü  Integrate All Your Data at Any Speed
ü  Innovate and Scale Effortlessly

15
PIPELINE DESIGNER IS BUILT FOR MODERN
DATA ENGINEERING

Live Preview Schema on read Batch & Streaming

Extensible with Python Portability & Multi-Cloud Scale with Talend Cloud

16
STUDIO
SINGLE DESIGN ENVIRONMENT
•  Single, unified platform based on Eclipse
•  One Design Studio for
•  ETL/ELT
•  Data Quality ( Profiling,Job Design’s)
•  Big Data ( Map Reduce /Spark)
•  ESB( Camel Routes)
•  Connect to anything
•  View the Code to Debug Faster
•  Spark batch, streaming, & machine learning
•  Run Locally ( You can design Pipelines on
an Airplane mode)
LEARN ONCE- ADAPT TO NEW TECHNOLOGIES ATSCALE

18
CONNECTIVITY UPDATES
Enhancements
Additions
•  New AWS regions in S3, Redshift and Snowflake connectors
•  Redshift bulk components: reuse S3 connection
•  Snowflake: •  Support JSON for DynamoDB
•  tSnowflakeCommit, •  Support Dynamic Schema for BigQuery
tSnowflakeRollback •  Snowflake bulk: temp table removed
•  Azure: •  Snowflake added in tSQLTemplateMerge DB type
•  tAzureAdlsGen2Input, •  tServiceNowInput: combine filter conditions with and/or
tAzureAdlsGen2Output •  tRunJob: 2 options for setting the JVM args
•  Workday: •  use child job JVM args
•  overwrite child job JVM args
•  tWorkdayInput
•  tPostgreSQLOutput: handle casing when creating a table
•  “Use alternate schema” feature in many DB components
•  tJDBCSCDELT: set end date in type 2 SCD
19
WHAT SHOULD I USE?
Stitch Pipeline Designer Studio

Connectors Apps Apps+DB’s 900 Plus


Transformation N/A Yes, Customizable Yes ,Heavily customizable
Streaming Source Batch Only Yes Yes

Spark Processing No Yes Yes


Typical Users Business- zero Tech Data Scientists, Citizen Integrators Technical Users
Use Cases Simple Ingestion From Marketing apps Simple use cases with Transformation. •  More Complex Integration
or DB’s on Cloudplatform Modern Ingestion with Python •  Spark
integration •  ESB
Cloud Based Execution •  Complete Flexibility
Systems(EMR/Databricks) •  Dynamic Schema’s
•  Supports almost every execution
framework

Data Sovereignty Moves to our Cloud Can execute within your VPC/Firewall Can execute within your VPC/Firewall

Installation Zero Zero for design Have to Install all the components

54
DELIVER TRUSTED DATA
TALEND IN THE DATA JOURNEY
DATA DATA DATA DATA
INTEGRATIO PREPARATIO QUALIT STEWARDSHI
N N Y P

RAW Discover and


Combine Data
Cleanse And
Ensure Data Analyze/Share
TRUSTED
From Multiple Quality/Proper Data (Internally
DATA Ingest Data
Sources
Prepare Data
Usage & Externally) DATA

STITCH DATA API &


DATA CATALOG APPLICATION
LOADER Talend INTEGRATION
Data 7
Inventory 7
DATA THAT YOU CAN TRUST AS A
TEAM SPORT

Talend Data Quality


Talend Data CataLog
Talend Data Preparation
Talend Data Stewardship

Designed by Business, Operationalized by IT, Managed by Talend 9


9
DATA QUALITY PROJECT STEPS
An iterative and never-ending process…
Discover
Discover
Standardize Search, find and profile data to understand
structure and typology through various
analysis and indicators

Consolidate
Standardize
Match, deduplicate... and get
Convert, format,
the golden record through
Operationalize validate, enrich, mask
survivorship activities

Continuous Operationalize
Monitor
Improvement Measure, analyse,
Collaborate, leverage
Consolidate business knowledge and
control and improve
industrialize it by IT

Monitor

10
Search and understand your enterprise
data the easy way

Discov Make the initial investigations on data


•  Identify quickly data quality issues
er •  Discover (hidden) patterns
•  Spot anomalies
with the help of summary statistics
and graphical representations

11
11
11
USING THE TALEND STUDIO
Template analysis and statistics indicators are available to discover data quality level

26
USING DATA PREPARATION

•  Semantic-awareness
with semantic types
•  Quality bar to spot
anomalies at a glance

27
USING DATA CATALOG
Search and browse your Enterprise data the easy way

•  Faceted Search
•  Auto-completion

28
USING DATA
CATALOG
Automatic Data Profiling
•  At dataset and attribute level
•  Pattern Detection
•  Completeness
•  Profiling statistics

29
Transform data taken from different
sources and various formats into a
consistent format
•  Convert your data to a standardized
format (e.g. date, phone numbers, etc.)
Standardiz •  Clean your data to eliminate “noise”
•  Enrich and complete your data using
e references
•  Mask, shuffle, hash, encrypt… to ensure
data privacy policies

22
22
22
NUMEROUS STUDIO COMPONENTS
DQ at scale – Available in DI and Big Data

•  Same use cases


covered in DI and Big
Data with the native
support of Spark 2.X

31
NUMEROUS FUNCTIONS IN DATA
PREP
•  For strings, for dates, for data privacy, for conversions, etc., etc., etc.

32
DATA PRIVACY
Different approaches for different needs
Masking Shuffling

Encryption Hashing

33
DATA PRIVACY
Masking •  Replace sensitive data with other values for the Shuffling
•  Hide data while preserving the overall format and semantic
•  Support random, repeatable and bijective masking same attribute from a different record
•  Allow to unmask data by supporting FPE algorithms •  Guarantee that any metrics computed on the dataset is still valid
•  Supports random, group and partition shuffling

Plain Input Masked Output Plain Input Shuffled Output


SSN 376-76-6765 675-45-9287 1 376-76-6765 1 673-92-7309
Phone (541) 754-3010 (623) 346-8239 2 673-92-7309 2 482-32-3287

3 482-32-3287 3 376-76-6765

Hashing
Encryption •  Protect data by transforming it into unreadable cipher text •  Transform data of arbitrary size onto unique piece of data
•  Can be decrypted to retrieve the original plain data. of a fixed size
•  Non reversible function (i.e. cannot unhash)
Plain Input Encrypted Output Plain Input Hashed Output

SSN 376-76-6765 NxpVN55Q/8MqArUnRQ== SSN 376-76-6765 BF3EA452BAC756201A0B724E0174E006

Phone (541) 754-3010 +qMr+N7i0L+mpwEL2VCzik1rhUBgUnqDg== Phone (541) 754-3010 A1449D83CBB396D80648DACE1A68B594

34
Create a single view of your data
•  Reconciliates and creates groups of similar
data records in any source via a matching
process
•  Automates deduplication when records are
Consolidat guaranteed being duplicates via survivorship
rules
•  Manually creates the golden records when
e records are suspected being duplicates by
leveraging the knowledge of data stewards

32
32
32
HIGH LEVEL SCENARIO
Standard use case as an example •  Some of the records are guaranteed being unique.
•  They go directly into the target DW.

•  Some of the records are guaranteed being duplicates.


•  We can automate the resolution and the creation of the
golden records via hardcoded survivorship rules.

Reconciliation of the
different sources via
matching

•  Using Data Prep, a business user


solves DQ issues.
• This preparation is then
•  Some of the records are suspected being duplicates. There is a need
industrialized by IT.
for a manual intervention.
• Using Data Stewardship, a data steward creates the golden records.
36
THREE POSSIBLE OUTPUTS
Matching score is assigned to each record to indicate a degree of similarity

Unique Duplicates Suspected


Records records duplicates
Matching score Matching score
Matching score > 95 %
between 0 and 75 % between 75 et 95 %

Between confidence
< match threshold > confidence threshold
threshold & match threshold

Automatic Consolidation Manual Consolidation


No Treatment
(via tRuleSurvivorship) (via
tStewardshipTaskOutput)

Ingestion into Data


Ingestion into target Ingestion into target
Stewardship
37
•  Continuously monitor data quality
throughout the data lifecycle and
compare the results over time
Monitor •  Trace history of data quality and
preserve historical records of
measuring data improvement or
degradation

37
37
37
BASIC REPORTS
•  Create basic reports which provides the statistics collected by the
analysis listed in a given report

39
EVOLUTION REPORTS
•  Create evolution reports which
provides information showing the
evolution through time of the
indicators used on the analyses
listed in a given report

40
•  Leverage business users knowledge
by operationalizing and
industrializing their tasks part of an
IT job/pipeline
Operationaliz •  Allow to govern your data in a way
e that ensures regulation compliance
and data security without overly
restricting data access and usage

41
41
41
SHARE YOUR PREPARATION
•  Business users can collaborate
between themselves
•  Share business user tasks with
IT to
•  Leverage their knowledge…
•  … part of an IT job/pipeline…
•  … in a governed way…

42
DESIGNED BY BUSINESS…
… OPERATIONALIZED BY
IT…

43
THAT’S IT!
An iterative and never-ending process…
Discover
Discover
Standardize Search, find and profile data to understand
structure and typology through various
analysis and indicators

Consolidate
Standardize
Match, deduplicate... and get
Convert, format,
the golden record through
Operationalize validate, enrich, mask
survivorship activities

Continuous Operationalize
Monitor
Improvement Measure, analyse,
Collaborate, leverage
Consolidate business knowledge and
control and improve
industrialize it by IT

Monitor

44
How does the tools fit in?

BUSINESS DISCOVER REMEDIATE RESOLVE


Data Stewardship
Data Catalog + Data Prep Data Stewardship

FIND CLEAN CONTROL

IT PROFILING FILTER CHECK


Talend Studio Talend Studio
Talend Studio

49

You might also like