0% found this document useful (0 votes)
128 views14 pages

ETL Tools: L. Libkin

ETL tools are used to extract large volumes of data from different sources, transform the data (e.g. cleaning, profiling, and conversions between formats), and load the transformed data into a data warehouse. Major database vendors like IBM, Microsoft, and Oracle, as well as independent companies like Informatica, provide ETL tools that are good at bulk loading and real-time data integration but are less capable of complex structural transformations and query answering. These tools continue to improve in handling real-time data and different data formats but have more progress to make in areas like virtual integration and metadata management.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views14 pages

ETL Tools: L. Libkin

ETL tools are used to extract large volumes of data from different sources, transform the data (e.g. cleaning, profiling, and conversions between formats), and load the transformed data into a data warehouse. Major database vendors like IBM, Microsoft, and Oracle, as well as independent companies like Informatica, provide ETL tools that are good at bulk loading and real-time data integration but are less capable of complex structural transformations and query answering. These tools continue to improve in handling real-time data and different data formats but have more progress to make in areas like virtual integration and metadata management.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

ETL Tools

• ETL = Extract – Transform – Load


• Typically: data integration software for building data warehouse
• Pull large volumes of data from different sources, in different formats,
restructure them and load into a warehouse
• A variety of tools:
◦ major database vendors (IBM, Microsoft, Oracle)
◦ independent companies (Informatica – currently among market lead-
ers)
◦ Open source (e.g. Clover ETL)

L. Libkin 1 Data Integration and Exchange


ETL tools cont’d

Emphasis on:

• data quality (in particular cleaning and profiling tools)


• transformations between specific formats
• latency requirements (towards real-time)

Much less (currently) emphasis on:

• nontrivial transformations
• proper query answering

L. Libkin 2 Data Integration and Exchange


IBM

• Product name: InfoSphere DataStage


• Main claims:
◦ variety of data sources (almost any database, text, XML, web ser-
vices)
◦ capable of handling data arriving in real-time
◦ scalability
• Unix (Linux) and Windows Platforms

L. Libkin 3 Data Integration and Exchange


InfoSphere DataStage cont’d

• InfoSphere – product line that includes software from WebSphere and


Information Server lines.
• Includes lots of other things
◦ application integration and transformation
◦ online marketing tools
◦ mobile, speech middleware
◦ business process management
◦ change data capture
◦ information analyzer
◦ data quality tools

L. Libkin 4 Data Integration and Exchange


InfoSphere Federation Server

• Federated (virtual) integration: “Access and integrate diverse data and


content sources as if they were a single resource - regardless of where
the information resides.”
• Integration across different relational products (db2, Oracle, SQL server)
• Integrity and accuracy guarantees
• Distributed query optimizer
• XML support
• Security strategies
• These are expensive products (>US$60K license)

L. Libkin 5 Data Integration and Exchange


IBM’s view of data integration

• Key tasks, with associated products


• Tasks:
◦ Connect to information (products: information server; data pub-
lisher)
◦ Understand information (data architect, models for ... (banking,
insurance, retail, telecom))
◦ Cleanse information (QualityStage: matching engine, cleaning rules
etc)
◦ Transform information (DataStage)
◦ Deliver information (Federation Server, DataStage)

L. Libkin 6 Data Integration and Exchange


Microsoft

• Integration Services – part of SQL Server (SSIS)


• Supports multiple formats; converts everything into tabular format
• Transformations:
◦ join, union
◦ sort
◦ aggregate
◦ lookup
◦ convert
• Has a data quality tool
• Goes beyond traditional ETL: e.g., data and text mining tools

L. Libkin 7 Data Integration and Exchange


Oracle

• Oracle Warehouse Builder (OWB)


• Data integration and metadata management tasks:
◦ Extraction, transformation, and loading (ETL) for data warehouses
◦ Migrating data from legacy systems
◦ Designing and managing corporate metadata
◦ Data profiling
◦ Data cleaning
• Included in the Oracle database product.

L. Libkin 8 Data Integration and Exchange


Oracle: transformations

• Scalar value transformations (plenty of predefined ones):


◦ Characters
◦ Conversions
◦ Dates
◦ Numbers
◦ Spatial objects
◦ XML transformations (from very simple – select nodes by XPath
expressions – to very complex, such as applying XSLT style sheet)
• Also user-defined (functions, procedures, packages)

L. Libkin 9 Data Integration and Exchange


Informatica

• Market leader – Informatica PowerCenter


• Provides support for
◦ migration
◦ synchronization
◦ warehousing
◦ cross-enterprise integration
• Works with multiple data formats
• Provides support for metadata management
• Real-time capabilities

L. Libkin 10 Data Integration and Exchange


Informatica: Transformation language

• Main orientation: scalar value transformations


• Functions: change data in a mapping
• Operators: create transformation expressions
• Syntax is SQL-based
• Part of it is essentially a programming language in a Java-like syntax
for manipulating values.
• Roughly: looks at a portion of the source data, modifies it, and changes
the target data accordingly.

L. Libkin 11 Data Integration and Exchange


Informatica: Transformation language cont’d

• DD_DELETE and DD_INSERT specify what to do with data items.


• E.g., IIF(job=‘CEO’, DD_DELETE, DD_INSERT) says: items with
job being CEO are marked for deleting, others for insertion.
• Operators:
◦ Arithmetic
◦ String
◦ Comparisons
◦ Logical
◦ (almost) everything you can imagine
• Many functions for dealing with dates in different formats.

L. Libkin 12 Data Integration and Exchange


Informatica: Transformation language con’t

• Large number of functions


• Aggregates: AVG, COUNT, MIN, MAX, MEDIAN, PERCENTILE, STDDEV,
SUM, etc.
• Character functions: CONCAT, LENGTH, TRIM, etc
• Conversion functions (e.g., TO_CHAR for Date, TO_DECIMAL, TO_FLOAT,
TO_DATE)
• Date functions: ADD_TO_DATE, DATE_DIFF, DATE_COMPARE, etc
• Numerical: the usual suspects.
• Scientific: SIN, COS, TAN, etc
• Search for a value in the source: LOOKUP
• This was quick; full manual – almost 250 pages.

L. Libkin 13 Data Integration and Exchange


Summary

• Complex tools; very good at transforming data values, and at working


with specific formats (MS Word, Excel, PDF, UN/EDIFACT, Roset-
taNet, etc) and for specific industries (finance, insurance, health)
• Much better these days at getting real-time data; very good at bulk
loading, supporting multiple formats
• Not so good:
◦ virtual integration
◦ complex structural transformation
◦ query answering
◦ metadata management
• A lot of effort will be put there over the coming years

L. Libkin 14 Data Integration and Exchange

You might also like