Embulk at Treasure Data
Satoshi Akama
Dec. 15, 2015
Embulk meetup #2
×
About me…
Satoshi Akama
Embulk plugins
 ・embulk-output-bigquery
 ・embulk-input-gcs
 ・embulk-input-azure_blob_storage
 ・embulk-output-azure_blob_storage
Treasure Data Inc.
Software Engineer (Java/Scala/Ruby)
github.com/sakama/
@oreradio
We are providing Hosted Embulk
Data Connector
(Import)
Result Output
(export)
+
“Data Loading” should not be customer’s work
unless they’re developing ETL tools.
Streaming Import
MySQL
PostgreSQL
Redshift
AWS S3
Google Cloud Storage
SalesForce
Marketo
…etc
MySQL
PostgreSQL
Redshift
BigQuery
…etc
Treasure Data as a Datahub
Schema Less
(Treasure Data)
Something Data Store
(Schema full)
You can create Data Pipeline easily
Various formatted data
・log
・Sensor data(IoT)
・Visualize
・Digital Marketing
Data Connector(Import) - CUI
guess/preview/import
$ td connector:guess seed.yml -o load.yml
$ td connector:preview load.yml
$ td connector:issue load.yml —database td_sample_db 
—table td_sample_table
Scheduled execution
$ td connector:create 
daily_import 
“10 5 * * * “ 
td_sample_db 
td_sample_table 
load.yml 
—time-column created_at
GUI will come in the near future
Result Output(Output) - GUI/CUI
Unchanged OSS Embulk/Embulk plugins
Send pull-request to OSS Embulk
We are using…
We will use at our service after
「いわゆるオープンソースソフトウェアの中で基本機能は無償で公開してコミュニティに任せる、でも機
能を追加したソフトを有償で提供するというモデルは実際にはそんなに上手く行ってないのではないか
と感じています。」-「「Fluentdをきっかけにビジネスが回る仕掛けがとっても気持ちイイです。」 ¦ Think IT(シンクイッ
ト)」 https://thinkit.co.jp/story/2015/07/17/6232
「オープンソースソフトウェアといってもいろいろな開発スタイルがあると思うんですが、fluentdの場
合、僕が所属するトレジャーデータが全面的にバックアップしています。現在は、この開発スタイル「企
業がバックについているけど、開発はオープンに行う」という手法が一番合っていると思います。」
- OSや言語ではなくデータベースを極めたい:グリー技術者が聞いた、fluentdの新機能とTreasure Data古橋氏の野心 (2/3) - @IT
http://www.atmarkit.co.jp/ait/articles/1310/07/news010_2.html
Process to use Embulk plugins at TD
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
Send Pull-Request to
OSS Embulk or Embulk Plugins
Sorry, this is sorry closed source code
Release as “Data Connector” or ”Result Output”
Process to use Embulk plugins at TD (1)
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
・Add some features
e.g. add various authentication method.
・Add some fixes
 e.g.
add retry logic
fix error handling
Process to use Embulk plugins at TD (2)
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
Handling of file path
MR executor could not read local file path(like private key)
Fix authorization logic if need
transaction() and open() method will run at different
instances
Process to use Embulk plugins at TD (3)
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
Need 80% coverage
By internal rules,
we can’t deploy without 80% coverered unit test.
Write Unit test
Write unit test for Embulk plugin is difficult.
e.g. connect to cloud service…
Process to use Embulk plugins at TD (4)
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
Write Integration Test for Treasure Data Service
(1) Import data into TD
(2) Send query into Presto, Hive
(3) Check result with local file.
e.g.
Process to use Embulk plugins at TD (5)
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
Release as “Data Connector” or ”Result Output”
We hope Win-Win relationship
Embulk Community
Use at TD
Core development
Plugin development
Use at your
own environment
Contribute
Embulk Execution Platform at Treasure Data
Load Balancer
TD API(API Servers)Web Console
td commands
td connector:issue
td guess config.yml…
Response
Response
Request
Request
Bulkload API
(API Servers)
Perfect Queue
TD worker
(worker process)
enqueue
dequeue
Submit Job
(Retry if need)
Execute with MR / Local Executor
guess/preview
TD API / Bulkload API
TD API(API Servers)
Bulkload API(API Servers)
guess/preview is processed at different API Servers.
ResponseRequest
guess/preview
data import
Perfect Queue
Load Balancer
Queuing
Http Request/Response
guess/preview needs quick response
enqueue
Problems
Stability of Integration Tests
Execution time of Integration Tests
・Many plugins × Many test cases × Frequent execution
 sometimes causes failure.
・Many plugins × Many test cases causes long execution time:)

More Related Content

PDF
Embulk - 進化するバルクデータローダ
PPTX
Data integration with embulk
PDF
Embuk internals
PDF
Using Embulk at Treasure Data
PDF
Recent Updates at Embulk Meetup #3
PDF
Fighting Against Chaotically Separated Values with Embulk
PDF
Automating Workflows for Analytics Pipelines
PDF
Fluentd at Bay Area Kubernetes Meetup
Embulk - 進化するバルクデータローダ
Data integration with embulk
Embuk internals
Using Embulk at Treasure Data
Recent Updates at Embulk Meetup #3
Fighting Against Chaotically Separated Values with Embulk
Automating Workflows for Analytics Pipelines
Fluentd at Bay Area Kubernetes Meetup

What's hot (20)

PDF
Scripting Embulk Plugins
PPTX
PostgREST Design Philosophy
PDF
using Mithril.js + postgREST to build and consume API's
PDF
Cachopo - Scalable Stateful Services - Madrid Elixir Meetup
PDF
Digdagによる大規模データ処理の自動化とエラー処理
PDF
Logging for Production Systems in The Container Era
PPTX
A Tour of PostgREST
PDF
Google App Engine With Java And Groovy
PDF
Plugin-based software design with Ruby and RubyGems
PPTX
Cyansible
PPTX
PDF
Play Framework: async I/O with Java and Scala
PDF
Digdag Updates 2020 July
PDF
Heat optimization
PDF
Phoenix for Rails Devs
PDF
What's New in v2 - AnsibleFest London 2015
PDF
Introduction to Asynchronous scala
PPT
A Brief Introduce to WSGI
PDF
The OMR GC talk - Ruby Kaigi 2015
PDF
Managing Your Cisco Datacenter Network with Ansible
Scripting Embulk Plugins
PostgREST Design Philosophy
using Mithril.js + postgREST to build and consume API's
Cachopo - Scalable Stateful Services - Madrid Elixir Meetup
Digdagによる大規模データ処理の自動化とエラー処理
Logging for Production Systems in The Container Era
A Tour of PostgREST
Google App Engine With Java And Groovy
Plugin-based software design with Ruby and RubyGems
Cyansible
Play Framework: async I/O with Java and Scala
Digdag Updates 2020 July
Heat optimization
Phoenix for Rails Devs
What's New in v2 - AnsibleFest London 2015
Introduction to Asynchronous scala
A Brief Introduce to WSGI
The OMR GC talk - Ruby Kaigi 2015
Managing Your Cisco Datacenter Network with Ansible
Ad

Similar to Embulk at Treasure Data (20)

PDF
Our challenge for Bulkload reliability improvement
PDF
Snowflake for Data Engineering
PDF
Using Embulk at Treasure Data
PPTX
Is there a way that we can build our Azure Data Factory all with parameters b...
PPT
The 90-Day Startup with Google AppEngine for Java
PPTX
Building data pipelines
PDF
Yaetos_Meetup_SparkBCN_v1.pdf
PPT
Google App Engine for Java
PDF
Yaetos Tech Overview
PDF
Peteris Arajs - Where is my data
PPT
Ado.Net Data Services (Astoria)
PPTX
Dataflow.pptx
PPTX
Synapse 2018 Guarding against failure in a hundred step pipeline
PDF
USQ Landdemos Azure Data Lake
PPT
닷넷 개발자를 위한 패턴이야기
PDF
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
PDF
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
PDF
Google Cloud Dataflow
PPT
Windows Azure and a little SQL Data Services
PPTX
Content migration for sitecore
Our challenge for Bulkload reliability improvement
Snowflake for Data Engineering
Using Embulk at Treasure Data
Is there a way that we can build our Azure Data Factory all with parameters b...
The 90-Day Startup with Google AppEngine for Java
Building data pipelines
Yaetos_Meetup_SparkBCN_v1.pdf
Google App Engine for Java
Yaetos Tech Overview
Peteris Arajs - Where is my data
Ado.Net Data Services (Astoria)
Dataflow.pptx
Synapse 2018 Guarding against failure in a hundred step pipeline
USQ Landdemos Azure Data Lake
닷넷 개발자를 위한 패턴이야기
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
Google Cloud Dataflow
Windows Azure and a little SQL Data Services
Content migration for sitecore
Ad

Recently uploaded (20)

PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
MBA JAPAN: 2025 the University of Waseda
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PDF
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
PPTX
ai agent creaction with langgraph_presentation_
PPTX
eGramSWARAJ-PPT Training Module for beginners
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPT
statistic analysis for study - data collection
PPTX
Machine Learning and working of machine Learning
PPT
Image processing and pattern recognition 2.ppt
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
AI AND ML PROPOSAL PRESENTATION MUST.pptx
Caseware_IDEA_Detailed_Presentation.pptx
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
statsppt this is statistics ppt for giving knowledge about this topic
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
MBA JAPAN: 2025 the University of Waseda
expt-design-lecture-12 hghhgfggjhjd (1).ppt
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
ai agent creaction with langgraph_presentation_
eGramSWARAJ-PPT Training Module for beginners
A biomechanical Functional analysis of the masitary muscles in man
statistic analysis for study - data collection
Machine Learning and working of machine Learning
Image processing and pattern recognition 2.ppt
1 hour to get there before the game is done so you don’t need a car seat for ...
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx

Embulk at Treasure Data

  • 1. Embulk at Treasure Data Satoshi Akama Dec. 15, 2015 Embulk meetup #2 ×
  • 2. About me… Satoshi Akama Embulk plugins  ・embulk-output-bigquery  ・embulk-input-gcs  ・embulk-input-azure_blob_storage  ・embulk-output-azure_blob_storage Treasure Data Inc. Software Engineer (Java/Scala/Ruby) github.com/sakama/ @oreradio
  • 3. We are providing Hosted Embulk Data Connector (Import) Result Output (export) + “Data Loading” should not be customer’s work unless they’re developing ETL tools. Streaming Import MySQL PostgreSQL Redshift AWS S3 Google Cloud Storage SalesForce Marketo …etc MySQL PostgreSQL Redshift BigQuery …etc
  • 4. Treasure Data as a Datahub Schema Less (Treasure Data) Something Data Store (Schema full) You can create Data Pipeline easily Various formatted data ・log ・Sensor data(IoT) ・Visualize ・Digital Marketing
  • 5. Data Connector(Import) - CUI guess/preview/import $ td connector:guess seed.yml -o load.yml $ td connector:preview load.yml $ td connector:issue load.yml —database td_sample_db —table td_sample_table Scheduled execution $ td connector:create daily_import “10 5 * * * “ td_sample_db td_sample_table load.yml —time-column created_at GUI will come in the near future
  • 7. Unchanged OSS Embulk/Embulk plugins Send pull-request to OSS Embulk We are using… We will use at our service after 「いわゆるオープンソースソフトウェアの中で基本機能は無償で公開してコミュニティに任せる、でも機 能を追加したソフトを有償で提供するというモデルは実際にはそんなに上手く行ってないのではないか と感じています。」-「「Fluentdをきっかけにビジネスが回る仕掛けがとっても気持ちイイです。」 ¦ Think IT(シンクイッ ト)」 https://thinkit.co.jp/story/2015/07/17/6232 「オープンソースソフトウェアといってもいろいろな開発スタイルがあると思うんですが、fluentdの場 合、僕が所属するトレジャーデータが全面的にバックアップしています。現在は、この開発スタイル「企 業がバックについているけど、開発はオープンに行う」という手法が一番合っていると思います。」 - OSや言語ではなくデータベースを極めたい:グリー技術者が聞いた、fluentdの新機能とTreasure Data古橋氏の野心 (2/3) - @IT http://www.atmarkit.co.jp/ait/articles/1310/07/news010_2.html
  • 8. Process to use Embulk plugins at TD Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor Send Pull-Request to OSS Embulk or Embulk Plugins Sorry, this is sorry closed source code Release as “Data Connector” or ”Result Output”
  • 9. Process to use Embulk plugins at TD (1) Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor ・Add some features e.g. add various authentication method. ・Add some fixes  e.g. add retry logic fix error handling
  • 10. Process to use Embulk plugins at TD (2) Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor Handling of file path MR executor could not read local file path(like private key) Fix authorization logic if need transaction() and open() method will run at different instances
  • 11. Process to use Embulk plugins at TD (3) Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor Need 80% coverage By internal rules, we can’t deploy without 80% coverered unit test. Write Unit test Write unit test for Embulk plugin is difficult. e.g. connect to cloud service…
  • 12. Process to use Embulk plugins at TD (4) Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor Write Integration Test for Treasure Data Service (1) Import data into TD (2) Send query into Presto, Hive (3) Check result with local file. e.g.
  • 13. Process to use Embulk plugins at TD (5) Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor Release as “Data Connector” or ”Result Output”
  • 14. We hope Win-Win relationship Embulk Community Use at TD Core development Plugin development Use at your own environment Contribute
  • 15. Embulk Execution Platform at Treasure Data Load Balancer TD API(API Servers)Web Console td commands td connector:issue td guess config.yml… Response Response Request Request Bulkload API (API Servers) Perfect Queue TD worker (worker process) enqueue dequeue Submit Job (Retry if need) Execute with MR / Local Executor guess/preview
  • 16. TD API / Bulkload API TD API(API Servers) Bulkload API(API Servers) guess/preview is processed at different API Servers. ResponseRequest guess/preview data import Perfect Queue Load Balancer Queuing Http Request/Response guess/preview needs quick response enqueue
  • 17. Problems Stability of Integration Tests Execution time of Integration Tests ・Many plugins × Many test cases × Frequent execution  sometimes causes failure. ・Many plugins × Many test cases causes long execution time:)