SlideShare a Scribd company logo
© 2018 Arm Limited
• Kentaro Yoshida
Improve data engineering work
with Digdag and Presto UDF
• 2018/10/17
at Plazma TD TechTalk 2018 Fall
© 2018 Arm Limited2
About me
• @yoshi_ken
• Leading DATA Team
• Support data driven work at TD
• Published DWH Platform books
Familiar Products
© 2018 Arm Limited3
What is DATA Team?
• Management for internal data ETL & Analysis Platform on TreasureData
• As historical reason, using Luigi, Airflow(with embulk) and Digdag
• Management data visualizing and reporting workflow for business
• Not only for engineers but also sales, marketing and operation
• Make simple solution insight from complexed data ocean
• Kind of data science(analysis) solution
• A rare team who use TreasureData internally as daily basis
• We can tell feedback as user mind for new improvements
© 2018 Arm Limited4
Technical Challenge of DATA Team
• Make scalable&robust data pipeline
• ex) 1 query generates numerous metrics logs from each components
• Improve fact data for supporting data-driven business/engineering
• ex) make easier to use data beforehand enrich/pre-processing
• Seek performance tuning insights for presto/hive at the platform side
• ex) root cause of making table fragmentation
• Change semi-realtime data processing from daily jobs
• ex) fresh/quick stat data make good insight for engineer/support
© 2018 Arm Limited
Introduce nice improvements
For Presto UDF and digdag
© 2018 Arm Limited6
Introduced nice improvements in Digdag and Presto
• New feature of Digdag
1. Added ${td.last_job.num_records}
• Which has number of records for job results
2. Added “_else_do” after if> operator since digdag v0.9.31
3. Added param_set> and param_get>
• For parameter sharing between workflow (not available in TD workflow)
• New feature of Presto
1. Added TD_TIME_STRING() UDF
• In SELECT clause, Make easier to format date string
2. Added TD_INTERVAL() UDF
• In WHERE clause, Make easier to specify time range extraction
© 2018 Arm Limited
New Feature of Digdag
© 2018 Arm Limited8
Situation of zero result error in workflow
• Due to some reason, in the case of final results got zero result unexpectedly.
• It need to investigate result number of rows for each step-by-step.
• I wish if digdag check the result number of rows at each step…
• I wish if digdag has function of result output with job_id…
Oops!
© 2018 Arm Limited9
Situation of zero result error in workflow
• Introduced ${td.last_job.num_records} has number of records for job
results
$ cat num_records.dig
+query:
td>:
data: SELECT DISTINCT symbol FROM nasdaq
database: sample_datasets
+fail_if_zero:
if>: ${td.last_job.num_records < 1}
_do:
fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows.
© 2018 Arm Limited10
Situation of zero result error in workflow
• Introduced “_else_do” after if> operator since digdag v0.9.31
$ cat num_records.dig
+query:
td>:
data: SELECT DISTINCT symbol FROM nasdaq
database: sample_datasets
+fail_if_zero:
if>: ${td.last_job.num_records < 1}
_do:
fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows.
_else_do:
sh>: td export:result ${td.last_job_id} ${result_path} # enqueue job
_export:
result_path: td://@/workflow_logs/jobid_${td.last_job_id}
© 2018 Arm Limited
New Feature of Presto
TD_TIME_STRING() UDF
© 2018 Arm Limited12
Efficient way to format date string in SELECT
• It was required to use burden of writing date format conversion.
• This type of query has used GROUP BY statement in generally.
• So, I have used to be add preset custom dictionary with “td” for my IME.
© 2018 Arm Limited13
Efficient way to format date string in SELECT
• TD_TIME_STRING() is awesome UDF
• Easier way to truncate timestamp
format
string
format example
y yyyy-MM-dd HH:mm:ssZ 2018-01-01 00:00:00+0700
q yyyy-MM-dd HH:mm:ssZ 2018-04-01 00:00:00+0700
M yyyy-MM-dd HH:mm:ssZ 2018-09-01 00:00:00+0700
w yyyy-MM-dd HH:mm:ssZ 2018-09-09 00:00:00+0700
d yyyy-MM-dd HH:mm:ssZ 2018-09-13 00:00:00+0700
h yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:00:00+0700
m yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:00+0700
s yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:34+0700
y! yyyy 2018
q! yyyy-MM 2018-04
M! yyyy-MM 2018-09
w! yyyy-MM-dd 2018-09-09
d! yyyy-MM-dd 2018-09-13
h! yyyy-MM-dd HH 2018-09-13 16
m! yyyy-MM-dd HH:mm 2018-09-13 16:45
s! yyyy-MM-dd HH:mm:ss 2018-09-13 16:45:34
—- Before
TD_TIME_FORMAT(
TD_DATE_TRUNC('day', time),
'yyyy-MM-dd')
—- After
TD_TIME_STRING(time, 'd!') day,
© 2018 Arm Limited
New Feature of Presto
TD_INTERVAL() UDF
© 2018 Arm Limited15
Efficient way to specify range of date in WHERE
• There are many complicated technique to gather specific range
—- cover 6 months of the data until today. 156=31*5+1
TD_TIME_RANGE(time,
TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')),
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME())
)
-— cover the beginning of day until now
TD_TIME_RANGE(time,
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME()
)
© 2018 Arm Limited16
Efficient way to specify range of date in WHERE
• TD_INTERVAL() UDF make easier
—- BEFORE
—- cover 6 months of the data until today. 156=31*5+1
TD_TIME_RANGE(time,
TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')),
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME())
)
—- AFTER
—- it can be specify with short UDF
TD_INTERVAL(time, '-6M/0d')
© 2018 Arm Limited17
Efficient way to specify range of date in WHERE
• TD_INTERVAL() UDF make easier
—- BEFORE
-— cover the beginning of day until now
TD_TIME_RANGE(time,
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME()
)
—- AFTER
—- it can be specify with short UDF
TD_INTERVAL(time, '-1d')
© 2018 Arm Limited18
Efficient way to specify range of date in WHERE
© 2018 Arm Limited19
Efficient way to specify range of date in WHERE
-— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC)
# The last hour [2018-08-14 00:00:00, 2018-08-14 01:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-1h')
# From the last hour to now [2018-08-14 00:00:00, 2018-08-14 01:23:45)
SELECT ... WHERE TD_INTERVAL(time, '-1h/now')
# The last hour since the beginning of today [2018-08-13 23:00:00,
2018-08-14 00:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-1h/0d')
• After slash, it can specify the borderline of the day.
© 2018 Arm Limited20
Efficient way to specify range of date in WHERE
-— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC)
# The last 7 days since 2015-12-25 [2015-12-18 00:00:00, 2015-12-25
00:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-7d/2015-12-25')
# The last 10 days since the beginning of the last month [2018-06-21
00:00:00, 2018-07-01 00:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-10d/-1M')
• After slash, it can specify the borderline of the day.
• Effective way, It also work ${session_date} if using digdag.
© 2018 Arm Limited21
Tips about handling time range
-- recommend to test with such a time_series table
CREATE TABLE time_series AS
SELECT
time,
TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ssZ', 'UTC') AS date
FROM (
SELECT times
FROM (
VALUES
SEQUENCE(TD_TIME_PARSE('2018-01-01', 'UTC'), TD_TIME_PARSE('2018-12-31', 'UTC'), 60*60)
) AS x (times)
) t1
CROSS JOIN UNNEST(times) AS t (time)
ORDER BY time
https://qiita.com/reflet/items/151a10e9a0914e0ec3ee
© 2018 Arm Limited22
Let’s enjoy data engineering work with digdag!
And also feel free to talk to me
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
धन्यवाद
‫תודה‬© 2018 Arm Limited23

More Related Content

PDF
201810 td tech_talk
PDF
Recent Changes and Challenges for Future Presto
PDF
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
PDF
Managing Machine Learning workflows on Treasure Data
PDF
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
PDF
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
PDF
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
PDF
Presto At Arm Treasure Data - 2019 Updates
201810 td tech_talk
Recent Changes and Challenges for Future Presto
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Managing Machine Learning workflows on Treasure Data
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Presto At Arm Treasure Data - 2019 Updates

What's hot (19)

PDF
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
PDF
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
PPTX
Spark vstez
PPTX
What's new in Hadoop Common and HDFS
PDF
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
PPTX
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
PPTX
Large-scaled telematics analytics
PDF
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
PPTX
Hive+Tez: A performance deep dive
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PPTX
Linked in nosql_atnetflix_2012_v1
PPTX
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
PDF
Introduction to Apache Hivemall v0.5.0
PDF
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
PDF
Distributed Crypto-Currency Trading with Apache Pulsar
PPTX
Apache Impala (incubating) 2.5 Performance Update
PDF
Introduction to Spark
PDF
Apache Spark streaming and HBase
PDF
Data Analysis with TensorFlow in PostgreSQL
 
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
Spark vstez
What's new in Hadoop Common and HDFS
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Large-scaled telematics analytics
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
Hive+Tez: A performance deep dive
Build a Time Series Application with Apache Spark and Apache HBase
Linked in nosql_atnetflix_2012_v1
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Introduction to Apache Hivemall v0.5.0
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Distributed Crypto-Currency Trading with Apache Pulsar
Apache Impala (incubating) 2.5 Performance Update
Introduction to Spark
Apache Spark streaming and HBase
Data Analysis with TensorFlow in PostgreSQL
 
Ad

Similar to Improve data engineering work with Digdag and Presto UDF (20)

PDF
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
PPT
Teradata 13.10
PDF
Temporal Data
PPTX
Teradata Tutorial for Beginners
PDF
Data Wrangling: Working with Date / Time Data and Visualizing It
PPTX
SQL Server & SQL Azure Temporal Tables - V2
PPT
tempDB.ppt
PDF
Data Exploration with Apache Drill: Day 2
PDF
New date datatypes in sql server 2008 tech republic
PDF
Your Timestamps Deserve Better than a Generic Database
PPT
Temporal PPT details about the platform and its uses
PDF
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
PPT
Temporal
PDF
P6 Analytics history hierarchies and maps - Oracle Primavera P6 Collaborate 14
PDF
Temporal Databases: Queries
PDF
QuestDB: The building blocks of a fast open-source time-series database
PDF
Temporal database
PDF
Microsoft azure data fundamentals (dp 900) practice tests 2022
DOC
Timewizard Public
PDF
Correctly Loading Incremental Data at Scale
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Teradata 13.10
Temporal Data
Teradata Tutorial for Beginners
Data Wrangling: Working with Date / Time Data and Visualizing It
SQL Server & SQL Azure Temporal Tables - V2
tempDB.ppt
Data Exploration with Apache Drill: Day 2
New date datatypes in sql server 2008 tech republic
Your Timestamps Deserve Better than a Generic Database
Temporal PPT details about the platform and its uses
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
Temporal
P6 Analytics history hierarchies and maps - Oracle Primavera P6 Collaborate 14
Temporal Databases: Queries
QuestDB: The building blocks of a fast open-source time-series database
Temporal database
Microsoft azure data fundamentals (dp 900) practice tests 2022
Timewizard Public
Correctly Loading Incremental Data at Scale
Ad

More from Kentaro Yoshida (13)

PDF
TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
PDF
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
PDF
トレジャーデータ 導入体験記 リブセンス編
PDF
Hivemallで始める不動産価格推定サービス
PDF
爆速クエリエンジン”Presto”を使いたくなる話
PDF
Fluentdのお勧めシステム構成パターン
PDF
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
PDF
MySQLユーザ視点での小さく始めるElasticsearch
PDF
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
PDF
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
PDF
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
PDF
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
PDF
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
トレジャーデータ 導入体験記 リブセンス編
Hivemallで始める不動産価格推定サービス
爆速クエリエンジン”Presto”を使いたくなる話
Fluentdのお勧めシステム構成パターン
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
MySQLユーザ視点での小さく始めるElasticsearch
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」

Recently uploaded (20)

PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPT
Project quality management in manufacturing
PDF
Digital Logic Computer Design lecture notes
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
Welding lecture in detail for understanding
PPTX
OOP with Java - Java Introduction (Basics)
PDF
composite construction of structures.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Construction Project Organization Group 2.pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Structs to JSON How Go Powers REST APIs.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
UNIT 4 Total Quality Management .pptx
Project quality management in manufacturing
Digital Logic Computer Design lecture notes
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Geodesy 1.pptx...............................................
CH1 Production IntroductoryConcepts.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Sustainable Sites - Green Building Construction
Welding lecture in detail for understanding
OOP with Java - Java Introduction (Basics)
composite construction of structures.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Construction Project Organization Group 2.pptx
573137875-Attendance-Management-System-original
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx

Improve data engineering work with Digdag and Presto UDF

  • 1. © 2018 Arm Limited • Kentaro Yoshida Improve data engineering work with Digdag and Presto UDF • 2018/10/17 at Plazma TD TechTalk 2018 Fall
  • 2. © 2018 Arm Limited2 About me • @yoshi_ken • Leading DATA Team • Support data driven work at TD • Published DWH Platform books Familiar Products
  • 3. © 2018 Arm Limited3 What is DATA Team? • Management for internal data ETL & Analysis Platform on TreasureData • As historical reason, using Luigi, Airflow(with embulk) and Digdag • Management data visualizing and reporting workflow for business • Not only for engineers but also sales, marketing and operation • Make simple solution insight from complexed data ocean • Kind of data science(analysis) solution • A rare team who use TreasureData internally as daily basis • We can tell feedback as user mind for new improvements
  • 4. © 2018 Arm Limited4 Technical Challenge of DATA Team • Make scalable&robust data pipeline • ex) 1 query generates numerous metrics logs from each components • Improve fact data for supporting data-driven business/engineering • ex) make easier to use data beforehand enrich/pre-processing • Seek performance tuning insights for presto/hive at the platform side • ex) root cause of making table fragmentation • Change semi-realtime data processing from daily jobs • ex) fresh/quick stat data make good insight for engineer/support
  • 5. © 2018 Arm Limited Introduce nice improvements For Presto UDF and digdag
  • 6. © 2018 Arm Limited6 Introduced nice improvements in Digdag and Presto • New feature of Digdag 1. Added ${td.last_job.num_records} • Which has number of records for job results 2. Added “_else_do” after if> operator since digdag v0.9.31 3. Added param_set> and param_get> • For parameter sharing between workflow (not available in TD workflow) • New feature of Presto 1. Added TD_TIME_STRING() UDF • In SELECT clause, Make easier to format date string 2. Added TD_INTERVAL() UDF • In WHERE clause, Make easier to specify time range extraction
  • 7. © 2018 Arm Limited New Feature of Digdag
  • 8. © 2018 Arm Limited8 Situation of zero result error in workflow • Due to some reason, in the case of final results got zero result unexpectedly. • It need to investigate result number of rows for each step-by-step. • I wish if digdag check the result number of rows at each step… • I wish if digdag has function of result output with job_id… Oops!
  • 9. © 2018 Arm Limited9 Situation of zero result error in workflow • Introduced ${td.last_job.num_records} has number of records for job results $ cat num_records.dig +query: td>: data: SELECT DISTINCT symbol FROM nasdaq database: sample_datasets +fail_if_zero: if>: ${td.last_job.num_records < 1} _do: fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows.
  • 10. © 2018 Arm Limited10 Situation of zero result error in workflow • Introduced “_else_do” after if> operator since digdag v0.9.31 $ cat num_records.dig +query: td>: data: SELECT DISTINCT symbol FROM nasdaq database: sample_datasets +fail_if_zero: if>: ${td.last_job.num_records < 1} _do: fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows. _else_do: sh>: td export:result ${td.last_job_id} ${result_path} # enqueue job _export: result_path: td://@/workflow_logs/jobid_${td.last_job_id}
  • 11. © 2018 Arm Limited New Feature of Presto TD_TIME_STRING() UDF
  • 12. © 2018 Arm Limited12 Efficient way to format date string in SELECT • It was required to use burden of writing date format conversion. • This type of query has used GROUP BY statement in generally. • So, I have used to be add preset custom dictionary with “td” for my IME.
  • 13. © 2018 Arm Limited13 Efficient way to format date string in SELECT • TD_TIME_STRING() is awesome UDF • Easier way to truncate timestamp format string format example y yyyy-MM-dd HH:mm:ssZ 2018-01-01 00:00:00+0700 q yyyy-MM-dd HH:mm:ssZ 2018-04-01 00:00:00+0700 M yyyy-MM-dd HH:mm:ssZ 2018-09-01 00:00:00+0700 w yyyy-MM-dd HH:mm:ssZ 2018-09-09 00:00:00+0700 d yyyy-MM-dd HH:mm:ssZ 2018-09-13 00:00:00+0700 h yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:00:00+0700 m yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:00+0700 s yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:34+0700 y! yyyy 2018 q! yyyy-MM 2018-04 M! yyyy-MM 2018-09 w! yyyy-MM-dd 2018-09-09 d! yyyy-MM-dd 2018-09-13 h! yyyy-MM-dd HH 2018-09-13 16 m! yyyy-MM-dd HH:mm 2018-09-13 16:45 s! yyyy-MM-dd HH:mm:ss 2018-09-13 16:45:34 —- Before TD_TIME_FORMAT( TD_DATE_TRUNC('day', time), 'yyyy-MM-dd') —- After TD_TIME_STRING(time, 'd!') day,
  • 14. © 2018 Arm Limited New Feature of Presto TD_INTERVAL() UDF
  • 15. © 2018 Arm Limited15 Efficient way to specify range of date in WHERE • There are many complicated technique to gather specific range —- cover 6 months of the data until today. 156=31*5+1 TD_TIME_RANGE(time, TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')), TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()) ) -— cover the beginning of day until now TD_TIME_RANGE(time, TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME() )
  • 16. © 2018 Arm Limited16 Efficient way to specify range of date in WHERE • TD_INTERVAL() UDF make easier —- BEFORE —- cover 6 months of the data until today. 156=31*5+1 TD_TIME_RANGE(time, TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')), TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()) ) —- AFTER —- it can be specify with short UDF TD_INTERVAL(time, '-6M/0d')
  • 17. © 2018 Arm Limited17 Efficient way to specify range of date in WHERE • TD_INTERVAL() UDF make easier —- BEFORE -— cover the beginning of day until now TD_TIME_RANGE(time, TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME() ) —- AFTER —- it can be specify with short UDF TD_INTERVAL(time, '-1d')
  • 18. © 2018 Arm Limited18 Efficient way to specify range of date in WHERE
  • 19. © 2018 Arm Limited19 Efficient way to specify range of date in WHERE -— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC) # The last hour [2018-08-14 00:00:00, 2018-08-14 01:00:00) SELECT ... WHERE TD_INTERVAL(time, '-1h') # From the last hour to now [2018-08-14 00:00:00, 2018-08-14 01:23:45) SELECT ... WHERE TD_INTERVAL(time, '-1h/now') # The last hour since the beginning of today [2018-08-13 23:00:00, 2018-08-14 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-1h/0d') • After slash, it can specify the borderline of the day.
  • 20. © 2018 Arm Limited20 Efficient way to specify range of date in WHERE -— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC) # The last 7 days since 2015-12-25 [2015-12-18 00:00:00, 2015-12-25 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-7d/2015-12-25') # The last 10 days since the beginning of the last month [2018-06-21 00:00:00, 2018-07-01 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-10d/-1M') • After slash, it can specify the borderline of the day. • Effective way, It also work ${session_date} if using digdag.
  • 21. © 2018 Arm Limited21 Tips about handling time range -- recommend to test with such a time_series table CREATE TABLE time_series AS SELECT time, TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ssZ', 'UTC') AS date FROM ( SELECT times FROM ( VALUES SEQUENCE(TD_TIME_PARSE('2018-01-01', 'UTC'), TD_TIME_PARSE('2018-12-31', 'UTC'), 60*60) ) AS x (times) ) t1 CROSS JOIN UNNEST(times) AS t (time) ORDER BY time https://qiita.com/reflet/items/151a10e9a0914e0ec3ee
  • 22. © 2018 Arm Limited22 Let’s enjoy data engineering work with digdag! And also feel free to talk to me