SlideShare a Scribd company logo
Balancing the Complexity of Data Pipeline
Engineering: A Technological Landscape
Where Human Expertise Meets Large
Language Models
Dr. Rim Moussa, Eng. School of Carthage, University of Carthage
Pr. Tarek Bejaoui, Faculty of Sciences Bizerta, University of Carthage
The 11th International Symposium on Networks, Computers and Communications (ISNCC'24) @ Washington
D.C., USA
22 - 25 October 2024
Outline
Context and Motivations
Objectives and Solution
Data Pipeline Engineering: Use Case Infer Aircraft Flights in
Crowd sourced networks
Review of 5 AI assistants
Related Work
Conclusion and Future Work
1
2
3
4
5
2
6
3
Data to Insights pipelines?
● “Data pipelines are sets of processes that move and transform data from various sources
to a destination (data warehouse, data lake, data lakehouse), where new value can be
derived.” James Densmore, 2021
● Data pipelines consist of several tasks or actions that need to be executed to achieve a
desired result
● A data pipeline is represented using Directed Acyclic Graph (DAG)
● NOT EXTRACT data & LOAD them INTO a data store: Raw data is refined to clean,
structure, normalize, combine, aggregate, and sometimes anonymize.
● Companies are becoming more data driven
● Paradigm shift in implementing data pipelines
○ code-to-data with big data frameworks
4
Data Warehouses (80’s), Data lakes (2000), Lakehouses(2020)
“Catch-all” repositories
Armbrust et al., CIDR’2021
5
Get valuable insights from big data
6
Data Prep is time-consuming!
Source: CrowdFlower, 2015
Data to insights pipeline. Data science pipelines are often
complex with several stages, each with many participants.
One team prepares the data, sourced from heterogeneous
data sources in data lakes. Another team builds models on
the data. Finally, end users access the data and models
through interactive dashboards. The database community
needs to develop simple and efficient tools that support
building and maintaining data pipelines. Data scientists
repeatedly say that data cleaning, integration, and
transformation together consume 80%-90% of their
time. These are problems the database community has
experienced in the context of enterprise data for decades.
However, much of our past efforts focused on solving
algorithmic challenges for important “point problems,” such
as schema mapping and entity resolution. Moving forward,
we must adapt our community’s expertise in data cleaning,
integration, and transformation to aid the iterative
end-to-end development of the data-to-insights pipeline.
Source: The Seattle Database Report, 2022 [1]
Objectives
● showcase a complex data pipeline
■ optimized with human expertise
■ implemented using big data frameworks
● Review of 5 Conversational AI assistants
7
USE CASE : INFERRING
AIRCRAFTS ’ TRIPS IN
CROWDSOURCED
NETWORKS
8
● Multiple groups promote the
transformation of aviation into cleaner,
safer, more efficient and predictable
system, such as
○ The OpenSky Network
○ The High Level Group on Aviation
Research Europe Commission:
European Aviation Vision 2050
○ The Next Generation Air
Transportation in USA
● The OpenSky Network
○ over 6,000 sensors
○ open aircrafts’ logs
Data Sources: flights’ logs
9
Positional
data
latitude
longitude
geoaltitude
baroaltitude
Speed
data
velocity
vertical-rate
Dynamics’
data
heading
Operational
data
alert
spi-rate
Time
data
osn_ts
last-contact
…
Aircraft
data
icao24
Data Sources: airports’ data
10
Positional
data
lat_decimal
lon_decimal
altitude
…
city
country
Airport
data
id
name
iata_code
icao_code
Data pipeline
11
● Data sources (big data):
● multiple sources (e.g., OpenSky
Network, Airports dataset )
● data at rest (airports) and data in
motion (flight logs)
● Data cleansing
● valid positional data, speed data,...
● Correlate datasets using complex
operations
● build spatial indexes on batches
● combine with spatial join
● combine with outer join
● prune with filtering
● Inferred flight data
● no need for further processing
● require merge
○ either with previously inferred
flight data from previous
batches
○ or with previously inferred
flight data from previous
batches
Flight
12
● aircraft identifier
● departure airport
● destination airport
● departure timestamp
● arrival timestamp
● trajectory data
● speed data
● operational data
● dynamics data
Conversational AI assistants
● ChatGPT (OpenAI) url
● Llama-3 (Meta) url
● QWEN2 (Alibaba) url
● Gemma2 (Google) url
● Mistral-Nemo (Nvidia) url
13
Prompt #1
14
Attached two csv files in google drive link [...], in dropbox link [...]
The first file "airports.csv" is a dataset of airports. Each airport is identified by an id, is located in a country, and
each airport is located in a 3d reference system given its decimal longitude ('lon decimal' column), decimal latitude
('lat_decimal' column), and its altitude ('altitude' column).
The second file "logs.csv" is an extract of logs captured by the open sky network during one day. Each entry
denotes the
position of an aircraft, identified by the column 'icao24', in a 3d reference system given latitude ('lat' column),
longitude ('lon' column) and 'geoaltitude' column.
We want to infer the flight(s) details performed by each aircraft, determine the departure airport (takeoff event),
the arrival
airport (landing event), the first recorded timestamp, the last recorded timestamp, the duration calculated as last
recorded
timestamp minus first recorded timestamp.
Notice that for some flights, the departure airport and/or the arrival airport are unknown, consequently we could
only
extract a part of the trajectory. There are four types of inferred flights:
_type 0: a flight such that the departure airport is unknown, and the arrival airport is known
_type 1: a flight such that both departure and arrival airports are known
_type 2: a flight such that the departure airport is known, and the arrival airport is unknown
_type 3: a flight such that both departure and arrival airports are known.
-1 denotes an unknown airport either for departure or arrival.
Could you propose a solution using [....], incorporating the inferred flights derived from the shared datasets?
Prompt #2
15
Attached is a PSV file containing the flight trajectories of an aircraft. Each trajectory is represented in WKT format.
Could you visualize these trajectories in a 3D reference system and on a map using Folium, and then share the
resulting plots online?
Review of our interactions
1 Communication and Data Access
● support access to public cloud storage services like Google Drive, Dropbox, and
GitHub to upload files,
● accept voice prompts, text prompts
2 Clarifications
● ask for clarifications before providing a response,
● or automatically generate a response based on their own assumptions.
Review of our interactions
3 Results
● the code snippets may be presented in stages or as a single script, with or without explanations
● some prompts can generate and run the code on their cloud resources, providing output plots
or other results.
● If no results are delivered, the engine may explain the need for further refinement :(
● store previous prompts and answers
○ e.g. ChatGPT: today, yesterday, previous 7 days, previous 30 days, September, August,
…January, 2023, ..
Review of our interactions
Review of our interactions
4 Feedbacks
● propose multiple solutions, and ask the user to test the solutions, and select the
most appropriate one,
● ask to rate a given solution.
5 Recommendations
● refine the result code for more accurate and robust flight inference, optimize
performance,
● consider using a more advanced LLM release,
● or caution the user against using the code as-is.
Optimizations
● The prompts generally do not implement or recommend optimizations such as:
○ Indexing geospatial data before joining datasets;
○ Filtering log entries based on predicates (e.g., aircraft altitude is close to
airport altitude) better than a cross product with a all airports;
○ Handling cases of multiple flights performed by the aircraft on the same day.
6
Related Work
20
● Description Languages [3]
● Data Quality [4]–[5]
● Frameworks Apache Airflow, Dagster,...
● Implementation technologies
● Apache Hadoop -Pig Latin, Apache Spark, Nvidia RAPIDS NVTabular, …
● AutoETL [10], generate pre-processing pipelines.
● Auto-Pipeline [11], synthesize pipelines using deep reinforcement-learning.
● Pipemizer [13] - improve the performance of queries or jobs in pipeline at
Microsoft.
● LLM: [17] and [18], respectively describe LLM as aim to combine human expertise
with LLM-driven automation and to achieve a favorable cost-optimization balance in
data pipeline engineering.
● Benchmarking
● keep the pipeline cost-effective, and manage the resources, such as storage,
compute power, and network bandwidth,
Conclusion and Future Work
● Design and implementation of a complex data pipeline related
to air traffic
● Review of 5 Conversational AI assistants
● Work perspectives
○ How to train an LLM to address complex data pipelines, considering
broad domain applications and computation and storage optimizations,
○ Use the inferred data for analytical purposes, and benchmarking
OLAP/ML models
■ Analysis of aircrafts’ trajectories,
■ Fuel savings and CO2 emissions’ reduction,
21
References
[1] D. Abadi, A. Ailamaki, D. G. Andersen, P. Bailis, M. Balazinska, P. A. Bernstein, P. A. Boncz, S. Chaudhuri, A. Cheung, A. Doan, L.
Dong, M. J. Franklin, J. Freire, A. Y. Halevy, J. M. Hellerstein, S. Idreos, D. Kossmann, T. Kraska, S. Krishnamurthy, V. Markl, S.
Melnik, T. Milo, C. Mohan, T. Neumann, B. C. Ooi, F. Ozcan, J. M. Patel, A. Pavlo, R. A. Popa, R. Ramakrishnan, C. Ré, M.
Stonebraker, and D. Suciu, “The seattle report on database research,” Commun. ACM, vol. 65, no. 8, pp. 72–79, 2022. ↬
[2] Mattias Schaffer and Vincent Lenders and Ivan Martinovis, “OpenSky Network: Open Air Traffic Data for Research,”
https://opensky-network.org/, online; accessed 10 August 2024..
[3] C. Nielsen, Z. Su, and G. Indiveri, “Yak: An asynchronous bundled data pipeline description language,” in 28th IEEE International
Symposium on Asynchronous Circuits and Systems, ASYNC 2023, Beijing, China, July 16-19, 2023. IEEE, 2023, pp. 34–41.
[4] H. Foidl, V. Golendukhina, R. Ramler, and M. Felderer, “Data pipeline quality: Influencing factors, root causes of data-related
issues, and processing problem areas for developers,” J. Syst. Softw., vol. 207, p. 111855, 2024.
[5] F. J. de Haro-Olmo, Á. Valencia-Parra, Á. J. VarelaVaca, J. A. Álvarez-Bermejo, and M. T. Gómez-López, “ELI: an iot-aware big
data pipeline with data curation and data quality,” PeerJ Comput. Sci., vol. 9, p. e1605, 2023. [Online]. Available:
https://doi.org/10.7717/peerj-cs.1605
[6] P. Maymounkov, “Koji: Automating pipelines with mixed-semantics data sources,” CoRR, vol. abs/1901.01908, 2019. [Online].
Available: http://arxiv.org/abs/1901.01908
22
References
[7] S. Redyuk, Z. Kaoudi, S. Schelter, and V. Markl, “DORIAN in action: Assisted design of data science pipelines,” Proc. VLDB
Endow., vol. 15, no. 12, pp.3714–3717, 2022.
[8] G. Vargas-Solar, K. Belhajjame, J. Espinosa-Oviedo, S. Negrete-Yankelevich, and J. Zechinelli-Martini, “MATILDA: inclusive
data science pipelines design through computational creativity,” in Proceedings of the Workshops of the EDBT/ICDT Joint
Conference, vol. 3651, 2024. [Online]. Available: https://ceur-ws.org/Vol-3651/DARLI-AP-11.pdf
[9] Z. Liu, T. Hoang, J. Zhang, M. Zhu, T. Lan, S. Kokane, J. Tan, W. Yao, Z. Liu, Y. Feng, R. Murthy, L. Yang, S. Savarese, J. C. Niebles,
H. Wang, S. Heinecke, and C. Xiong, “Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets,”
CoRR, vol. abs/2406.18518, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.18518
[10] J. Giovanelli, B. Bilalli, and A. Abelló, “Data preprocessing pipeline generation for autoetl,” Inf. Syst., vol. 108, p. 101957,
2022.
[11] J. Yang, Y. He, and S. Chaudhuri, “Autopipeline: Synthesize data pipelines by-target using reinforcement learning and
search,” CoRR, vol. abs/2106.13861, 2021. [Online]. Available: https://arxiv.org/abs/2106.13861
[12] Z. Miao, “Simplifying human-in-the-loop data science pipeline: Explanations, debugging, and data preparation,” Ph.D.
dissertation, Duke University, Durham, NC, USA, 2022. [Online]. Available: https://hdl.handle.net/10161/26796
[13] S. Gakhar, J. Cahoon, W. Le, X. Li, K. Ravichandran, H. Patel, M. T. Friedman, B. Haynes, S. Qiao, A. Jindal, and J. Leeka,
“Pipemizer: An optimizer for analytics data pipelines,” Proc. VLDB Endow., vol. 15, no. 12, pp. 3710–3713, 2022.
23
References
[14] M. Dareck, C. Edelstenne, T. Enders, E. Fernandez, J.-P. Herteman, M. Kerkloh, I. King, P. Ky, M. Mathieu, G. Orsi, G.
Schotman, C. Smith, and J.-D. Worner, “FlightPath 2050: Europe’s Vision for Aviation -Maintaining Global Leadership and Serving
Society’s Needs,” http://www.sesarju.eu/ , 2010, online; accessed 10 August 2024.
[15] European Union and EuroControl and SESAR, “The DART Project: Data-Driven Aircraft Trajectory Prediction Research,”
http://dart-research.eu/ , online; accessed 10 August 2024.
[16] US NextGen, “Modernization of United States Airspace,” https://www.faa.gov/nextgen/ , 2019, online; accessed 10 August
2024.
[17] A. Remadi, K. E. Hage, Y. Hobeika, and F. Bugiotti, “To prompt or not to prompt: Navigating the use of large language models for
integrating and modeling heterogeneous data,” Data Knowl. Eng., vol. 152, p. 102313, 2024.
[18] S. Arora, B. Yang, S. Eyuboglu, A. Narayan, A. Hojel, I. Trummer, and C. Ré, “Language models enable simple systems for
generating structured views of heterogeneous data lakes,” Proc. VLDB Endow., vol. 17, no. 2, pp. 92–105, 2023.
24
Thank you for your Attention
Q&A
Dr. Rim Moussa, Eng. School of Carthage, University of Carthage
Pr. Tarek Bejaoui, Faculty of Sciences Bizerta, University of Carthage
The 11th International Symposium on Networks, Computers and Communications @ Washington D.C., USA
22 - 25 October 2024

More Related Content

PDF
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
PDF
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
PDF
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
PDF
Pretext Knowledge Grids on Unstructured Data for Facilitating Online Education
PDF
A Transfer Learning Approach to Traffic Sign Recognition
PDF
IRJET- Cost Effective Workflow Scheduling in Bigdata
PDF
P4_tutorial.pdf
PDF
How Bluemix Helps NASA Innovate
 
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
Pretext Knowledge Grids on Unstructured Data for Facilitating Online Education
A Transfer Learning Approach to Traffic Sign Recognition
IRJET- Cost Effective Workflow Scheduling in Bigdata
P4_tutorial.pdf
How Bluemix Helps NASA Innovate
 

Similar to data pipelines complexity human expertise and LLM era (20)

PDF
Evolutionary Multi-Goal Workflow Progress in Shade
PDF
An Efficient PDP Scheme for Distributed Cloud Storage
PDF
Airline Data Analysis
PPTX
An optimized scientific workflow scheduling in cloud computing
PDF
Qo s aware scientific application scheduling algorithm in cloud environment
PDF
Qo s aware scientific application scheduling algorithm in cloud environment
PDF
Data mining model for the data retrieval from central server configuration
PDF
Ck34520526
PDF
Presentation of Turbo C++ || Railway Reservation System project || B.Sc. student
PDF
5G-USA-Telemetry
PDF
A Case Study On Implementation Of Grid Computing To Academic Institution
PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
PDF
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
PDF
Survey on Synchronizing File Operations Along with Storage Scalable Mechanism
PDF
Assignment As A Location-Based Service In Outsourced Databases
PDF
Overview of OSLC - INCOSE IW 2018 MBSE Workshop
DOC
Uma Sagar_c
PDF
Automated Traffic Classification And Application Identification Using Machine...
PDF
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
PDF
Providing a multi-objective scheduling tasks by Using PSO algorithm for cost ...
Evolutionary Multi-Goal Workflow Progress in Shade
An Efficient PDP Scheme for Distributed Cloud Storage
Airline Data Analysis
An optimized scientific workflow scheduling in cloud computing
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Data mining model for the data retrieval from central server configuration
Ck34520526
Presentation of Turbo C++ || Railway Reservation System project || B.Sc. student
5G-USA-Telemetry
A Case Study On Implementation Of Grid Computing To Academic Institution
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Survey on Synchronizing File Operations Along with Storage Scalable Mechanism
Assignment As A Location-Based Service In Outsourced Databases
Overview of OSLC - INCOSE IW 2018 MBSE Workshop
Uma Sagar_c
Automated Traffic Classification And Application Identification Using Machine...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
Providing a multi-objective scheduling tasks by Using PSO algorithm for cost ...
Ad

More from Rim Moussa (19)

PDF
customized eager lazy data cleansing for satisfactory big data veracity
PDF
doc oriented stores for mailing lists using elastic stack
PDF
scalable air quality analytics with apache spark and apache sedona
PDF
polystore_NYC_inrae_sysinfo2021-1.pdf
PDF
Big Data Projects
PDF
ISNCC 2017
PDF
EMR AWS Demo
PDF
ER 2016 Tutorial
PDF
BICOD-2017
PDF
Asd 2015
PDF
Ismis2014 dbaas expert
PDF
Parallel Sequence Generator
PDF
Hadoop ensma poitiers
PDF
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
PDF
Automation of MultiDimensional DB Design (poster)
PDF
TPC-H analytics' scenarios and performances on Hadoop data clouds
PDF
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
PDF
highly available distributed databases (poster)
PDF
parallel OLAP
customized eager lazy data cleansing for satisfactory big data veracity
doc oriented stores for mailing lists using elastic stack
scalable air quality analytics with apache spark and apache sedona
polystore_NYC_inrae_sysinfo2021-1.pdf
Big Data Projects
ISNCC 2017
EMR AWS Demo
ER 2016 Tutorial
BICOD-2017
Asd 2015
Ismis2014 dbaas expert
Parallel Sequence Generator
Hadoop ensma poitiers
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Automation of MultiDimensional DB Design (poster)
TPC-H analytics' scenarios and performances on Hadoop data clouds
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
highly available distributed databases (poster)
parallel OLAP
Ad

Recently uploaded (20)

PDF
Computing-Curriculum for Schools in Ghana
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
master seminar digital applications in india
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Cell Structure & Organelles in detailed.
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Lesson notes of climatology university.
PDF
RMMM.pdf make it easy to upload and study
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Institutional Correction lecture only . . .
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
Presentation on HIE in infants and its manifestations
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
A systematic review of self-coping strategies used by university students to ...
Computing-Curriculum for Schools in Ghana
GDM (1) (1).pptx small presentation for students
master seminar digital applications in india
Abdominal Access Techniques with Prof. Dr. R K Mishra
Cell Structure & Organelles in detailed.
human mycosis Human fungal infections are called human mycosis..pptx
Microbial disease of the cardiovascular and lymphatic systems
VCE English Exam - Section C Student Revision Booklet
Lesson notes of climatology university.
RMMM.pdf make it easy to upload and study
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Institutional Correction lecture only . . .
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
O7-L3 Supply Chain Operations - ICLT Program
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Presentation on HIE in infants and its manifestations
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
A systematic review of self-coping strategies used by university students to ...

data pipelines complexity human expertise and LLM era

  • 1. Balancing the Complexity of Data Pipeline Engineering: A Technological Landscape Where Human Expertise Meets Large Language Models Dr. Rim Moussa, Eng. School of Carthage, University of Carthage Pr. Tarek Bejaoui, Faculty of Sciences Bizerta, University of Carthage The 11th International Symposium on Networks, Computers and Communications (ISNCC'24) @ Washington D.C., USA 22 - 25 October 2024
  • 2. Outline Context and Motivations Objectives and Solution Data Pipeline Engineering: Use Case Infer Aircraft Flights in Crowd sourced networks Review of 5 AI assistants Related Work Conclusion and Future Work 1 2 3 4 5 2 6
  • 3. 3 Data to Insights pipelines? ● “Data pipelines are sets of processes that move and transform data from various sources to a destination (data warehouse, data lake, data lakehouse), where new value can be derived.” James Densmore, 2021 ● Data pipelines consist of several tasks or actions that need to be executed to achieve a desired result ● A data pipeline is represented using Directed Acyclic Graph (DAG) ● NOT EXTRACT data & LOAD them INTO a data store: Raw data is refined to clean, structure, normalize, combine, aggregate, and sometimes anonymize. ● Companies are becoming more data driven ● Paradigm shift in implementing data pipelines ○ code-to-data with big data frameworks
  • 4. 4 Data Warehouses (80’s), Data lakes (2000), Lakehouses(2020) “Catch-all” repositories Armbrust et al., CIDR’2021
  • 5. 5 Get valuable insights from big data
  • 6. 6 Data Prep is time-consuming! Source: CrowdFlower, 2015 Data to insights pipeline. Data science pipelines are often complex with several stages, each with many participants. One team prepares the data, sourced from heterogeneous data sources in data lakes. Another team builds models on the data. Finally, end users access the data and models through interactive dashboards. The database community needs to develop simple and efficient tools that support building and maintaining data pipelines. Data scientists repeatedly say that data cleaning, integration, and transformation together consume 80%-90% of their time. These are problems the database community has experienced in the context of enterprise data for decades. However, much of our past efforts focused on solving algorithmic challenges for important “point problems,” such as schema mapping and entity resolution. Moving forward, we must adapt our community’s expertise in data cleaning, integration, and transformation to aid the iterative end-to-end development of the data-to-insights pipeline. Source: The Seattle Database Report, 2022 [1]
  • 7. Objectives ● showcase a complex data pipeline ■ optimized with human expertise ■ implemented using big data frameworks ● Review of 5 Conversational AI assistants 7
  • 8. USE CASE : INFERRING AIRCRAFTS ’ TRIPS IN CROWDSOURCED NETWORKS 8 ● Multiple groups promote the transformation of aviation into cleaner, safer, more efficient and predictable system, such as ○ The OpenSky Network ○ The High Level Group on Aviation Research Europe Commission: European Aviation Vision 2050 ○ The Next Generation Air Transportation in USA ● The OpenSky Network ○ over 6,000 sensors ○ open aircrafts’ logs
  • 9. Data Sources: flights’ logs 9 Positional data latitude longitude geoaltitude baroaltitude Speed data velocity vertical-rate Dynamics’ data heading Operational data alert spi-rate Time data osn_ts last-contact … Aircraft data icao24
  • 10. Data Sources: airports’ data 10 Positional data lat_decimal lon_decimal altitude … city country Airport data id name iata_code icao_code
  • 11. Data pipeline 11 ● Data sources (big data): ● multiple sources (e.g., OpenSky Network, Airports dataset ) ● data at rest (airports) and data in motion (flight logs) ● Data cleansing ● valid positional data, speed data,... ● Correlate datasets using complex operations ● build spatial indexes on batches ● combine with spatial join ● combine with outer join ● prune with filtering ● Inferred flight data ● no need for further processing ● require merge ○ either with previously inferred flight data from previous batches ○ or with previously inferred flight data from previous batches
  • 12. Flight 12 ● aircraft identifier ● departure airport ● destination airport ● departure timestamp ● arrival timestamp ● trajectory data ● speed data ● operational data ● dynamics data
  • 13. Conversational AI assistants ● ChatGPT (OpenAI) url ● Llama-3 (Meta) url ● QWEN2 (Alibaba) url ● Gemma2 (Google) url ● Mistral-Nemo (Nvidia) url 13
  • 14. Prompt #1 14 Attached two csv files in google drive link [...], in dropbox link [...] The first file "airports.csv" is a dataset of airports. Each airport is identified by an id, is located in a country, and each airport is located in a 3d reference system given its decimal longitude ('lon decimal' column), decimal latitude ('lat_decimal' column), and its altitude ('altitude' column). The second file "logs.csv" is an extract of logs captured by the open sky network during one day. Each entry denotes the position of an aircraft, identified by the column 'icao24', in a 3d reference system given latitude ('lat' column), longitude ('lon' column) and 'geoaltitude' column. We want to infer the flight(s) details performed by each aircraft, determine the departure airport (takeoff event), the arrival airport (landing event), the first recorded timestamp, the last recorded timestamp, the duration calculated as last recorded timestamp minus first recorded timestamp. Notice that for some flights, the departure airport and/or the arrival airport are unknown, consequently we could only extract a part of the trajectory. There are four types of inferred flights: _type 0: a flight such that the departure airport is unknown, and the arrival airport is known _type 1: a flight such that both departure and arrival airports are known _type 2: a flight such that the departure airport is known, and the arrival airport is unknown _type 3: a flight such that both departure and arrival airports are known. -1 denotes an unknown airport either for departure or arrival. Could you propose a solution using [....], incorporating the inferred flights derived from the shared datasets?
  • 15. Prompt #2 15 Attached is a PSV file containing the flight trajectories of an aircraft. Each trajectory is represented in WKT format. Could you visualize these trajectories in a 3D reference system and on a map using Folium, and then share the resulting plots online?
  • 16. Review of our interactions 1 Communication and Data Access ● support access to public cloud storage services like Google Drive, Dropbox, and GitHub to upload files, ● accept voice prompts, text prompts 2 Clarifications ● ask for clarifications before providing a response, ● or automatically generate a response based on their own assumptions.
  • 17. Review of our interactions 3 Results ● the code snippets may be presented in stages or as a single script, with or without explanations ● some prompts can generate and run the code on their cloud resources, providing output plots or other results. ● If no results are delivered, the engine may explain the need for further refinement :( ● store previous prompts and answers ○ e.g. ChatGPT: today, yesterday, previous 7 days, previous 30 days, September, August, …January, 2023, ..
  • 18. Review of our interactions
  • 19. Review of our interactions 4 Feedbacks ● propose multiple solutions, and ask the user to test the solutions, and select the most appropriate one, ● ask to rate a given solution. 5 Recommendations ● refine the result code for more accurate and robust flight inference, optimize performance, ● consider using a more advanced LLM release, ● or caution the user against using the code as-is. Optimizations ● The prompts generally do not implement or recommend optimizations such as: ○ Indexing geospatial data before joining datasets; ○ Filtering log entries based on predicates (e.g., aircraft altitude is close to airport altitude) better than a cross product with a all airports; ○ Handling cases of multiple flights performed by the aircraft on the same day. 6
  • 20. Related Work 20 ● Description Languages [3] ● Data Quality [4]–[5] ● Frameworks Apache Airflow, Dagster,... ● Implementation technologies ● Apache Hadoop -Pig Latin, Apache Spark, Nvidia RAPIDS NVTabular, … ● AutoETL [10], generate pre-processing pipelines. ● Auto-Pipeline [11], synthesize pipelines using deep reinforcement-learning. ● Pipemizer [13] - improve the performance of queries or jobs in pipeline at Microsoft. ● LLM: [17] and [18], respectively describe LLM as aim to combine human expertise with LLM-driven automation and to achieve a favorable cost-optimization balance in data pipeline engineering. ● Benchmarking ● keep the pipeline cost-effective, and manage the resources, such as storage, compute power, and network bandwidth,
  • 21. Conclusion and Future Work ● Design and implementation of a complex data pipeline related to air traffic ● Review of 5 Conversational AI assistants ● Work perspectives ○ How to train an LLM to address complex data pipelines, considering broad domain applications and computation and storage optimizations, ○ Use the inferred data for analytical purposes, and benchmarking OLAP/ML models ■ Analysis of aircrafts’ trajectories, ■ Fuel savings and CO2 emissions’ reduction, 21
  • 22. References [1] D. Abadi, A. Ailamaki, D. G. Andersen, P. Bailis, M. Balazinska, P. A. Bernstein, P. A. Boncz, S. Chaudhuri, A. Cheung, A. Doan, L. Dong, M. J. Franklin, J. Freire, A. Y. Halevy, J. M. Hellerstein, S. Idreos, D. Kossmann, T. Kraska, S. Krishnamurthy, V. Markl, S. Melnik, T. Milo, C. Mohan, T. Neumann, B. C. Ooi, F. Ozcan, J. M. Patel, A. Pavlo, R. A. Popa, R. Ramakrishnan, C. Ré, M. Stonebraker, and D. Suciu, “The seattle report on database research,” Commun. ACM, vol. 65, no. 8, pp. 72–79, 2022. ↬ [2] Mattias Schaffer and Vincent Lenders and Ivan Martinovis, “OpenSky Network: Open Air Traffic Data for Research,” https://opensky-network.org/, online; accessed 10 August 2024.. [3] C. Nielsen, Z. Su, and G. Indiveri, “Yak: An asynchronous bundled data pipeline description language,” in 28th IEEE International Symposium on Asynchronous Circuits and Systems, ASYNC 2023, Beijing, China, July 16-19, 2023. IEEE, 2023, pp. 34–41. [4] H. Foidl, V. Golendukhina, R. Ramler, and M. Felderer, “Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers,” J. Syst. Softw., vol. 207, p. 111855, 2024. [5] F. J. de Haro-Olmo, Á. Valencia-Parra, Á. J. VarelaVaca, J. A. Álvarez-Bermejo, and M. T. Gómez-López, “ELI: an iot-aware big data pipeline with data curation and data quality,” PeerJ Comput. Sci., vol. 9, p. e1605, 2023. [Online]. Available: https://doi.org/10.7717/peerj-cs.1605 [6] P. Maymounkov, “Koji: Automating pipelines with mixed-semantics data sources,” CoRR, vol. abs/1901.01908, 2019. [Online]. Available: http://arxiv.org/abs/1901.01908 22
  • 23. References [7] S. Redyuk, Z. Kaoudi, S. Schelter, and V. Markl, “DORIAN in action: Assisted design of data science pipelines,” Proc. VLDB Endow., vol. 15, no. 12, pp.3714–3717, 2022. [8] G. Vargas-Solar, K. Belhajjame, J. Espinosa-Oviedo, S. Negrete-Yankelevich, and J. Zechinelli-Martini, “MATILDA: inclusive data science pipelines design through computational creativity,” in Proceedings of the Workshops of the EDBT/ICDT Joint Conference, vol. 3651, 2024. [Online]. Available: https://ceur-ws.org/Vol-3651/DARLI-AP-11.pdf [9] Z. Liu, T. Hoang, J. Zhang, M. Zhu, T. Lan, S. Kokane, J. Tan, W. Yao, Z. Liu, Y. Feng, R. Murthy, L. Yang, S. Savarese, J. C. Niebles, H. Wang, S. Heinecke, and C. Xiong, “Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets,” CoRR, vol. abs/2406.18518, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.18518 [10] J. Giovanelli, B. Bilalli, and A. Abelló, “Data preprocessing pipeline generation for autoetl,” Inf. Syst., vol. 108, p. 101957, 2022. [11] J. Yang, Y. He, and S. Chaudhuri, “Autopipeline: Synthesize data pipelines by-target using reinforcement learning and search,” CoRR, vol. abs/2106.13861, 2021. [Online]. Available: https://arxiv.org/abs/2106.13861 [12] Z. Miao, “Simplifying human-in-the-loop data science pipeline: Explanations, debugging, and data preparation,” Ph.D. dissertation, Duke University, Durham, NC, USA, 2022. [Online]. Available: https://hdl.handle.net/10161/26796 [13] S. Gakhar, J. Cahoon, W. Le, X. Li, K. Ravichandran, H. Patel, M. T. Friedman, B. Haynes, S. Qiao, A. Jindal, and J. Leeka, “Pipemizer: An optimizer for analytics data pipelines,” Proc. VLDB Endow., vol. 15, no. 12, pp. 3710–3713, 2022. 23
  • 24. References [14] M. Dareck, C. Edelstenne, T. Enders, E. Fernandez, J.-P. Herteman, M. Kerkloh, I. King, P. Ky, M. Mathieu, G. Orsi, G. Schotman, C. Smith, and J.-D. Worner, “FlightPath 2050: Europe’s Vision for Aviation -Maintaining Global Leadership and Serving Society’s Needs,” http://www.sesarju.eu/ , 2010, online; accessed 10 August 2024. [15] European Union and EuroControl and SESAR, “The DART Project: Data-Driven Aircraft Trajectory Prediction Research,” http://dart-research.eu/ , online; accessed 10 August 2024. [16] US NextGen, “Modernization of United States Airspace,” https://www.faa.gov/nextgen/ , 2019, online; accessed 10 August 2024. [17] A. Remadi, K. E. Hage, Y. Hobeika, and F. Bugiotti, “To prompt or not to prompt: Navigating the use of large language models for integrating and modeling heterogeneous data,” Data Knowl. Eng., vol. 152, p. 102313, 2024. [18] S. Arora, B. Yang, S. Eyuboglu, A. Narayan, A. Hojel, I. Trummer, and C. Ré, “Language models enable simple systems for generating structured views of heterogeneous data lakes,” Proc. VLDB Endow., vol. 17, no. 2, pp. 92–105, 2023. 24
  • 25. Thank you for your Attention Q&A Dr. Rim Moussa, Eng. School of Carthage, University of Carthage Pr. Tarek Bejaoui, Faculty of Sciences Bizerta, University of Carthage The 11th International Symposium on Networks, Computers and Communications @ Washington D.C., USA 22 - 25 October 2024