data pipelines complexity human expertise and LLM era

Balancing the Complexity of Data Pipeline
Engineering: A Technological Landscape
Where Human Expertise Meets Large
Language Models
Dr. Rim Moussa, Eng. School of Carthage, University of Carthage
Pr. Tarek Bejaoui, Faculty of Sciences Bizerta, University of Carthage
The 11th International Symposium on Networks, Computers and Communications (ISNCC'24) @ Washington
D.C., USA
22 - 25 October 2024

Outline
Context and Motivations
Objectives and Solution
Data Pipeline Engineering: Use Case Infer Aircraft Flights in
Crowd sourced networks
Review of 5 AI assistants
Related Work
Conclusion and Future Work
1
2
3
4
5
2
6

3
Data to Insights pipelines?
● “Data pipelines are sets of processes that move and transform data from various sources
to a destination (data warehouse, data lake, data lakehouse), where new value can be
derived.” James Densmore, 2021
● Data pipelines consist of several tasks or actions that need to be executed to achieve a
desired result
● A data pipeline is represented using Directed Acyclic Graph (DAG)
● NOT EXTRACT data & LOAD them INTO a data store: Raw data is reﬁned to clean,
structure, normalize, combine, aggregate, and sometimes anonymize.
● Companies are becoming more data driven
● Paradigm shift in implementing data pipelines
○ code-to-data with big data frameworks

4
Data Warehouses (80’s), Data lakes (2000), Lakehouses(2020)
“Catch-all” repositories
Armbrust et al., CIDR’2021

5
Get valuable insights from big data

6
Data Prep is time-consuming!
Source: CrowdFlower, 2015
Data to insights pipeline. Data science pipelines are often
complex with several stages, each with many participants.
One team prepares the data, sourced from heterogeneous
data sources in data lakes. Another team builds models on
the data. Finally, end users access the data and models
through interactive dashboards. The database community
needs to develop simple and efficient tools that support
building and maintaining data pipelines. Data scientists
repeatedly say that data cleaning, integration, and
transformation together consume 80%-90% of their
time. These are problems the database community has
experienced in the context of enterprise data for decades.
However, much of our past efforts focused on solving
algorithmic challenges for important “point problems,” such
as schema mapping and entity resolution. Moving forward,
we must adapt our community’s expertise in data cleaning,
integration, and transformation to aid the iterative
end-to-end development of the data-to-insights pipeline.
Source: The Seattle Database Report, 2022 [1]

Objectives
● showcase a complex data pipeline
■ optimized with human expertise
■ implemented using big data frameworks
● Review of 5 Conversational AI assistants
7

USE CASE : INFERRING
AIRCRAFTS ’ TRIPS IN
CROWDSOURCED
NETWORKS
8
● Multiple groups promote the
transformation of aviation into cleaner,
safer, more efﬁcient and predictable
system, such as
○ The OpenSky Network
○ The High Level Group on Aviation
Research Europe Commission:
European Aviation Vision 2050
○ The Next Generation Air
Transportation in USA
● The OpenSky Network
○ over 6,000 sensors
○ open aircrafts’ logs

Data Sources: ﬂights’ logs
9
Positional
data
latitude
longitude
geoaltitude
baroaltitude
Speed
data
velocity
vertical-rate
Dynamics’
data
heading
Operational
data
alert
spi-rate
Time
data
osn_ts
last-contact
…
Aircraft
data
icao24

Data Sources: airports’ data
10
Positional
data
lat_decimal
lon_decimal
altitude
…
city
country
Airport
data
id
name
iata_code
icao_code

Data pipeline
11
● Data sources (big data):
● multiple sources (e.g., OpenSky
Network, Airports dataset )
● data at rest (airports) and data in
motion (flight logs)
● Data cleansing
● valid positional data, speed data,...
● Correlate datasets using complex
operations
● build spatial indexes on batches
● combine with spatial join
● combine with outer join
● prune with filtering
● Inferred flight data
● no need for further processing
● require merge
○ either with previously inferred
flight data from previous
batches
○ or with previously inferred
flight data from previous
batches

Flight
12
● aircraft identiﬁer
● departure airport
● destination airport
● departure timestamp
● arrival timestamp
● trajectory data
● speed data
● operational data
● dynamics data

Conversational AI assistants
● ChatGPT (OpenAI) url
● Llama-3 (Meta) url
● QWEN2 (Alibaba) url
● Gemma2 (Google) url
● Mistral-Nemo (Nvidia) url
13

Prompt #1
14
Attached two csv files in google drive link [...], in dropbox link [...]
The first file "airports.csv" is a dataset of airports. Each airport is identified by an id, is located in a country, and
each airport is located in a 3d reference system given its decimal longitude ('lon decimal' column), decimal latitude
('lat_decimal' column), and its altitude ('altitude' column).
The second file "logs.csv" is an extract of logs captured by the open sky network during one day. Each entry
denotes the
position of an aircraft, identified by the column 'icao24', in a 3d reference system given latitude ('lat' column),
longitude ('lon' column) and 'geoaltitude' column.
We want to infer the flight(s) details performed by each aircraft, determine the departure airport (takeoff event),
the arrival
airport (landing event), the first recorded timestamp, the last recorded timestamp, the duration calculated as last
recorded
timestamp minus first recorded timestamp.
Notice that for some flights, the departure airport and/or the arrival airport are unknown, consequently we could
only
extract a part of the trajectory. There are four types of inferred flights:
_type 0: a flight such that the departure airport is unknown, and the arrival airport is known
_type 1: a flight such that both departure and arrival airports are known
_type 2: a flight such that the departure airport is known, and the arrival airport is unknown
_type 3: a flight such that both departure and arrival airports are known.
-1 denotes an unknown airport either for departure or arrival.
Could you propose a solution using [....], incorporating the inferred flights derived from the shared datasets?

Prompt #2
15
Attached is a PSV file containing the flight trajectories of an aircraft. Each trajectory is represented in WKT format.
Could you visualize these trajectories in a 3D reference system and on a map using Folium, and then share the
resulting plots online?

Review of our interactions
1 Communication and Data Access
● support access to public cloud storage services like Google Drive, Dropbox, and
GitHub to upload files,
● accept voice prompts, text prompts
2 Clarifications
● ask for clarifications before providing a response,
● or automatically generate a response based on their own assumptions.

3 Results
● the code snippets may be presented in stages or as a single script, with or without explanations
● some prompts can generate and run the code on their cloud resources, providing output plots
or other results.
● If no results are delivered, the engine may explain the need for further refinement :(
● store previous prompts and answers
○ e.g. ChatGPT: today, yesterday, previous 7 days, previous 30 days, September, August,
…January, 2023, ..

4 Feedbacks
● propose multiple solutions, and ask the user to test the solutions, and select the
most appropriate one,
● ask to rate a given solution.
5 Recommendations
● refine the result code for more accurate and robust flight inference, optimize
performance,
● consider using a more advanced LLM release,
● or caution the user against using the code as-is.
Optimizations
● The prompts generally do not implement or recommend optimizations such as:
○ Indexing geospatial data before joining datasets;
○ Filtering log entries based on predicates (e.g., aircraft altitude is close to
airport altitude) better than a cross product with a all airports;
○ Handling cases of multiple flights performed by the aircraft on the same day.
6

Related Work
20
● Description Languages [3]
● Data Quality [4]–[5]
● Frameworks Apache Airﬂow, Dagster,...
● Implementation technologies
● Apache Hadoop -Pig Latin, Apache Spark, Nvidia RAPIDS NVTabular, …
● AutoETL [10], generate pre-processing pipelines.
● Auto-Pipeline [11], synthesize pipelines using deep reinforcement-learning.
● Pipemizer [13] - improve the performance of queries or jobs in pipeline at
Microsoft.
● LLM: [17] and [18], respectively describe LLM as aim to combine human expertise
with LLM-driven automation and to achieve a favorable cost-optimization balance in
data pipeline engineering.
● Benchmarking
● keep the pipeline cost-effective, and manage the resources, such as storage,
compute power, and network bandwidth,

Conclusion and Future Work
● Design and implementation of a complex data pipeline related
to air trafﬁc
● Review of 5 Conversational AI assistants
● Work perspectives
○ How to train an LLM to address complex data pipelines, considering
broad domain applications and computation and storage optimizations,
○ Use the inferred data for analytical purposes, and benchmarking
OLAP/ML models
■ Analysis of aircrafts’ trajectories,
■ Fuel savings and CO2 emissions’ reduction,
21

References
[1] D. Abadi, A. Ailamaki, D. G. Andersen, P. Bailis, M. Balazinska, P. A. Bernstein, P. A. Boncz, S. Chaudhuri, A. Cheung, A. Doan, L.
Dong, M. J. Franklin, J. Freire, A. Y. Halevy, J. M. Hellerstein, S. Idreos, D. Kossmann, T. Kraska, S. Krishnamurthy, V. Markl, S.
Melnik, T. Milo, C. Mohan, T. Neumann, B. C. Ooi, F. Ozcan, J. M. Patel, A. Pavlo, R. A. Popa, R. Ramakrishnan, C. Ré, M.
Stonebraker, and D. Suciu, “The seattle report on database research,” Commun. ACM, vol. 65, no. 8, pp. 72–79, 2022. ↬
[2] Mattias Schaffer and Vincent Lenders and Ivan Martinovis, “OpenSky Network: Open Air Trafﬁc Data for Research,”
https://opensky-network.org/, online; accessed 10 August 2024..
[3] C. Nielsen, Z. Su, and G. Indiveri, “Yak: An asynchronous bundled data pipeline description language,” in 28th IEEE International
Symposium on Asynchronous Circuits and Systems, ASYNC 2023, Beijing, China, July 16-19, 2023. IEEE, 2023, pp. 34–41.
[4] H. Foidl, V. Golendukhina, R. Ramler, and M. Felderer, “Data pipeline quality: Inﬂuencing factors, root causes of data-related
issues, and processing problem areas for developers,” J. Syst. Softw., vol. 207, p. 111855, 2024.
[5] F. J. de Haro-Olmo, Á. Valencia-Parra, Á. J. VarelaVaca, J. A. Álvarez-Bermejo, and M. T. Gómez-López, “ELI: an iot-aware big
data pipeline with data curation and data quality,” PeerJ Comput. Sci., vol. 9, p. e1605, 2023. [Online]. Available:
https://doi.org/10.7717/peerj-cs.1605
[6] P. Maymounkov, “Koji: Automating pipelines with mixed-semantics data sources,” CoRR, vol. abs/1901.01908, 2019. [Online].
Available: http://arxiv.org/abs/1901.01908
22

References
[7] S. Redyuk, Z. Kaoudi, S. Schelter, and V. Markl, “DORIAN in action: Assisted design of data science pipelines,” Proc. VLDB
Endow., vol. 15, no. 12, pp.3714–3717, 2022.
[8] G. Vargas-Solar, K. Belhajjame, J. Espinosa-Oviedo, S. Negrete-Yankelevich, and J. Zechinelli-Martini, “MATILDA: inclusive
data science pipelines design through computational creativity,” in Proceedings of the Workshops of the EDBT/ICDT Joint
Conference, vol. 3651, 2024. [Online]. Available: https://ceur-ws.org/Vol-3651/DARLI-AP-11.pdf
[9] Z. Liu, T. Hoang, J. Zhang, M. Zhu, T. Lan, S. Kokane, J. Tan, W. Yao, Z. Liu, Y. Feng, R. Murthy, L. Yang, S. Savarese, J. C. Niebles,
H. Wang, S. Heinecke, and C. Xiong, “Apigen: Automated pipeline for generating veriﬁable and diverse function-calling datasets,”
CoRR, vol. abs/2406.18518, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.18518
[10] J. Giovanelli, B. Bilalli, and A. Abelló, “Data preprocessing pipeline generation for autoetl,” Inf. Syst., vol. 108, p. 101957,
2022.
[11] J. Yang, Y. He, and S. Chaudhuri, “Autopipeline: Synthesize data pipelines by-target using reinforcement learning and
search,” CoRR, vol. abs/2106.13861, 2021. [Online]. Available: https://arxiv.org/abs/2106.13861
[12] Z. Miao, “Simplifying human-in-the-loop data science pipeline: Explanations, debugging, and data preparation,” Ph.D.
dissertation, Duke University, Durham, NC, USA, 2022. [Online]. Available: https://hdl.handle.net/10161/26796
[13] S. Gakhar, J. Cahoon, W. Le, X. Li, K. Ravichandran, H. Patel, M. T. Friedman, B. Haynes, S. Qiao, A. Jindal, and J. Leeka,
“Pipemizer: An optimizer for analytics data pipelines,” Proc. VLDB Endow., vol. 15, no. 12, pp. 3710–3713, 2022.
23

References
[14] M. Dareck, C. Edelstenne, T. Enders, E. Fernandez, J.-P. Herteman, M. Kerkloh, I. King, P. Ky, M. Mathieu, G. Orsi, G.
Schotman, C. Smith, and J.-D. Worner, “FlightPath 2050: Europe’s Vision for Aviation -Maintaining Global Leadership and Serving
Society’s Needs,” http://www.sesarju.eu/ , 2010, online; accessed 10 August 2024.
[15] European Union and EuroControl and SESAR, “The DART Project: Data-Driven Aircraft Trajectory Prediction Research,”
http://dart-research.eu/ , online; accessed 10 August 2024.
[16] US NextGen, “Modernization of United States Airspace,” https://www.faa.gov/nextgen/ , 2019, online; accessed 10 August
2024.
[17] A. Remadi, K. E. Hage, Y. Hobeika, and F. Bugiotti, “To prompt or not to prompt: Navigating the use of large language models for
integrating and modeling heterogeneous data,” Data Knowl. Eng., vol. 152, p. 102313, 2024.
[18] S. Arora, B. Yang, S. Eyuboglu, A. Narayan, A. Hojel, I. Trummer, and C. Ré, “Language models enable simple systems for
generating structured views of heterogeneous data lakes,” Proc. VLDB Endow., vol. 17, no. 2, pp. 92–105, 2023.
24

Thank you for your Attention
Q&A
Dr. Rim Moussa, Eng. School of Carthage, University of Carthage
Pr. Tarek Bejaoui, Faculty of Sciences Bizerta, University of Carthage
The 11th International Symposium on Networks, Computers and Communications @ Washington D.C., USA
22 - 25 October 2024

data pipelines complexity human expertise and LLM era

More Related Content

Similar to data pipelines complexity human expertise and LLM era (20)

More from Rim Moussa (19)

Recently uploaded (20)

data pipelines complexity human expertise and LLM era