The DataMap Project, Scaling Up with R and Apache Arrow, ML Tutorials

Rami Krispin

Senior Manager - Data Science and Engineering at Apple | Docker Captain | LinkedIn Learning Instructor

Published May 17, 2025

+ Follow

This week's agenda:

Open Source of the Week - The DataMap project
New learning resources - New tutorials for RAG, linear regression with Python, ML foundation, stats for data scientists, hyperparameter tuning
Book of the week - Scaling Up with R and Apache Arrow by Nic Crane, Jonathan Keane, and Neal Richardson

I share daily updates on Substack, Facebook, Telegram, WhatsApp, and Viber.

Are you interested in learning how to set up automation using GitHub Actions? If so, please check out my course on LinkedIn Learning:

Open Source of the Week

This week's focus is on the DataMap project - a new R library for exploratory data analysis. This project by Prof. Steven Ge from South Dakota State University provides a web UI for visualizing and exploring data matrices. And yes, as you guessed, the UI is based on a serverless Shiny interface.

The library enables a visualization of high-dimensional data by using the following methods:

PCA
Heatmaps
t-SNE

Article content — High-dimensional plots; Image credit: project documentation

The following short video provides an introduction to the project functionalities:

Once the R library is installed, you can launch the UI on your browser. The project has a demo app using a Shiny live application:

License: MIT

New Learning Resources

Here are some new learning resources that I came across this week.

Wikipedia RAG System in Python

The following tutorial by NeuralNinbe provides an introduction to RAG systems with Python. This includes using tools such as LlamaIndex and Streamlit to build an RAG system based on Wikipedia.

Faster Data Pipelines development with MCP and DuckDB

This looks super awesome - the following video by Mehdi Ouazza provides an introduction for developing data pipelines faster using DuckDB and an MCP server.

Linear Regression Model with Python

The following one-hour workshop by Anna Strahl focuses on how to develop a linear regression model in Python. It is a beginner-level workshop.

ML Foundations for AI Engineers

A short tutorial by Shaw Talebi focusing on core ML concepts such as training methods, deep learning, reinforcement learning, and working with data.

Statistics for Data Science

The third and last article from the Data Hustle newsletter 🗞️ focuses on core statistical concepts for data science. This article covers foundational topics such as the statistical relationship between variables, correlation, margin of error, statistical power, time series, etc.

Thanks to Venkata Naga Sai Kumar Bysani and Vaishali Macwan for the great summary!

Hyperparameter Tuning in Python

The following tutorial by Code with Josh focuses on hyperparameter tuning machine learning models with Python. This includes random search and grid search tuning approaches using SciKit.

Book of the Week

This week's focus is on a new R book - Scaling Up with R and Apache Arrow by Nic Crane , Jonathan Keane , and Neal Richardson . The book, as the name implies, focuses on working with data at a large scale in R using Apache Arrow. The Apache Arrow project is a multi-language format for handling tabular data in-memory with high performance. The book demonstrates how to leverage Apache Arrow's capabilities to streamline data workflows without leaving the familiar tidyverse environment.

Topics Covered:

Introduction to Apache Arrow and its integration with R
Efficient data manipulation using Arrow with dplyr syntax
Handling various file formats, including Parquet, for optimized storage and retrieval
Working with real-world datasets like the U.S. Census PUMS
Advanced topics such as user-defined functions and interoperability across programming languages
Strategies for cloud-based data processing and sharing

The book is ideal for R users and data professionals seeking to scale their data analysis workflows. This book provides practical insights into managing big data. The book, thanks to the authors, is available online for free:

A printed version is available to purchase on the publisher's website and on Amazon:

Have any questions? Please comment below!

See you next Saturday!

Thanks,

Rami

Rami's Data Newsletter

30,178 followers

+ Subscribe

Abdul Rehman

Helping founders automate customer chats & bookings using no-code AI agents (40+ languages | 24/7 | Web + Landing page)

2mo

Thanks for consistently curating such high-value resources, Rami! The DataMap project and this week’s learning picks are spot on—especially the focus on RAG and hyperparameter tuning. At 4ai.chat, we’re always exploring efficient ways to scale AI-driven interactions, and your newsletter is a great source of inspiration. Just subscribed—looking forward to diving in!

Kshitiz Khanal

AI/ML for transportation and energy planning

2mo

Very useful resources, Rami. Esp the R/Arrow book. I checked it out, and it's very well written and comprehensive!

4 Reactions

Antonio Inguane

Co-Founder and Advisor @ Data4Moz|Data4ANGOLA

2mo

Déyril M Ibraimo Stélio Francisco Matsinhe Renaldo Flor Sebastião Vilanculos Jubilio Mausse

The DataMap Project, Scaling Up with R and Apache Arrow, ML Tutorials

Rami Krispin

Senior Manager - Data Science and Engineering at Apple | Docker Captain | LinkedIn Learning Instructor

Open Source of the Week

New Learning Resources

Wikipedia RAG System in Python

Faster Data Pipelines development with MCP and DuckDB

Linear Regression Model with Python

ML Foundations for AI Engineers

Statistics for Data Science

Hyperparameter Tuning in Python

Book of the Week

Rami's Data Newsletter

30,178 followers

More articles by this author

Others also viewed

Graph RAG, Automated Prompt Engineering, Agent Frameworks, and Other September Must-Reads

Data Science Portfolios, Speeding Up Python, KANs, and Other May Must-Reads

Ten Essential Python Libraries for Data Science Beginners

Top Python Libraries Every Data Scientist Should Know

Data Science Full Stack Roadmap 2022

Introduction to Quant Investing with Python

Unlocking Time Series Insights with TSFresh: A Python Guide

AI at Work

Python MACHINE LEARNING

Day 2: Logistic Regression

Explore topics

Open Source of the Week

New Learning Resources

Wikipedia RAG System in Python

Faster Data Pipelines development with MCP and DuckDB

Linear Regression Model with Python

ML Foundations for AI Engineers

Statistics for Data Science

Hyperparameter Tuning in Python

Book of the Week

Rami's Data Newsletter

30,178 followers

OpenAI Open Source Models, Data Engineering with DBT, MLOps with Databricks, Scikit-Learn Crash Course

Aug 9, 2025

New Book - Models Demystified, the social-media-kit Project, MCP Tutorials

Aug 2, 2025

The Ibis Project, LLMOps, Introduction to Docker Model Runner

Jul 26, 2025

The Orbital Project, LLMs In Production, AI chatbot with Docker Model Runner

Jul 19, 2025

Decoding Machine Learning Interviews, The gt-extra Project, New Tutorials

Jul 12, 2025

The Dagster Project, Visualization for Social Data Science, Forecasting with Linear Regression

Jul 5, 2025

Marimo Extension to Quarto, Health Metrics and the Spread of Infectious Diseases, New Tutorials

Jun 28, 2025

The mirai Project, Polars Cookbook, New Learning Resources

Jun 21, 2025

The AIOps Newsletter

Jun 15, 2025

Forecasting: Principles and Practice - The Pythonic Way, The PostgreSQL VScode Extension

Jun 14, 2025

Others also viewed

Graph RAG, Automated Prompt Engineering, Agent Frameworks, and Other September Must-Reads

Data Science Portfolios, Speeding Up Python, KANs, and Other May Must-Reads

Ten Essential Python Libraries for Data Science Beginners

Top Python Libraries Every Data Scientist Should Know

Data Science Full Stack Roadmap 2022

Introduction to Quant Investing with Python

Unlocking Time Series Insights with TSFresh: A Python Guide

AI at Work

Python MACHINE LEARNING

Day 2: Logistic Regression

Explore topics