The DataMap Project, Scaling Up with R and Apache Arrow, ML Tutorials
This week's agenda:
Are you interested in learning how to set up automation using GitHub Actions? If so, please check out my course on LinkedIn Learning:
Open Source of the Week
This week's focus is on the DataMap project - a new R library for exploratory data analysis. This project by Prof. Steven Ge from South Dakota State University provides a web UI for visualizing and exploring data matrices. And yes, as you guessed, the UI is based on a serverless Shiny interface.
The library enables a visualization of high-dimensional data by using the following methods:
The following short video provides an introduction to the project functionalities:
Once the R library is installed, you can launch the UI on your browser. The project has a demo app using a Shiny live application:
License: MIT
New Learning Resources
Here are some new learning resources that I came across this week.
Wikipedia RAG System in Python
The following tutorial by NeuralNinbe provides an introduction to RAG systems with Python. This includes using tools such as LlamaIndex and Streamlit to build an RAG system based on Wikipedia.
Faster Data Pipelines development with MCP and DuckDB
This looks super awesome - the following video by Mehdi Ouazza provides an introduction for developing data pipelines faster using DuckDB and an MCP server.
Linear Regression Model with Python
The following one-hour workshop by Anna Strahl focuses on how to develop a linear regression model in Python. It is a beginner-level workshop.
ML Foundations for AI Engineers
A short tutorial by Shaw Talebi focusing on core ML concepts such as training methods, deep learning, reinforcement learning, and working with data.
Statistics for Data Science
The third and last article from the Data Hustle newsletter 🗞️ focuses on core statistical concepts for data science. This article covers foundational topics such as the statistical relationship between variables, correlation, margin of error, statistical power, time series, etc.
Thanks to Venkata Naga Sai Kumar Bysani and Vaishali Macwan for the great summary!
Hyperparameter Tuning in Python
The following tutorial by Code with Josh focuses on hyperparameter tuning machine learning models with Python. This includes random search and grid search tuning approaches using SciKit.
Book of the Week
This week's focus is on a new R book - Scaling Up with R and Apache Arrow by Nic Crane , Jonathan Keane , and Neal Richardson . The book, as the name implies, focuses on working with data at a large scale in R using Apache Arrow. The Apache Arrow project is a multi-language format for handling tabular data in-memory with high performance. The book demonstrates how to leverage Apache Arrow's capabilities to streamline data workflows without leaving the familiar tidyverse environment.
Topics Covered:
The book is ideal for R users and data professionals seeking to scale their data analysis workflows. This book provides practical insights into managing big data. The book, thanks to the authors, is available online for free:
A printed version is available to purchase on the publisher's website and on Amazon:
Have any questions? Please comment below!
See you next Saturday!
Thanks,
Rami
Helping founders automate customer chats & bookings using no-code AI agents (40+ languages | 24/7 | Web + Landing page)
2moThanks for consistently curating such high-value resources, Rami! The DataMap project and this week’s learning picks are spot on—especially the focus on RAG and hyperparameter tuning. At 4ai.chat, we’re always exploring efficient ways to scale AI-driven interactions, and your newsletter is a great source of inspiration. Just subscribed—looking forward to diving in!
AI/ML for transportation and energy planning
2moVery useful resources, Rami. Esp the R/Arrow book. I checked it out, and it's very well written and comprehensive!
Co-Founder and Advisor @ Data4Moz|Data4ANGOLA
2moDéyril M Ibraimo Stélio Francisco Matsinhe Renaldo Flor Sebastião Vilanculos Jubilio Mausse