The DataMap Project, Scaling Up with R and Apache Arrow, ML Tutorials

The DataMap Project, Scaling Up with R and Apache Arrow, ML Tutorials

This week's agenda:

  • Open Source of the Week - The DataMap project
  • New learning resources - New tutorials for RAG, linear regression with Python, ML foundation, stats for data scientists, hyperparameter tuning
  • Book of the week - Scaling Up with R and Apache Arrow by Nic Crane, Jonathan Keane, and Neal Richardson

I share daily updates on Substack, Facebook, Telegram, WhatsApp, and Viber.


Are you interested in learning how to set up automation using GitHub Actions? If so, please check out my course on LinkedIn Learning:


Open Source of the Week

This week's focus is on the DataMap project - a new R library for exploratory data analysis. This project by Prof. Steven Ge from South Dakota State University provides a web UI for visualizing and exploring data matrices. And yes, as you guessed, the UI is based on a serverless Shiny interface.

The library enables a visualization of high-dimensional data by using the following methods:

  • PCA
  • Heatmaps
  • t-SNE

Article content
High-dimensional plots; Image credit: project documentation

The following short video provides an introduction to the project functionalities:

Once the R library is installed, you can launch the UI on your browser. The project has a demo app using a Shiny live application:

License: MIT


New Learning Resources

Here are some new learning resources that I came across this week.

Wikipedia RAG System in Python

The following tutorial by NeuralNinbe provides an introduction to RAG systems with Python. This includes using tools such as LlamaIndex and Streamlit to build an RAG system based on Wikipedia.

Faster Data Pipelines development with MCP and DuckDB

This looks super awesome - the following video by Mehdi Ouazza provides an introduction for developing data pipelines faster using DuckDB and an MCP server.

Linear Regression Model with Python

The following one-hour workshop by Anna Strahl focuses on how to develop a linear regression model in Python. It is a beginner-level workshop.

ML Foundations for AI Engineers

A short tutorial by Shaw Talebi focusing on core ML concepts such as training methods, deep learning, reinforcement learning, and working with data.

Statistics for Data Science

The third and last article from the Data Hustle newsletter 🗞️ focuses on core statistical concepts for data science. This article covers foundational topics such as the statistical relationship between variables, correlation, margin of error, statistical power, time series, etc.

Thanks to Venkata Naga Sai Kumar Bysani and Vaishali Macwan for the great summary!

Hyperparameter Tuning in Python

The following tutorial by Code with Josh focuses on hyperparameter tuning machine learning models with Python. This includes random search and grid search tuning approaches using SciKit.


Book of the Week

This week's focus is on a new R book - Scaling Up with R and Apache Arrow by Nic Crane , Jonathan Keane , and Neal Richardson . The book, as the name implies, focuses on working with data at a large scale in R using Apache Arrow. The Apache Arrow project is a multi-language format for handling tabular data in-memory with high performance. The book demonstrates how to leverage Apache Arrow's capabilities to streamline data workflows without leaving the familiar tidyverse environment.

Topics Covered:

  • Introduction to Apache Arrow and its integration with R
  • Efficient data manipulation using Arrow with dplyr syntax
  • Handling various file formats, including Parquet, for optimized storage and retrieval
  • Working with real-world datasets like the U.S. Census PUMS
  • Advanced topics such as user-defined functions and interoperability across programming languages
  • Strategies for cloud-based data processing and sharing

Article content
Scaling Up with R and Apache Arrow; Image credit: Publisher

The book is ideal for R users and data professionals seeking to scale their data analysis workflows. This book provides practical insights into managing big data. The book, thanks to the authors, is available online for free:

A printed version is available to purchase on the publisher's website and on Amazon:


Have any questions? Please comment below!

See you next Saturday!

Thanks,

Rami

Abdul Rehman

Helping founders automate customer chats & bookings using no-code AI agents (40+ languages | 24/7 | Web + Landing page)

2mo

Thanks for consistently curating such high-value resources, Rami! The DataMap project and this week’s learning picks are spot on—especially the focus on RAG and hyperparameter tuning. At 4ai.chat, we’re always exploring efficient ways to scale AI-driven interactions, and your newsletter is a great source of inspiration. Just subscribed—looking forward to diving in!

Like
Reply
Kshitiz Khanal

AI/ML for transportation and energy planning

2mo

Very useful resources, Rami. Esp the R/Arrow book. I checked it out, and it's very well written and comprehensive!

To view or add a comment, sign in

Others also viewed

Explore topics