--- title: "connector" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{connector} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction The `connector` package provides a set of functions to connect to different data sources (such as databases and file systems) and read and write data from them using a consistent interface. It is designed to be a generic and extensible package, so that new data sources can be added easily. This vignette demonstrates how to use the `connector` package to connect to either a **file system** or a **database** to access different types of data. ## Connector configuration The main function in this package is `connect()`. This function, based on a configuration file or a list, creates a `connectors` object with a `connector` for each of the specified data sources. The configuration file can be in list format, JSON, or YAML format. The input list (or configuration file) must have the following structure: * Only `metadata`, `env`, and `datasources` fields are allowed. * All elements must be named. * **`datasources`** is mandatory. * **`metadata`** and **`env`** must each be a list of named character vectors of length 1. * **`datasources`** must be a list of unnamed lists. * Each datasource must have the named character element **`name`** and the named list element **`backend`**. * For each backend, **`type`** must be provided. ## Working example ```{r, include = FALSE} # Use a temporary directory as working directory for the example below tmp <- withr::local_tempdir() knitr::opts_knit$set(root.dir = tmp) ``` Here is an example anyone can run to see how the `connector` package works. We will use the configuration file provided below, which uses the file system as the connection type for ADaM and TFL data. ```{r, include = FALSE} 'metadata: adam_path: !expr file.path(getwd(), "adam") tfl_path: !expr file.path(getwd(), "tfl") datasources: - name: "adam" backend: type: "connector::connector_fs" path: "{metadata.adam_path}" - name: "tfl" backend: type: "connector::connector_fs" path: "{metadata.tfl_path}" ' |> writeLines("_connector.yml") ``` `_connector.yml:` ```yaml metadata: adam_path: !expr file.path(getwd(), "adam") tfl_path: !expr file.path(getwd(), "tfl") datasources: - name: "adam" backend: type: "connector::connector_fs" path: "{metadata.adam_path}" - name: "tfl" backend: type: "connector::connector_fs" path: "{metadata.tfl_path}" ``` As you can see, the configuration file contains metadata about the paths to the directories where the data will be stored, and two data sources: `adam` and `tfl`, both using the `connector_fs` backend to connect to file system folders. Note that the paths to the directories are defined using metadata variables (e.g., `{metadata.adam_path}`), which allows you to easily change the paths in one place. Now, let's run the example: ```{r, include = FALSE} library(connector) library(dplyr) library(ggplot2) # Let's create ADaM and TFL directories dir.create("adam") dir.create("tfl") ``` The first step is to create the connections to the data sources. ```{r} # Load data connections db <- connect() ``` Next, we manipulate the iris dataset and store it in the `adam` connector. This means we will create a subset of the iris dataset and save it as an RDS file in the `adam` directory. ```{r} ## Iris data setosa <- iris |> filter(Species == "setosa") ## Store data db$adam |> write_cnt(setosa, "setosa.rds") ``` We can also create more complex summaries and store them in the same connector. ```{r} mean_for_all_iris <- iris |> group_by(Species) |> summarise_all(list(mean, median, sd, min, max)) db$adam |> write_cnt(mean_for_all_iris, "mean_iris.rds") ## List and load data db$adam |> list_content_cnt() ``` We can also read back the data we just created and filter it further using the `read_cnt()` function. ```{r} # Read and filter data setosa_filtered <- db$adam |> read_cnt("setosa") |> filter(Sepal.Length > 5) ``` Finally, we can create a plot with the `ggplot2` package and store it in the `tfl` connector. ```{r} # Create a plot plot_setosa <- ggplot(setosa_filtered) + aes(x = Sepal.Length, y = Sepal.Width) + geom_point() ## Store data and plot objects db$tfl |> write_cnt(plot_setosa$data, "setosa_data.csv") db$tfl |> write_cnt(plot_setosa, "setosa_plot.rds") ## Store plot image tmp_file <- tempfile(fileext = ".png") ggsave(tmp_file, plot_setosa) db$tfl |> upload_cnt(tmp_file, "setosa_plot.png") # List all files in the TFL directory db$tfl |> list_content_cnt() ```