---
title: "Differential Network Analysis with multiDEGGs"
author: "Elisabetta Sciacca, Myles Lewis"
output:
html_document:
toc: true
toc_float:
collapsed: false
toc_depth: 2
number_sections: false
vignette: >
%\VignetteIndexEntry{multiDEGGs vignette}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Introduction
The multiDEGGs package performs multi-omic differential network analysis by
identifying differential interactions between molecular entities (genes,
proteins, miRNAs, or other biomolecules) across the omic datasets provided.
For each omic dataset, a differential network is constructed, where links
represent statistically significant differential interactions between entities.
These networks are then integrated into a comprehensive visualization using
distinct colors to distinguish interactions from different omic layers. This
unified visualization allows interactive exploration of cross-omic patterns
(e.g., differential interactions present at both transcript and protein level).
For each link, users can access differential statistical significance metrics
(p-values or adjusted p-values, calculated via robust or traditional linear
regression with interaction term), and differential regression plots.
Beyond network visualization and exploration, multiDEGGs extends its utility
into predictive modeling applications. The identified differential interactions
can be leveraged as engineered features in machine learning pipelines, providing
biologically meaningful predictors that capture relational information
between molecular entities. The package includes specialized functions for
nested cross-validation that ensure proper feature selection and engineering
without data leakage, enabling the construction of robust and interpretable
predictive models.
## Installation
Install from CRAN:
`install.packages("multiDEGGs")`
Install from Github:
`devtools::install_github("elisabettasciacca/multiDEGGs")`
## Quick start - Generate Differential Networks
Let's start by loading the package and sample data:
```{r load_data}
library(multiDEGGs)
data("synthetic_metadata")
data("synthetic_rnaseqData")
data("synthetic_proteomicData")
data("synthetic_OlinkData")
```
Generate Differential Networks:
```{r}
assayData_list <- list("RNAseq" = synthetic_rnaseqData,
"Proteomics" = synthetic_proteomicData,
"Olink" = synthetic_OlinkData)
deggs_object <- get_diffNetworks(assayData = assayData_list,
metadata = synthetic_metadata,
category_variable = "response",
regression_method = "lm",
padj_method = "bonferroni",
verbose = FALSE,
show_progressBar = FALSE,
cores = 2)
```
### Key Parameters of `get_diffNetworks`
It's worth explaining some of the important parameters of `get_diffNetworks`:
* `assayData`: accepts either a single normalized matrix/data frame (for single
omic differential analysis), or a list of matrices/data frames (for multi-omic
scenarios). For multi-omic analysis, it's highly recommended to use a named
list of data. If unnamed, sequential names (assayData1, assayData2, etc.) will
be assigned to identify each matrix or data frame.
* `metadata`: can also be a named factor vector, with names matching the patient
IDs in column names of the assay data matrices/data frames. In that case, the
category_variable can remain unset (NULL by default).
* `category_subset`: this parameter can restrict the analysis to a certain
subset of categories available in the metadata/category vector.
* `regression_method`: set to `"lm"` by default because it is faster and highly
recommended in machine learning scenarios, where the function might be
repeatedly called many times. For basic differential analyses, `"rlm"` can
also be used and may perform better in some cases.
* `percentile_vector`: by default, molecular targets (genes, proteins, etc.)
whose expression level is below the 35th percentile of the entire data matrix
are excluded from the analysis. This threshold can be modified by specifying
the percentile vector that is internally used for the percolation analysis.
For example, to remove only targets below the 25th percentile, set
`percentile_vector = seq(0.25, 0.98, by = 0.05)`.
* `padj_method`: the default method is Bonferroni. Storey's q values often give
more generous results but the `qvalue` package needs to be installed first.
**NOTE**: Not all patient IDs need to be present across datasets. Different
numbers of samples per omic are acceptable. Only IDs whose data is available in
the colnames of the assayData will be included in the analysis. Missing IDs will
be listed in a message similar to:
`The following samples IDs are missing in Proteomics: PT001, PT005, PT0030`
## Visualization
The `deggs_object` now contains the differential networks for each omic data
in `assayData_list`. These networks can be integrated into a comprehensive
visualization where different colors distinguish links from different omic
layers.
```{r, eval=FALSE}
View_diffNetworks(deggs_object)
```
This visualization interface allows you to:
1. Navigate the networks associated with each patient category
2. Filter by link significance
3. Search for specific genes inside the network
{width=75%}
Thicker links correspond to higher significant p-values.
The direction of the
arrows shows the relationship direction reported in literature, not derived from
the data.
The user can visualize differential regression plots by clicking on a link:
{width=75%}
Single node differential expressions can also be visualized by clicking on the
nodes:
{width=80%}
**NOTE**: For multi-omic scenarios, the data from the first matrix in the list
passed to `assayData` will be used for this boxplot.
## List All Differential Interactions
Outside of the interactive environment, the `get_multiOmics_diffNetworks()`
function can be used to get a table of all differential interactions, ordered by
p-value or adjusted p-value:
```{r, warning=FALSE}
get_multiOmics_diffNetworks(deggs_object, sig_threshold = 0.05)
```
For single omic scenarios, use the `get_sig_deggs()` function:
```{r}
deggs_object_oneOmic <- get_diffNetworks(assayData = synthetic_rnaseqData,
metadata = synthetic_metadata,
category_variable = "response",
regression_method = "lm",
padj_method = "bonferroni",
verbose = FALSE,
show_progressBar = FALSE,
cores = 2)
get_sig_deggs(deggs_object_oneOmic, sig_threshold = 0.05)
```
## Differential Regression Plots
To plot the differential regression fits outside of the interactive environment,
use `plot_regressions()` specifying the omic data to be used and the two targets:
```{r, fig.width = 4.5, fig.height = 4, eval=FALSE}
plot_regressions(deggs_object,
assayDataName = "RNAseq",
gene_A = "MTOR",
gene_B = "AKT2",
legend_position = "bottomright")
```
{width=50%}
In single omic analyses, the `assayDataName` parameter can remain unset.
## Differential Network Analysis with More Than Two Groups
It's possible to compare differential interactions among more than two
categorical groups. All steps described above stay the same;
the dropdown
menu of the interactive environment will show all available categories:
While regressions and boxplots will show all categories:
The statistical significance of the interaction term is calculated via one-way
ANOVA in this case.
We highly recommend to have at least 4 or 5 observations per group.
## Feature Selection and Engineering with multiDEGGs in Nested Cross-Validation
In computational biology applications involving high-throughput data,
researchers commonly encounter situations where the number of potential
predictors far exceeds the available sample size. This dimensional challenge
requires careful feature selection strategies for both mathematical and clinical
reasons.
Standard feature selection methods typically evaluate predictors individually,
identifying those variables that show the strongest univariate associations with
the outcome variable (such as through t-tests or Wilcoxon tests).
While effective, this approach overlooks the interconnected nature of biological
systems, where \bold{informative patterns may emerge from relationships between
variables rather than from individual measurements alone.}
Feature engineering represents a complementary strategy that creates new
predictors by combining or transforming existing variables. In biology,
such approach can be used to capture higher-order information that reflects the
interconnected nature of molecular processes. For instance, the ratio between
two genes may provide more discriminative power than either gene expression
level independently, particularly when their relative balance is disrupted
in disease states.
The informative content encoded in differential interactions, combined with
multiDEGGs' ability to identify only literature-validated differential
relationships, makes it particularly well-suited for both individual feature
selection and guided creation of engineered predictors in machine learning.
Such approach has potential to overcome the limitations of conventional
algorithms which may select individual predictors without clear biological
significance, compromising both the interpretability and clinical credibility
of the resulting models.
### Why Nested Cross-Validation for Feature Engineering?
It is crucial that feature selection and modification is conducted exclusively
on training data within cross-validation loops to prevent information leakage
from the test set. The `nestedcv` package enables the nested modification of
predictors within each outer fold, ensuring that the attributes learned from
the training part are applied to the test data without prior knowledge of the
test data itself.
The selected and combined features, and corresponding model, can then be
evaluated on the hold-out test data without introducing bias.
Both \link[nestedcv](nestcv.glmnet) and \link[nestedcv](nestcv.train) from
`nestedcv` accept any user-defined function
that filters or transforms the feature matrix by passing the function name to
the `modifyX` parameter.
**The multiDEGGs package provides two specialized functions for this purpose.**
### multiDEGGs_filter(): Pure Differential Network-Based Selection
The `multiDEGGs_filter()` function performs feature selection based entirely on
differential network analysis. It identifies significant differential molecular
interactions and can return either the interaction pairs alone or both pairs
and individual variables involved in those interactions.
#### Key Parameters
When using `multiDEGGs_filter()`, you can control the following parameters
through `modifyX_options`:
- **`keep_single_genes`** (logical, default `FALSE`): Controls whether to
include individual genes from significant pairs in addition to the pairs
themselves
- **`nfilter`** (integer, default `NULL`): Maximum number of predictors to
return. When `NULL`, all significant interactions found are included
#### Usage Examples
##### Basic Usage: Pairs Only
```{r}
library(nestedcv)
data("synthetic_metadata")
data("synthetic_rnaseqData")
# Regularized linear model with interaction pairs only
fit.glmnet <- nestcv.glmnet(
y = as.numeric(synthetic_metadata$response),
x = t(synthetic_rnaseqData),
modifyX = "multiDEGGs_filter",
modifyX_options = list(
keep_single_genes = FALSE,
nfilter = 20
),
modifyX_useY = TRUE,
n_outer_folds = 5,
n_inner_folds = 6,
verbose = FALSE
)
summary(fit.glmnet)
```
##### Including Individual Genes (keep_single_genes = TRUE)
```{r, fig.width = 3, fig.height = 3}
# Random forest model including both pairs and individual genes
fit.rf <- nestcv.train(
y = synthetic_metadata$response,
x = t(synthetic_rnaseqData),
method = "rf",
modifyX = "multiDEGGs_filter",
modifyX_options = list(
keep_single_genes = TRUE,
nfilter = 30
),
modifyX_useY = TRUE,
n_outer_folds = 5,
n_inner_folds = 6,
verbose = FALSE
)
fit.rf$summary
# Plot ROC on outer folds
plot(fit.rf$roc)
```
#### How nfilter works with keep_single_genes
- When **`keep_single_genes = FALSE`**: `nfilter` limits only the number of
interaction pairs returned
- When **`keep_single_genes = TRUE`**: `nfilter` limits the combined count
of unique individual genes plus interaction pairs. The function prioritizes
pairs by significance and adds individual genes as needed until the limit is
reached
### multiDEGGs_combined_filter(): Hybrid Statistical and Network-Based Selection
The `multiDEGGs_combined_filter()` function combines traditional statistical
feature selection with differential network analysis. This hybrid approach
allows you to benefit from both conventional univariate selection methods and
the biological insights from interaction analysis.
#### Key Parameters
- **`filter_method`** (character): Statistical method for single feature
selection.
Options: `"ttest"`, `"wilcoxon"`, `"ranger"`, `"glmnet"`, `"pls"`
- **`nfilter`** (integer): Maximum number of features to select
- **`dynamic_nfilter`** (logical): Controls how `nfilter` is applied
(see detailed explanation below)
- **`keep_single_genes`** (logical): When `dynamic_nfilter = TRUE`,
determines whether to include individual genes from multiDEGGs pairs
#### Dynamic vs. Balanced Selection Modes
##### Dynamic Selection (`dynamic_nfilter = TRUE`)
In dynamic mode, the function:
1. Selects `nfilter` single genes using the chosen statistical method
2. Adds ALL significant interaction pairs found by multiDEGGs
3. Total predictors = `nfilter` single genes + number of significant pairs
This mode allows the feature space to expand based on the biological complexity
discovered in each fold.
```{r}
# Dynamic selection with t-test for single genes
fit.dynamic <- nestcv.glmnet(
y = as.numeric(synthetic_metadata$response),
x = t(synthetic_rnaseqData),
modifyX = "multiDEGGs_combined_filter",
modifyX_options = list(
filter_method = "ttest",
nfilter = 20,
dynamic_nfilter = TRUE,
keep_single_genes = FALSE
),
modifyX_useY = TRUE,
n_outer_folds = 5,
n_inner_folds = 6,
verbose = FALSE
)
```
##### Balanced Selection (`dynamic_nfilter = FALSE`)
In balanced mode, the function:
1. Allocates approximately half of `nfilter` to interaction pairs
2. Fills remaining slots with single genes from the statistical filter
3. Maintains consistent total number of predictors across all folds
This mode ensures a fixed feature space size while balancing single genes and
interactions.
```{r}
# Balanced selection with Wilcoxon-test importance
fit.balanced <- nestcv.train(
y = synthetic_metadata$response,
x = t(synthetic_rnaseqData),
method = "rf",
modifyX = "multiDEGGs_combined_filter",
modifyX_options = list(
filter_method = "wilcoxon",
nfilter = 40,
dynamic_nfilter = FALSE
),
modifyX_useY = TRUE,
n_outer_folds = 5,
n_inner_folds = 6,
verbose = FALSE
)
```
#### Available Statistical Methods
- **`"ttest"`**: Two-sample t-test for differential expression
- **`"wilcoxon"`**: Wilcoxon rank-sum test (non-parametric alternative to t-test)
- **`"ranger"`**: Random Forest variable importance scoring (the `ranger` package
must be installed first)
- **`"glmnet"`**: Elastic net regularization coefficients
- **`"pls"`**: Partial Least Squares variable importance
### Practical considerations
Before implementing multiDEGGs in your machine learning pipeline, it's highly
recommended to first run a preliminary analysis on your complete dataset to
assess the number of differential interactions detected. This exploratory step
can guide your choice of approach and parameter settings.
If multiDEGGs identifies only a small number of differential interactions
(e.g., fewer than 10-20 pairs), these features alone may lack sufficient
predictive power. In such cases, consider:
- Using `multiDEGGs_combined_filter()` to integrate network-based features with
traditional statistical selection methods
- Setting `keep_single_genes = TRUE` in `multiDEGGs_filter()` to include
individual genes involved in the differential pairs
- Adjusting the `percentile_vector` or significance thresholds in the initial
multiDEGGs analysis to potentially capture more interactions
Conversely, if a large number of differential interactions are detected,
`multiDEGGs_filter()` alone may provide sufficient feature diversity for
effective model training.
### Feature Engineering Details
Both functions create ratio-based features from significant gene pairs
(Gene A / Gene B), which capture the relative expression relationships that
drive differential network connectivity. The `predict` methods automatically
handle the feature transformation for both training and test data within each
cross-validation fold, ensuring no information leakage.
**Note:** If no significant differential interactions are found in a particular
fold, both functions automatically fall back to t-test-based selection to ensure
robust performance across all scenarios. This fallback is indicated by a
printed "0" during execution.
## Session Info
```{r}
sessionInfo()
```
## Citation
```{r}
citation("multiDEGGs")
```