Introduction to SJD (Structured Joint Decomposition)

We are using a list of neurogenesis datasets of both human and mouse model from different studies to demonstrate the use of SJD package. Here we have a list of datasets NeuroGenesis4, and one dataframe NeuroGenesis.info containing meta informationtot helps visualize the results. Users are suggested to make the dataframe including information prior to visualization.

Install and load `SJD` package

To install this package in R, run the following commands:

library(devtools)
install_github("CHuanSite/SJD")

library(SJD)

Jointly decompose 4 NeuroGenesis genomics data with `SJD` package

Load data

The whole structured NeuroGenesis4 and its meta information NeuroGenesis.info are deposited to googledrive and the downsized subsets are built in the SJD package. Please follow the below commands to download the real data and info For users who wants to save time by using the subset data, please skip downloading from google drive and follow the commands below.

We download the full data sets from the google drive, and load it into R

install.packages("googledrive", repos = "http://cran.us.r-project.org")
#> Installing package into '/private/var/folders/hn/15_w90p96zn9px58h76x8sfr0000gn/T/Rtmpnf4wal/temp_libpath14ecf600e2706'
#> (as 'lib' is unspecified)
#> 
#> The downloaded binary packages are in
#>  /var/folders/hn/15_w90p96zn9px58h76x8sfr0000gn/T//RtmpbEWihf/downloaded_packages

library(googledrive)

## Download NeuroGenesis4
drive_download(file = "https://drive.google.com/u/0/uc?export=download&id=1jf4f_dxps47yT89m8Tkpviyci_XoOtGk", path = "./NeuroGenesis4.RData", overwrite = TRUE)
#> ! Using an auto-discovered, cached token.
#>   To suppress this message, modify your code or options to clearly consent to
#>   the use of a cached token.
#>   See gargle's "Non-interactive auth" vignette for more details:
#>   <https://gargle.r-lib.org/articles/non-interactive-auth.html>
#> ℹ The googledrive package is using a cached token for hzchenhuan@gmail.com.
#> Auto-refreshing stale OAuth token.
#> File downloaded:
#> • NeuroGenesis4_full_data.RData <id: 1jf4f_dxps47yT89m8Tkpviyci_XoOtGk>
#> Saved locally as:
#> • ./NeuroGenesis4.RData

## Download NeruGenesis4.info
drive_download(file = "https://drive.google.com/file/d/1aGYAxZDYJKOLLO1LmFk9vPYUmtmS1TjZ/view?usp=sharing", path = "./NeuroGenesis4.info.RData", overwrite = TRUE)
#> File downloaded:
#> • NeuroGenesis4_full_info.RData <id: 1aGYAxZDYJKOLLO1LmFk9vPYUmtmS1TjZ>
#> Saved locally as:
#> • ./NeuroGenesis4.info.RData

load(file = "./NeuroGenesis4.RData")
load(file = "./NeuroGenesis4.info.RData")

This block of codes containing the small sampled data, which is faster to compute than the full scale data,

# data("NeuroGenesis4", package = "SJD")
# data("NeuroGenesis4.info", package = "SJD")

# str(NeuroGenesis4)

Preprocess data

Here we input the list of datasets (expression matrices) and wrap them to be ready for SJD analysis. sjdWrap identifies the shared genes among all datasets regardless of species and output a new list of datasets with uniformed gene names

dataset_list: expression matrices to be analyzed
species_vetor: the species information for each input dataset
geneType_vetor: the gene type information for each input dataset
geneType_out: the desired gene type for all datasets
species_out: the desired species to refer when uniform the genes for all datasets

SJDdataIN = sjdWrap(
  data.list = NeuroGenesis4,
  species.vector = c("human", "human", "human", "mouse"),
  geneType.vector = c("symbol","ensembl","symbol","symbol"),
  geneType.out = "symbol",
  species.out = "human"
  )
#> Using biomaRt to connect gene IDs across 4 datasets:
#> Getting biomaRt IDs for dataset 1 
#> You have input  23368  genes
#> We found  22458  matches
#> 2624  of those are duplicates and only keeping the 1st of each
#> Getting biomaRt IDs for dataset 2 
#> You have input  63677  genes
#> We found  54544  matches
#> 17816  of those are duplicates and only keeping the 1st of each
#> Getting biomaRt IDs for dataset 3 
#> You have input  35543  genes
#> We found  25753  matches
#> 2297  of those are duplicates and only keeping the 1st of each
#> Getting biomaRt IDs for dataset 4 
#> You have input  16441  genes
#> We found  12475  matches
#> 637  of those are duplicates and only keeping the 1st of each
#> constructed 4 tables of cross-species matching genes
#> we found 10694 shared genes in 4 datasets
#> new data list of 4 datasets constructed

Apply the decomposition method

After the transformation and alignment among the data sets, the next step is to apply the decomposition method, we give examples on different methods to illustrate the process

grp = list(
  Shared.All.4 = c(1 : 4), 
  Shared.bulk.2 = c(1,2),
  Shared.sc.2 = c(3, 4),
  Hs.Meisnr.1 = c(1),
  Hs.AZ.1 = c(2),
  Gesch.1 = c(3),
  Telley.1 = c(4))

dims = c(2, 2, 2, 2, 2, 2, 2) # must have same length as "grp"

# sep
# sepPCA.out = sepPCA(SJDdataIN, dims)
# sepICA.out = sepICA(SJDdataIN, dims)
# sepNMF.out = sepNMF(SJDdataIN, dims)

# concat
# concatPCA.out = concatPCA(SJDdataIN, grp, dims)
# concatICA.out = concatICA(SJDdataIN, grp, dims)
# concatNMF.out = concatNMF(SJDdataIN, grp, dims)

# joint
# jointPCA.out = jointPCA(SJDdataIN, grp, dims)
# jointICA.out = jointICA(SJDdataIN, grp, dims)
# jointNMF.out = jointNMF(SJDdataIN, grp, dims)

# sequential
twoStageLCA.out = twoStageLCA(dataset = SJDdataIN, group = grp, comp_num = dims)
# twoStageiLCA.out = twoStageiLCA(dataset = SJDdataIN, group = grp, comp_num = dims)

Plot

library(ggplot2)
library(gridExtra)
#> 
#> Attaching package: 'gridExtra'
#> The following object is masked from 'package:Biobase':
#> 
#>     combine
#> The following object is masked from 'package:BiocGenerics':
#> 
#>     combine

Plot each dim in each dataset individually

Here we need to call the SJDScorePlotter function to generate the SJDScorePlotter.obj, a list of ggplot objects that can be used

SampleMetaNamesTable = data.frame(
    row.names = names(NeuroGenesis4),
    Type = c('Yaxis','Yaxis','2Dscatter','2Dscatter'),
    XaxisColumn = c("X","DAYx","tSNE_1","tsne1:ch1"),
    YaxisColumn = c("PJDscores","PJDscores","tSNE_2","tsne2:ch1"),
    COLaxisColumn = c("color","colorBYlabelsX","PJDscores","PJDscores"),
    PCHColumn = c("","","",""),
    cexx=c(1,1,2,2)
)

SJDScorePlotter.obj = SJDScorePlotter(
     SJDalg = "twoStageLCA",
     scores = twoStageLCA.out$score_list,
     lbb = "NeuroGenesis4.p2",
     info = NeuroGenesis4.info,
     SampleMetaNamesTable = SampleMetaNamesTable
)
#> [1] "**********************************************************"
#> [1] "plotting scores for twoStageLCA anaysis."
#> [1] "*******************************"
#> [1] "dataset: Meissner.inVitro.bulk.Hs"
#> [1] "grouping: Shared.All.4"
#> [1] 1
#> [1] 2
#> [1] "grouping: Shared.bulk.2"
#> [1] 1
#> [1] 2
#> [1] "grouping: Shared.sc.2"
#> [1] "No SJD scores for grouping : Shared.sc.2, moving to the next SJD output for Meissner.inVitro.bulk.Hs"
#> [1] "grouping: Hs.Meisnr.1"
#> [1] 1
#> [1] 2
#> [1] "grouping: Hs.AZ.1"
#> [1] "No SJD scores for grouping : Hs.AZ.1, moving to the next SJD output for Meissner.inVitro.bulk.Hs"
#> [1] "grouping: Gesch.1"
#> [1] "No SJD scores for grouping : Gesch.1, moving to the next SJD output for Meissner.inVitro.bulk.Hs"
#> [1] "grouping: Telley.1"
#> [1] "No SJD scores for grouping : Telley.1, moving to the next SJD output for Meissner.inVitro.bulk.Hs"
#> [1] "*******************************"
#> [1] "dataset: LIBD.AZ.inVitro.bulk.Hs"
#> [1] "grouping: Shared.All.4"
#> [1] 1
#> [1] 2
#> [1] "grouping: Shared.bulk.2"
#> [1] 1
#> [1] 2
#> [1] "grouping: Shared.sc.2"
#> [1] "No SJD scores for grouping : Shared.sc.2, moving to the next SJD output for LIBD.AZ.inVitro.bulk.Hs"
#> [1] "grouping: Hs.Meisnr.1"
#> [1] "No SJD scores for grouping : Hs.Meisnr.1, moving to the next SJD output for LIBD.AZ.inVitro.bulk.Hs"
#> [1] "grouping: Hs.AZ.1"
#> [1] 1
#> [1] 2
#> [1] "grouping: Gesch.1"
#> [1] "No SJD scores for grouping : Gesch.1, moving to the next SJD output for LIBD.AZ.inVitro.bulk.Hs"
#> [1] "grouping: Telley.1"
#> [1] "No SJD scores for grouping : Telley.1, moving to the next SJD output for LIBD.AZ.inVitro.bulk.Hs"
#> [1] "*******************************"
#> [1] "dataset: Geschwind.inVivo.sc.Hs"
#> [1] "grouping: Shared.All.4"
#> [1] 1
#> [1] 2
#> [1] "grouping: Shared.bulk.2"
#> [1] "No SJD scores for grouping : Shared.bulk.2, moving to the next SJD output for Geschwind.inVivo.sc.Hs"
#> [1] "grouping: Shared.sc.2"
#> [1] 1
#> [1] 2
#> [1] "grouping: Hs.Meisnr.1"
#> [1] "No SJD scores for grouping : Hs.Meisnr.1, moving to the next SJD output for Geschwind.inVivo.sc.Hs"
#> [1] "grouping: Hs.AZ.1"
#> [1] "No SJD scores for grouping : Hs.AZ.1, moving to the next SJD output for Geschwind.inVivo.sc.Hs"
#> [1] "grouping: Gesch.1"
#> [1] 1
#> [1] 2
#> [1] "grouping: Telley.1"
#> [1] "No SJD scores for grouping : Telley.1, moving to the next SJD output for Geschwind.inVivo.sc.Hs"
#> [1] "*******************************"
#> [1] "dataset: Jabaudon.inVivo.sc.Mm"
#> [1] "grouping: Shared.All.4"
#> [1] 1
#> [1] 2
#> [1] "grouping: Shared.bulk.2"
#> [1] "No SJD scores for grouping : Shared.bulk.2, moving to the next SJD output for Jabaudon.inVivo.sc.Mm"
#> [1] "grouping: Shared.sc.2"
#> [1] 1
#> [1] 2
#> [1] "grouping: Hs.Meisnr.1"
#> [1] "No SJD scores for grouping : Hs.Meisnr.1, moving to the next SJD output for Jabaudon.inVivo.sc.Mm"
#> [1] "grouping: Hs.AZ.1"
#> [1] "No SJD scores for grouping : Hs.AZ.1, moving to the next SJD output for Jabaudon.inVivo.sc.Mm"
#> [1] "grouping: Gesch.1"
#> [1] "No SJD scores for grouping : Gesch.1, moving to the next SJD output for Jabaudon.inVivo.sc.Mm"
#> [1] "grouping: Telley.1"
#> [1] 1
#> [1] 2

print(names(SJDScorePlotter.obj))
#>  [1] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Meissner.inVitro.bulk.Hs.comp1of2" 
#>  [2] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Meissner.inVitro.bulk.Hs.comp2of2" 
#>  [3] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.bulk.2.data_Meissner.inVitro.bulk.Hs.comp1of2"
#>  [4] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.bulk.2.data_Meissner.inVitro.bulk.Hs.comp2of2"
#>  [5] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Hs.Meisnr.1.data_Meissner.inVitro.bulk.Hs.comp1of2"  
#>  [6] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Hs.Meisnr.1.data_Meissner.inVitro.bulk.Hs.comp2of2"  
#>  [7] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_LIBD.AZ.inVitro.bulk.Hs.comp1of2"  
#>  [8] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_LIBD.AZ.inVitro.bulk.Hs.comp2of2"  
#>  [9] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.bulk.2.data_LIBD.AZ.inVitro.bulk.Hs.comp1of2" 
#> [10] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.bulk.2.data_LIBD.AZ.inVitro.bulk.Hs.comp2of2" 
#> [11] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Hs.AZ.1.data_LIBD.AZ.inVitro.bulk.Hs.comp1of2"       
#> [12] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Hs.AZ.1.data_LIBD.AZ.inVitro.bulk.Hs.comp2of2"       
#> [13] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Geschwind.inVivo.sc.Hs.comp1of2"   
#> [14] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Geschwind.inVivo.sc.Hs.comp2of2"   
#> [15] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.sc.2.data_Geschwind.inVivo.sc.Hs.comp1of2"    
#> [16] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.sc.2.data_Geschwind.inVivo.sc.Hs.comp2of2"    
#> [17] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Gesch.1.data_Geschwind.inVivo.sc.Hs.comp1of2"        
#> [18] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Gesch.1.data_Geschwind.inVivo.sc.Hs.comp2of2"        
#> [19] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Jabaudon.inVivo.sc.Mm.comp1of2"    
#> [20] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Jabaudon.inVivo.sc.Mm.comp2of2"    
#> [21] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.sc.2.data_Jabaudon.inVivo.sc.Mm.comp1of2"     
#> [22] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.sc.2.data_Jabaudon.inVivo.sc.Mm.comp2of2"     
#> [23] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Telley.1.data_Jabaudon.inVivo.sc.Mm.comp1of2"        
#> [24] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Telley.1.data_Jabaudon.inVivo.sc.Mm.comp2of2"

Save the whole list to a .RData file

# save(SJDScorePlotter.obj, file = "./SJDScorePlotter.obj.RData")

Assemble the images based on dataset

assemble.byDataset.obj = assemble.byDataset(SJDScorePlotter.obj = SJDScorePlotter.obj, dataset_name = "Meissner.inVitro.bulk.Hs", SJD_algorithm = "twoStageLCA", group = NA)

print(names(assemble.byDataset.obj))
#> [1] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Meissner.inVitro.bulk.Hs.comp1of2" 
#> [2] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Meissner.inVitro.bulk.Hs.comp2of2" 
#> [3] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.bulk.2.data_Meissner.inVitro.bulk.Hs.comp1of2"
#> [4] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.bulk.2.data_Meissner.inVitro.bulk.Hs.comp2of2"
#> [5] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Hs.Meisnr.1.data_Meissner.inVitro.bulk.Hs.comp1of2"  
#> [6] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Hs.Meisnr.1.data_Meissner.inVitro.bulk.Hs.comp2of2"

Plot all images corresponding from the assemble.byDataset.obj. We can construct them into a whole big image through the grid.arrange from the grid.Extra package.

assemble.byDataset.obj[[1]]

assemble.byDataset.obj[[2]]

g = grid.arrange(
  assemble.byDataset.obj[[3]],
  assemble.byDataset.obj[[4]],
  nrow = 1, ncol = 2
)

Assemble the images based on component

assemble.byComponent.obj = assemble.byComponent(SJDScorePlotter.obj = SJDScorePlotter.obj, component = c(1, 2), SJD_algorithm = "twoStageLCA", group = 'Shared.All.4')

print(names(assemble.byComponent.obj))
#> [1] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Meissner.inVitro.bulk.Hs.comp1of2"
#> [2] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_LIBD.AZ.inVitro.bulk.Hs.comp1of2" 
#> [3] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Geschwind.inVivo.sc.Hs.comp1of2"  
#> [4] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Jabaudon.inVivo.sc.Mm.comp1of2"   
#> [5] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Meissner.inVitro.bulk.Hs.comp2of2"
#> [6] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_LIBD.AZ.inVitro.bulk.Hs.comp2of2" 
#> [7] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Geschwind.inVivo.sc.Hs.comp2of2"  
#> [8] "SJDout_NeuroGenesis4.p2.SJDalg_twoStageLCA.grp_Shared.All.4.data_Jabaudon.inVivo.sc.Mm.comp2of2"

assemble.byComponent.obj[[1]]

assemble.byComponent.obj[[2]]

assemble.byComponent.obj[[3]]

assemble.byComponent.obj[[4]]

Huan Chen

Jinrui Liu

Shreyash Sonthalia

Guangyan Li

Carlo Colantuoni

02 October, 2022

Install and load `SJD` package

Jointly decompose 4 NeuroGenesis genomics data with `SJD` package

Load data

Preprocess data

Apply the decomposition method

Plot

Plot each dim in each dataset individually

Assemble the images based on dataset

Assemble the images based on component

Introduction to SJD (Structured Joint Decomposition)

Huan Chen

Jinrui Liu

Shreyash Sonthalia

Guangyan Li

Carlo Colantuoni

02 October, 2022

Install and load SJD package

Jointly decompose 4 NeuroGenesis genomics data with SJD package

Load data

Preprocess data

Apply the decomposition method

Plot

Plot each dim in each dataset individually

Assemble the images based on dataset

Assemble the images based on component

Install and load `SJD` package

Jointly decompose 4 NeuroGenesis genomics data with `SJD` package