Analyze single-cell data with scGEAToolbox in Matlab.

Analysis of Single-Cell RNA-Sequencing Data using scGEAToolbox in Matlab

Part 1: Work on Matlab using scGEAToolbox for data analysis

Part 1

(work on Matlab as instruction, do source code in .m file and a separate file with figures):

1.Use an edited version of the scGEAToolbox (Cai (2020) Bioinformatics) to analyze single-cell RNA-sequencing data. To apply the functions included in the scGEAToolbox, refer to the documentation https://www.mathworks.com/matlabcentral/fileexchange/72917-scgeatoolboxsingle-cell-gene-expression-analysis-toolbox (do not use the published scGEAToolbox online. Use the scatter() and gscatter() functions to plot and annotate UMAP embeddings. Also, use “deseq” to specify the type of data, instead of the default for the function “libsize”)

2.Add two files named “scGEAToolbox” and “dataset” folders to Matlab path. (provided in attached files)

3.Upload the “dataset” files to Matlab workspace using the function sc_readmtxfile(). Use the readtable() function to upload the metadata file.

Part 2: Answer the following questions in 100-200 words for each.

1.Dataset question
1)Examine the dimensions of your expression matrix. What do the rows and columns of the matrix represent? What does each individual count represent?

2)Normalize your expression matrix using the function sc_norm().

3)Plot the relationship between the log10 of the total number of transcripts and the cell barcode (rank order the barcodes from the largest to lowest number of associated total transcripts). Log10 scale the cell barcode axis after rank ordering. Which area of this graph is more likely to contain low quality cells? Which area of this graph is more likely to contain doublets or multiplets (i.e., multiple cells per cell barcode)? Explain your reason.

Hint: each value in the evaluation matrix is inferred mRNA counts so the total number of transcripts can be calculated by summing the counts. You should have a data list variable containing the cell barcoded, and you just need the total number of transcripts to create the knee plot; in addition, there is no need to use the actual barcode strings when plotting. Just simply create a plot of the total counts for each cell in descending order. This only required a [1* number cells] data input. You can log scale the x-axis using set (gca, ‘XScale’, ‘log’). For an example of what the plot should look like, search Google Images for “quality control knee plot”)

4)Determine the number of unique samples in datasets. “Sample”, found in the metadata table, references patient cell source. Generate a set of violin plots that summarizes the total number of transcripts per cell across samples and another set of violin plots that summarizes the number of genes expressed per cell across samples. For every sample, provide the median number of transcripts and genes per cell.

2.Dimensionality reduction and visualization using Uniform Manifold Projection and Approximation (UMAP)

Hint to do following questions: the data matrix used for normalization, dissension reduction and plotting should be a 23686 * 7930 matrix of doubles. sc_readmtx() should also output a gene list of length 23686 and a barcode / cell list of length 7930. The metadata file should be loaded using readtable() and the output provides information regarding each single cell.

All in all, you should have:

23683 7930 expression matrix (used for normalization and plotting)
23686 1 gene list
7930 1 barcode per single cell list
7930 19 metadata table

1)Plot the relationship between mean expression level and dispersion for the genes in dataset. Dispersion can be measured using the squared coefficient of variation. Log scale both the x- and y-axes. Where would over-dispersed genes lie in the graph? Why are over-dispersed genes frequently chosen as features when modeling single-cell transcriptomics data?

Hint: using normalized gene matrix here. When determine the actual expression of each gene, it does not matter if the expression numbers are small, and there is nothing need to do the normalized expression matrix before taking the mean in there)

2)Using the sc_umap() function, generate a UMAP embedding of the single-cell transcriptomes in dataset using the top 2000 over-dispersed genes as features. Over-dispersed genes can be determined using the sc_hvg() function. What is UMAP attempting to summarize? Categorize your answer by referring to the local and global differences in the distribution of individual samples in UMAP space?

3)Regenerate the UMAP embedding by varying the min_dist (e.g., 0.1-0.8) and n_neighbors (e.g., 3-45) parameters in sc_umap(). In addition to the ndim, dinorm and dolog 1p parameters found in the documentation, we have added min_dist and n_neighbors as fourth and fifth parameters respectively. Let ndim=3, donorm=true and dolog 1p= true for each variation of min_dist and n_neighbors in this question. How do these parameters alter the embedding? Do smaller values for min_dist optimize for summarizing local or large-scale relationship?