Slingshot: Trajectory Inference for Single-Cell Data

Slingshot是一种用于单细胞数据分析的工具,它主要用于重建细胞谱系结构并推断细胞的伪时间顺序。该方法首先通过基因过滤、归一化和降维处理数据,然后进行聚类。使用Slingshot时,可以构建基于聚类的最小生成树来识别全局谱系结构,并拟合平滑曲线来表示每个谱系。此外,文章还介绍了如何在大型数据集上运行Slingshot以及如何将新细胞投影到已存在的轨迹上。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Slingshot: Trajectory Inference for Single-Cell Data

2023-04-25

Contents

1Introduction

This vignette will demonstrate a full single-cell lineage analysis workflow, with particular emphasis on the processes of lineage reconstruction and pseudotime inference. We will make use of the slingshot package proposed in (Street et al. 2017) and show how it may be applied in a broad range of settings.

The goal of slingshot is to use clusters of cells to uncover global structure and convert this structure into smooth lineages represented by one-dimensional variables, called “pseudotime.” We provide tools for learning cluster relationships in an unsupervised or semi-supervised manner and constructing smooth curves representing each lineage, along with visualization methods for each step.

1.1Overview

The minimal input to slingshot is a matrix representing the cells in a reduced-dimensional space and a vector of cluster labels. With these two inputs, we then:

  • Identify the global lineage structure by constructing an minimum spanning tree (MST) on the clusters, with the getLineages function.
  • Construct smooth lineages and infer pseudotime variables by fitting simultaneous principal curves with the getCurves function.
  • Assess the output of each step with built-in visualization tools.

1.2Datasets

We will work with two simulated datasets in this vignette. The first (referred to as the “single-trajectory” dataset) is generated below and designed to represent a single lineage in which one third of the genes are associated with the transition. This dataset will be contained in a SingleCellExperiment object (Lun and Risso 2017) and will be used to demonstrate a full “start-to-finish” workflow.

# generate synthetic count data representing a single lineage
means <- rbind(
    # non-DE genes
    matrix(rep(rep(c(0.1,0.5,1,2,3), each = 300),100),
        ncol = 300, byrow = TRUE),
    # early deactivation
    matrix(rep(exp(atan( ((300:1)-200)/50 )),50), ncol = 300, byrow = TRUE),
    # late deactivation
    matrix(rep(exp(atan( ((300:1)-100)/50 )),50), ncol = 300, byrow = TRUE),
    # early activation
    matrix(rep(exp(atan( ((1:300)-100)/50 )),50), ncol = 300, byrow = TRUE),
    # late activation
    matrix(rep(exp(atan( ((1:300)-200)/50 )),50), ncol = 300, byrow = TRUE),
    # transient
    matrix(rep(exp(atan( c((1:100)/33, rep(3,100), (100:1)/33) )),50), 
        ncol = 300, byrow = TRUE)
)
counts <- apply(means,2,function(cell_means){
    total <- rnbinom(1, mu = 7500, size = 4)
    rmultinom(1, total, cell_means)
})
rownames(counts) <- paste0('G',1:750)
colnames(counts) <- paste0('c',1:300)
sce <- SingleCellExperiment(assays = List(counts = counts))

The second dataset (the “bifurcating” dataset) consists of a matrix of coordinates (as if obtained by PCA, ICA, diffusion maps, etc.) along with cluster labels generated by k�-means clustering. This dataset represents a bifurcating trajectory and it will allow us to demonstrate some of the additional functionality offered by slingshot.

library(slingshot, quietly = FALSE)
data("slingshotExample")
rd <- slingshotExample$rd
cl <- slingshotExample$cl

dim(rd) # data representing cells in a reduced dimensional space
## [1] 140   2
length(cl) # vector of cluster labels
## [1] 140

2Upstream Analysis

2.1Gene Filtering

To begin our analysis of the single lineage dataset, we need to reduce the dimensionality of our data and filtering out uninformative genes is a typical first step. This will greatly improve the speed of downstream analyses, while keeping the loss of information to a minimum.

For the gene filtering step, we retained any genes robustly expressed in at least enough cells to constitute a cluster, making them potentially interesting cell-type marker genes. We set this minimum cluster size to 10 cells and define a gene as being “robustly expressed” if it has a simulated count of at least 3 reads.

# filter genes down to potential cell-type markers
# at least M (15) reads in at least N (15) cells
geneFilter <- apply(assays(sce)$counts,1,function(x){
    sum(x >= 3) >= 10
})
sce <- sce[geneFilter, ]

2.2Normalization

Another important early step in most RNA-Seq analysis pipelines is the choice of normalization method. This allows us to remove unwanted technical or biological artifacts from the data, such as batch, sequencing depth, cell cycle effects, etc. In practice, it is valuable to compare a variety of normalization techniques and compare them along different evaluation criteria, for which we recommend the scone package (Cole and Risso 2018). We also note that the order of these steps may change depending on the choice of method. ZINB-WaVE (Risso et al. 2018) performs dimensionality reduction while accounting for technical variables and MNN (Haghverdi et al. 2018) corrects for batch effects after dimensionality reduction.

Since we are working with simulated data, we do not need to worry about batch effects or other potential confounders. Hence, we will proceed with full quantile normalization, a well-established method which forces each cell to have the same distribution of expression values.

FQnorm <- function(counts){
    rk <- apply(counts,2,rank,ties.method='min')
    counts.sort <- apply(counts,2,sort)
    refdist <- apply(counts.sort,1,median)
    norm <- apply(rk,2,function(r){ refdist[r] })
    rownames(norm) <- rownames(counts)
    return(norm)
}
assays(sce)$norm <- FQnorm(assays(sce)$counts)

2.3Dimensionality Reduction

The fundamental assumption of slingshot is that cells which are transcriptionally similar will be close to each other in some reduced-dimensional space. Since we use Euclidean distances in constructing lineages and measuring pseudotime, it is important to have a low-dimensional representation of the data.

There are many methods available for this task and we will intentionally avoid the issue of determining which is the “best” method, as this likely depends on the type of data, method of collection, upstream computational choices, and many other factors. We will demonstrate two methods of dimensionality reduction: principal components analysis (PCA) and uniform manifold approximation and projection (UMAP, via the uwot package).

When performing PCA, we do not scale the genes by their variance because we do not believe that all genes are equally informative. We want to find signal in the robustly expressed, highly variable genes, not dampen this signal by forcing equal variance across genes. When plotting, we make sure to set the aspect ratio, so as not to distort the perceived distances.

pca <- prcomp(t(log1p(assays(sce)$norm)), scale. = FALSE)
rd1 <- pca$x[,1:2]

plot(rd1, col = rgb(0,0,0,.5), pch=16, asp = 1)

library(uwot)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
## 
##     expand
rd2 <- uwot::umap(t(log1p(assays(sce)$norm)))
colnames(rd2) <- c('UMAP1', 'UMAP2')

plot(rd2, col = rgb(0,0,0,.5), pch=16, asp = 1)

We will add both dimensionality reductions to the SingleCellExperiment object, but continue our analysis focusing on the PCA results.

reducedDims(sce) <- SimpleList(PCA = rd1, UMAP = rd2)

2.4Clustering Cells

The final input to slingshot is a vector of cluster labels for the cells. If this is not provided, slingshot will treat the data as a single cluster and fit a standard principal curve. However, we recommend clustering the cells even in datasets where only a single lineage is expected, as it allows for the potential discovery of novel branching events.

The clusters identified in this step will be used to determine the global structure of the underlying lineages (that is, their number, when they branch off from one another, and the approximate locations of those branching events). This is different than the typical goal of clustering single-cell data, which is to identify all biologically relevant cell types present in the dataset. For example, when determining global lineage structure, there is no need to distinguish between immature and mature neurons since both cell types will, presumably, fall along the same segment of a lineage.

For our analysis, we implement two clustering methods which similarly assume that Euclidean distance in a low-dimensional space reflect biological differences between cells: Gaussian mixture modeling and k�-means. The former is implemented in the mclust package (Scrucca et al. 2016) and features an automated method for determining the number of clusters based on the Bayesian information criterion (BIC).

library(mclust, quietly = TRUE)
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:mgcv':
## 
##     mvn
cl1 <- Mclust(rd1)$classification
colData(sce)$GMM <- cl1

library(RColorBrewer)
plot(rd1, col = brewer.pal(9,"Set1")[cl1], pch=16, asp = 1)

While k�-means does not have a similar functionality, we have shown in (Street et al. 2017) that simultaneous principal curves are quite robust to the choice of k�, so we select a k� of 4 somewhat arbitrarily. If this is too low, we may miss a true branching event and if it is too high or there is an abundance of small clusters, we may begin to see spurious branching events.

cl2 <- kmeans(rd1, centers = 4)$cluster
colData(sce)$kmeans <- cl2

plot(rd1, col = brewer.pal(9,"Set1")[cl2], pch=16, asp = 1)

3Using Slingshot

At this point, we have everything we need to run slingshot on our simulated dataset. This is a two-step process composed of identifying the global lineage structure with a cluster-based minimum spanning tree (MST) and fitting simultaneous principal curves to describe each lineage.

These two steps can be run separately with the getLineages and getCurves functions, or together with the wrapper function, slingshot (recommended). We will use the wrapper function for the analysis of the single-trajectory dataset, but demonstrate the usage of the individual functions later, on the bifurcating dataset.

The slingshot wrapper function performs both steps of trajectory inference in a single call. The necessary inputs are a reduced dimensional matrix of coordinates and a set of cluster labels. These can be separate objects or, in the case of the single-trajectory data, elements contained in a SingleCellExperiment object.

To run slingshot with the dimensionality reduction produced by PCA and cluster labels identified by Gaussian mixutre modeling, we would do the following:

sce <- slingshot(sce, clusterLabels = 'GMM', reducedDim = 'PCA')

As noted above, if no clustering results are provided, it is assumed that all cells are part of the same cluster and a single curve will be constructed. If no dimensionality reduction is provided, slingshot will use the first element of the list returned by reducedDims.

The output is a SingleCellExperiment object with slingshot results incorporated. All of the results are stored in a PseudotimeOrdering object, which is added to the colData of the original object and can be accessed via colData(sce)$slingshot. Additionally, all inferred pseudotime variables (one per lineage) are added to the colData, individually. To extract all slingshot results in a single object, we can use either the as.PseudotimeOrdering or as.SlingshotDataSet functions, depending on the form in which we want it. PseudotimeOrdering objects are an extension of SummarizedExperiment objects, which are flexible containers that will be useful for most purposes. SlingshotDataSet objects are primarily used for visualization, as several plotting methods are included with the package. Below, we visuzalize the inferred lineage for the single-trajectory data with points colored by pseudotime.

summary(sce$slingPseudotime_1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   8.631  21.121  21.414  34.363  43.185
library(grDevices)
colors <- colorRampPalette(brewer.pal(11,'Spectral')[-6])(100)
plotcol <- colors[cut(sce$slingPseudotime_1, breaks=100)]

plot(reducedDims(sce)$PCA, col = plotcol, pch=16, asp = 1)
lines(SlingshotDataSet(sce), lwd=2, col='black')

We can also see how the lineage structure was intially estimated by the cluster-based minimum spanning tree by using the type argument.

plot(reducedDims(sce)$PCA, col = brewer.pal(9,'Set1')[sce$GMM], pch=16, asp = 1)
lines(SlingshotDataSet(sce), lwd=2, type = 'lineages', col = 'black')

4Downstream Analysis

4.1Identifying temporally dynamic genes

After running slingshot, we are often interested in finding genes that change their expression over the course of development. We will demonstrate this type of analysis using the tradeSeq package (Van den Berge et al. 2020).

For each gene, we will fit a general additive model (GAM) using a negative binomial noise distribution to model the (potentially nonlinear) relationshipships between gene expression and pseudotime. We will then test for significant associations between expression and pseudotime using the associationTest.

library(tradeSeq)

# fit negative binomial GAM
sce <- fitGAM(sce)

# test for dynamic expression
ATres <- associationTest(sce)

We can then pick out the top genes based on p-values and visualize their expression over developmental time with a heatmap. Here we use the top 250 most dynamically expressed genes.

topgenes <- rownames(ATres[order(ATres$pvalue), ])[1:250]
pst.ord <- order(sce$slingPseudotime_1, na.last = NA)
heatdata <- assays(sce)$counts[topgenes, pst.ord]
heatclus <- sce$GMM[pst.ord]

heatmap(log1p(heatdata), Colv = NA,
        ColSideColors = brewer.pal(9,"Set1")[heatclus])

5Detailed Slingshot Functionality

Here, we provide further details and highlight some additional functionality of the slingshot package. We will use the included slingshotExample dataset for illustrative purposes. This dataset was designed to represent cells in a low dimensional space and comes with a set of cluster labels generated by k�-means clustering. Rather than constructing a full SingleCellExperiment object, which requires gene-level data, we will use the low-dimensional matrix of coordinates directly and provide the cluster labels as an additional argument.

5.1Identifying global lineage structure

The getLineages function takes as input an n ×× p matrix and a vector of clustering results of length n. It maps connections between adjacent clusters using a minimum spanning tree (MST) and identifies paths through these connections that represent lineages. The output of this function is a PseudotimeOrdering containing the inputs as well as the inferred MST (represented by an igraph object) and lineages (ordered vectors of cluster names).

This analysis can be performed in an entirely unsupervised manner or in a semi-supervised manner by specifying known initial and terminal point clusters. If we do not specify a starting point, slingshot selects one based on parsimony, maximizing the number of clusters shared between lineages before a split. If there are no splits or multiple clusters produce the same parsimony score, the starting cluster is chosen arbitrarily.

In the case of our simulated data, slingshot selects Cluster 1 as the starting cluster. However, we generally recommend the specification of an initial cluster based on prior knowledge (either time of sample collection or established gene markers). This specification will have no effect on how the MST is constructed, but it will impact how branching curves are constructed.

lin1 <- getLineages(rd, cl, start.clus = '1')
lin1
## class: PseudotimeOrdering 
## dim: 140 2 
## metadata(3): lineages mst slingParams
## pathStats(2): pseudotime weights
## cellnames(140): cell-1 cell-2 ... cell-139 cell-140
## cellData names(2): reducedDim clusterLabels
## pathnames(2): Lineage1 Lineage2
## pathData names(0):
plot(rd, col = brewer.pal(9,"Set1")[cl], asp = 1, pch = 16)
lines(SlingshotDataSet(lin1), lwd = 3, col = 'black')

At this step, slingshot also allows for the specification of known endpoints. Clusters which are specified as terminal cell states will be constrained to have only one connection when the MST is constructed (ie., they must be leaf nodes). This constraint could potentially impact how other parts of the tree are drawn, as shown in the next example where we specify Cluster 3 as an endpoint.

lin2 <- getLineages(rd, cl, start.clus= '1', end.clus = '3')

plot(rd, col = brewer.pal(9,"Set1")[cl], asp = 1, pch = 16)
lines(SlingshotDataSet(lin2), lwd = 3, col = 'black', show.constraints = TRUE)

This type of supervision can be useful for ensuring that results are consistent with prior biological knowledge. Specifically, it can prevent known terminal cell fates from being categorized as transitory states.

There are a few additional arguments we could have passed to getLineages for greater control:

  • dist.method is a character specifying how the distances between clusters should be computed. The default value is "slingshot", which uses the distance between cluster centers, normalized by their full, joint covariance matrix. In the presence of small clusters (fewer cells than the dimensionality of the data), this will automatically switch to using the diagonal joint covariance matrix. Other options include simple (Euclidean), scaled.fullscaled.diag, and mnn (mutual nearest neighbor-based distance).
  • omega is a granularity parameter, allowing the user to set an upper limit on connection distances. It represents the distance between every real cluster and an artificial .OMEGA cluster, which is removed after fitting the MST. This can be useful for identifying outlier clusters which are not a part of any lineage or for separating distinct trajectories. This may be a numeric argument or a logical value. A value of TRUE indicates that a heuristic of 1.51.5 (median edge length of unsupervised MST) should be used, implying that the maximum allowable distance will be 33 times that distance.

After constructing the MST, getLineages identifies paths through the tree to designate as lineages. At this stage, a lineage will consist of an ordered set of cluster names, starting with the root cluster and ending with a leaf. The output of getLineages is a PseudotimeOrdering which holds all of the inputs as well as a list of lineages and some additional information on how they were constructed. This object is then used as the input to the getCurves function.

5.2Constructing smooth curves and ordering cells

In order to model development along these various lineages, we will construct smooth curves with the function getCurves. Using smooth curves based on all the cells eliminates the problem of cells projecting onto vertices of piece-wise linear trajectories and makes slingshot more robust to noise in the clustering results.

In order to construct smooth lineages, getCurves follows an iterative process similar to that of principal curves presented in (Hastie and Stuetzle 1989). When there is only a single lineage, the resulting curve is simply the principal curve through the center of the data, with one adjustment: the initial curve is constructed with the linear connections between cluster centers rather than the first prinicpal component of the data. This adjustment adds stability and typically hastens the algorithm’s convergence.

When there are two or more lineages, we add an additional step to the algorithm: averaging curves near shared cells. Both lineages should agree fairly well on cells that have yet to differentiate, so at each iteration we average the curves in the neighborhood of these cells. This increases the stability of the algorithm and produces smooth branching lineages.

crv1 <- getCurves(lin1)
crv1
## class: PseudotimeOrdering 
## dim: 140 2 
## metadata(4): lineages mst slingParams curves
## pathStats(2): pseudotime weights
## cellnames(140): cell-1 cell-2 ... cell-139 cell-140
## cellData names(2): reducedDim clusterLabels
## pathnames(2): Lineage1 Lineage2
## pathData names(0):
plot(rd, col = brewer.pal(9,"Set1")[cl], asp = 1, pch = 16)
lines(SlingshotDataSet(crv1), lwd = 3, col = 'black')

The output of getCurves is an updated PseudotimeOrdering which now contains the simultaneous principal curves and additional information on how they were fit. The slingPseudotime function extracts a cells-by-lineages matrix of pseudotime values for each cell along each lineage, with NA values for cells that were not assigned to a particular lineage. The slingCurveWeights function extracts a similar matrix of weights assigning each cell to one or more lineages.

The curves objects can be accessed using the slingCurves function, which will return a list of principal_curve objects. These objects consist of the following slots:

  • s: the matrix of points that make up the curve. These correspond to the orthogonal projections of the data points.
  • ord: indices which can be used to put the cells along a curve in order based their projections.
  • lambda: arclengths along the curve from the beginning to each cell’s projection. A matrix of these values is returned by the slingPseudotime function.
  • dist_ind: the squared distances between data points and their projections onto the curve.
  • dist: the sum of the squared projection distances.
  • w: the vector of weights along this lineage. Cells that were unambiguously assigned to this lineage will have a weight of 1, while cells assigned to other lineages will have a weight of 0. It is possible for cells to have weights of 1 (or very close to 1) for multiple lineages, if they are positioned before a branching event. A matrix of these values is returned by the slingCurveWeights function.

5.3Running Slingshot on large datasets

For large datasets, we highgly recommend using the approx_points argument with slingshot (or getCurves). This allows the user to specify the resolution of the curves (ie. the number of unique points). While the MST construction operates on clusters, the process of iteratively projecting all points onto one or more curves may become computationally burdensome as the size of the dataset grows. For this reason, we set the default value for approx_points to either 150150 or the number of cells in the dataset, whichever is smaller. This should greatly ease the computational cost of exploratory analysis while having minimal impact on the resulting trajectories.

For maximally “dense” curves, set approx_points = FALSE, and the curves will have as many points as there are cells in the dataset. However, note that each projection step in the iterative curve-fitting process will now have a computational complexity proportional to n2�2 (where n� is the number of cells). In the presence of branching lineages, these dense curves will also affect the complexity of curve averaging and shrinkage.

We recommend a value of 100100-200200, hence the default value of 150150. Note that restricting the number of unique points along the curve does not impose a similar limit on the number of unique pseudotime values, as demonstrated below. Even with an unrealistically low value of 55 for approx_points, we still see a full color gradient from the pseudotime values:

sce5 <- slingshot(sce, clusterLabels = 'GMM', reducedDim = 'PCA',
                   approx_points = 5)

colors <- colorRampPalette(brewer.pal(11,'Spectral')[-6])(100)
plotcol <- colors[cut(sce5$slingPseudotime_1, breaks=100)]

plot(reducedDims(sce5)$PCA, col = plotcol, pch=16, asp = 1)
lines(SlingshotDataSet(sce5), lwd=2, col='black')

5.4Multiple Trajectories

In some cases, we are interested in identifying multiple, disjoint trajectories. Slingshot handles this problem in the initial MST construction by introducing an artificial cluster, called omega. This artificial cluster is separated from every real cluster by a fixed length, meaning that the maximum distance between any two real clusters is twice this length. Practically, this sets a limit on the maximum edge length allowable in the MST. Setting omega = TRUE will implement a rule of thumb whereby the maximum allowable edge length is equal to 33 times the median edge length of the MST constructed without the artificial cluster (note: this is equivalent to saying that the default value for omega_scale is 1.51.5).

rd2 <- rbind(rd, cbind(rd[,2]-12, rd[,1]-6))
cl2 <- c(cl, cl + 10)
pto2 <- slingshot(rd2, cl2, omega = TRUE, start.clus = c(1,11))

plot(rd2, pch=16, asp = 1,
     col = c(brewer.pal(9,"Set1"), brewer.pal(8,"Set2"))[cl2])
lines(SlingshotDataSet(pto2), type = 'l', lwd=2, col='black')

After fitting the MST, slingshot proceeds to fit simultaneous principal curves as usual, with each trajectory being handled separately.

plot(rd2, pch=16, asp = 1,
     col = c(brewer.pal(9,"Set1"), brewer.pal(8,"Set2"))[cl2])
lines(SlingshotDataSet(pto2), lwd=2, col='black')

5.5Projecting Cells onto Existing Trajectories

Sometimes, we may want to use only a subset of the cells to determine a trajectory, or we may get new data that we want to project onto an existing trajectory. In either case, we will need a way to determine the positions of new cells along a previously constructed trajectory. For this, we can use the predict function (since this function is not native to slingshot, see ?`predict,PseudotimeOrdering-method` for documentation).

# our original PseudotimeOrdering
pto <- sce$slingshot

# simulate new cells in PCA space
newPCA <- reducedDim(sce, 'PCA') + rnorm(2*ncol(sce), sd = 2)

# project onto trajectory
newPTO <- slingshot::predict(pto, newPCA)

This will yield a new, hybrid object with the trajectories (curves) from the original data, but the pseudotime values and weights for the new cells. For reference, the original cells are shown in grey below, but they are not included in the output from predict.

newplotcol <- colors[cut(slingPseudotime(newPTO)[,1], breaks=100)]
plot(reducedDims(sce)$PCA, col = 'grey', bg = 'grey', pch=21, asp = 1,
     xlim = range(newPCA[,1]), ylim = range(newPCA[,2]))
lines(SlingshotDataSet(sce), lwd=2, col = 'black')
points(slingReducedDim(newPTO), col = newplotcol, pch = 16)

6Session Info

sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] mclust_6.0.0                uwot_0.1.14                
##  [3] Matrix_1.5-4                mgcv_1.8-42                
##  [5] nlme_3.1-162                RColorBrewer_1.1-3         
##  [7] slingshot_2.8.0             TrajectoryUtils_1.8.0      
##  [9] princurve_2.1.6             SingleCellExperiment_1.22.0
## [11] SummarizedExperiment_1.30.0 Biobase_2.60.0             
## [13] GenomicRanges_1.52.0        GenomeInfoDb_1.36.0        
## [15] IRanges_2.34.0              S4Vectors_0.38.0           
## [17] BiocGenerics_0.46.0         MatrixGenerics_1.12.0      
## [19] matrixStats_0.63.0          BiocStyle_2.28.0           
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.5                bitops_1.0-7             
##  [3] lattice_0.21-8            digest_0.6.31            
##  [5] magrittr_2.0.3            evaluate_0.20            
##  [7] sparseMatrixStats_1.12.0  grid_4.3.0               
##  [9] bookdown_0.33             fastmap_1.1.1            
## [11] jsonlite_1.8.4            BiocManager_1.30.20      
## [13] codetools_0.2-19          jquerylib_0.1.4          
## [15] cli_3.6.1                 rlang_1.1.0              
## [17] XVector_0.40.0            splines_4.3.0            
## [19] cachem_1.0.7              DelayedArray_0.26.0      
## [21] yaml_2.3.7                FNN_1.1.3.2              
## [23] tools_4.3.0               GenomeInfoDbData_1.2.10  
## [25] R6_2.5.1                  magick_2.7.4             
## [27] zlibbioc_1.46.0           irlba_2.3.5.1            
## [29] pkgconfig_2.0.3           bslib_0.4.2              
## [31] Rcpp_1.0.10               xfun_0.39                
## [33] highr_0.10                knitr_1.42               
## [35] htmltools_0.5.5           igraph_1.4.2             
## [37] rmarkdown_2.21            DelayedMatrixStats_1.22.0
## [39] compiler_4.3.0            RCurl_1.98-1.12

References

Cole, Michael, and Davide Risso. 2018. Scone: Single Cell Overview of Normalized Expression Data.

Haghverdi, Laleh, Aaron T. L. Lun, Michael D. Morgan, and John C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnol. 36 (5): 421–27. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors | Nature Biotechnology.

Hastie, Trevor, and Werner Stuetzle. 1989. “Principal Curves.” Journal of the American Statistical Association 84 (406): 502–16.

Lun, Aaron, and Davide Risso. 2017. SingleCellExperiment: S4 Classes for Single Cell Data.

Risso, Davide, Fanny Perraudeau, Svetlana Gribkova, Sandrine Dudoit, and Jean-Philippe Vert. 2018. “A General and Flexible Method for Signal Extraction from Single-Cell RNA-Seq Data.” Nature Communications 9: 284. A general and flexible method for signal extraction from single-cell RNA-seq data | Nature Communications.

Scrucca, Luca, Michael Fop, Thomas Brendan Murphy, and Adrian E. Raftery. 2016. “mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models.” The R Journal 8 (1): 205–33.

Street, Kelly, Davide Risso, Russell B Fletcher, Diya Das, John Ngai, Nir Yosef, Elizabeth Purdom, and Sandrine Dudoit. 2017. “Slingshot: Cell Lineage and Pseudotime Inference for Single-Cell Transcriptomics.” bioRxivhttps://doi.org/10.1101/128843.

Van den Berge, Koen, Hector Roux de Bézieux, Kelly Street, Wouter Saelens, Robrecht Cannoodt, Yvan Saeys, Sandrine Dudoit, and Lieven Clement. 2020. “Trajectory-Based Differential Expression Analysis for Single-Cell Sequencing Data.” Nature Communications 11 (1): 1201. Trajectory-based differential expression analysis for single-cell sequencing data | Nature Communications.

内容概要:本文详细介绍了C语言指针和字符串操作的基础知识与高级技巧。指针部分涵盖了指针作为数据类型的特点,包括指针变量的定义、间接赋值的应用场景及其重要性,以及不同级别的指针如何在函数间传递并修改实参的值。同时强调了指针操作的安全性问题,如不允许向NULL或未知地址拷贝内存,并讲解了`void*`指针的作用及其转换规则。字符串操作部分则重点讨论了字符串初始化、`sizeof`与`strlen`的区别、字符`\0`的作用及其与其他符号的区别,还展示了数组法和指针法两种操作字符串的方式,并给出了几个常见的字符串处理算法实例,如统计子串出现次数、去除字符串两端空白字符等。 适用人群:具有初步C语言基础的学习者,特别是对指针和字符串操作有进一步需求的编程人员。 使用场景及目标:①帮助读者深入理解指针的工作机制,掌握通过指针间接访问和修改内存的技术;②使读者能够熟练运用字符串操作的基本函数,并能编写高效的字符串处理代码;③培养读者的安全意识,避免因不当使用指针而导致程序崩溃或产生未定义行为。 阅读建议:由于指针和字符串是C语言中较为复杂的概念,建议读者在学习过程中多做笔记,动手实践书中的示例代码,尤其要注意理解指针间接赋值的原理,以及字符串处理函数的具体实现细节。此外,对于`void*`指针的理解和使用,应特别留意其类型转换的要求,确保代码的安全性和正确性。
# 单细胞转录组解析肺纤维化的分子机制与细胞异质性 **摘要** 本研究旨在通过单细胞转录组测序技术解析肺纤维化(PF)的分子机制及细胞异质性。选取临床确诊的肺纤维化患者组织样本(n=30)与健康对照(n=15),利用10x Genomics平台进行单细胞RNA测序,通过Seurat进行细胞聚类、差异基因分析和功能富集。结果显示:1. 肺纤维化组织中成纤维细胞亚群显著扩增,高表达COL1A1、ACTA2(α-SMA);2. 肺泡上皮细胞中异常激活的TGF-β信号通路与上皮-间质转化(EMT)相关基因(如SNAI1、VIM)显著上调;3. 巨噬细胞亚群M2型极化特征明显,与纤维化区域共定位。结论:单细胞转录组揭示了肺纤维化中关键细胞类型的动态变化及分子互作网络,为靶向治疗提供新思路。 **关键词**:肺纤维化;单细胞转录组;成纤维细胞活化;上皮-间质转化;TGF-β信号通路 **引言** 肺纤维化是一种以肺间质瘢痕形成为特征的致命性疾病,目前治疗手段有限。近年来,单细胞技术为解析其细胞异质性提供了新工具。国内外研究多聚焦于组织整体转录组,但对特定细胞亚群的贡献尚不明确。本研究通过单细胞测序,旨在解决以下问题:1. 肺纤维化中哪些细胞亚群驱动疾病进展?2. 关键信号通路如何调控细胞间互作?论文首先综述PF的病理机制研究现状,继而提出单细胞解析的必要性,最后阐明研究内容与框架。 **研究内容与方法** 1. **研究对象**:纳入特发性肺纤维化(IPF)患者30例(符合ATS/ERS诊断标准),健康对照15例(肺移植供体)。排除合并感染或恶性肿瘤者。 2. **实验流程**:新鲜组织经消化后获取单细胞悬液,10x Genomics建库,Illumina Novaseq测序。数据经Cell Ranger预处理,Seurat(v4.0)进行质控、标准化及聚类(分辨率=0.8)。 3. **分析策略**:差异基因鉴定(Wilcoxon检验),通路富集(KEGG/GO),细胞互作预测(CellChat)。 **研究结果** 1. 鉴定出12个细胞簇,包括成纤维细胞(COL1A1+)、肺泡上皮细胞(SFTPC+)、巨噬细胞(CD68+)等。 2. 成纤维细胞中肌成纤维细胞亚群(ACTA2+)占比增加3.2倍(P<0.001),与胶原沉积呈正相关(r=0.78)。 3. 肺泡Ⅱ型上皮细胞中EMT评分显著升高(P=0.003),与TGF-β1表达量正相关。 **讨论** 1. 肌成纤维细胞扩增验证了既往组织学研究,但单细胞层面揭示了其异质性来源(如部分源自EMT)。 2. TGF-β通路的核心作用与动物模型一致,但人类数据提示巨噬细胞-上皮细胞串扰可能为潜在干预靶点。 3. 局限性:样本量较小,未涵盖疾病动态进程;未来需结合空间转录组验证细胞定位。 **结论** 本研究系统描绘了肺纤维化的单细胞图谱,明确了关键细胞亚群及分子事件,为开发精准治疗策略奠定基础。后续研究将聚焦于调控EMT的小分子抑制剂筛选。 **参考文献** (示例,需按GB/T 7714补充) 1. Adams, T.S., et al. (2020). *Nature*. 2. Habermann, A.C., et al. (2020). *Nat Med*. 3. 王某某等. (2021). 中华医学杂志. --- **注**:以上内容严格遵循图示规范,摘要包含四要素,关键词为专业术语,讨论覆盖结果机制与不足,字数与格式符合要求。实际写作需补充具体数据与完整参考文献。 请按照以上内容生成相关结果图
07-11
To generate visual results based on single-cell transcriptomic analysis of pulmonary fibrosis, including cell clustering, gene expression patterns, and pathway activation, a combination of bioinformatics tools and visualization techniques is applied. The process involves analyzing high-dimensional gene expression data to reveal cellular heterogeneity, identify disease-associated subpopulations, and explore functional pathways. --- ### Perform Cell Clustering Using Dimensionality Reduction Cell clustering in single-cell RNA sequencing (scRNA-seq) typically begins with dimensionality reduction using Principal Component Analysis (PCA), followed by non-linear methods such as Uniform Manifold Approximation and Projection (UMAP) or t-distributed Stochastic Neighbor Embedding (t-SNE). These techniques help visualize the separation of cell populations and define clusters for further annotation. ```python # Example UMAP visualization using Scanpy (Python) import scanpy as sc adata = sc.read_h5ad("pulmonary_fibrosis_data.h5ad") sc.pp.pca(adata) sc.pp.neighbors(adata) sc.tl.umap(adata) sc.pl.umap(adata, color="leiden") ``` Clustering algorithms such as Leiden or Louvain are commonly used to group cells based on their transcriptomic similarity. --- ### Visualize Gene Expression Patterns Across Clusters Once clusters are defined, differentially expressed genes can be identified and visualized across cell types. Heatmaps, violin plots, and feature plots are frequently used to display gene expression levels within each cluster. For instance, in pulmonary fibrosis studies, genes like *ACTA2*, *COL1A1*, and *MMP9* may show distinct expression profiles among fibroblasts, epithelial cells, and immune cells. ```r # Example in Seurat (R) library(Seurat) pbmc <- readRDS("pulmonary_fibrosis_seurat.rds") FeaturePlot(pbmc, features = c("ACTA2", "COL1A1", "MMP9")) VlnPlot(pbmc, features = c("ACTA2", "COL1A1"), group.by = "cell_type") ``` These visualizations help highlight key marker genes that distinguish pathological from normal states. --- ### Analyze and Visualize Pathway Activation Pathway-level insights can be derived using gene set variation analysis (GSVA), gene set enrichment analysis (GSEA), or tools like **AUCell** and **SCENIC** to infer transcription factor activity and regulatory networks. Visualization includes heatmaps of pathway scores or bar charts showing enriched pathways. ```r # Example using AUCell in R library(AUCell) geneSets <- msigdbCollection(species="Homo sapiens", category="C2") cellScores <- AUCell_buildRankings(adata@assays$RNA@data) cellActivity <- AUCell_runAnalysis(geneSets, cellScores) pheatmap::pheatmap(as.matrix(cellActivity), scale="row") ``` This helps in identifying activated profibrotic pathways such as TGF-β signaling, extracellular matrix remodeling, and inflammatory responses. --- ### Integrate Trajectory Inference for Dynamic Pathway Changes Pseudotime trajectory inference using Monocle, Slingshot, or Pseudogp can reconstruct the progression of fibrotic states. This approach allows visualization of how gene expression and pathway activation change along a developmental continuum. ```r # Example using Monocle (R) library(monocle) cds <- readRDS("fibrosis_cds.rds") cds <- orderCells(cds) plot_cell_trajectory(cds, color_by = "pseudotime") plot_genes_in_pseudotime(cds, genes = c("TGFB1", "CTGF", "PDGFRB")) ``` Trajectory-based plots provide dynamic insights into the molecular mechanisms driving fibrosis progression. ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

生信小博士

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值