Performing Clustering using FEATS
This notebook shows how to perform clustering using the FEATS package.
The first step is to import libraries including functions from the FEATS package. We use the
json package to load a python dictionary of dataset specific parameters stored as a text file. The
pandas package is used to open data files stored as csv and to create dataframes. The
adjusted_rand_score function from the scikit-learn package is used for computing ARI. The
SingleCell class from the singlecelldata package is used for easy management of single-cell datasets. The functions in FEATS pcakage accepts SingleCell objects only to perform analysis. Plotting functions from the scplotlib library are used to generate cluster visualizations. The Cluster function from FEATS package is used to perform clustering.
# Import core libraries import json import pandas as pd from sklearn.metrics.cluster import adjusted_rand_score # Import SingleCell from singlecelldata import SingleCell # Import visualization functions from scplotlib import PCAPlot, SilhouettePlot, GeneExpHeatmap # Import the FEATS methods from feats import Cluster, GeneFilter, LogFilter
The next step is to load the dataset. Here we use Pandas to import data stored in csv format into dataframes. The dataframes are then used to create the SingleCell object as the functions in the FEATS package accepts SingleCell objects.
# Python dictionary storing dataset specific parameters. dataset_params = json.load(open("dset_param.txt")) dataset = "kolodziejczyk" dset_name = dataset_params[dataset]['dset_name'] path = "datasets/" data_path = path + dset_name + '/' + dset_name + "_data.csv" celldata_path = path + dset_name + '/' + dset_name + "_celldata.csv" genedata_path = path + dset_name + '/' + dset_name + "_genedata.csv" data = pd.read_csv(data_path, index_col=0) celldata = pd.read_csv(celldata_path, index_col=0) genedata = pd.read_csv(genedata_path, index_col = 0) # Create a single cell object sc = SingleCell(dataset,data,celldata,genedata) # Get parameters from the dictionary label = dataset_params[dataset]['label'] # This is the name of the # column which stores true cell labels. nc = dataset_params[dataset]['nc'] # This is the true number of # clusters in the dataset. # sc.print()
LogFilter function takes the Log (with base 2) of the expression counts stored in sc and stores the log transformed values in the data assay.
# Take a log transformation of gene expressions sc = LogFilter(sc)
GeneFilter function filters out genes based on the filtering criteria. It accepts three arguments apart from the sc object which are
expr_thres. The GeneFilter function filters out genes expressed (with expression value less than or equal to
expr_thres) in less than
min_cells and greater than
max_cells number of cells. Here we use 10% of cells as the
min_cells and 90% of cells as
expr_thres is 0 by default.
# Perform Gene Filtering sc = GeneFilter(sc, min_cells = int((10/100) * sc.dim), max_cells = int((90/100) * sc.dim), expr_thres = 0)
Applying Gene Filter . . . Number of features remaining after gene filtering: 10725
Cluster function from FEATS is used to cluster the cells in the dataset. The first argument is the SingleCell object. The second argument is the number of clusters in the data
n_clusters. Since we know this for this dataset, we pass the true number of clusters,
nc. This parameter also accepts a python list of integers as number of clusters. Clustering will be performed using all the integers in the list. It also accepts the string
'gap', in which case it will estimate the number of clusters using gap statistic. The output argument is the SingleCell object which stores the computed clusters and other information. The second output argument is the number of clusters in the data (which we don’t need in this case).
# Perform clustering sc, _ = Cluster(sc, k = nc)
Computing Temporary Clusters . . . Performing Feature Selection . . . Computing 3 clusters using best features . . . Saving final cluster labels in Single Cell object . . .
Adjusted rand index
We can assess the clustering performance by computing the ARI using
adjusted_rand_score function from scikit-learn package. The sc object stores the true cluster labels. We can get a numeric version of the true cell labels using the function
getNumericCellLabels if it is stored as a string or in any other form. The FEATS clustering also stores the computed cluster labels in the celldata assay under the column name
k is the number of clusters.
# Compute and print the Adjusted Rand Score print ("Adjusted Rand Index: ") print (adjusted_rand_score(sc.getNumericCellLabels(label), sc.getCellData("FEATS_3_Clusters")))
Adjusted Rand Index: 0.9919841315168355
Visualize the clusters formed
We can visualize the clusters formed using FEATS in various ways. (1) We can do a PCA plot which is a scatter plot in 2D after reducing the dimensionality using PCA. The scatter plot can be colored using the
color_by argument. We color the plot according to clusters computed by FEATS (fig1). (2) We can do a silhouette plot which will show the silhouette scores of the samples in each cluster (sorted by cluster and silhouette score). We have to specify the cluster information using the
cluster_by argument. Silhouette score is a measure of how similar cells are to their own cluster compared to other clusters. The silhouette values are in the range $[-1, 1]$. Values closer to 1 indicate that samples are well matched to their own cluster and poorly matched to neighbouring clusters (fig2).
fig1 = PCAPlot(sc, color_by = "FEATS_3_Clusters", marker_size = 10, marker_thickness = 2) fig1.show()
fig2 = SilhouettePlot(sc, cluster_by = "FEATS_3_Clusters") fig2.show()
High scoring genes (Top genes)
We can visualize the top genes according to their F-scores. To do this, we can generate a heatmap of the genes sorted by the F-scores using the
GeneExpHeatmap function from scplotlib package (fig3). Cells can be sorted according to their clusters with a horizontal colored bar on top of the heatmap indicating the cluster. The
color_by argument is used to do this. The genes can be sorted also using the
sort_by argument. We use the F-scores stored in the sc object after performing FEATS clustering. The F-scores are stored under the column
FEATS_F_Score in the genedata assay. We can also point to the column which stores the gene names using the
name_by argument. In this case the column
feature_symbol stores the gene names in the genedata assay. In addition, we can specify the number of genes we want using the
top_num_genes argument. We can also generate the heatmap where cells are sorted according to the temporary clusters computed by FEATS (fig4). The temporary clusters are computed to perform ANOVA.
# Plot top expressions fig3 = GeneExpHeatmap(sc, color_by = 'FEATS_3_Clusters', sort_by = 'FEATS_F_Score', name_by = 'feature_symbol', top_num_genes = 10) fig3.show()
fig4 = GeneExpHeatmap(sc, color_by = 'FEATS_3_Clusters_Temp', sort_by = 'FEATS_F_Score', name_by = 'feature_symbol', top_num_genes = 10) fig4.show()
For clarification on any of the functions used in this document please contact us via email: email@example.com