Run examples

This section demostrates how to use scGO with examples.

Train scGO model using Baron dataset.

The Baron dataset covers a range of cell types found in the pancreas, including acinar cells, activated stellate cells, alpha cells, beta cells, delta cells, ductal cells, endothelial cells, epsilon cells, gamma cells, macrophages, mast cells, quiescent stellate cells, and schwann cells. Baron dataset is available at GSE84133. In this demo, a subsets of the the baron dataset was taken for demonstration purposes due to the large size of the original datasets. The demo dataset was located under ./demo/ directory.

demo
├── baron_data.csv
├── baron_meta_data.csv
├── goa_human.gaf
└── TF_annotation_hg38.demo.tsv

1. Data normalization and filtering

Normalize data and filter out genes that expressed in fewer cells. The number of genes to be retained can be customized by using the argument --num_genes. The default value is 2000. The normalized and filtered data will be saved in the ./demo/ folder.

python scripts/data_processing.py norm_and_filter --gene_expression_matrix demo/baron_data.csv --num_genes 2000 --output demo/baron_data_filtered.csv

2. Building network connections

Build network connections between gene layer, TF layer and GO layer according to GO annotations and TF annotations.

python scripts/data_processing.py build_network --gene_expression_matrix demo/baron_data_filtered.csv --GO_annotation demo/goa_human.gaf  --TF_annotation demo/TF_annotation_hg38.demo.tsv

3. Model training

To train the scGO model using your own dataset from scratch, you can run the scGO.py script with the command train. scGO accepts both single-cell gene expression matrix (.csv) as input. Meta data (.csv) is needed for the training process. The column cell_type in the meta data is used as the label for training. You can specify the model save path by using the argument --model. The model’s training epochs can be defined using the argument --epochs, and the model states will be saved at the end of each epoch. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may range from minutes to several hours. Here is an example of the training process using the demo dataset.

python scripts/scGO.py train --gene_expression_matrix demo/baron_data_filtered.csv --meta_data demo/baron_meta_data.csv --model models/scGO.demo.pkl

During the training process, the following information can be used to monitor and evaluate the performance of the model:

epoch 0         accuracy:        0.8    loss:    0.6765657464663187
epoch 1         accuracy:        1.0    loss:    0.21495116502046585
epoch 2         accuracy:        1.0    loss:    0.045270659029483795
epoch 3         accuracy:        1.0    loss:    0.01799739959339301
epoch 4         accuracy:        1.0    loss:    0.008456315845251083
epoch 5         accuracy:        1.0    loss:    0.0038305727454523244
epoch 6         accuracy:        1.0    loss:    0.0015001039331158001
epoch 7         accuracy:        1.0    loss:    0.0006393008710195621
epoch 8         accuracy:        1.0    loss:    0.0003799065889324993
epoch 9         accuracy:        1.0    loss:    0.00025623222851815325
epoch 10        accuracy:        1.0    loss:    0.0001992921606870368
epoch 11        accuracy:        1.0    loss:    0.0001639570424837681
epoch 12        accuracy:        1.0    loss:    0.00013759526094266525
epoch 13        accuracy:        1.0    loss:    0.0001203383071697317

Application of pretrained scGO to predict new data

Predicting new data

After the completion of the training process, the model file scGO.demo.pkl will be stored in the ./models/ folder. This trained model can be employed to make predictions on new data. Use the predict command to predict new data, and assign the predicted results using the --output argument.

python scripts/scGO.py predict --gene_expression_matrix demo/baron_data.csv  --model models/scGO.demo.pkl --output demo/baron_data_filtered.predicted.csv

The prediction results include a predicted cell type label along with the confidence probability of the predicted cell type. The following serves as an example of the prediction results.:

                        cell_id       predicted cell_type      probability
0   human1_lib1.final_cell_0123       Epsilon cells            0.999998
1   human1_lib1.final_cell_0288       Macrophages              1.000000
2   human1_lib1.final_cell_0309       Epsilon cells            0.999936
3   human1_lib1.final_cell_0323       Epsilon cells            0.999969
4   human1_lib1.final_cell_0417       Macrophages              0.999999
..  ...                               ...                      ...
68  human4_lib1.final_cell_0349       Macrophages              0.999999
69  human4_lib1.final_cell_0579       Macrophages              1.000000
70  human4_lib3.final_cell_0064       Macrophages              0.999999
71  human4_lib3.final_cell_0215       Macrophages              1.000000
72  human4_lib3.final_cell_0574       Macrophages              1.000000

Reporting novel cell type

scGO provided the configuration to indiate novel cell type by setting the argument --indicate_novel_cell_type to True. The predictions with low confident probability will be asigned as novel cell type. The following serves as an example of the prediction results with novel cell type.

python scripts/scGO.py predict --gene_expression_matrix demo/baron_data.csv  --model models/scGO.demo.pkl --indicate_novel_cell_type True --output demo/baron_data_filtered_novel.predicted.csv


                        cell_id        predicted cell_type     probability
0   human4_lib1.final_cell_0035        Macrophages             0.999773
1   human3_lib3.final_cell_0413        Macrophages             1.000000
2   human1_lib1.final_cell_0428        novel cell type         0.515880
3   human3_lib3.final_cell_0819        Epsilon cells           0.880849
4   human3_lib3.final_cell_0621        Epsilon cells           0.998823
..  ...                                ...                     ...
93  human2_lib1.final_cell_0399        Epsilon cells           0.999883
94  human2_lib1.final_cell_0544        Macrophages             0.999957
95  human4_lib1.final_cell_0326        Epsilon cells           0.999983
96  human2_lib3.final_cell_0147        Macrophages             1.000000
97  human4_lib1.final_cell_0295        Macrophages             0.999955

Train a regression model that predict a continous value

In addition to discrete cell types, we provided a regression mode (set the task argument to regression) to predict a continous cell status. The demo dataset contains a meta data baron_meta_data_senescence_score.csv with a column senescence_score under the demo directory. The senescence_score is a continous value. We can train a regression model to predict the senescence_score. The following serves as an example of the training process using the demo dataset. The data processing and connections building are similar to the classification model. The sole distinction lies in setting the task argument to regression and specifying the label argument to correspond to a column in the metadata that you aim to predict.

1. Data normalization and filtering

Normalize data and filter out genes that expressed in fewer cells. The number of genes to be retained can be customized by using the argument --num_genes. The default value is 2000. The normalized and filtered data will be saved in the ./demo/ folder.

python scripts/data_processing.py norm_and_filter --gene_expression_matrix demo/baron_data.csv --num_genes 2000 --output demo/baron_data_filtered.csv

2. Building network connections

Build network connections between gene layer, TF layer and GO layer according to GO annotations and TF annotations.

python scripts/data_processing.py build_network --gene_expression_matrix demo/baron_data_filtered.csv --GO_annotation demo/goa_human.gaf  --TF_annotation demo/TF_annotation_hg38.demo.tsv

3. Training regression model

Set the task argument to regression and specify the label argument to correspond to a column in the metadata that you aim to predict.

python scripts/scGO.py train --gene_expression_matrix demo/baron_data_filtered.csv --task regression --epoch 100 --batch_size 8 --meta_data demo/baron_meta_data_senescence_score.csv --label senescence_score --model models/scGO.senescence_score.demo.pkl

4. Predicitng new data

Load pre-trained model and predict new data. The predicted results include a predicted cell type label along with the predicted value.

python scripts/scGO.py predict --gene_expression_matrix demo/baron_data.csv --task regression --model models/scGO.senescence_score.demo.pkl --output demo/baron_meta_data_senescence_score.predicted.csv

After the data processing and model training, the following files should be generated by scGO. The trained model will be saved in the ./models/ folder. You can utilize this model for making predictions in the future.

demo
├── baron_data.csv
├── baron_data_filtered.csv
├── baron_data_filtered.predicted.csv
├── baron_meta_data.csv
├── baron_meta_data_senescence_score.csv
├── baron_meta_data_senescence_score.predicted.csv
├── feature
├── gene_TF_dict
├── gene_to_TF_transform_matrix
├── goa_human.gaf.zip
├── GO_mask
├── GO_TF_mask
├── test_data.csv
├── TF_annotation_hg38.demo.tsv
├── TF_gene_dict
└── TF_mask
models
├── scGO.demo.pkl
└── scGO.senescence_score.demo.pkl

The execution time for each demonstration is estimated to be approximately 0-3 minutes.