Run examples
This section demostrates how to use scGO with examples.
Train scGO model using Baron dataset.
The Baron dataset covers a range of cell types found in the pancreas, including acinar cells, activated stellate cells, alpha cells, beta cells, delta cells, ductal cells, endothelial cells, epsilon cells, gamma cells, macrophages, mast cells, quiescent stellate cells, and schwann cells. Baron dataset is available at GSE84133. In this demo, a subsets of the the baron dataset was taken for demonstration purposes due to the large size of the original datasets. The demo dataset was located under ./demo/
directory.
demo
├── baron_data.csv
├── baron_meta_data.csv
├── goa_human.gaf
└── TF_annotation_hg38.demo.tsv
1. Data normalization and filtering
Normalize data and filter out genes that expressed in fewer cells. The number of genes to be retained can be customized by using the argument --num_genes
. The default value is 2000. The normalized and filtered data will be saved in the ./demo/
folder.
python scripts/data_processing.py norm_and_filter --gene_expression_matrix demo/baron_data.csv --num_genes 2000 --output demo/baron_data_filtered.csv
2. Building network connections
Build network connections between gene layer, TF layer and GO layer according to GO annotations and TF annotations.
python scripts/data_processing.py build_network --gene_expression_matrix demo/baron_data_filtered.csv --GO_annotation demo/goa_human.gaf --TF_annotation demo/TF_annotation_hg38.demo.tsv
3. Model training
To train the scGO model using your own dataset from scratch, you can run the scGO.py
script with the command train
. scGO accepts both single-cell gene expression matrix (.csv) as input. Meta data (.csv) is needed for the training process. The column cell_type in the meta data is used as the label for training. You can specify the model save path by using the argument --model
. The model’s training epochs can be defined using the argument --epochs
, and the model states will be saved at the end of each epoch. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may range from minutes to several hours. Here is an example of the training process using the demo dataset.
python scripts/scGO.py train --gene_expression_matrix demo/baron_data_filtered.csv --meta_data demo/baron_meta_data.csv --model models/scGO.demo.pkl
During the training process, the following information can be used to monitor and evaluate the performance of the model:
epoch 0 accuracy: 0.8 loss: 0.6765657464663187
epoch 1 accuracy: 1.0 loss: 0.21495116502046585
epoch 2 accuracy: 1.0 loss: 0.045270659029483795
epoch 3 accuracy: 1.0 loss: 0.01799739959339301
epoch 4 accuracy: 1.0 loss: 0.008456315845251083
epoch 5 accuracy: 1.0 loss: 0.0038305727454523244
epoch 6 accuracy: 1.0 loss: 0.0015001039331158001
epoch 7 accuracy: 1.0 loss: 0.0006393008710195621
epoch 8 accuracy: 1.0 loss: 0.0003799065889324993
epoch 9 accuracy: 1.0 loss: 0.00025623222851815325
epoch 10 accuracy: 1.0 loss: 0.0001992921606870368
epoch 11 accuracy: 1.0 loss: 0.0001639570424837681
epoch 12 accuracy: 1.0 loss: 0.00013759526094266525
epoch 13 accuracy: 1.0 loss: 0.0001203383071697317
Application of pretrained scGO to predict new data
Predicting new data
After the completion of the training process, the model file scGO.demo.pkl
will be stored in the ./models/
folder. This trained model can be employed to make predictions on new data. Use the predict
command to predict new data, and assign the predicted results using the --output
argument.
python scripts/scGO.py predict --gene_expression_matrix demo/baron_data.csv --model models/scGO.demo.pkl --output demo/baron_data_filtered.predicted.csv
The prediction results include a predicted cell type label along with the confidence probability of the predicted cell type. The following serves as an example of the prediction results.:
cell_id predicted cell_type probability
0 human1_lib1.final_cell_0123 Epsilon cells 0.999998
1 human1_lib1.final_cell_0288 Macrophages 1.000000
2 human1_lib1.final_cell_0309 Epsilon cells 0.999936
3 human1_lib1.final_cell_0323 Epsilon cells 0.999969
4 human1_lib1.final_cell_0417 Macrophages 0.999999
.. ... ... ...
68 human4_lib1.final_cell_0349 Macrophages 0.999999
69 human4_lib1.final_cell_0579 Macrophages 1.000000
70 human4_lib3.final_cell_0064 Macrophages 0.999999
71 human4_lib3.final_cell_0215 Macrophages 1.000000
72 human4_lib3.final_cell_0574 Macrophages 1.000000
Reporting novel cell type
scGO provided the configuration to indiate novel cell type by setting the argument --indicate_novel_cell_type
to True
. The predictions with low confident probability will be asigned as novel cell type. The following serves as an example of the prediction results with novel cell type.
python scripts/scGO.py predict --gene_expression_matrix demo/baron_data.csv --model models/scGO.demo.pkl --indicate_novel_cell_type True --output demo/baron_data_filtered_novel.predicted.csv
cell_id predicted cell_type probability
0 human4_lib1.final_cell_0035 Macrophages 0.999773
1 human3_lib3.final_cell_0413 Macrophages 1.000000
2 human1_lib1.final_cell_0428 novel cell type 0.515880
3 human3_lib3.final_cell_0819 Epsilon cells 0.880849
4 human3_lib3.final_cell_0621 Epsilon cells 0.998823
.. ... ... ...
93 human2_lib1.final_cell_0399 Epsilon cells 0.999883
94 human2_lib1.final_cell_0544 Macrophages 0.999957
95 human4_lib1.final_cell_0326 Epsilon cells 0.999983
96 human2_lib3.final_cell_0147 Macrophages 1.000000
97 human4_lib1.final_cell_0295 Macrophages 0.999955
Train a regression model that predict a continous value
In addition to discrete cell types, we provided a regression mode (set the task
argument to regression
) to predict a continous cell status. The demo dataset contains a meta data baron_meta_data_senescence_score.csv
with a column senescence_score
under the demo
directory. The senescence_score
is a continous value. We can train a regression model to predict the senescence_score
. The following serves as an example of the training process using the demo dataset. The data processing and connections building are similar to the classification model. The sole distinction lies in setting the task
argument to regression
and specifying the label
argument to correspond to a column in the metadata that you aim to predict.
1. Data normalization and filtering
Normalize data and filter out genes that expressed in fewer cells. The number of genes to be retained can be customized by using the argument --num_genes
. The default value is 2000. The normalized and filtered data will be saved in the ./demo/
folder.
python scripts/data_processing.py norm_and_filter --gene_expression_matrix demo/baron_data.csv --num_genes 2000 --output demo/baron_data_filtered.csv
2. Building network connections
Build network connections between gene layer, TF layer and GO layer according to GO annotations and TF annotations.
python scripts/data_processing.py build_network --gene_expression_matrix demo/baron_data_filtered.csv --GO_annotation demo/goa_human.gaf --TF_annotation demo/TF_annotation_hg38.demo.tsv
3. Training regression model
Set the task
argument to regression
and specify the label
argument to correspond to a column in the metadata that you aim to predict.
python scripts/scGO.py train --gene_expression_matrix demo/baron_data_filtered.csv --task regression --epoch 100 --batch_size 8 --meta_data demo/baron_meta_data_senescence_score.csv --label senescence_score --model models/scGO.senescence_score.demo.pkl
4. Predicitng new data
Load pre-trained model and predict new data. The predicted results include a predicted cell type label along with the predicted value.
python scripts/scGO.py predict --gene_expression_matrix demo/baron_data.csv --task regression --model models/scGO.senescence_score.demo.pkl --output demo/baron_meta_data_senescence_score.predicted.csv
After the data processing and model training, the following files should be generated by scGO. The trained model will be saved in the ./models/
folder. You can utilize this model for making predictions in the future.
demo
├── baron_data.csv
├── baron_data_filtered.csv
├── baron_data_filtered.predicted.csv
├── baron_meta_data.csv
├── baron_meta_data_senescence_score.csv
├── baron_meta_data_senescence_score.predicted.csv
├── feature
├── gene_TF_dict
├── gene_to_TF_transform_matrix
├── goa_human.gaf.zip
├── GO_mask
├── GO_TF_mask
├── test_data.csv
├── TF_annotation_hg38.demo.tsv
├── TF_gene_dict
└── TF_mask
models
├── scGO.demo.pkl
└── scGO.senescence_score.demo.pkl
The execution time for each demonstration is estimated to be approximately 0-3 minutes.