.. _run_examples: Run examples ================================== This section demostrates how to use scGO with examples. Train scGO model using `Baron dataset `_. ******************** The Baron dataset covers a range of cell types found in the pancreas, including acinar cells, activated stellate cells, alpha cells, beta cells, delta cells, ductal cells, endothelial cells, epsilon cells, gamma cells, macrophages, mast cells, quiescent stellate cells, and schwann cells. Baron dataset is available at `GSE84133 `_. In this demo, a subsets of the the baron dataset was taken for demonstration purposes due to the large size of the original datasets. The demo dataset was located under ``./demo/`` directory. :: demo ├── baron_data.csv ├── baron_meta_data.csv ├── goa_human.gaf └── TF_annotation_hg38.demo.tsv **1. Data normalization and filtering** Normalize data and filter out genes that expressed in fewer cells. The number of genes to be retained can be customized by using the argument ``--num_genes``. The default value is 2000. The normalized and filtered data will be saved in the ``./demo/`` folder. :: python scripts/data_processing.py norm_and_filter --gene_expression_matrix demo/baron_data.csv --num_genes 2000 --output demo/baron_data_filtered.csv **2. Building network connections** Build network connections between gene layer, TF layer and GO layer according to GO annotations and TF annotations. :: python scripts/data_processing.py build_network --gene_expression_matrix demo/baron_data_filtered.csv --GO_annotation demo/goa_human.gaf --TF_annotation demo/TF_annotation_hg38.demo.tsv **3. Model training** To train the scGO model using your own dataset from scratch, you can run the ``scGO.py`` script with the command ``train``. scGO accepts both single-cell gene expression matrix (.csv) as input. Meta data (.csv) is needed for the training process. The column cell_type in the meta data is used as the label for training. You can specify the model save path by using the argument ``--model``. The model's training epochs can be defined using the argument ``--epochs``, and the model states will be saved at the end of each epoch. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may range from minutes to several hours. Here is an example of the training process using the demo dataset. :: python scripts/scGO.py train --gene_expression_matrix demo/baron_data_filtered.csv --meta_data demo/baron_meta_data.csv --model models/scGO.demo.pkl During the training process, the following information can be used to monitor and evaluate the performance of the model: :: epoch 0 accuracy: 0.8 loss: 0.6765657464663187 epoch 1 accuracy: 1.0 loss: 0.21495116502046585 epoch 2 accuracy: 1.0 loss: 0.045270659029483795 epoch 3 accuracy: 1.0 loss: 0.01799739959339301 epoch 4 accuracy: 1.0 loss: 0.008456315845251083 epoch 5 accuracy: 1.0 loss: 0.0038305727454523244 epoch 6 accuracy: 1.0 loss: 0.0015001039331158001 epoch 7 accuracy: 1.0 loss: 0.0006393008710195621 epoch 8 accuracy: 1.0 loss: 0.0003799065889324993 epoch 9 accuracy: 1.0 loss: 0.00025623222851815325 epoch 10 accuracy: 1.0 loss: 0.0001992921606870368 epoch 11 accuracy: 1.0 loss: 0.0001639570424837681 epoch 12 accuracy: 1.0 loss: 0.00013759526094266525 epoch 13 accuracy: 1.0 loss: 0.0001203383071697317 Application of pretrained scGO to predict new data ******************** **Predicting new data** After the completion of the training process, the model file ``scGO.demo.pkl`` will be stored in the ``./models/`` folder. This trained model can be employed to make predictions on new data. Use the ``predict`` command to predict new data, and assign the predicted results using the ``--output`` argument. :: python scripts/scGO.py predict --gene_expression_matrix demo/baron_data.csv --model models/scGO.demo.pkl --output demo/baron_data_filtered.predicted.csv The prediction results include a predicted cell type label along with the confidence probability of the predicted cell type. The following serves as an example of the prediction results.:: cell_id predicted cell_type probability 0 human1_lib1.final_cell_0123 Epsilon cells 0.999998 1 human1_lib1.final_cell_0288 Macrophages 1.000000 2 human1_lib1.final_cell_0309 Epsilon cells 0.999936 3 human1_lib1.final_cell_0323 Epsilon cells 0.999969 4 human1_lib1.final_cell_0417 Macrophages 0.999999 .. ... ... ... 68 human4_lib1.final_cell_0349 Macrophages 0.999999 69 human4_lib1.final_cell_0579 Macrophages 1.000000 70 human4_lib3.final_cell_0064 Macrophages 0.999999 71 human4_lib3.final_cell_0215 Macrophages 1.000000 72 human4_lib3.final_cell_0574 Macrophages 1.000000 **Reporting novel cell type** scGO provided the configuration to indiate novel cell type by setting the argument ``--indicate_novel_cell_type`` to ``True``. The predictions with low confident probability will be asigned as novel cell type. The following serves as an example of the prediction results with novel cell type. :: python scripts/scGO.py predict --gene_expression_matrix demo/baron_data.csv --model models/scGO.demo.pkl --indicate_novel_cell_type True --output demo/baron_data_filtered_novel.predicted.csv cell_id predicted cell_type probability 0 human4_lib1.final_cell_0035 Macrophages 0.999773 1 human3_lib3.final_cell_0413 Macrophages 1.000000 2 human1_lib1.final_cell_0428 novel cell type 0.515880 3 human3_lib3.final_cell_0819 Epsilon cells 0.880849 4 human3_lib3.final_cell_0621 Epsilon cells 0.998823 .. ... ... ... 93 human2_lib1.final_cell_0399 Epsilon cells 0.999883 94 human2_lib1.final_cell_0544 Macrophages 0.999957 95 human4_lib1.final_cell_0326 Epsilon cells 0.999983 96 human2_lib3.final_cell_0147 Macrophages 1.000000 97 human4_lib1.final_cell_0295 Macrophages 0.999955 Train a regression model that predict a continous value ******************** In addition to discrete cell types, we provided a regression mode (set the ``task`` argument to ``regression``) to predict a continous cell status. The demo dataset contains a meta data ``baron_meta_data_senescence_score.csv`` with a column ``senescence_score`` under the ``demo`` directory. The ``senescence_score`` is a continous value. We can train a regression model to predict the ``senescence_score``. The following serves as an example of the training process using the demo dataset. The data processing and connections building are similar to the classification model. The sole distinction lies in setting the ``task`` argument to ``regression`` and specifying the ``label`` argument to correspond to a column in the metadata that you aim to predict. **1. Data normalization and filtering** Normalize data and filter out genes that expressed in fewer cells. The number of genes to be retained can be customized by using the argument ``--num_genes``. The default value is 2000. The normalized and filtered data will be saved in the ``./demo/`` folder. :: python scripts/data_processing.py norm_and_filter --gene_expression_matrix demo/baron_data.csv --num_genes 2000 --output demo/baron_data_filtered.csv **2. Building network connections** Build network connections between gene layer, TF layer and GO layer according to GO annotations and TF annotations. :: python scripts/data_processing.py build_network --gene_expression_matrix demo/baron_data_filtered.csv --GO_annotation demo/goa_human.gaf --TF_annotation demo/TF_annotation_hg38.demo.tsv **3. Training regression model** Set the ``task`` argument to ``regression`` and specify the ``label`` argument to correspond to a column in the metadata that you aim to predict. :: python scripts/scGO.py train --gene_expression_matrix demo/baron_data_filtered.csv --task regression --epoch 100 --batch_size 8 --meta_data demo/baron_meta_data_senescence_score.csv --label senescence_score --model models/scGO.senescence_score.demo.pkl **4. Predicitng new data** Load pre-trained model and predict new data. The predicted results include a predicted cell type label along with the predicted value. :: python scripts/scGO.py predict --gene_expression_matrix demo/baron_data.csv --task regression --model models/scGO.senescence_score.demo.pkl --output demo/baron_meta_data_senescence_score.predicted.csv After the data processing and model training, the following files should be generated by scGO. The trained model will be saved in the ``./models/`` folder. You can utilize this model for making predictions in the future. :: demo ├── baron_data.csv ├── baron_data_filtered.csv ├── baron_data_filtered.predicted.csv ├── baron_meta_data.csv ├── baron_meta_data_senescence_score.csv ├── baron_meta_data_senescence_score.predicted.csv ├── feature ├── gene_TF_dict ├── gene_to_TF_transform_matrix ├── goa_human.gaf.zip ├── GO_mask ├── GO_TF_mask ├── test_data.csv ├── TF_annotation_hg38.demo.tsv ├── TF_gene_dict └── TF_mask models ├── scGO.demo.pkl └── scGO.senescence_score.demo.pkl The execution time for each demonstration is estimated to be approximately 0-3 minutes.