Usage
Source code
The source code and data processing scripts are available on GitHub. You can download them by using the git clone command:
git clone https://github.com/yulab2021/modCnet.git
In the provided repository, the pretrained models are located under the ./model
directory, and the data processing scripts and the main script are located under the ./script
directory:
.
├── modCnet.yaml
├── data
│ ├── event_level_features_C_base_quality.csv
│ ├── event_level_features_C_length.csv
│ ├── event_level_features_C_mean.csv
│ ├── event_level_features_C_median.csv
│ ├── event_level_features_C_std.csv
│ ├── IVT_transcripts_ac4C.csv
│ ├── IVT_transcripts_C.csv
│ ├── IVT_transcripts_m5C.csv
│ ├── qPCR_curve_4.13_1.csv
│ ├── qPCR_curve_4.13_2.csv
│ └── Rn_cycle_curve.csv
├── demo_data
│ ├── ac4C.feature.test.tsv
│ ├── ac4C.feature.train.tsv
│ ├── C.feature.test.tsv
│ ├── C.feature.train.tsv
│ ├── GRCh38_subset_reference.fa
│ ├── HeLa
│ ├── IVT_DRS.reference.fasta
│ ├── IVT_fast5
│ ├── IVT_fast5_guppy
│ ├── IVT_fast5_guppy_single
│ ├── IVT.fastq
│ ├── IVT.feature
│ ├── IVT.sam
│ ├── m5C.feature.test.tsv
│ ├── m5C.feature.train.tsv
│ ├── model
│ └── test.feature.tsv
├── docs
│ └── test
├── model
│ ├── C_ac4C.pkl
│ ├── C_m5C_ac4C.pkl
│ ├── C_m5C.pkl
│ └── m5C_ac4C.pkl
├── README.md
├── results_reproduce
│ └── figure1_script.ipynb
└── script
├── modCnet.py
├── feature_extraction.py
├── __init__.py
├── model.py
├── models.py
├── __pycache__
├── read_level_prediction_to_site_level_prediction.py
├── transcriptome_location_to_genome_location.py
└── utils.py
Data processing
The data processing procedure is essential for training and prediction. The raw FAST5 files need to be converted into feature files that can be used as input for the model. The data processing scripts are located in the ./script
directory. The following steps are required to process the raw FAST5 files.
1. Guppy basecalling
guppy_basecaller -i demo_data/IVT_ac4C -s demo_data/IVT_ac4C_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg
2. Multi-fast5 to single-fast5
multi_to_single_fast5 -i demo_data/IVT_ac4C_guppy -s demo_data/IVT_ac4C_guppy_single -t 40 --recursive
3. Tombo resquiggling
tombo resquiggle --overwrite --basecall-group Basecall_1D_000 demo_data/IVT_ac4C_single demo_data/IVT_DRS_ac4C.reference.fasta --processes 40 --fit-global-scale --include-event-stdev
4. Mapping to reference
cat demo_data/IVT_ac4C_guppy/pass/*.fastq >demo_data/IVT_ac4C.fastq
minimap2 -ax map-ont demo_data/IVT_DRS_ac4C.reference.fasta demo_data/IVT_ac4C.fastq >demo_data/IVT_ac4C.sam
5. Feature extraction
python script/feature_extraction.py --input demo_data/IVT_ac4C_guppy_single \
--reference demo_data/IVT_DRS_ac4C.reference.fasta \
--sam demo_data/IVT_ac4C.sam \
--output demo_data/IVT_ac4C.feature.tsv \
--clip 10 \
--motif NNCNN
Training from scratch
The de novo training mode in modCnet enables users to train the model from scratch using their own datasets. To train a modCnet model, both modified and modification-free Direct RNA Sequencing (DRS) data are required.
Before training, the raw FAST5 files need to undergo the data processing procedure . This process generates feature files specific to each modification type. The feature files should follow a naming convention that reflects the modification type they represent:
|-- data
| |-- C.feature.tsv
| |-- ac4C.feature.tsv
In order to evaluate the performance during the training process, it is important to have a separate test dataset. Here’s a script that randomly splits the feature file into training and test sets:
python script/train_test_split.py --input_file data/C.feature.tsv --train_file data/C.feature.train.tsv --test_file data/C.feature.test.tsv --train_ratio 0.8
python script/train_test_split.py --input_file data/ac4C.feature.tsv --train_file data/ac4C.feature.train.tsv --test_file data/ac4C.feature.test.tsv --train_ratio 0.8
To train the modCnet model using labelled training dataset, you can set the --run_mode
argument to “train”. This allows the model to be trained from scratch. Test data are required to evaluation the model performance.
python script/modCnet.py --run_mode train \
--model_type C/ac4C \
--new_model model/C_ac4C.pkl \
--train_data_C data/C.feature.train.tsv \
--train_data_ac4C data/ac4C.feature.train.tsv \
--test_data_C data/C.feature.test.tsv \
--test_data_ac4C data/ac4C.feature.test.tsv \
--epoch 100
The training process can be stopped manually based on the performance on the test set or by setting the maximum number of epochs. You can monitor the performance of the model on the test set during training and decide when to stop based on your desired criteria, such as reaching a certain accuracy or loss threshold. Alternatively, you can set a specific number of epochs as the maximum value for training using the -epoch
argument. This allows the model to train for a fixed number of iterations, regardless of the performance on the test set. After the specified number of epochs, the training process will automatically stop. By providing these options, you have the flexibility to control the training process based on your specific requirements and preferences. The training process should be something like this:
Epoch 2-2 Train acc: 0.853227, Test Acc: 0.801561, time: 0.684026
Epoch 2-3 Train acc: 0.857492, Test Acc: 0.809284, time: 0.689912
Epoch 2-4 Train acc: 0.859884, Test Acc: 0.810469, time: 0.695631
Epoch 2-5 Train acc: 0.863527, Test Acc: 0.812851, time: 0.701268
Epoch 2-6 Train acc: 0.865912, Test Acc: 0.814036, time: 0.701268
Prediction
Pretained models were saved in directory ./model
. You can load pretrained models to predict modification for new data by setting the --run_mode
argument to “predict”. Before prediction, the raw FAST5 files need to undergo the data processing procedure
python script/modCnet.py --run_mode predict \
--pretrained_model model/C_ac4C.pkl \
--feature_file data/WT.feature.tsv
--predict_result data/WT.predict.tsv
The prediction result “data/WT.predict.tsv” has the following format:
transcript_id site motif read_id prediction probability
NM_001349947.2 552 AACCA 320a1a8b-7709-4335-8f6a-84f09ba6592a unmod 0.00014777448
XM_006720125.3 2437 ACCAG 53dd21de-f74b-44db-baa3-06c68772b7e1 unmod 0.062309794
NM_001321485.2 498 TGCTG 1f8ce6a2-5fac-4a2f-ae25-0abdb0de412e unmod 0.17353779
NM_001199673.2 2972 ATCAA 5781a0c4-ede0-452e-8789-9a43740451ab unmod 0.26891512
NM_014364.5 1233 GACAA 47f7b914-a51e-4eab-adb2-e500d8a46fd1 unmod 0.029849814
NM_001321485.2 515 GCCTC 31fe54e8-7724-40c6-aaa2-025ab5de7754 unmod 0.004975981
NM_001136267.2 1780 GACTA 62b6ab58-5ee0-4871-95d5-5db66a9c56c7 unmod 0.0018304548
NM_001143883.4 714 TGCAG 4fb0be9b-9628-46aa-9ba4-40a6456d7d52 unmod 0.1989807
NM_006012.4 1058 ATCTT 7c7ff067-1ead-4838-97c8-5fca91fdfe8a unmod 0.06284212