Run examples
This sections gives examples on how to use the three modes of TandemMod.
Train m6A model using IVET m6A dataset
IVET datasets have been uploaded to GEO database under the accession number GSE227087. To train a m6A detection model, the followinng two fast5 files (m6A-modified and unmodified) are required.
IVET_DRS_m6A.tar.gz
IVET_DRS_unmodified.tar.gz
In this demo, subsets of the two datasets were taken for demonstration purposes due to the large size of the original datasets. The demo datasets were located undelr ./demo/IVET/
directory.
demo
└── IVET
├── IVET_m6A
│ └── IVET_m6A.fast5
└── IVET_unmod
└── IVET_unmod.fast5
1. Guppy basecalling
Basecalling converts the raw signal generated by Oxform Nanopore sequencing to DNA/RNA sequence. Guppy is used for basecalling in this step. In some nanopore datasets, the sequence information is already contained within the FAST5 files. In such cases, the basecalling step can be skipped as the sequence data is readily available.
#m6A
guppy_basecaller -i demo/IVET/IVET_m6A -s demo/IVET/IVET_m6A_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg
#unmodified
guppy_basecaller -i demo/IVET/IVET_unmod -s demo/IVET/IVET_unmod_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg
2. Multi-reads FAST5 files to single-read FAST5 files
Convert multi-reads FAST5 files to single-read FAST5 files. If the data generated by the sequencing device is already in the single-read format, this step can be skipped.
#m6A
multi_to_single_fast5 -i demo/IVET/IVET_m6A_guppy -s demo/IVET/IVET_m6A_guppy_single --recursive
#unmodified
multi_to_single_fast5 -i demo/IVET/IVET_unmod_guppy -s demo/IVET/IVET_unmod_guppy_single --recursive
3. Tombo resquiggling
In this step, the sequence obtained by basecalling is aligned or mapped to a reference genome or a known sequence. Then the corrected sequence is then associated with the corresponding current signals. The resquiggling process is typically performed in-place. No separate files are generated in this step.
#m6A
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/IVET/IVET_m6A_guppy_single demo/IVET_reference.fa --processes 40 --fit-global-scale --include-event-stdev
#unmodified
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/IVET/IVET_unmod_guppy_single demo/IVET_reference.fa --processes 40 --fit-global-scale --include-event-stdev
4. Map reads to reference
minimap2 is used to map basecalled sequences to reference transcripts. The output sam file serves as the input for the subsequent feature extraction step.
#m6A
cat demo/IVET/IVET_m6A_guppy/pass/*.fastq >demo/IVET/IVET_m6A.fastq
minimap2 -ax map-ont demo/IVET_reference.fa demo/IVET/IVET_m6A.fastq >demo/IVET/IVET_m6A.sam
#unmodified
cat demo/IVET/IVET_unmod_guppy/pass/*.fastq >demo/IVET/IVET_unmod.fastq
minimap2 -ax map-ont demo/IVET_reference.fa demo/IVET/IVET_unmod.fastq >demo/IVET/IVET_unmod.sam
5. Feature extraction
Extract signals and features from resquiggled fast5 files using the following python scripts.
#m6A
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/IVET/IVET_m6A_guppy_single --reference demo/IVET_reference.fa --sam demo/IVET/IVET_m6A.sam --output demo/IVET/m6A.signal.tsv --clip 10
python scripts/extract_feature_from_signal.py --signal_file demo/IVET/m6A.signal.tsv --clip 10 --output demo/IVET/m6A.feature.tsv --motif DRACH
#unmodified
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/IVET/IVET_unmod_guppy_single --reference demo/IVET_reference.fa --sam demo/IVET/IVET_unmod.sam --output demo/IVET/unmod.signal.tsv --clip 10
python scripts/extract_feature_from_signal.py --signal_file demo/IVET/unmod.signal.tsv --clip 10 --output demo/IVET/unmod.feature.tsv --motif DRACH
In the feature extraction step, the motif pattern should be provided using the argument --motif
. The base symbols of the motif follow the IUB code standard. Here is the full definition of IUB base symbols:
IUB Base |
Expansion |
---|---|
A |
A |
C |
C |
G |
G |
T |
T |
M |
AC |
V |
ACG |
R |
AG |
H |
ACT |
W |
AT |
D |
AGT |
S |
CG |
B |
CGT |
Y |
CT |
N |
ACGT |
K |
GT |
6. Train-test split
The train-test split is performed randomly, ensuring that the data points in each set are representative of the overall dataset. The default split ratios are 80% for training and 20% for testing. The train-test split ratio can be customized by using the argument --train_ratio
to accommodate the specific requirements of the problem and the size of the dataset.
The training set is used to train the model, allowing it to learn patterns and relationships present in the data. The testing set, on the other hand, is used to assess the model’s performance on new, unseen data. It serves as an independent evaluation set to measure how well the trained model generalizes to data it has not encountered before. By evaluating the model on the testing set, we can estimate its performance, detect overfitting (when the model performs well on the training set but poorly on the testing set) and assess its ability to make accurate predictions on new data.
usage: train_test_split.py [-h] [--input_file INPUT_FILE]
[--train_file TRAIN_FILE] [--test_file TEST_FILE]
[--train_ratio TRAIN_RATIO]
Split a feature file into training and testing sets.
optional arguments:
-h, --help show this help message and exit
--input_file INPUT_FILE Path to the input feature file
--train_file TRAIN_FILE Path to the train feature file
--test_file TEST_FILE Path to the test feature file
--train_ratio TRAIN_RATIO Ratio of instances to use for training (default: 0.8)
#m6A
python scripts/train_test_split.py --input_file demo/IVET/m6A.feature.tsv --train_file demo/IVET/m6A.train.feature.tsv --test_file demo/IVET/m6A.test.feature.tsv --train_ratio 0.8
#unmodified
python scripts/train_test_split.py --input_file demo/IVET/unmod.feature.tsv --train_file demo/IVET/unmod.train.feature.tsv --test_file demo/IVET/unmod.test.feature.tsv --train_ratio 0.8
7. Train m6A model
To train the TandemMod model using your own dataset from scratch, you can set the --run_mode
argument to “train”. TandemMod accepts both modified and unmodified feature files as input. Additionally, test feature files are necessary to evaluate the model’s performance. You can specify the model save path by using the argument --new_model
. The model’s training epochs can be defined using the argument --epochs
, and the model states will be saved at the end of each epoch. TandemMod will preferentially use the GPU
for training if CUDA is available on your device; otherwise, it will utilize the CPU
mode. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may last for several hours.
python scripts/TandemMod.py --run_mode train \
--new_model demo/model/m6A.demo.IVET.pkl \
--train_data_mod demo/IVET/m6A.train.feature.tsv \
--train_data_unmod demo/IVET/unmod.train.feature.tsv \
--test_data_mod demo/IVET/m6A.test.feature.tsv \
--test_data_unmod demo/IVET/unmod.test.feature.tsv \
--epoch 100
During training process, the following information can be used to monitor and evaluate the performance of the model:
device= cpu
train process.
data loaded.
start training...
Epoch 0-0 Train acc: 0.494000,Test Acc: 0.581081,time0:00:08.936393
Epoch 1-0 Train acc: 0.514000,Test Acc: 0.817568,time0:00:06.084542
Epoch 2-0 Train acc: 0.796000,Test Acc: 0.668919,time0:00:06.000019
Epoch 3-0 Train acc: 0.672000,Test Acc: 0.770270,time0:00:07.456637
Epoch 4-0 Train acc: 0.786000,Test Acc: 0.763514,time0:00:06.132852
Epoch 5-0 Train acc: 0.824000,Test Acc: 0.834459,time0:00:06.584059
Epoch 6-0 Train acc: 0.810000,Test Acc: 0.814189,time0:00:06.600892
Epoch 7-0 Train acc: 0.780000,Test Acc: 0.790541,time0:00:07.301838
After the data processing and model training, the following files should be generated by TandemMod. The trained model m6A.demo.IVET.pkl
will be saved in the ./demo/model/
folder. You can utilize this model for making predictions in the future.
demo
├── IVET
│ ├── IVET_m6A
│ ├── IVET_m6A.fastq
│ ├── IVET_m6A_guppy
│ ├── IVET_m6A_guppy_single
│ ├── IVET_m6A.sam
│ ├── IVET_unmod
│ ├── IVET_unmod.fastq
│ ├── IVET_unmod_guppy
│ ├── IVET_unmod_guppy_single
│ ├── IVET_unmod.sam
│ ├── m6A.feature.tsv
│ ├── m6A.signal.tsv
│ ├── m6A.test.feature.tsv
│ ├── m6A.train.feature.tsv
│ ├── unmod.feature.tsv
│ ├── unmod.signal.tsv
│ ├── unmod.test.feature.tsv
│ └── unmod.train.feature.tsv
├── IVET_reference.fa
└── model
└── m6A.demo.IVET.pkl
Train m6A model using curlcake m6A dataset
Curlcake datasets are publicly available at the GEO database under the accession code GSE124309. In this demo, subsets of the curcake datasets (m6A-modified and unmodified) were taken for demonstration purposes due to the large size of the original datasets. The demo datasets were located under ./demo/curlcake/
directory.
demo
└── curlcake
├── curlcake_m6A
│ └── curlcake_m6A.fast5
└── curlcake_unmod
└── curlcake_unmod.fast5
1. Guppy basecalling
Basecalling converts the raw signal generated by Oxform Nanopore sequencing to DNA/RNA sequence. Guppy is used for basecalling in this step. In some nanopore datasets, the sequence information is already contained within the FAST5 files. In such cases, the basecalling step can be skipped as the sequence data is readily available.
#m6A
guppy_basecaller -i demo/curlcake/curlcake_m6A -s demo/curlcake/curlcake_m6A_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg
#unmodified
guppy_basecaller -i demo/curlcake/curlcake_unmod -s demo/curlcake/curlcake_unmod_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg
2. Multi-reads FAST5 files to single-read FAST5 files
Convert multi-reads FAST5 files to single-read FAST5 files. If the data generated by the sequencing device is already in the single-read format, this step can be skipped.
#m6A
multi_to_single_fast5 -i demo/curlcake/curlcake_m6A_guppy -s demo/curlcake/curlcake_m6A_guppy_single --recursive
#unmodified
multi_to_single_fast5 -i demo/curlcake/curlcake_unmod_guppy -s demo/curlcake/curlcake_unmod_guppy_single --recursive
3. Tombo resquiggling
In this step, the sequence obtained by basecalling is aligned or mapped to a reference genome or a known sequence. Then the corrected sequence is then associated with the corresponding current signals. The resquiggling process is typically performed in-place. No separate files are generated in this step. Curlcake reference file can be download here.
#m6A
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/curlcake/curlcake_m6A_guppy_single demo/curlcake_reference.fa --processes 40 --fit-global-scale --include-event-stdev
#unmodified
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/curlcake/curlcake_unmod_guppy_single demo/curlcake_reference.fa --processes 40 --fit-global-scale --include-event-stdev
4. Map reads to reference
minimap2 is used to map basecalled sequences to reference transcripts. The output sam file serves as the input for the subsequent feature extraction step.
#m6A
cat demo/curlcake/curlcake_m6A_guppy/pass/*.fastq >demo/curlcake/curlcake_m6A.fastq
minimap2 -ax map-ont demo/curlcake_reference.fa demo/curlcake/curlcake_m6A.fastq >demo/curlcake/curlcake_m6A.sam
#unmodified
cat demo/curlcake/curlcake_unmod_guppy/pass/*.fastq >demo/curlcake/curlcake_unmod.fastq
minimap2 -ax map-ont demo/curlcake_reference.fa demo/curlcake/curlcake_unmod.fastq >demo/curlcake/curlcake_unmod.sam
5. Feature extraction
Extract signals and features from resquiggled fast5 files using the following python scripts.
#m6A
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/curlcake/curlcake_m6A_guppy_single --reference demo/curlcake_reference.fa --sam demo/curlcake/curlcake_m6A.sam --output demo/curlcake/m6A.signal.tsv --clip=10
python scripts/extract_feature_from_signal.py --signal_file demo/curlcake/m6A.signal.tsv --clip 10 --output demo/curlcake/m6A.feature.tsv --motif DRACH
#unmodified
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/curlcake/curlcake_unmod_guppy_single --reference demo/curlcake_reference.fa --sam demo/curlcake/curlcake_unmod.sam --output demo/curlcake/unmod.signal.tsv --clip=10
python scripts/extract_feature_from_signal.py --signal_file demo/curlcake/unmod.signal.tsv --clip 10 --output demo/curlcake/unmod.feature.tsv --motif DRACH
In the feature extraction step, the motif pattern should be provided using the argument --motif
. The base symbols of the motif follow the IUB code standard.
6. Train-test split
The train-test split is performed randomly, ensuring that the data points in each set are representative of the overall dataset. The default split ratios are 80% for training and 20% for testing. The train-test split ratio can be customized by using the argument --train_ratio
to accommodate the specific requirements of the problem and the size of the dataset.
The training set is used to train the model, allowing it to learn patterns and relationships present in the data. The testing set, on the other hand, is used to assess the model’s performance on new, unseen data. It serves as an independent evaluation set to measure how well the trained model generalizes to data it has not encountered before. By evaluating the model on the testing set, we can estimate its performance, detect overfitting (when the model performs well on the training set but poorly on the testing set) and assess its ability to make accurate predictions on new data.
usage: train_test_split.py [-h] [--input_file INPUT_FILE]
[--train_file TRAIN_FILE] [--test_file TEST_FILE]
[--train_ratio TRAIN_RATIO]
Split a feature file into training and testing sets.
optional arguments:
-h, --help show this help message and exit
--input_file INPUT_FILE Path to the input feature file
--train_file TRAIN_FILE Path to the train feature file
--test_file TEST_FILE Path to the test feature file
--train_ratio TRAIN_RATIO Ratio of instances to use for training (default: 0.8)
#m6A
python scripts/train_test_split.py --input_file demo/curlcake/m6A.feature.tsv --train_file demo/curlcake/m6A.train.feature.tsv --test_file demo/curlcake/m6A.test.feature.tsv --train_ratio 0.8
#unmodified
python scripts/train_test_split.py --input_file demo/curlcake/unmod.feature.tsv --train_file demo/curlcake/unmod.train.feature.tsv --test_file demo/curlcake/unmod.test.feature.tsv --train_ratio 0.8
7. Train m6A model
To train the TandemMod model using your own dataset from scratch, you can set the --run_mode
argument to “train”. TandemMod accepts both modified and unmodified feature files as input. Additionally, test feature files are necessary to evaluate the model’s performance. You can specify the model save path by using the argument --new_model
. The model’s training epochs can be defined using the argument --epochs
, and the model states will be saved at the end of each epoch. TandemMod will preferentially use the GPU
for training if CUDA is available on your device; otherwise, it will utilize the CPU
mode. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may last for several hours.
python scripts/TandemMod.py --run_mode train \
--new_model demo/model/m6A.demo.curlcake.pkl \
--train_data_mod demo/curlcake/m6A.train.feature.tsv \
--train_data_unmod demo/curlcake/unmod.train.feature.tsv \
--test_data_mod demo/curlcake/m6A.test.feature.tsv \
--test_data_unmod demo/curlcake/unmod.test.feature.tsv \
--epoch 100
During training process, the following information can be used to monitor and evaluate the performance of the model:
device= cpu
train process.
data loaded.
start training...
Epoch 0-0 Train acc: 0.482000,Test Acc: 0.788462,time0:00:07.666192
Epoch 1-0 Train acc: 0.514000,Test Acc: 0.211538,time0:00:04.977504
Epoch 2-0 Train acc: 0.496000,Test Acc: 0.211538,time0:00:05.498799
Epoch 3-0 Train acc: 0.694000,Test Acc: 0.432692,time0:00:05.893204
Epoch 4-0 Train acc: 0.814000,Test Acc: 0.639423,time0:00:06.149194
Epoch 5-0 Train acc: 0.806000,Test Acc: 0.711538,time0:00:05.443221
Epoch 6-0 Train acc: 0.828000,Test Acc: 0.831731,time0:00:05.706294
Epoch 7-0 Train acc: 0.808000,Test Acc: 0.846154,time0:00:05.674450
Epoch 8-0 Train acc: 0.804000,Test Acc: 0.822115,time0:00:05.956936
After the data processing and model training, the following files should be generated by TandemMod. The trained model m6A.demo.curlcake.pkl
will be saved in the ./demo/model/
folder. You can utilize this model for making predictions in the future.
demo
├── curlcake
│ ├── curlcake_m6A
│ ├── curlcake_m6A.fastq
│ ├── curlcake_m6A_guppy
│ ├── curlcake_m6A_guppy_single
│ ├── curlcake_m6A.sam
│ ├── curlcake_unmod
│ ├── curlcake_unmod.fastq
│ ├── curlcake_unmod_guppy
│ ├── curlcake_unmod_guppy_single
│ ├── curlcake_unmod.sam
│ ├── m6A.feature.tsv
│ ├── m6A.signal.tsv
│ ├── m6A.test.feature.tsv
│ ├── m6A.train.feature.tsv
│ ├── unmod.feature.tsv
│ ├── unmod.signal.tsv
│ ├── unmod.test.feature.tsv
│ └── unmod.train.feature.tsv
├── curlcake_reference.fa
└── model
└── m6A.demo.curlcake.pkl
Transfer m6A model to m7G using ELIGOS dataset
To transfer the pretrained m6A model to an m7G prediction model using the ELIGOS dataset, you can follow these steps:
Obtain the ELIGOS dataset: Download or access the ELIGOS m7G dataset, which consists of the necessary data (m7G-modified and unmodified) for training and testing.
Prepare the data: Preprocess the ELIGOS dataset to extact features for transfer learning.
Load the pretrained m6A model: Load the pretrained m6A model that you want to transfer to predict m7G modifications. This model should have been previously trained on a relevant m6A dataset.
Train the modified model: Use the ELIGOS m7G dataset to fine-tune the model’s parameters using transfer learning techniques.
Evaluate the performance: Assess the performance of the transferred m7G model on the m7G testing set from the ELIGOS dataset.
By following these steps, you can transfer the knowledge gained from the pretrained m6A model to predict m7G modifications using the ELIGOS dataset.
ELIGOS datasets are publicly available at the SRA database under the accession code SRP166020. In this demo, subsets of the ELIGOS datasets (m7G-modified and unmodified) were taken for demonstration purposes due to the large size of the original datasets. The demo datasets were located under ./demo/ELIGOS/
directory.
demo
└── ELIGOS
├── ELIGOS_m7G
│ └── ELIGOS_m7G.fast5
└── ELIGOS_unmod
└── ELIGOS_unmod.fast5
1. Guppy basecalling
Basecalling converts the raw signal generated by Oxform Nanopore sequencing to DNA/RNA sequence. Guppy is used for basecalling in this step. In some nanopore datasets, the sequence information is already contained within the FAST5 files. In such cases, the basecalling step can be skipped as the sequence data is readily available.
#m7G
guppy_basecaller -i demo/ELIGOS/ELIGOS_m7G -s demo/ELIGOS/ELIGOS_m7G_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg
#unmodified
guppy_basecaller -i demo/ELIGOS/ELIGOS_unmod -s demo/ELIGOS/ELIGOS_unmod_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg
2. Multi-reads FAST5 files to single-read FAST5 files
Convert multi-reads FAST5 files to single-read FAST5 files. If the data generated by the sequencing device is already in the single-read format, this step can be skipped.
#m7G
multi_to_single_fast5 -i demo/ELIGOS/ELIGOS_m7G_guppy -s demo/ELIGOS/ELIGOS_m7G_guppy_single --recursive
#unmodified
multi_to_single_fast5 -i demo/ELIGOS/ELIGOS_unmod_guppy -s demo/ELIGOS/ELIGOS_unmod_guppy_single --recursive
3. Tombo resquiggling
In this step, the sequence obtained by basecalling is aligned or mapped to a reference genome or a known sequence. Then the corrected sequence is then associated with the corresponding current signals. The resquiggling process is typically performed in-place. No separate files are generated in this step. ELIGOS reference file can be download here.
#m7G
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/ELIGOS/ELIGOS_m7G_guppy_single demo/ELIGOS_reference.fa --processes 40 --fit-global-scale --include-event-stdev
#unmodified
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/ELIGOS/ELIGOS_unmod_guppy_single demo/ELIGOS_reference.fa --processes 40 --fit-global-scale --include-event-stdev
4. Map reads to reference
minimap2 is used to map basecalled sequences to reference transcripts. The output sam file serves as the input for the subsequent feature extraction step.
#m7G
cat demo/ELIGOS/ELIGOS_m7G_guppy/pass/*.fastq >demo/ELIGOS/ELIGOS_m7G.fastq
minimap2 -ax map-ont demo/ELIGOS_reference.fa demo/ELIGOS/ELIGOS_m7G.fastq >demo/ELIGOS/ELIGOS_m7G.sam
#unmodified
cat demo/ELIGOS/ELIGOS_unmod_guppy/pass/*.fastq >demo/ELIGOS/ELIGOS_unmod.fastq
minimap2 -ax map-ont demo/ELIGOS_reference.fa demo/ELIGOS/ELIGOS_unmod.fastq >demo/ELIGOS/ELIGOS_unmod.sam
5. Feature extraction
Extract signals and features from resquiggled fast5 files using the following python scripts.
#m7G
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/ELIGOS/ELIGOS_m7G_guppy_single --reference demo/ELIGOS_reference.fa --sam demo/ELIGOS/ELIGOS_m7G.sam --output demo/ELIGOS/m7G.signal.tsv --clip=10
python scripts/extract_feature_from_signal.py --signal_file demo/ELIGOS/m7G.signal.tsv --clip 10 --output demo/ELIGOS/m7G.feature.tsv --motif NNGNN
#unmodified
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/ELIGOS/ELIGOS_unmod_guppy_single --reference demo/ELIGOS_reference.fa --sam demo/ELIGOS/ELIGOS_unmod.sam --output demo/ELIGOS/unmod.signal.tsv --clip=10
python scripts/extract_feature_from_signal.py --signal_file demo/ELIGOS/unmod.signal.tsv --clip 10 --output demo/ELIGOS/unmod.feature.tsv --motif NNGNN
In the feature extraction step, the motif pattern should be provided using the argument --motif
. The base symbols of the motif follow the IUB code standard.
6. Train-test split
The train-test split is performed randomly, ensuring that the data points in each set are representative of the overall dataset. The default split ratios are 80% for training and 20% for testing. The train-test split ratio can be customized by using the argument --train_ratio
to accommodate the specific requirements of the problem and the size of the dataset.
The training set is used to train the model, allowing it to learn patterns and relationships present in the data. The testing set, on the other hand, is used to assess the model’s performance on new, unseen data. It serves as an independent evaluation set to measure how well the trained model generalizes to data it has not encountered before. By evaluating the model on the testing set, we can estimate its performance, detect overfitting (when the model performs well on the training set but poorly on the testing set) and assess its ability to make accurate predictions on new data.
usage: train_test_split.py [-h] [--input_file INPUT_FILE]
[--train_file TRAIN_FILE] [--test_file TEST_FILE]
[--train_ratio TRAIN_RATIO]
Split a feature file into training and testing sets.
optional arguments:
-h, --help show this help message and exit
--input_file INPUT_FILE Path to the input feature file
--train_file TRAIN_FILE Path to the train feature file
--test_file TEST_FILE Path to the test feature file
--train_ratio TRAIN_RATIO Ratio of instances to use for training (default: 0.8)
#m7G
python scripts/train_test_split.py --input_file demo/ELIGOS/m7G.feature.tsv --train_file demo/ELIGOS/m7G.train.feature.tsv --test_file demo/ELIGOS/m7G.test.feature.tsv --train_ratio 0.8
#unmodified
python scripts/train_test_split.py --input_file demo/ELIGOS/unmod.feature.tsv --train_file demo/ELIGOS/unmod.train.feature.tsv --test_file demo/ELIGOS/unmod.test.feature.tsv --train_ratio 0.8
7. Train m7G model
To transfer the pretrained TandemMod model to new types of modifications, you can set the --run_mode
argument to “transfer”. TandemMod accepts both modified and unmodified feature files as input. Additionally, test feature files are necessary to evaluate the model’s performance. You can specify the pretrained model by using the argument --pretrained_model
and the new model save path by using the argument --new_model
. The model’s training epochs can be defined using the argument --epochs
, and the model states will be saved at the end of each epoch. TandemMod will preferentially use the GPU
for training if CUDA is available on your device; otherwise, it will utilize the CPU
mode. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may last for several hours.
usage: TandemMod.py [-h] --run_mode RUN_MODE
[--pretrained_model PRETRAINED_MODEL]
[--new_model NEW_MODEL] [--train_data_mod TRAIN_DATA_MOD]
[--train_data_unmod TRAIN_DATA_UNMOD]
[--test_data_mod TEST_DATA_MOD]
[--test_data_unmod TEST_DATA_UNMOD]
[--feature_file FEATURE_FILE]
[--predict_result PREDICT_RESULT] [--epoch EPOCH]
TandemMod, multiple types of RNA modification detection.
optional arguments:
-h, --help show this help message and exit
--run_mode RUN_MODE Run mode. Default is train
--pretrained_model PRETRAINED_MODEL Pretrained model file.
--new_model NEW_MODEL New model file to be saved.
--train_data_mod TRAIN_DATA_MOD Train data file, modified.
--train_data_unmod TRAIN_DATA_UNMOD Train data file, unmodified.
--test_data_mod TEST_DATA_MOD Test data file, modified.
--test_data_unmod TEST_DATA_UNMOD Test data file, unmodified.
--epoch EPOCH Training epoch
python scripts/TandemMod.py --run_mode transfer \
--pretrained_model demo/model/m6A.demo.IVET.pkl \
--new_model demo/model/m7G.demo.ELIGOS.transfered_from_IVET_m6A.pkl \
--train_data_mod demo/ELIGOS/m7G.train.feature.tsv \
--train_data_unmod demo/ELIGOS/unmod.train.feature.tsv \
--test_data_mod demo/ELIGOS/m7G.test.feature.tsv \
--test_data_unmod demo/ELIGOS/unmod.test.feature.tsv \
--epoch 100
During training process, the following information can be used to monitor and evaluate the performance of the transfered model:
device= cpu
transfer learning process.
data loaded.
start training...
Epoch 0-0 Train acc: 0.544000,Test Acc: 0.489786,time0:00:08.688707
Epoch 1-0 Train acc: 0.674000,Test Acc: 0.857939,time0:00:05.190997
Epoch 2-0 Train acc: 0.748000,Test Acc: 0.813835,time0:00:05.426035
Epoch 3-0 Train acc: 0.778000,Test Acc: 0.753946,time0:00:05.180632
Epoch 4-0 Train acc: 0.854000,Test Acc: 0.776230,time0:00:05.236281
Epoch 5-0 Train acc: 0.886000,Test Acc: 0.817549,time0:00:05.219122
Epoch 6-0 Train acc: 0.926000,Test Acc: 0.889044,time0:00:05.470729
After the data processing and model training, the following files should be generated by TandemMod. The trained model m7G.demo.ELIGOS.transfered_from_IVET_m6A.pkl
will be saved in the ./demo/model/
folder. You can utilize this fine-tuned model for making predictions in the future.
demo
├── ELIGOS
│ ├── ELIGOS_m7G
│ ├── ELIGOS_m7G.fastq
│ ├── ELIGOS_m7G_guppy
│ ├── ELIGOS_m7G_guppy_single
│ ├── ELIGOS_m7G.sam
│ ├── ELIGOS_unmod
│ ├── ELIGOS_unmod.fastq
│ ├── ELIGOS_unmod_guppy
│ ├── ELIGOS_unmod_guppy_single
│ ├── ELIGOS_unmod.sam
│ ├── m7G.feature.tsv
│ ├── m7G.signal.tsv
│ ├── m7G.test.feature.tsv
│ ├── m7G.train.feature.tsv
│ ├── unmod.feature.tsv
│ ├── unmod.signal.tsv
│ ├── unmod.test.feature.tsv
│ └── unmod.train.feature.tsv
├── ELIGOS_reference.fa
└── model
├── m6A.demo.IVET.pkl
└── m7G.demo.ELIGOS.transfered_from_IVET_m6A.pkl
Predict m6A sites in human cell lines
HEK293T nanopore data is publicly available and can be downloaded from the SG-NEx project. In this demo, subset of the HEK293T nanopore data was taken for demonstration purposes due to the large size of the original datasets. The demo datasets were located under ./demo/HEK293T/
directory.
demo
└── HEK293T
└── HEK293T_fast5
└── HEK293T.fast5
1. Guppy basecalling
Basecalling converts the raw signal generated by Oxform Nanopore sequencing to DNA/RNA sequence. Guppy is used for basecalling in this step. In some nanopore datasets, the sequence information is already contained within the FAST5 files. In such cases, the basecalling step can be skipped as the sequence data is readily available.
guppy_basecaller -i demo/HEK293T/HEK293T_fast5 -s demo/HEK293T/HEK293T_fast5_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg
2. Multi-reads FAST5 files to single-read FAST5 files
Convert multi-reads FAST5 files to single-read FAST5 files. If the data generated by the sequencing device is already in the single-read format, this step can be skipped.
multi_to_single_fast5 -i demo/HEK293T/HEK293T_fast5_guppy -s demo/HEK293T/HEK293T_fast5_guppy_single --recursive
3. Tombo resquiggling
In this step, the sequence obtained by basecalling is aligned or mapped to a reference genome or a known sequence. Then the corrected sequence is then associated with the corresponding current signals. The resquiggling process is typically performed in-plac. No separate files are generated in this step. GRCh38 transcripts file can be download here.
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/HEK293T/HEK293T_fast5_guppy_single demo/GRCh38_subset_reference.fa --processes 40 --fit-global-scale --include-event-stdev
4. Map reads to reference
minimap2 is used to map basecalled sequences to reference transcripts. The output sam file serves as the input for the subsequent feature extraction step.
cat demo/HEK293T/HEK293T_fast5_guppy/pass/*.fastq >demo/HEK293T/HEK293T.fastq
minimap2 -ax map-ont demo/GRCh38_subset_reference.fa demo/HEK293T/HEK293T.fastq >demo/HEK293T/HEK293T.sam
5. Feature extraction
Extract signals and features from resquiggled fast5 files using the following python scripts.
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/HEK293T/HEK293T_fast5_guppy_single --reference demo/GRCh38_subset_reference.fa --sam demo/HEK293T/HEK293T.sam --output demo/HEK293T/HEK293T.signal.tsv --clip=10
python scripts/extract_feature_from_signal.py --signal_file demo/HEK293T/HEK293T.signal.tsv --clip 10 --output demo/HEK293T/HEK293T.feature.tsv --motif DRACH
In the feature extraction step, the motif pattern should be provided using the argument --motif
. The base symbols of the motif follow the IUB code standard.
7. Predict m6A sites
To predict m6A sites in HEK293T nanopore data using a pretrained model, you can set the --run_mode
argument to “predict”. You can specify the pretrained model by using the argument --pretrained_model
.
python scripts/TandemMod.py --run_mode predict \
--pretrained_model demo/model/m6A.demo.IVET.pkl \
--feature_file demo/HEK293T/HEK293T.feature.tsv \
--predict_result demo/HEK293T/HEK293T.prediction.tsv
During the prediction process, TandemMod generates the following files. The prediction result file is named “HEK293T.prediction.tsv”.
demo
└── HEK293T
├── HEK293T_fast5
├── HEK293T_fast5_guppy
├── HEK293T_fast5_guppy_single
├── HEK293T.fastq
├── HEK293T.feature.tsv
├── HEK293T.prediction.tsv
├── HEK293T.sam
└── HEK293T.signal.tsv
The prediction result “demo/HEK293T/HEK293T.prediction.tsv” provides prediction labels along with the corresponding modification probabilities, which can be utilized for further analysis.
transcript_id site motif read_id prediction probability
XM_005261965.4 10156 AAACA 60e0f6a3-2166-4730-9a10-8f8aaa750b37 unmod 0.1364245
XM_005261965.4 10164 AAACT 60e0f6a3-2166-4730-9a10-8f8aaa750b37 unmod 0.034915127
XM_005261965.4 10229 GAACC 60e0f6a3-2166-4730-9a10-8f8aaa750b37 unmod 0.4773725
XM_005261965.4 10241 GGACC 60e0f6a3-2166-4730-9a10-8f8aaa750b37 unmod 0.11096856
XM_005261965.4 10324 GGACT 60e0f6a3-2166-4730-9a10-8f8aaa750b37 mod 0.908553
XM_005261965.4 10362 AAACA 60e0f6a3-2166-4730-9a10-8f8aaa750b37 unmod 0.2004475
XM_005261965.4 10434 AGACA 60e0f6a3-2166-4730-9a10-8f8aaa750b37 unmod 0.1934688
XM_005261965.4 10498 GGACC 60e0f6a3-2166-4730-9a10-8f8aaa750b37 unmod 0.1313666
XM_005261965.4 10507 AAACA 60e0f6a3-2166-4730-9a10-8f8aaa750b37 unmod 0.030169742
XM_005261965.4 10511 AAACT 60e0f6a3-2166-4730-9a10-8f8aaa750b37 unmod 0.020174831
XM_005261965.4 10592 AGACT 60e0f6a3-2166-4730-9a10-8f8aaa750b37 mod 0.7666112
The execution time for each demonstration is estimated to be approximately 3-10 minutes.