Run examples

This sections gives examples on how to use the three modes of TandemMod.

Train m6A model using IVET m6A dataset

IVET datasets have been uploaded to GEO database under the accession number GSE227087. To train a m6A detection model, the followinng two fast5 files (m6A-modified and unmodified) are required.

IVET_DRS_m6A.tar.gz
IVET_DRS_unmodified.tar.gz

In this demo, subsets of the two datasets were taken for demonstration purposes due to the large size of the original datasets. The demo datasets were located undelr ./demo/IVET/ directory.

demo
└── IVET
    ├── IVET_m6A
    │   └── IVET_m6A.fast5
    └── IVET_unmod
        └── IVET_unmod.fast5

1. Guppy basecalling

Basecalling converts the raw signal generated by Oxform Nanopore sequencing to DNA/RNA sequence. Guppy is used for basecalling in this step. In some nanopore datasets, the sequence information is already contained within the FAST5 files. In such cases, the basecalling step can be skipped as the sequence data is readily available.

#m6A
guppy_basecaller -i demo/IVET/IVET_m6A -s demo/IVET/IVET_m6A_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

#unmodified
guppy_basecaller -i demo/IVET/IVET_unmod -s demo/IVET/IVET_unmod_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

2. Multi-reads FAST5 files to single-read FAST5 files

Convert multi-reads FAST5 files to single-read FAST5 files. If the data generated by the sequencing device is already in the single-read format, this step can be skipped.

#m6A
multi_to_single_fast5 -i demo/IVET/IVET_m6A_guppy -s demo/IVET/IVET_m6A_guppy_single --recursive

#unmodified
multi_to_single_fast5 -i demo/IVET/IVET_unmod_guppy -s demo/IVET/IVET_unmod_guppy_single --recursive

3. Tombo resquiggling

In this step, the sequence obtained by basecalling is aligned or mapped to a reference genome or a known sequence. Then the corrected sequence is then associated with the corresponding current signals. The resquiggling process is typically performed in-place. No separate files are generated in this step.

#m6A
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/IVET/IVET_m6A_guppy_single  demo/IVET_reference.fa --processes 40 --fit-global-scale --include-event-stdev

#unmodified
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/IVET/IVET_unmod_guppy_single  demo/IVET_reference.fa --processes 40 --fit-global-scale --include-event-stdev

4. Map reads to reference

minimap2 is used to map basecalled sequences to reference transcripts. The output sam file serves as the input for the subsequent feature extraction step.

#m6A
cat demo/IVET/IVET_m6A_guppy/pass/*.fastq >demo/IVET/IVET_m6A.fastq
minimap2 -ax map-ont demo/IVET_reference.fa demo/IVET/IVET_m6A.fastq >demo/IVET/IVET_m6A.sam

#unmodified
cat demo/IVET/IVET_unmod_guppy/pass/*.fastq >demo/IVET/IVET_unmod.fastq
minimap2 -ax map-ont demo/IVET_reference.fa demo/IVET/IVET_unmod.fastq >demo/IVET/IVET_unmod.sam

5. Feature extraction

Extract signals and features from resquiggled fast5 files using the following python scripts.

#m6A
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/IVET/IVET_m6A_guppy_single --reference demo/IVET_reference.fa --sam demo/IVET/IVET_m6A.sam --output demo/IVET/m6A.signal.tsv --clip 10
python scripts/extract_feature_from_signal.py  --signal_file demo/IVET/m6A.signal.tsv --clip 10 --output demo/IVET/m6A.feature.tsv --motif DRACH

#unmodified
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/IVET/IVET_unmod_guppy_single --reference demo/IVET_reference.fa --sam demo/IVET/IVET_unmod.sam --output demo/IVET/unmod.signal.tsv --clip 10
python scripts/extract_feature_from_signal.py  --signal_file demo/IVET/unmod.signal.tsv --clip 10 --output demo/IVET/unmod.feature.tsv --motif DRACH

In the feature extraction step, the motif pattern should be provided using the argument --motif. The base symbols of the motif follow the IUB code standard. Here is the full definition of IUB base symbols:

IUB Base	Expansion
A	A
C	C
G	G
T	T
M	AC
V	ACG
R	AG
H	ACT
W	AT
D	AGT
S	CG
B	CGT
Y	CT
N	ACGT
K	GT

6. Train-test split

The train-test split is performed randomly, ensuring that the data points in each set are representative of the overall dataset. The default split ratios are 80% for training and 20% for testing. The train-test split ratio can be customized by using the argument --train_ratio to accommodate the specific requirements of the problem and the size of the dataset.

The training set is used to train the model, allowing it to learn patterns and relationships present in the data. The testing set, on the other hand, is used to assess the model’s performance on new, unseen data. It serves as an independent evaluation set to measure how well the trained model generalizes to data it has not encountered before. By evaluating the model on the testing set, we can estimate its performance, detect overfitting (when the model performs well on the training set but poorly on the testing set) and assess its ability to make accurate predictions on new data.

usage: train_test_split.py [-h] [--input_file INPUT_FILE]
                           [--train_file TRAIN_FILE] [--test_file TEST_FILE]
                           [--train_ratio TRAIN_RATIO]

Split a feature file into training and testing sets.

optional arguments:
  -h, --help                  show this help message and exit
  --input_file INPUT_FILE     Path to the input feature file
  --train_file TRAIN_FILE     Path to the train feature file
  --test_file TEST_FILE       Path to the test feature file
  --train_ratio TRAIN_RATIO   Ratio of instances to use for training (default: 0.8)

#m6A
python scripts/train_test_split.py --input_file demo/IVET/m6A.feature.tsv --train_file demo/IVET/m6A.train.feature.tsv --test_file demo/IVET/m6A.test.feature.tsv --train_ratio 0.8

#unmodified
python scripts/train_test_split.py --input_file demo/IVET/unmod.feature.tsv --train_file demo/IVET/unmod.train.feature.tsv --test_file demo/IVET/unmod.test.feature.tsv --train_ratio 0.8

7. Train m6A model

To train the TandemMod model using your own dataset from scratch, you can set the --run_mode argument to “train”. TandemMod accepts both modified and unmodified feature files as input. Additionally, test feature files are necessary to evaluate the model’s performance. You can specify the model save path by using the argument --new_model. The model’s training epochs can be defined using the argument --epochs, and the model states will be saved at the end of each epoch. TandemMod will preferentially use the GPU for training if CUDA is available on your device; otherwise, it will utilize the CPU mode. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may last for several hours.

python scripts/TandemMod.py --run_mode train \
  --new_model demo/model/m6A.demo.IVET.pkl \
  --train_data_mod demo/IVET/m6A.train.feature.tsv \
  --train_data_unmod demo/IVET/unmod.train.feature.tsv \
  --test_data_mod demo/IVET/m6A.test.feature.tsv \
  --test_data_unmod demo/IVET/unmod.test.feature.tsv \
  --epoch 100

During training process, the following information can be used to monitor and evaluate the performance of the model:

device= cpu
train process.
data loaded.
start training...
Epoch 0-0 Train acc: 0.494000,Test Acc: 0.581081,time0:00:08.936393
Epoch 1-0 Train acc: 0.514000,Test Acc: 0.817568,time0:00:06.084542
Epoch 2-0 Train acc: 0.796000,Test Acc: 0.668919,time0:00:06.000019
Epoch 3-0 Train acc: 0.672000,Test Acc: 0.770270,time0:00:07.456637
Epoch 4-0 Train acc: 0.786000,Test Acc: 0.763514,time0:00:06.132852
Epoch 5-0 Train acc: 0.824000,Test Acc: 0.834459,time0:00:06.584059
Epoch 6-0 Train acc: 0.810000,Test Acc: 0.814189,time0:00:06.600892
Epoch 7-0 Train acc: 0.780000,Test Acc: 0.790541,time0:00:07.301838

After the data processing and model training, the following files should be generated by TandemMod. The trained model m6A.demo.IVET.pkl will be saved in the ./demo/model/ folder. You can utilize this model for making predictions in the future.

demo
├── IVET
│   ├── IVET_m6A
│   ├── IVET_m6A.fastq
│   ├── IVET_m6A_guppy
│   ├── IVET_m6A_guppy_single
│   ├── IVET_m6A.sam
│   ├── IVET_unmod
│   ├── IVET_unmod.fastq
│   ├── IVET_unmod_guppy
│   ├── IVET_unmod_guppy_single
│   ├── IVET_unmod.sam
│   ├── m6A.feature.tsv
│   ├── m6A.signal.tsv
│   ├── m6A.test.feature.tsv
│   ├── m6A.train.feature.tsv
│   ├── unmod.feature.tsv
│   ├── unmod.signal.tsv
│   ├── unmod.test.feature.tsv
│   └── unmod.train.feature.tsv
├── IVET_reference.fa
└── model
       └── m6A.demo.IVET.pkl

Train m6A model using curlcake m6A dataset

Curlcake datasets are publicly available at the GEO database under the accession code GSE124309. In this demo, subsets of the curcake datasets (m6A-modified and unmodified) were taken for demonstration purposes due to the large size of the original datasets. The demo datasets were located under ./demo/curlcake/ directory.

demo
└── curlcake
    ├── curlcake_m6A
    │   └── curlcake_m6A.fast5
    └── curlcake_unmod
        └── curlcake_unmod.fast5

1. Guppy basecalling

Basecalling converts the raw signal generated by Oxform Nanopore sequencing to DNA/RNA sequence. Guppy is used for basecalling in this step. In some nanopore datasets, the sequence information is already contained within the FAST5 files. In such cases, the basecalling step can be skipped as the sequence data is readily available.

#m6A
guppy_basecaller -i demo/curlcake/curlcake_m6A -s demo/curlcake/curlcake_m6A_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

#unmodified
guppy_basecaller -i demo/curlcake/curlcake_unmod -s demo/curlcake/curlcake_unmod_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

2. Multi-reads FAST5 files to single-read FAST5 files

Convert multi-reads FAST5 files to single-read FAST5 files. If the data generated by the sequencing device is already in the single-read format, this step can be skipped.

#m6A
multi_to_single_fast5 -i demo/curlcake/curlcake_m6A_guppy -s demo/curlcake/curlcake_m6A_guppy_single --recursive

#unmodified
multi_to_single_fast5 -i demo/curlcake/curlcake_unmod_guppy -s demo/curlcake/curlcake_unmod_guppy_single --recursive

3. Tombo resquiggling

In this step, the sequence obtained by basecalling is aligned or mapped to a reference genome or a known sequence. Then the corrected sequence is then associated with the corresponding current signals. The resquiggling process is typically performed in-place. No separate files are generated in this step. Curlcake reference file can be download here.

#m6A
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/curlcake/curlcake_m6A_guppy_single  demo/curlcake_reference.fa --processes 40 --fit-global-scale --include-event-stdev

#unmodified
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/curlcake/curlcake_unmod_guppy_single  demo/curlcake_reference.fa --processes 40 --fit-global-scale --include-event-stdev

4. Map reads to reference

minimap2 is used to map basecalled sequences to reference transcripts. The output sam file serves as the input for the subsequent feature extraction step.

#m6A
cat demo/curlcake/curlcake_m6A_guppy/pass/*.fastq >demo/curlcake/curlcake_m6A.fastq
minimap2 -ax map-ont demo/curlcake_reference.fa demo/curlcake/curlcake_m6A.fastq >demo/curlcake/curlcake_m6A.sam

#unmodified
cat demo/curlcake/curlcake_unmod_guppy/pass/*.fastq >demo/curlcake/curlcake_unmod.fastq
minimap2 -ax map-ont demo/curlcake_reference.fa demo/curlcake/curlcake_unmod.fastq >demo/curlcake/curlcake_unmod.sam

5. Feature extraction

Extract signals and features from resquiggled fast5 files using the following python scripts.

#m6A
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/curlcake/curlcake_m6A_guppy_single --reference demo/curlcake_reference.fa --sam demo/curlcake/curlcake_m6A.sam --output demo/curlcake/m6A.signal.tsv --clip=10
python scripts/extract_feature_from_signal.py  --signal_file demo/curlcake/m6A.signal.tsv --clip 10 --output demo/curlcake/m6A.feature.tsv --motif DRACH

#unmodified
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/curlcake/curlcake_unmod_guppy_single --reference demo/curlcake_reference.fa --sam demo/curlcake/curlcake_unmod.sam --output demo/curlcake/unmod.signal.tsv --clip=10
python scripts/extract_feature_from_signal.py  --signal_file demo/curlcake/unmod.signal.tsv --clip 10 --output demo/curlcake/unmod.feature.tsv --motif DRACH

In the feature extraction step, the motif pattern should be provided using the argument --motif. The base symbols of the motif follow the IUB code standard.

6. Train-test split

The train-test split is performed randomly, ensuring that the data points in each set are representative of the overall dataset. The default split ratios are 80% for training and 20% for testing. The train-test split ratio can be customized by using the argument --train_ratio to accommodate the specific requirements of the problem and the size of the dataset.

The training set is used to train the model, allowing it to learn patterns and relationships present in the data. The testing set, on the other hand, is used to assess the model’s performance on new, unseen data. It serves as an independent evaluation set to measure how well the trained model generalizes to data it has not encountered before. By evaluating the model on the testing set, we can estimate its performance, detect overfitting (when the model performs well on the training set but poorly on the testing set) and assess its ability to make accurate predictions on new data.

usage: train_test_split.py [-h] [--input_file INPUT_FILE]
                           [--train_file TRAIN_FILE] [--test_file TEST_FILE]
                           [--train_ratio TRAIN_RATIO]

Split a feature file into training and testing sets.

optional arguments:
  -h, --help                  show this help message and exit
  --input_file INPUT_FILE     Path to the input feature file
  --train_file TRAIN_FILE     Path to the train feature file
  --test_file TEST_FILE       Path to the test feature file
  --train_ratio TRAIN_RATIO   Ratio of instances to use for training (default: 0.8)

#m6A
python scripts/train_test_split.py --input_file demo/curlcake/m6A.feature.tsv --train_file demo/curlcake/m6A.train.feature.tsv --test_file demo/curlcake/m6A.test.feature.tsv --train_ratio 0.8

#unmodified
python scripts/train_test_split.py --input_file demo/curlcake/unmod.feature.tsv --train_file demo/curlcake/unmod.train.feature.tsv --test_file demo/curlcake/unmod.test.feature.tsv --train_ratio 0.8

7. Train m6A model

To train the TandemMod model using your own dataset from scratch, you can set the --run_mode argument to “train”. TandemMod accepts both modified and unmodified feature files as input. Additionally, test feature files are necessary to evaluate the model’s performance. You can specify the model save path by using the argument --new_model. The model’s training epochs can be defined using the argument --epochs, and the model states will be saved at the end of each epoch. TandemMod will preferentially use the GPU for training if CUDA is available on your device; otherwise, it will utilize the CPU mode. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may last for several hours.

python scripts/TandemMod.py --run_mode train \
  --new_model demo/model/m6A.demo.curlcake.pkl \
  --train_data_mod demo/curlcake/m6A.train.feature.tsv \
  --train_data_unmod demo/curlcake/unmod.train.feature.tsv \
  --test_data_mod demo/curlcake/m6A.test.feature.tsv \
  --test_data_unmod demo/curlcake/unmod.test.feature.tsv \
  --epoch 100

During training process, the following information can be used to monitor and evaluate the performance of the model:

device= cpu
train process.
data loaded.
start training...
Epoch 0-0 Train acc: 0.482000,Test Acc: 0.788462,time0:00:07.666192
Epoch 1-0 Train acc: 0.514000,Test Acc: 0.211538,time0:00:04.977504
Epoch 2-0 Train acc: 0.496000,Test Acc: 0.211538,time0:00:05.498799
Epoch 3-0 Train acc: 0.694000,Test Acc: 0.432692,time0:00:05.893204
Epoch 4-0 Train acc: 0.814000,Test Acc: 0.639423,time0:00:06.149194
Epoch 5-0 Train acc: 0.806000,Test Acc: 0.711538,time0:00:05.443221
Epoch 6-0 Train acc: 0.828000,Test Acc: 0.831731,time0:00:05.706294
Epoch 7-0 Train acc: 0.808000,Test Acc: 0.846154,time0:00:05.674450
Epoch 8-0 Train acc: 0.804000,Test Acc: 0.822115,time0:00:05.956936

After the data processing and model training, the following files should be generated by TandemMod. The trained model m6A.demo.curlcake.pkl will be saved in the ./demo/model/ folder. You can utilize this model for making predictions in the future.

demo
├── curlcake
│   ├── curlcake_m6A
│   ├── curlcake_m6A.fastq
│   ├── curlcake_m6A_guppy
│   ├── curlcake_m6A_guppy_single
│   ├── curlcake_m6A.sam
│   ├── curlcake_unmod
│   ├── curlcake_unmod.fastq
│   ├── curlcake_unmod_guppy
│   ├── curlcake_unmod_guppy_single
│   ├── curlcake_unmod.sam
│   ├── m6A.feature.tsv
│   ├── m6A.signal.tsv
│   ├── m6A.test.feature.tsv
│   ├── m6A.train.feature.tsv
│   ├── unmod.feature.tsv
│   ├── unmod.signal.tsv
│   ├── unmod.test.feature.tsv
│   └── unmod.train.feature.tsv
├── curlcake_reference.fa
└── model
       └── m6A.demo.curlcake.pkl

Transfer m6A model to m7G using ELIGOS dataset

To transfer the pretrained m6A model to an m7G prediction model using the ELIGOS dataset, you can follow these steps:

Obtain the ELIGOS dataset: Download or access the ELIGOS m7G dataset, which consists of the necessary data (m7G-modified and unmodified) for training and testing.
Prepare the data: Preprocess the ELIGOS dataset to extact features for transfer learning.
Load the pretrained m6A model: Load the pretrained m6A model that you want to transfer to predict m7G modifications. This model should have been previously trained on a relevant m6A dataset.
Train the modified model: Use the ELIGOS m7G dataset to fine-tune the model’s parameters using transfer learning techniques.
Evaluate the performance: Assess the performance of the transferred m7G model on the m7G testing set from the ELIGOS dataset.

By following these steps, you can transfer the knowledge gained from the pretrained m6A model to predict m7G modifications using the ELIGOS dataset.

ELIGOS datasets are publicly available at the SRA database under the accession code SRP166020. In this demo, subsets of the ELIGOS datasets (m7G-modified and unmodified) were taken for demonstration purposes due to the large size of the original datasets. The demo datasets were located under ./demo/ELIGOS/ directory.

demo
└── ELIGOS
    ├── ELIGOS_m7G
    │   └── ELIGOS_m7G.fast5
    └── ELIGOS_unmod
        └── ELIGOS_unmod.fast5

1. Guppy basecalling

Basecalling converts the raw signal generated by Oxform Nanopore sequencing to DNA/RNA sequence. Guppy is used for basecalling in this step. In some nanopore datasets, the sequence information is already contained within the FAST5 files. In such cases, the basecalling step can be skipped as the sequence data is readily available.

#m7G
guppy_basecaller -i demo/ELIGOS/ELIGOS_m7G -s demo/ELIGOS/ELIGOS_m7G_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

#unmodified
guppy_basecaller -i demo/ELIGOS/ELIGOS_unmod -s demo/ELIGOS/ELIGOS_unmod_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

2. Multi-reads FAST5 files to single-read FAST5 files

Convert multi-reads FAST5 files to single-read FAST5 files. If the data generated by the sequencing device is already in the single-read format, this step can be skipped.

#m7G
multi_to_single_fast5 -i demo/ELIGOS/ELIGOS_m7G_guppy -s demo/ELIGOS/ELIGOS_m7G_guppy_single --recursive

#unmodified
multi_to_single_fast5 -i demo/ELIGOS/ELIGOS_unmod_guppy -s demo/ELIGOS/ELIGOS_unmod_guppy_single --recursive

3. Tombo resquiggling

In this step, the sequence obtained by basecalling is aligned or mapped to a reference genome or a known sequence. Then the corrected sequence is then associated with the corresponding current signals. The resquiggling process is typically performed in-place. No separate files are generated in this step. ELIGOS reference file can be download here.

#m7G
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/ELIGOS/ELIGOS_m7G_guppy_single  demo/ELIGOS_reference.fa --processes 40 --fit-global-scale --include-event-stdev

#unmodified
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/ELIGOS/ELIGOS_unmod_guppy_single  demo/ELIGOS_reference.fa --processes 40 --fit-global-scale --include-event-stdev

4. Map reads to reference

minimap2 is used to map basecalled sequences to reference transcripts. The output sam file serves as the input for the subsequent feature extraction step.

#m7G
cat demo/ELIGOS/ELIGOS_m7G_guppy/pass/*.fastq >demo/ELIGOS/ELIGOS_m7G.fastq
minimap2 -ax map-ont demo/ELIGOS_reference.fa demo/ELIGOS/ELIGOS_m7G.fastq >demo/ELIGOS/ELIGOS_m7G.sam

#unmodified
cat demo/ELIGOS/ELIGOS_unmod_guppy/pass/*.fastq >demo/ELIGOS/ELIGOS_unmod.fastq
minimap2 -ax map-ont demo/ELIGOS_reference.fa demo/ELIGOS/ELIGOS_unmod.fastq >demo/ELIGOS/ELIGOS_unmod.sam

5. Feature extraction

Extract signals and features from resquiggled fast5 files using the following python scripts.

#m7G
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/ELIGOS/ELIGOS_m7G_guppy_single --reference demo/ELIGOS_reference.fa --sam demo/ELIGOS/ELIGOS_m7G.sam --output demo/ELIGOS/m7G.signal.tsv --clip=10
python scripts/extract_feature_from_signal.py  --signal_file demo/ELIGOS/m7G.signal.tsv --clip 10 --output demo/ELIGOS/m7G.feature.tsv --motif NNGNN

#unmodified
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/ELIGOS/ELIGOS_unmod_guppy_single --reference demo/ELIGOS_reference.fa --sam demo/ELIGOS/ELIGOS_unmod.sam --output demo/ELIGOS/unmod.signal.tsv --clip=10
python scripts/extract_feature_from_signal.py  --signal_file demo/ELIGOS/unmod.signal.tsv --clip 10 --output demo/ELIGOS/unmod.feature.tsv --motif NNGNN

In the feature extraction step, the motif pattern should be provided using the argument --motif. The base symbols of the motif follow the IUB code standard.

6. Train-test split

The train-test split is performed randomly, ensuring that the data points in each set are representative of the overall dataset. The default split ratios are 80% for training and 20% for testing. The train-test split ratio can be customized by using the argument --train_ratio to accommodate the specific requirements of the problem and the size of the dataset.

The training set is used to train the model, allowing it to learn patterns and relationships present in the data. The testing set, on the other hand, is used to assess the model’s performance on new, unseen data. It serves as an independent evaluation set to measure how well the trained model generalizes to data it has not encountered before. By evaluating the model on the testing set, we can estimate its performance, detect overfitting (when the model performs well on the training set but poorly on the testing set) and assess its ability to make accurate predictions on new data.

usage: train_test_split.py [-h] [--input_file INPUT_FILE]
                           [--train_file TRAIN_FILE] [--test_file TEST_FILE]
                           [--train_ratio TRAIN_RATIO]

Split a feature file into training and testing sets.

optional arguments:
  -h, --help                  show this help message and exit
  --input_file INPUT_FILE     Path to the input feature file
  --train_file TRAIN_FILE     Path to the train feature file
  --test_file TEST_FILE       Path to the test feature file
  --train_ratio TRAIN_RATIO   Ratio of instances to use for training (default: 0.8)

#m7G
python scripts/train_test_split.py --input_file demo/ELIGOS/m7G.feature.tsv --train_file demo/ELIGOS/m7G.train.feature.tsv --test_file demo/ELIGOS/m7G.test.feature.tsv --train_ratio 0.8

#unmodified
python scripts/train_test_split.py --input_file demo/ELIGOS/unmod.feature.tsv --train_file demo/ELIGOS/unmod.train.feature.tsv --test_file demo/ELIGOS/unmod.test.feature.tsv --train_ratio 0.8

7. Train m7G model

To transfer the pretrained TandemMod model to new types of modifications, you can set the --run_mode argument to “transfer”. TandemMod accepts both modified and unmodified feature files as input. Additionally, test feature files are necessary to evaluate the model’s performance. You can specify the pretrained model by using the argument --pretrained_model and the new model save path by using the argument --new_model. The model’s training epochs can be defined using the argument --epochs, and the model states will be saved at the end of each epoch. TandemMod will preferentially use the GPU for training if CUDA is available on your device; otherwise, it will utilize the CPU mode. The training process duration can vary, depending on the size of your dataset and the computational capacity, and may last for several hours.

usage: TandemMod.py [-h] --run_mode RUN_MODE
                    [--pretrained_model PRETRAINED_MODEL]
                    [--new_model NEW_MODEL] [--train_data_mod TRAIN_DATA_MOD]
                    [--train_data_unmod TRAIN_DATA_UNMOD]
                    [--test_data_mod TEST_DATA_MOD]
                    [--test_data_unmod TEST_DATA_UNMOD]
                    [--feature_file FEATURE_FILE]
                    [--predict_result PREDICT_RESULT] [--epoch EPOCH]

TandemMod, multiple types of RNA modification detection.

optional arguments:
  -h, --help                               show this help message and exit
  --run_mode RUN_MODE                      Run mode. Default is train
  --pretrained_model PRETRAINED_MODEL      Pretrained model file.
  --new_model NEW_MODEL                    New model file to be saved.
  --train_data_mod TRAIN_DATA_MOD          Train data file, modified.
  --train_data_unmod TRAIN_DATA_UNMOD      Train data file, unmodified.
  --test_data_mod TEST_DATA_MOD            Test data file, modified.
  --test_data_unmod TEST_DATA_UNMOD        Test data file, unmodified.
  --epoch EPOCH                            Training epoch

python scripts/TandemMod.py --run_mode transfer \
  --pretrained_model demo/model/m6A.demo.IVET.pkl \
  --new_model demo/model/m7G.demo.ELIGOS.transfered_from_IVET_m6A.pkl \
  --train_data_mod demo/ELIGOS/m7G.train.feature.tsv \
  --train_data_unmod demo/ELIGOS/unmod.train.feature.tsv \
  --test_data_mod demo/ELIGOS/m7G.test.feature.tsv \
  --test_data_unmod demo/ELIGOS/unmod.test.feature.tsv \
  --epoch 100

During training process, the following information can be used to monitor and evaluate the performance of the transfered model:

device= cpu
transfer learning process.
data loaded.
start training...
Epoch 0-0 Train acc: 0.544000,Test Acc: 0.489786,time0:00:08.688707
Epoch 1-0 Train acc: 0.674000,Test Acc: 0.857939,time0:00:05.190997
Epoch 2-0 Train acc: 0.748000,Test Acc: 0.813835,time0:00:05.426035
Epoch 3-0 Train acc: 0.778000,Test Acc: 0.753946,time0:00:05.180632
Epoch 4-0 Train acc: 0.854000,Test Acc: 0.776230,time0:00:05.236281
Epoch 5-0 Train acc: 0.886000,Test Acc: 0.817549,time0:00:05.219122
Epoch 6-0 Train acc: 0.926000,Test Acc: 0.889044,time0:00:05.470729

After the data processing and model training, the following files should be generated by TandemMod. The trained model m7G.demo.ELIGOS.transfered_from_IVET_m6A.pkl will be saved in the ./demo/model/ folder. You can utilize this fine-tuned model for making predictions in the future.

demo
├── ELIGOS
│   ├── ELIGOS_m7G
│   ├── ELIGOS_m7G.fastq
│   ├── ELIGOS_m7G_guppy
│   ├── ELIGOS_m7G_guppy_single
│   ├── ELIGOS_m7G.sam
│   ├── ELIGOS_unmod
│   ├── ELIGOS_unmod.fastq
│   ├── ELIGOS_unmod_guppy
│   ├── ELIGOS_unmod_guppy_single
│   ├── ELIGOS_unmod.sam
│   ├── m7G.feature.tsv
│   ├── m7G.signal.tsv
│   ├── m7G.test.feature.tsv
│   ├── m7G.train.feature.tsv
│   ├── unmod.feature.tsv
│   ├── unmod.signal.tsv
│   ├── unmod.test.feature.tsv
│   └── unmod.train.feature.tsv
├── ELIGOS_reference.fa
└── model
       ├── m6A.demo.IVET.pkl
       └── m7G.demo.ELIGOS.transfered_from_IVET_m6A.pkl

Predict m6A sites in human cell lines

HEK293T nanopore data is publicly available and can be downloaded from the SG-NEx project. In this demo, subset of the HEK293T nanopore data was taken for demonstration purposes due to the large size of the original datasets. The demo datasets were located under ./demo/HEK293T/ directory.

demo
└── HEK293T
    └── HEK293T_fast5
        └── HEK293T.fast5

1. Guppy basecalling

Basecalling converts the raw signal generated by Oxform Nanopore sequencing to DNA/RNA sequence. Guppy is used for basecalling in this step. In some nanopore datasets, the sequence information is already contained within the FAST5 files. In such cases, the basecalling step can be skipped as the sequence data is readily available.

guppy_basecaller -i demo/HEK293T/HEK293T_fast5 -s demo/HEK293T/HEK293T_fast5_guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

2. Multi-reads FAST5 files to single-read FAST5 files

Convert multi-reads FAST5 files to single-read FAST5 files. If the data generated by the sequencing device is already in the single-read format, this step can be skipped.

multi_to_single_fast5 -i demo/HEK293T/HEK293T_fast5_guppy -s demo/HEK293T/HEK293T_fast5_guppy_single --recursive

3. Tombo resquiggling

In this step, the sequence obtained by basecalling is aligned or mapped to a reference genome or a known sequence. Then the corrected sequence is then associated with the corresponding current signals. The resquiggling process is typically performed in-plac. No separate files are generated in this step. GRCh38 transcripts file can be download here.

tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/HEK293T/HEK293T_fast5_guppy_single  demo/GRCh38_subset_reference.fa --processes 40 --fit-global-scale --include-event-stdev

4. Map reads to reference

minimap2 is used to map basecalled sequences to reference transcripts. The output sam file serves as the input for the subsequent feature extraction step.

cat demo/HEK293T/HEK293T_fast5_guppy/pass/*.fastq >demo/HEK293T/HEK293T.fastq
minimap2 -ax map-ont demo/GRCh38_subset_reference.fa demo/HEK293T/HEK293T.fastq >demo/HEK293T/HEK293T.sam

5. Feature extraction

Extract signals and features from resquiggled fast5 files using the following python scripts.

python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/HEK293T/HEK293T_fast5_guppy_single --reference demo/GRCh38_subset_reference.fa --sam demo/HEK293T/HEK293T.sam --output demo/HEK293T/HEK293T.signal.tsv --clip=10
python scripts/extract_feature_from_signal.py  --signal_file demo/HEK293T/HEK293T.signal.tsv --clip 10 --output demo/HEK293T/HEK293T.feature.tsv --motif DRACH

In the feature extraction step, the motif pattern should be provided using the argument --motif. The base symbols of the motif follow the IUB code standard.

7. Predict m6A sites

To predict m6A sites in HEK293T nanopore data using a pretrained model, you can set the --run_mode argument to “predict”. You can specify the pretrained model by using the argument --pretrained_model.

python scripts/TandemMod.py --run_mode predict \
      --pretrained_model demo/model/m6A.demo.IVET.pkl \
      --feature_file demo/HEK293T/HEK293T.feature.tsv \
      --predict_result demo/HEK293T/HEK293T.prediction.tsv

During the prediction process, TandemMod generates the following files. The prediction result file is named “HEK293T.prediction.tsv”.

demo
└── HEK293T
    ├── HEK293T_fast5
    ├── HEK293T_fast5_guppy
    ├── HEK293T_fast5_guppy_single
    ├── HEK293T.fastq
    ├── HEK293T.feature.tsv
    ├── HEK293T.prediction.tsv
    ├── HEK293T.sam
    └── HEK293T.signal.tsv

The prediction result “demo/HEK293T/HEK293T.prediction.tsv” provides prediction labels along with the corresponding modification probabilities, which can be utilized for further analysis.

transcript_id   site    motif   read_id                                 prediction   probability
XM_005261965.4  10156   AAACA   60e0f6a3-2166-4730-9a10-8f8aaa750b37    unmod        0.1364245
XM_005261965.4  10164   AAACT   60e0f6a3-2166-4730-9a10-8f8aaa750b37    unmod        0.034915127
XM_005261965.4  10229   GAACC   60e0f6a3-2166-4730-9a10-8f8aaa750b37    unmod        0.4773725
XM_005261965.4  10241   GGACC   60e0f6a3-2166-4730-9a10-8f8aaa750b37    unmod        0.11096856
XM_005261965.4  10324   GGACT   60e0f6a3-2166-4730-9a10-8f8aaa750b37    mod          0.908553
XM_005261965.4  10362   AAACA   60e0f6a3-2166-4730-9a10-8f8aaa750b37    unmod        0.2004475
XM_005261965.4  10434   AGACA   60e0f6a3-2166-4730-9a10-8f8aaa750b37    unmod        0.1934688
XM_005261965.4  10498   GGACC   60e0f6a3-2166-4730-9a10-8f8aaa750b37    unmod        0.1313666
XM_005261965.4  10507   AAACA   60e0f6a3-2166-4730-9a10-8f8aaa750b37    unmod        0.030169742
XM_005261965.4  10511   AAACT   60e0f6a3-2166-4730-9a10-8f8aaa750b37    unmod        0.020174831
XM_005261965.4  10592   AGACT   60e0f6a3-2166-4730-9a10-8f8aaa750b37    mod          0.7666112

The execution time for each demonstration is estimated to be approximately 3-10 minutes.