Data preprocessing

This section involves transforming the raw electrical signal data from nanopore direct RNA sequencing (DRS) into meaningful features that can be used to study RNA modifications. The preprocessing procedure typically consists of several stages.

Raw FAST5 files: Begin with the raw FAST5 files generated by the nanopore sequencing platform. These files contain the raw electrical signal data captured during sequencing.
Basecalling: Perform basecalling on the raw FAST5 files using basecalling software such as Guppy or Albacore. Basecalling converts the raw electrical signals into sequences (e.g., A, C, G, T).
Resquiggling: Extract events from the basecalled FAST5 files. Events represent segments of the raw signal associated with individual bases. This step involves aligning the corrected sequences with the raw signal data.
Feature extraction: Extract features from the events to represent various characteristics of the RNA modifications. Features can include event duration, mean, standard error and other signal-based properties.
Preprocessing and normalization: Preprocess the extracted features by applying normalization techniques such as scaling or log transformation. Normalization helps to remove any biases or variations in the features across different samples.
Dataset creation: Combine the preprocessed features to create a dataset, which will be used for training or prediction purposes.
Additional processing: Depending on the specific requirements of the analysis, additional processing steps may be performed, such as filtering out low-quality events, removing noise, or applying statistical methods.

Basecalling

Guppy is used for basecalling in TandemMod. Guppy, as well as the now deprecated Albacore and all other basecallers, uses files in fast5 format as input. In addition to basecalling, Guppy also performs filtering of low quality reads, clipping of Oxford Nanopore adapters. More detailed documentation about Guppy can be found on the official Nanopore Technology repository. This step can be time-consuming and may require several hours or even days to complete, depending on the computational capacity available:

guppy_basecaller -i demo/fast5 -s demo/guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

Arguments

Argument name	Required	Default	Description
-i=DIR	Yes	NA	Input directory, containing FAST5 files generated by the nanopore sequencing platform
-s=DIR	Yes	NA	Output directory, containing FAST5 file as well as basecalled sequences.
–num_callers=NUM	No	1	Number of processes to run.
–fast5_out	Yes	None	Output FAST5 files to the directory.
–config=STR	Yes	None	The configure file is “rna_r9.4.1_70bps_hac.cfg” in TandemMod and should be adjusted according to DRS platform.
–recursive	Yes	None	This Argument allows recursive processing or batch processing of files

Multi-fast5 to single-fast5

If fast5 reads are stored at multi-reads format, ont_fast5_api is recommended to convert multi-fast5 reads to single-fast5 reads. Usually, the size of multi-reads fast5 file is about 200-300M. Convert multi-reads files to single-read files:

multi_to_single_fast5 -i demo/guppy -s demo/guppy_single -t 40 --recursive

Arguments

Argument name	Required	Default	Description
-i=DIR	Yes	NA	Input directory, containing multi-reads FAST5 files.
-s=DIR	Yes	NA	Output directory, containing single-read FAST5 files.
-t=NUM	No	1	Number of processes to run.
–recursive	Yes	NA	This Argument allows recursive processing or batch processing of files.

Resquiggling

The resquiggling algorithm is the basis for the Tombo framework. It takes as input a read file (in FAST5 format) containing raw signal and associated base calls. The base calls are mapped to a genome or transcriptome reference and then the raw signal is assigned to the reference sequence based on an expected current level model. Tombo is used for resquiggling in TandemMod:

tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/guppy_single  demo/reference_transcripts.fasta --processes 40 --fit-global-scale --include-event-stdev

Arguments

Argument name	Required	Default	Description
–overwrite	Yes	NA	Overwrite previous corrected group in FAST5 files.
–basecall-group	No	Basecall_1D_000	FAST5 group obtain original basecalls.
–processes	No	1	Number of processes to run.
–fit-global-scale	No	NA	Apply a scaling factor.
–include-event-stdev	No	NA	Include the standard deviation.
args[0]	Yes	NA	Fast5 basedir.
args[1]	Yes	NA	Reference transcripts, in fasta format.

Feature extraction

minimap2 is used to map basecalled sequences to reference transcripts:

cat demo/guppy/pass/*.fastq >demo/m6A.fastq
minimap2 -ax map-ont demo/reference_transcripts.fasta demo/m6A.fastq >demo/m6A.sam

Extract signal files from FAST5 files:

python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/guppy_single --reference demo/reference_transcripts.fasta --sam demo/m6A.sam --output demo/m6A.signal.tsv --clip 10

Arguments

Argument name	Required	Default	Description
–fast5	Yes	NA	Fast5 basedir.
–reference	Yes	NA	Reference transcripts, in fasta format.
-p	No	1	Number of processes to run.
–sam	Yes	NA	Aligment results, output from minimap2.
–output	Yes	NA	Output file contraining current signals.
–clip	Yes	NA	Base clip at both ends.

Extract features from signal files:

python scripts/extract_feature_from_signal.py  --signal_file demo/m6A.signal.tsv --clip 10 --output demo/m6A.feature.tsv --motif DRACH

Arguments

Argument name	Required	Default	Description
–signal_file	Yes	NA	File contraining current signals.
–reference	Yes	NA	Reference transcripts, in fasta format.
–output	Yes	NA	Output file contraining features.
–clip	Yes	NA	Base clip at both ends.