Data preprocessing
This section involves transforming the raw electrical signal data from nanopore direct RNA sequencing (DRS) into meaningful features that can be used to study RNA modifications. The preprocessing procedure typically consists of several stages.
Raw FAST5 files: Begin with the raw FAST5 files generated by the nanopore sequencing platform. These files contain the raw electrical signal data captured during sequencing.
Basecalling: Perform basecalling on the raw FAST5 files using basecalling software such as Guppy or Albacore. Basecalling converts the raw electrical signals into sequences (e.g., A, C, G, T).
Resquiggling: Extract events from the basecalled FAST5 files. Events represent segments of the raw signal associated with individual bases. This step involves aligning the corrected sequences with the raw signal data.
Feature extraction: Extract features from the events to represent various characteristics of the RNA modifications. Features can include event duration, mean, standard error and other signal-based properties.
Preprocessing and normalization: Preprocess the extracted features by applying normalization techniques such as scaling or log transformation. Normalization helps to remove any biases or variations in the features across different samples.
Dataset creation: Combine the preprocessed features to create a dataset, which will be used for training or prediction purposes.
Additional processing: Depending on the specific requirements of the analysis, additional processing steps may be performed, such as filtering out low-quality events, removing noise, or applying statistical methods.
Basecalling
Guppy is used for basecalling in TandemMod. Guppy, as well as the now deprecated Albacore and all other basecallers, uses files in fast5 format as input. In addition to basecalling, Guppy also performs filtering of low quality reads, clipping of Oxford Nanopore adapters. More detailed documentation about Guppy can be found on the official Nanopore Technology repository. This step can be time-consuming and may require several hours or even days to complete, depending on the computational capacity available:
guppy_basecaller -i demo/fast5 -s demo/guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg
Arguments
Argument name |
Required |
Default |
Description |
---|---|---|---|
-i=DIR |
Yes |
NA |
Input directory, containing FAST5 files generated by the nanopore sequencing platform |
-s=DIR |
Yes |
NA |
Output directory, containing FAST5 file as well as basecalled sequences. |
–num_callers=NUM |
No |
1 |
Number of processes to run. |
–fast5_out |
Yes |
None |
Output FAST5 files to the directory. |
–config=STR |
Yes |
None |
The configure file is “rna_r9.4.1_70bps_hac.cfg” in TandemMod and should be adjusted according to DRS platform. |
–recursive |
Yes |
None |
This Argument allows recursive processing or batch processing of files |
Multi-fast5 to single-fast5
If fast5 reads are stored at multi-reads format, ont_fast5_api is recommended to convert multi-fast5 reads to single-fast5 reads. Usually, the size of multi-reads fast5 file is about 200-300M. Convert multi-reads files to single-read files:
multi_to_single_fast5 -i demo/guppy -s demo/guppy_single -t 40 --recursive
Arguments
Argument name |
Required |
Default |
Description |
---|---|---|---|
-i=DIR |
Yes |
NA |
Input directory, containing multi-reads FAST5 files. |
-s=DIR |
Yes |
NA |
Output directory, containing single-read FAST5 files. |
-t=NUM |
No |
1 |
Number of processes to run. |
–recursive |
Yes |
NA |
This Argument allows recursive processing or batch processing of files. |
Resquiggling
The resquiggling algorithm is the basis for the Tombo framework. It takes as input a read file (in FAST5 format) containing raw signal and associated base calls. The base calls are mapped to a genome or transcriptome reference and then the raw signal is assigned to the reference sequence based on an expected current level model. Tombo is used for resquiggling in TandemMod:
tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/guppy_single demo/reference_transcripts.fasta --processes 40 --fit-global-scale --include-event-stdev
Arguments
Argument name |
Required |
Default |
Description |
---|---|---|---|
–overwrite |
Yes |
NA |
Overwrite previous corrected group in FAST5 files. |
–basecall-group |
No |
Basecall_1D_000 |
FAST5 group obtain original basecalls. |
–processes |
No |
1 |
Number of processes to run. |
–fit-global-scale |
No |
NA |
Apply a scaling factor. |
–include-event-stdev |
No |
NA |
Include the standard deviation. |
args[0] |
Yes |
NA |
Fast5 basedir. |
args[1] |
Yes |
NA |
Reference transcripts, in fasta format. |
Feature extraction
minimap2 is used to map basecalled sequences to reference transcripts:
cat demo/guppy/pass/*.fastq >demo/m6A.fastq
minimap2 -ax map-ont demo/reference_transcripts.fasta demo/m6A.fastq >demo/m6A.sam
Extract signal files from FAST5 files:
python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/guppy_single --reference demo/reference_transcripts.fasta --sam demo/m6A.sam --output demo/m6A.signal.tsv --clip 10
Arguments
Argument name |
Required |
Default |
Description |
---|---|---|---|
–fast5 |
Yes |
NA |
Fast5 basedir. |
–reference |
Yes |
NA |
Reference transcripts, in fasta format. |
-p |
No |
1 |
Number of processes to run. |
–sam |
Yes |
NA |
Aligment results, output from minimap2. |
–output |
Yes |
NA |
Output file contraining current signals. |
–clip |
Yes |
NA |
Base clip at both ends. |
Extract features from signal files:
python scripts/extract_feature_from_signal.py --signal_file demo/m6A.signal.tsv --clip 10 --output demo/m6A.feature.tsv --motif DRACH
Arguments
Argument name |
Required |
Default |
Description |
---|---|---|---|
–signal_file |
Yes |
NA |
File contraining current signals. |
–reference |
Yes |
NA |
Reference transcripts, in fasta format. |
–output |
Yes |
NA |
Output file contraining features. |
–clip |
Yes |
NA |
Base clip at both ends. |