Data preprocessing

This section involves transforming the raw electrical signal data from nanopore direct RNA sequencing (DRS) into meaningful features that can be used to study RNA modifications. The preprocessing procedure typically consists of several stages.

  • Raw FAST5 files: Begin with the raw FAST5 files generated by the nanopore sequencing platform. These files contain the raw electrical signal data captured during sequencing.

  • Basecalling: Perform basecalling on the raw FAST5 files using basecalling software such as Guppy or Albacore. Basecalling converts the raw electrical signals into sequences (e.g., A, C, G, T).

  • Resquiggling: Extract events from the basecalled FAST5 files. Events represent segments of the raw signal associated with individual bases. This step involves aligning the corrected sequences with the raw signal data.

  • Feature extraction: Extract features from the events to represent various characteristics of the RNA modifications. Features can include event duration, mean, standard error and other signal-based properties.

  • Preprocessing and normalization: Preprocess the extracted features by applying normalization techniques such as scaling or log transformation. Normalization helps to remove any biases or variations in the features across different samples.

  • Dataset creation: Combine the preprocessed features to create a dataset, which will be used for training or prediction purposes.

  • Additional processing: Depending on the specific requirements of the analysis, additional processing steps may be performed, such as filtering out low-quality events, removing noise, or applying statistical methods.

Basecalling

Guppy is used for basecalling in TandemMod. Guppy, as well as the now deprecated Albacore and all other basecallers, uses files in fast5 format as input. In addition to basecalling, Guppy also performs filtering of low quality reads, clipping of Oxford Nanopore adapters. More detailed documentation about Guppy can be found on the official Nanopore Technology repository. This step can be time-consuming and may require several hours or even days to complete, depending on the computational capacity available:

guppy_basecaller -i demo/fast5 -s demo/guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

Arguments

Argument name

Required

Default

Description

-i=DIR

Yes

NA

Input directory, containing FAST5 files generated by the nanopore sequencing platform

-s=DIR

Yes

NA

Output directory, containing FAST5 file as well as basecalled sequences.

–num_callers=NUM

No

1

Number of processes to run.

–fast5_out

Yes

None

Output FAST5 files to the directory.

–config=STR

Yes

None

The configure file is “rna_r9.4.1_70bps_hac.cfg” in TandemMod and should be adjusted according to DRS platform.

–recursive

Yes

None

This Argument allows recursive processing or batch processing of files

Multi-fast5 to single-fast5

If fast5 reads are stored at multi-reads format, ont_fast5_api is recommended to convert multi-fast5 reads to single-fast5 reads. Usually, the size of multi-reads fast5 file is about 200-300M. Convert multi-reads files to single-read files:

multi_to_single_fast5 -i demo/guppy -s demo/guppy_single -t 40 --recursive

Arguments

Argument name

Required

Default

Description

-i=DIR

Yes

NA

Input directory, containing multi-reads FAST5 files.

-s=DIR

Yes

NA

Output directory, containing single-read FAST5 files.

-t=NUM

No

1

Number of processes to run.

–recursive

Yes

NA

This Argument allows recursive processing or batch processing of files.

Resquiggling

The resquiggling algorithm is the basis for the Tombo framework. It takes as input a read file (in FAST5 format) containing raw signal and associated base calls. The base calls are mapped to a genome or transcriptome reference and then the raw signal is assigned to the reference sequence based on an expected current level model. Tombo is used for resquiggling in TandemMod:

tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/guppy_single  demo/reference_transcripts.fasta --processes 40 --fit-global-scale --include-event-stdev

Arguments

Argument name

Required

Default

Description

–overwrite

Yes

NA

Overwrite previous corrected group in FAST5 files.

–basecall-group

No

Basecall_1D_000

FAST5 group obtain original basecalls.

–processes

No

1

Number of processes to run.

–fit-global-scale

No

NA

Apply a scaling factor.

–include-event-stdev

No

NA

Include the standard deviation.

args[0]

Yes

NA

Fast5 basedir.

args[1]

Yes

NA

Reference transcripts, in fasta format.

Feature extraction

minimap2 is used to map basecalled sequences to reference transcripts:

cat demo/guppy/pass/*.fastq >demo/m6A.fastq
minimap2 -ax map-ont demo/reference_transcripts.fasta demo/m6A.fastq >demo/m6A.sam

Extract signal files from FAST5 files:

python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/guppy_single --reference demo/reference_transcripts.fasta --sam demo/m6A.sam --output demo/m6A.signal.tsv --clip 10

Arguments

Argument name

Required

Default

Description

–fast5

Yes

NA

Fast5 basedir.

–reference

Yes

NA

Reference transcripts, in fasta format.

-p

No

1

Number of processes to run.

–sam

Yes

NA

Aligment results, output from minimap2.

–output

Yes

NA

Output file contraining current signals.

–clip

Yes

NA

Base clip at both ends.

Extract features from signal files:

python scripts/extract_feature_from_signal.py  --signal_file demo/m6A.signal.tsv --clip 10 --output demo/m6A.feature.tsv --motif DRACH

Arguments

Argument name

Required

Default

Description

–signal_file

Yes

NA

File contraining current signals.

–reference

Yes

NA

Reference transcripts, in fasta format.

–output

Yes

NA

Output file contraining features.

–clip

Yes

NA

Base clip at both ends.