.. _data_preprocessing: Data preprocessing ================================== This section involves transforming the raw electrical signal data from nanopore direct RNA sequencing (DRS) into meaningful features that can be used to study RNA modifications. The preprocessing procedure typically consists of several stages. * Raw FAST5 files: Begin with the raw FAST5 files generated by the nanopore sequencing platform. These files contain the raw electrical signal data captured during sequencing. * Basecalling: Perform basecalling on the raw FAST5 files using basecalling software such as Guppy or Albacore. Basecalling converts the raw electrical signals into sequences (e.g., A, C, G, T). * Resquiggling: Extract events from the basecalled FAST5 files. Events represent segments of the raw signal associated with individual bases. This step involves aligning the corrected sequences with the raw signal data. * Feature extraction: Extract features from the events to represent various characteristics of the RNA modifications. Features can include event duration, mean, standard error and other signal-based properties. * Preprocessing and normalization: Preprocess the extracted features by applying normalization techniques such as scaling or log transformation. Normalization helps to remove any biases or variations in the features across different samples. * Dataset creation: Combine the preprocessed features to create a dataset, which will be used for training or prediction purposes. * Additional processing: Depending on the specific requirements of the analysis, additional processing steps may be performed, such as filtering out low-quality events, removing noise, or applying statistical methods. Basecalling ******************** Guppy is used for basecalling in TandemMod. Guppy, as well as the now deprecated Albacore and all other basecallers, uses files in fast5 format as input. In addition to basecalling, Guppy also performs filtering of low quality reads, clipping of Oxford Nanopore adapters. More detailed documentation about Guppy can be found on the official `Nanopore Technology repository `_. This step can be time-consuming and may require several hours or even days to complete, depending on the computational capacity available:: guppy_basecaller -i demo/fast5 -s demo/guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg Arguments ================================= ========== =================== ============================================================================================================ Argument name Required Default Description ================================= ========== =================== ============================================================================================================ -i=DIR Yes NA Input directory, containing FAST5 files generated by the nanopore sequencing platform -s=DIR Yes NA Output directory, containing FAST5 file as well as basecalled sequences. --num_callers=NUM No 1 Number of processes to run. --fast5_out Yes None Output FAST5 files to the directory. --config=STR Yes None The configure file is "rna_r9.4.1_70bps_hac.cfg" in TandemMod and should be adjusted according to DRS platform. --recursive Yes None This Argument allows recursive processing or batch processing of files ================================= ========== =================== ============================================================================================================ Multi-fast5 to single-fast5 ******************** If fast5 reads are stored at multi-reads format, ont_fast5_api is recommended to convert multi-fast5 reads to single-fast5 reads. Usually, the size of multi-reads fast5 file is about 200-300M. Convert multi-reads files to single-read files:: multi_to_single_fast5 -i demo/guppy -s demo/guppy_single -t 40 --recursive Arguments ================================= ========== =================== ============================================================================================================ Argument name Required Default Description ================================= ========== =================== ============================================================================================================ -i=DIR Yes NA Input directory, containing multi-reads FAST5 files. -s=DIR Yes NA Output directory, containing single-read FAST5 files. -t=NUM No 1 Number of processes to run. --recursive Yes NA This Argument allows recursive processing or batch processing of files. ================================= ========== =================== ============================================================================================================ Resquiggling ******************** The resquiggling algorithm is the basis for the Tombo framework. It takes as input a read file (in FAST5 format) containing raw signal and associated base calls. The base calls are mapped to a genome or transcriptome reference and then the raw signal is assigned to the reference sequence based on an expected current level model. Tombo is used for resquiggling in TandemMod:: tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/guppy_single demo/reference_transcripts.fasta --processes 40 --fit-global-scale --include-event-stdev Arguments ================================= ========== =================== ============================================================================================================ Argument name Required Default Description ================================= ========== =================== ============================================================================================================ --overwrite Yes NA Overwrite previous corrected group in FAST5 files. --basecall-group No Basecall_1D_000 FAST5 group obtain original basecalls. --processes No 1 Number of processes to run. --fit-global-scale No NA Apply a scaling factor. --include-event-stdev No NA Include the standard deviation. args[0] Yes NA Fast5 basedir. args[1] Yes NA Reference transcripts, in fasta format. ================================= ========== =================== ============================================================================================================ Feature extraction ******************** minimap2 is used to map basecalled sequences to reference transcripts:: cat demo/guppy/pass/*.fastq >demo/m6A.fastq minimap2 -ax map-ont demo/reference_transcripts.fasta demo/m6A.fastq >demo/m6A.sam Extract signal files from FAST5 files:: python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/guppy_single --reference demo/reference_transcripts.fasta --sam demo/m6A.sam --output demo/m6A.signal.tsv --clip 10 Arguments ================================= ========== =================== ============================================================================================================ Argument name Required Default Description ================================= ========== =================== ============================================================================================================ --fast5 Yes NA Fast5 basedir. --reference Yes NA Reference transcripts, in fasta format. -p No 1 Number of processes to run. --sam Yes NA Aligment results, output from minimap2. --output Yes NA Output file contraining current signals. --clip Yes NA Base clip at both ends. ================================= ========== =================== ============================================================================================================ Extract features from signal files:: python scripts/extract_feature_from_signal.py --signal_file demo/m6A.signal.tsv --clip 10 --output demo/m6A.feature.tsv --motif DRACH Arguments ================================= ========== =================== ============================================================================================================ Argument name Required Default Description ================================= ========== =================== ============================================================================================================ --signal_file Yes NA File contraining current signals. --reference Yes NA Reference transcripts, in fasta format. --output Yes NA Output file contraining features. --clip Yes NA Base clip at both ends. ================================= ========== =================== ============================================================================================================