.. _data_preprocessing:

Data preprocessing
==================================
This section involves transforming the raw electrical signal data from nanopore direct RNA sequencing (DRS) into meaningful features that can be used to study RNA modifications. The preprocessing procedure typically consists of several stages.

* Raw FAST5 files: Begin with the raw FAST5 files generated by the nanopore sequencing platform. These files contain the raw electrical signal data captured during sequencing.

* Basecalling: Perform basecalling on the raw FAST5 files using basecalling software such as Guppy or Albacore. Basecalling converts the raw electrical signals into sequences (e.g., A, C, G, T).

* Resquiggling: Extract events from the basecalled FAST5 files. Events represent segments of the raw signal associated with individual bases. This step involves aligning the corrected sequences with the raw signal data.

* Feature extraction: Extract features from the events to represent various characteristics of the RNA modifications. Features can include event duration, mean, standard error and other signal-based properties. 

* Preprocessing and normalization: Preprocess the extracted features by applying normalization techniques such as scaling or log transformation. Normalization helps to remove any biases or variations in the features across different samples.

* Dataset creation: Combine the preprocessed features to create a dataset, which will be used for training or prediction purposes.

* Additional processing: Depending on the specific requirements of the analysis, additional processing steps may be performed, such as filtering out low-quality events, removing noise, or applying statistical methods.


Basecalling
********************

Guppy is used for basecalling in TandemMod. Guppy, as well as the now deprecated Albacore and all other basecallers, uses files in fast5 format as input. In addition to basecalling, Guppy also performs filtering of low quality reads, clipping of Oxford Nanopore adapters. More detailed documentation about Guppy can be found on the official `Nanopore Technology repository <https://github.com/nanoporetech/pyguppyclient>`_. This step can be time-consuming and may require several hours or even days to complete, depending on the computational capacity available::

    guppy_basecaller -i demo/fast5 -s demo/guppy --num_callers 40 --recursive --fast5_out --config rna_r9.4.1_70bps_hac.cfg

Arguments

=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
-i=DIR                              Yes         NA                    Input directory, containing FAST5 files generated by the nanopore sequencing platform
-s=DIR                              Yes         NA                    Output directory, containing FAST5 file as well as basecalled sequences.
--num_callers=NUM                   No          1                     Number of processes to run.
--fast5_out                         Yes         None                  Output FAST5 files to the directory.
--config=STR                        Yes         None                  The configure file is "rna_r9.4.1_70bps_hac.cfg" in TandemMod and should be adjusted according to DRS platform.
--recursive                         Yes         None                  This Argument allows recursive processing or batch processing of files
=================================   ==========  ===================  ============================================================================================================

Multi-fast5 to single-fast5
********************
If fast5 reads are stored at multi-reads format, ont_fast5_api is recommended to convert multi-fast5 reads to single-fast5 reads. Usually, the size of multi-reads fast5 file is about 200-300M. Convert multi-reads files to single-read files::

    multi_to_single_fast5 -i demo/guppy -s demo/guppy_single -t 40 --recursive 

Arguments

=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
-i=DIR                              Yes         NA                    Input directory, containing multi-reads FAST5 files.
-s=DIR                              Yes         NA                    Output directory, containing single-read FAST5 files.
-t=NUM                              No          1                     Number of processes to run.
--recursive                         Yes         NA                    This Argument allows recursive processing or batch processing of files.
=================================   ==========  ===================  ============================================================================================================

Resquiggling
********************
The resquiggling algorithm is the basis for the Tombo framework. It takes as input a read file (in FAST5 format) containing raw signal and associated base calls. The base calls are mapped to a genome or transcriptome reference and then the raw signal is assigned to the reference sequence based on an expected current level model. Tombo is used for resquiggling in TandemMod::

    tombo resquiggle --overwrite --basecall-group Basecall_1D_001 demo/guppy_single  demo/reference_transcripts.fasta --processes 40 --fit-global-scale --include-event-stdev

Arguments

=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--overwrite                         Yes         NA                    Overwrite previous corrected group in FAST5 files.
--basecall-group                    No          Basecall_1D_000       FAST5 group obtain original basecalls. 
--processes                         No          1                     Number of processes to run.
--fit-global-scale                  No          NA                    Apply a scaling factor.
--include-event-stdev               No          NA                    Include the standard deviation.
args[0]                             Yes         NA                    Fast5 basedir. 
args[1]                             Yes         NA                    Reference transcripts, in fasta format.
=================================   ==========  ===================  ============================================================================================================

Feature extraction
********************
minimap2 is used to map basecalled sequences to reference transcripts:: 
    
    cat demo/guppy/pass/*.fastq >demo/m6A.fastq
    minimap2 -ax map-ont demo/reference_transcripts.fasta demo/m6A.fastq >demo/m6A.sam

Extract signal files from FAST5 files::
    
    python scripts/extract_signal_from_fast5.py -p 40 --fast5 demo/guppy_single --reference demo/reference_transcripts.fasta --sam demo/m6A.sam --output demo/m6A.signal.tsv --clip 10

Arguments

=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--fast5                             Yes         NA                    Fast5 basedir.
--reference                         Yes         NA                    Reference transcripts, in fasta format.
-p                                  No          1                     Number of processes to run.
--sam                               Yes         NA                    Aligment results, output from minimap2.
--output                            Yes         NA                    Output file contraining current signals.
--clip                              Yes         NA                    Base clip at both ends.
=================================   ==========  ===================  ============================================================================================================

Extract features from signal files::

    python scripts/extract_feature_from_signal.py  --signal_file demo/m6A.signal.tsv --clip 10 --output demo/m6A.feature.tsv --motif DRACH

Arguments

=================================   ==========  ===================  ============================================================================================================
Argument name                       Required    Default              Description
=================================   ==========  ===================  ============================================================================================================
--signal_file                       Yes         NA                    File contraining current signals.
--reference                         Yes         NA                    Reference transcripts, in fasta format.
--output                            Yes         NA                    Output file contraining features.
--clip                              Yes         NA                    Base clip at both ends.
=================================   ==========  ===================  ============================================================================================================