SequenceForge-Lite Usage Guide

Code & full README

Installation

git clone https://github.com/iliapopov17/SequenceForge-Lite.git && cd SequenceForge-Lite
pip install -r requirements.txt

Import modules

Input

from bio_files_processor import *
from sequence_forge import *

Convert multiline fasta to online

Input

! head -6 demo_data/example_multiline_fasta.fasta

Output

>GTD323452 5S_rRNA NODE_272_length_223_cov_0.720238:18-129(+)
ACGGCCATAGGACTTTGAAAGCACCGCATCCCGTCCGATCTGCGAAGTTAACCAAGATGCCGCCTGGTTAGTACCATGGTGGGGGACCACATGGGAATCCCT
GGTGCTGTG
>GTD678345 16S_rRNA NODE_80_length_720_cov_1.094737:313-719(+)
TTGGCTTCTTAGAGGGACTTTTGATGTTTAATCAAAGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACT
GAGCCCTTGGGAGTGGTCCATTTGAGCCGGCAACGGCACGTTTGGACTGCAAACTTGGGCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGT

Input

input_fasta_file = "demo_data/example_multiline_fasta.fasta"
output_fasta_file = "demo_data/example_oneline_fasta.fasta"
convert_multiline_fasta_to_oneline(input_fasta_file, output_fasta_file)

Output

Converted multiline FASTA to one-line FASTA. Saved as demo_data/example_oneline_fasta.fasta

Input

! head -4 demo_data/example_oneline_fasta.fasta

Output

>GTD323452 5S_rRNA NODE_272_length_223_cov_0.720238:18-129(+)
ACGGCCATAGGACTTTGAAAGCACCGCATCCCGTCCGATCTGCGAAGTTAACCAAGATGCCGCCTGGTTAGTACCATGGTGGGGGACCACATGGGAATCCCTGGTGCTGTG
>GTD678345 16S_rRNA NODE_80_length_720_cov_1.094737:313-719(+)
TTGGCTTCTTAGAGGGACTTTTGATGTTTAATCAAAGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGAGCCCTTGGGAGTGGTCCATTTGAGCCGGCAACGGCACGTTTGGACTGCAAACTTGGGCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGT

Change fasta start position

Input

input_fasta_file = "demo_data/example_oneline_fasta.fasta"
shift_amount = 10
change_fasta_start_pos(input_fasta_file, shift_amount)

Output

Shifted FASTA sequence saved to demo_data/example_oneline_fasta_shifted.fasta

Input

! head -4 demo_data/example_oneline_fasta_shifted.fasta

Output

>GTD323452 5S_rRNA NODE_272_length_223_cov_0.720238:18-129(+)
GACTTTGAAAGCACCGCATCCCGTCCGATCTGCGAAGTTAACCAAGATGCCGCCTGGTTAGTACCATGGTGGGGGACCACATGGGAATCCCTGGTGCTGTGACGGCCATAG
>GTD678345 16S_rRNA NODE_80_length_720_cov_1.094737:313-719(+)
AGAGGGACTTTTGATGTTTAATCAAAGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGAGCCCTTGGGAGTGGTCCATTTGAGCCGGCAACGGCACGTTTGGACTGCAAACTTGGGCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTGGCTTCTT

Parse blast output

Input

input_file = "demo_data/example_blast_results.txt"
parse_blast_output(input_file)

Output

Best BLAST results saved to demo_data/example_blast_results_parsed.txt

Input

! head -4 demo_data/example_blast_results_parsed.txt

Output

DNA methylase [Enterobacteriaceae]
DUF1380 domain-containing protein [Escherichia coli]
DUF1380 family protein [Enterobacteriaceae]
DUF4158 domain-containing protein [Klebsiella pneumoniae]

FASTQ filter

Input

! head -4 demo_data/example_fastq.fastq

Output

@SRX079804:1:SRR292678:1:1101:24563:24563 1:N:0:1 BH:failed
ATTAGCGAGGAGGAGTGCTGAGAAGATGTCGCCTACGCCGTTGAAATTCCCTTCAATCAGGGGGTACTGGAGGATACGAGTTTGTGTG
+
BFFFFFFFB@B@A<@D>BDDACDDDEBEDEFFFBFFFEFFDFFF=CC@DDFD8FFFFFFF8/+.2,@7<<:?B/:<><-><@.A*C>D

Input

input_file = "demo_data/example_fastq.fastq"
fastq_filter(input_file, gc_bound=(40,60), length_bound=(0, 50), quality_threshold=30)

Output

Filtered FastQ. Saved as demo_data/example_fastq_filtered.fastq

Input

! head -4 demo_data/example_fastq_filtered.fastq

Output

@SRX079804:1:SRR292678:1:1101:654270:654270 1:N:0:1 BH:failed
ATCTTTCTCTCTCTGCTGCATTCTCCGCTTCAGCTCCTCAATTTCAATCA
+
GGGGGGEDGGFEGDGCGGBCEFGGGGFBFGGEGGGGGEGFCFEEGGFEDB

Example usage of DNASequence class

Input

dna_sequence = DNASequence("ACCGGCTAATCGGCT")
motif_to_find = "CGG"
print(type(dna_sequence))
print("DNA Sequence:", dna_sequence)
print("Length:", len(dna_sequence))
print("GC Content:", dna_sequence.gc_content())
print("Complement:", dna_sequence.complement())
print("Transcribed RNA Sequence:", dna_sequence.transcribe())
print(f"Indexes of {motif_to_find} motif occurrences:", dna_sequence.find_motif(motif_to_find))

Output

<class 'sequence_forge.DNASequence'>
DNA Sequence: ACCGGCTAATCGGCT
Length: 15
GC Content: 0.6
Complement: TGGCCGATTAGCCGA
Transcribed RNA Sequence: UGGCCGAUUAGCCGA
Indexes of CGG motif occurrences: [2, 10]

Example usage of RNASequence class

Input

rna_sequence = dna_sequence.transcribe()
motif_to_find = "GCC"
print(type(rna_sequence))
print("RNA Sequence:", rna_sequence)
print("Length:", len(rna_sequence))
print("GC Content:", rna_sequence.gc_content())
print("Codons:", rna_sequence.codons())
print(f"Indexes of {motif_to_find} motif occurrences:", rna_sequence.find_motif(motif_to_find))
print("Tranlated to Amino Acid Sequence:", rna_sequence.translate())

Output

<class 'sequence_forge.RNASequence'>
RNA Sequence: UGGCCGAUUAGCCGA
Length: 15
GC Content: 0.6
Codons: ['UGG', 'CCG', 'AUU', 'AGC', 'CGA']
Indexes of GCC motif occurrences: [2, 10]
Tranlated to Amino Acid Sequence: WPISR

Example usage of AminoAcidSequence class

Input

amino_acid_sequence = rna_sequence.translate()
print(type(amino_acid_sequence))
print("Amino Acid Sequence:", amino_acid_sequence)
print("Length:", len(amino_acid_sequence))
print("Molecular Weight:", amino_acid_sequence.get_molecular_weight())

Output

<class 'sequence_forge.AminoAcidSequence'>
Amino Acid Sequence: WPISR
Length: 5
Molecular Weight: 729.8299999999999

Custom RandomForestClassifier

Import needed library and module

Input

from custom_random_forest import RandomForestClassifierCustom
from sklearn.datasets import make_classification

Create dataset

Input

X, y = make_classification(n_samples=100000)
rf = RandomForestClassifierCustom(n_estimators=10, max_depth=10, max_features=5, random_state=42)

Fit using 2 n_jobs

Input

%%time
rf.fit(X, y, n_jobs=2)

Output

CPU times: total: 109 ms
Wall time: 7.95 s

Predict using 2 n_jobs

Input

%%time
predictions_2 = rf.predict_proba(X, n_jobs=2)

Output

CPU times: total: 78.1 ms
Wall time: 2.3 s

Fit using 1 n_jobs

Input

%%time
rf.fit(X, y, n_jobs=1)

Output

CPU times: total: 9.56 s
Wall time: 9.57 s

Predict using 1 n_jobs

Input

%%time
predictions_2 = rf.predict_proba(X, n_jobs=1)

Output

CPU times: total: 172 ms
Wall time: 171 ms

Time comparison (CPU times)

n_jobsfitpredict
19.44 s172 ms
2109 ms78.1 ms

Finally, check that the resulting predictions match

Input

predictions_1 == predictions_2

Output

array([[ True,  True],
       [ True,  True],
       [ True,  True],
       ...,
       [ True,  True],
       [ True,  True],
       [ True,  True]])