Open main menu

CDOT Wiki β

Changes

BLAStoise

8,133 bytes added, 19:12, 12 February 2017
Assignment 1
----
 
----
 
'''DNA Edit Distance and Sequence Alignment Tool - Analysis by Jonathan D.'''
 
This program is a bioinformatics tool which will calculate the edit distance between two sequences. Furthermore, if the option is selected, then the program will also perform a sequence alignment on the two sequences. The two sequences must be given in FASTA format, but there are already ready to use sequences within the test directory of the project.
 
Original source code can be be found [https://github.com/kbiscanic/bioinformatics here].
 
'''Compilation'''
 
To compile the program, navigate to the project’s src directory and run the following command:
 
$ g++ -std=c++11 -g -pg -O2 BasicEditDistance.cpp main.cpp Parser.cpp Result.cpp Sequence.cpp Solver.cpp SubmatrixCalculator.cpp Writer.cpp
 
'''Running the Program'''
 
There are three options to run the program, as noted in the github readme:
 
b - basic edit distance (Needleman-Wunsch)
 
d - edit distance (Masek-Paterson)
 
a - edit distance and alignment (Masek-Paterson)
 
Initial testing of the program was done by using the "b" option which utilizes the Needleman-Wunsch algorithm to calculate basic edit distance and maximum sequence length of 100 nucleotides. (FASTA files were included in the project under the test directory)
 
INPUT FILE (test-100.fa):
 
>Chromosome_3565586_3565685_0:0:0_0:0:0_0/1
ACCGGTTGCCCGCTACATGCTCCAACCATCCGGCGATGGTTACCTGCTGCCGGACTGGTATAGCGCAGAGCCGCGTCGACACCGCGTATCCGTGCCCCCC
>Chromosome_3568561_3568661_0:0:1_0:0:1_1/1
TGGGGATTGCCAGTCCGTCCGGGGAGGTATTCAGAAAGGTACACCGGTCTGTTGATATTCATGTAACAGGTATTAATGATGAAGAAAGGAATGGCAAACA
 
 
Run the following command to align the two sequences:
 
$ ./a.out a ../test/data/test-100.fa test.maf
 
SCREEN OUTPUT:
 
Submatrix dimension: 1
 
String A size: 100
 
String B size: 100
 
Submatrices in edit table: 100x100
 
Allocating 230 locations.
 
Allocation time: 1e-05s
 
1 / 5 (submatrices: 45 )
 
5 / 5 (submatrices: 225 )
 
Submatrix calculation time: 8.9e-05s
 
Edit path calculation (Masek-Paterson): 0.000149
 
 
TEXT OUTPUT (test.maf)
 
a score=58
 
s Chromosome_3565586_3565685_0:0:0_0:0:0_0/1 0 100 + 107 ACCGG-TTGCCCGCTACATGC-TCCAACCATCCGGCGATGGT-TACCTG-CTGCCG-GACTGGTATAGC--GCAGAGCCGCGTCGACACCGCGTATCCGTGCCCCCC
 
s Chromosome_3568561_3568661_0:0:1_0:0:1_1/1 0 100 + 107 TGGGGATTGCCAG-TCCGTCCGGGGAGGTATTCAG-AAAGGTACACCGGTCTGTTGATATTCATGTAACAGGTATTAATG-ATGAAGAAAG-GAAT--G-GCAAACA
 
 
 
As you can see, the two sequences are now aligned which will show conserved regions of nucleotides between the two.
 
'''Analysis'''
 
In order to perform gprof analysis, sequence length of 10000 was used:
 
$ ./a.out a ../test/data/test-10000.fa test.maf
$ gprof -p -b > FULL_O2.analysis.10000.flt
 
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
36.59 0.45 0.45 302086997 0.00 0.00 std::vector<int, std::allocator<int> >::operator[](unsigned long)
22.77 0.73 0.28 1 0.28 1.05 Solver::fill_edit_matrix()
13.82 0.90 0.17 201301730 0.00 0.00 std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::operator[](unsigned long)
12.20 1.05 0.15 25000000 0.00 0.00 SubmatrixCalculator::sumSteps(int)
7.32 1.14 0.09 727566 0.00 0.00 std::vector<int, std::allocator<int> >::size() const
3.25 1.18 0.04 5 0.01 0.01 std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::vector()
0.81 1.19 0.01 284812 0.00 0.00 std::_Vector_base<int, std::allocator<int> >::_M_deallocate(int*, unsigned long)
 
Even with a low sequence length of 10000, we can already begin to see that a lot of time is used in the function call Solver::fill_edit_matrix().
 
If we run the program for a larger sequence of 50000, the following flat profile is generated.
 
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
85.67 5.61 5.61 1 5.61 5.61 Solver::fill_edit_matrix()
8.09 6.14 0.53 5267025 0.00 0.00 SubmatrixCalculator::calculateFinalSteps(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
3.05 6.34 0.20 63261681 0.00 0.00 std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::_M_fill_insert(__gnu_cxx::__normal_iterator<std::vector<int, std::allocator<int> >*, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > >, unsigned long, std::vector<int, std::allocator<int> > const&)
3.05 6.54 0.20 1 0.20 0.93 SubmatrixCalculator::calculate()
 
At 50000, it is even more clear that the Solver::fill_edit_matrix() function takes up the majority of the execution since it's taking up 85% of the elapsed time.
 
Below is a snippet of the code:
 
<nowiki>
void Solver::fill_edit_matrix() {
all_columns.resize(row_num + 1, vector<int>(column_num + 1, 0));
all_rows.resize(row_num + 1, vector<int>(column_num + 1, 0));
top_left_costs.resize(row_num + 1, vector<int>(column_num + 1, 0));
 
int initialVector = 0;
for (int i = 0; i < submatrix_dim; i++){
initialVector = initialVector * 10 + 2; // 222 == "222" == (1, 1, 1)
}
 
// padding string b step vectors
for (int submatrix_j = 1; submatrix_j <= column_num; submatrix_j++) {
if ((submatrix_j * submatrix_dim - 1) >= string_b_real_size) {
vector<int> temp_vec(submatrix_dim, 0);
for (int i = 0;
i < (string_b_real_size - ((submatrix_j - 1) * submatrix_dim));
i++)
temp_vec[i] = 1;
all_rows[0][submatrix_j] = SubmatrixCalculator::stepsToInt(temp_vec);
} else {
all_rows[0][submatrix_j] = initialVector;
}
}
 
// padding string a step vectors
for (int submatrix_i = 1; submatrix_i <= row_num; submatrix_i++) {
if ((submatrix_i * submatrix_dim - 1) >= string_a_real_size) {
vector<int> temp_vec(submatrix_dim, 0);
for (int i = 0;
i < (string_a_real_size - ((submatrix_i - 1) * submatrix_dim));
i++)
temp_vec[i] = 1;
all_columns[submatrix_i][0] =
SubmatrixCalculator::stepsToInt(temp_vec);
} else {
all_columns[submatrix_i][0] = initialVector;
}
}
 
for (int submatrix_i = 1; submatrix_i <= row_num; submatrix_i++) {
if ((submatrix_i - 1) * submatrix_dim > string_a_real_size)
top_left_costs[submatrix_i][1] = string_a_real_size;
else
top_left_costs[submatrix_i][1] = (submatrix_i - 1) * submatrix_dim;
 
for (int submatrix_j = 1; submatrix_j <= column_num; submatrix_j++) {
 
pair<int, int> final_steps = subm_calc->resultIndex[
// offset calculation
str_a_offsets[submatrix_i] + // left string
str_b_offsets[submatrix_j] + // top string
// left steps
subm_calc->stepOffsets[0][all_columns[submatrix_i][submatrix_j - 1]] +
// top steps
subm_calc->stepOffsets[1][all_rows[submatrix_i - 1][submatrix_j]]
];
 
all_columns[submatrix_i][submatrix_j] = final_steps.first;
all_rows[submatrix_i][submatrix_j] = final_steps.second;
 
if (submatrix_j != 1) {
top_left_costs[submatrix_i][submatrix_j] =
top_left_costs[submatrix_i][submatrix_j - 1];
top_left_costs[submatrix_i][submatrix_j] += subm_calc->sumSteps(
all_rows[submatrix_i - 1][submatrix_j - 1]);
}
}
}
}
</nowiki>
=== Assignment 2 ===
=== Assignment 3 ===
26
edits