N/A
Team Members
- Woosle Park, Data Compression
- Akshat Patel,
- Jordan Pitters,
Progress
Assignment 1
Application 1 - Data Compression
Description: https://www.geeksforgeeks.org/lzw-lempel-ziv-welch-compression-technique/
The algorithm used for data compression here is the Lempel–Ziv–Welch (LZW) algorithm. It is a lossless algorithm meaning no data is lost during compression for a file. This algorithm is generally used for gif or pdf files but for this example, I used a .txt file because it was easier to manipulate and scale in size. The file used for compression is a .txt version of the Holy Bible(https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt) because the contents are large enough to see the compression time and percentage. The algorithm should read a files sequence of symbols and grouping them into strings and then converting it into bit 12 code that is then stored into a table. That table is then referred to when decompressing a file doing a reverse sequence of steps from compression.
Source Code:
// Compile with gcc 4.7.2 or later, using the following command line: // // g++ -std=c++0x lzw.c -o lzw // //LZW algorithm implemented using fixed 12 bit codes. #include <iostream> #include <sstream> #include <fstream> #include <bitset> #include <string> #include <unordered_map> #define MAX_DEF 4096 using namespace std; string convert_int_to_bin(int number) { string result = bitset<12>(number).to_string(); return result; } void compress(string input, int size, string filename) { unordered_map<string, int> compress_dictionary(MAX_DEF); //Dictionary initializing with ASCII for ( int unsigned i = 0 ; i < 256 ; i++ ){ compress_dictionary[string(1,i)] = i; } string current_string; unsigned int code; unsigned int next_code = 256; //Output file for compressed data ofstream outputFile; outputFile.open(filename + ".lzw"); for(char& c: input){ current_string = current_string + c; if ( compress_dictionary.find(current_string) ==compress_dictionary.end() ){ if (next_code <= MAX_DEF) compress_dictionary.insert(make_pair(current_string, next_code++)); current_string.erase(current_string.size()-1); outputFile << convert_int_to_bin(compress_dictionary[current_string]); current_string = c; } } if (current_string.size()) outputFile << convert_int_to_bin(compress_dictionary[current_string]); outputFile.close(); } void decompress(string input, int size, string filename) { unordered_map<unsigned int, string> dictionary(MAX_DEF); //Dictionary initializing with ASCII for ( int unsigned i = 0 ; i < 256 ; i++ ){ dictionary[i] = string(1,i); } string previous_string; unsigned int code; unsigned int next_code = 256; //Output file for decompressed data ofstream outputFile; outputFile.open(filename + "_uncompressed.txt"); int i =0; while (i<size){ //Extracting 12 bits and converting binary to decimal string subinput = input.substr(i,12); bitset<12> binary(subinput); code = binary.to_ullong(); i+=12; if ( dictionary.find(code) ==dictionary.end() ) dictionary.insert(make_pair(code,(previous_string + previous_string.substr(0,1)))); outputFile<<dictionary[code]; if ( previous_string.size()) dictionary.insert(make_pair(next_code++,previous_string + dictionary[code][0])); previous_string = dictionary[code]; } outputFile.close(); } string convert_char_to_string(const char *pCh, int arraySize){ string str; if (pCh[arraySize-1] == '\0') str.append(pCh); else for(int i=0; i<arraySize; i++) str.append(1,pCh[i]); return str; } static void show_usage() { cerr << "Usage: \n" << "Specify the file that needs to be compressed or decompressed\n" <<"lzw -c input #compress file input\n" <<"lzw -d input #decompress file input\n" <<"Compressed data will be found in a file with the same name but with a .lzw extension\n" <<"Decompressed data can be found in a file with the same name and a _uncompressed.txt extension\n" << endl; } int main (int argc, char* argv[]) { streampos size; char * memblock; if (argc <2) { show_usage(); return(1); } ifstream file (argv[2], ios::in|ios::binary|ios::ate); if (file.is_open()) { size = file.tellg(); memblock = new char[size]; file.seekg (0, ios::beg); file.read (memblock, size); file.close(); string input = convert_char_to_string(memblock,size); if (string( "-c" ) == argv[1] ) compress(input,size, argv[2]); else if (string( "-d" ) == argv[1] ) decompress(input,size, argv[2]); else show_usage(); } else { cout << "Unable to open file."<<endl; show_usage(); } return 0; }
Flatline Profiles:
bible.txt - 4,351,186 bytes
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 50.04 0.18 0.18 5758089 31.29 31.29 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const 50.04 0.36 0.18 compress(std::string, int, std::string) 0.00 0.36 0.00 1402806 0.00 31.29 std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&) 0.00 0.36 0.00 4098 0.00 0.00 show_usage() 0.00 0.36 0.00 4097 0.00 0.00 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*) 0.00 0.36 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z18convert_int_to_bini 0.00 0.36 0.00 1 0.00 0.00 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()
bible2.txt - 8,702,373 bytes
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 48.39 0.44 0.44 11511109 38.26 38.26 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const 46.19 0.86 0.42 compress(std::string, int, std::string) 5.50 0.91 0.05 2804639 17.84 56.10 std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&) 0.00 0.91 0.00 4098 0.00 0.00 show_usage() 0.00 0.91 0.00 4097 0.00 0.00 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*) 0.00 0.91 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z18convert_int_to_bini 0.00 0.91 0.00 1 0.00 0.00 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()
bible3.txt - 13,053,560 bytes
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 47.58 0.58 0.58 17264129 33.63 33.63 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const 42.66 1.10 0.52 compress(std::string, int, std::string) 7.38 1.19 0.09 4206472 21.41 55.04 std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&) 1.64 1.21 0.02 convert_char_to_string(char const*, int) 0.82 1.22 0.01 std::pair<std::__detail::_Node_iterator<std::pair<unsigned int const, std::string>, false, false>, bool> std::_Hashtable<unsigned int, std::pair<unsigned int const, std::string>, std::allocator<std::pair<unsigned int const, std::string> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<std::pair<unsigned int, std::string> >(std::integral_constant<bool, true>, std::pair<unsigned int, std::string>&&) 0.00 1.22 0.00 4098 0.00 0.00 show_usage() 0.00 1.22 0.00 4097 0.00 0.00 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*) 0.00 1.22 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z18convert_int_to_bini 0.00 1.22 0.00 1 0.00 0.00 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()
bible4.txt - 17,039,360 bytes
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 60.43 0.96 0.96 22530032 42.65 42.65 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const 32.73 1.48 0.52 compress(std::string, int, std::string) 6.29 1.58 0.10 5486575 18.24 60.89 std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&) 0.63 1.59 0.01 convert_char_to_string(char const*, int) 0.00 1.59 0.00 4098 0.00 0.00 show_usage() 0.00 1.59 0.00 4097 0.00 0.00 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*) 0.00 1.59 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z18convert_int_to_bini 0.00 1.59 0.00 1 0.00 0.00 std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()
Conclusion:
This of time is spent in the compress function and the hashtable takes up most of the time because it is constantly being manipulated and read from. It looks like if the hashtable and the compress function were to be parallelized about 90% of the run time would be affected. The big-O for the application should be O(n) time so there is a linear increase in time based on file size. This application is not good for parallelization because of the dictionary hashtable. Due to the hastable needing to be accessible globally and be constantly modifiable and read this could pose issues if multiple threads were running especially since modifying and reading the table needs to be done sequentially for efficient compression.