Sunday, January 29, 2017

C++'da Döngü Oluşturma

C++'da döngü oluşturmanın çeşitli yöntemleri bulunuyor. Önce geleneksel for döngüsü ile tam sayılardan oluşan bir dizinin tek değerli elemanlarını sayalım:

int nums[]{4, 8, 15, 16, 23, 42};
 int odd=0;
 auto nums_size = sizeof(nums)/sizeof(int);
 
 for (int i=0;i<nums_size;++i){
     odd += nums[i] & 1;
 }
 cout << "# of odd nums: " << odd << endl;

Aynı döngüyü C++11 ile gelen yeni for-each gösterimi ile de gerçekleştirebiliriz:

int nums[]{4, 8, 15, 16, 23, 42};
int odd=0;

for (auto& num: nums){
    odd += num & 1;
}   
cout << "# of odd nums: " << odd << endl; 

Diğer bir yöntem ise STL kütüphanesinde yer alan for_each fonksiyonu ile döngüyü oluşturmaktır:

int nums[]{4, 8, 15, 16, 23, 42};
int odd=0;
auto nums_size = sizeof(nums)/sizeof(int);

for_each(nums,nums+nums_size,[&odd](int& num){
   odd += num & 1;
});
cout << "# of odd nums: " << odd << endl; 

Aynı işlevi STL'deki count_if fonksiyonu ile de sağlayabiliriz:

int nums[]{4, 8, 15, 16, 23, 42};
int odd=0;
auto nums_size = sizeof(nums)/sizeof(int);

odd= count_if(nums,nums+nums_size,[](int& num){
  return num & 1;
});
cout << "# of odd nums: " << odd << endl; 

Problemi count_if ile çözmenin kazanımlarından biri çözümü basitçe paralelleştirebilmektir:

int nums[]{4, 8, 15, 16, 23, 42};
int odd=0;
auto nums_size = sizeof(nums)/sizeof(int);

odd = __gnu_parallel::count_if(nums, nums+nums_size, [](int &value) {
     return value & 1;
});
cout << "# of odd nums: " << odd << endl; 

Şimdi yukarıda parça parça verilen kodları tek bir uygulamada toplayalım:

1 #include <iostream>
  2 #include <parallel/algorithm>
  3 
  4 using namespace std;
  5 
  6 int main() {
  7     int nums[]{4, 8, 15, 16, 23, 42};
  8     int odd=0;
  9     auto nums_size = sizeof(nums)/sizeof(int);
 10 
 11     for (int i=0;i<nums_size;++i){
 12         odd += nums[i] & 1;
 13     }
 14     cout << "# of odd nums: " << odd << endl;
 15     odd= 0;
 16 
 17     for(auto& num: nums){
 18         odd += num & 1;
 19     }
 20     cout << "# of odd nums: " << odd << endl;
 21 
 22     odd= 0;
 23     for_each(nums,nums+nums_size,[&odd](int& num){
 24        odd += num & 1;
 25     });
 26     cout << "# of odd nums: " << odd << endl;
 27 
 28     odd= count_if(nums,nums+nums_size,[](int& num){
 29        return num & 1;
 30     });
 31     cout << "# of odd nums: " << odd << endl;
 32 
 33     odd = __gnu_parallel::count_if(nums, nums+nums_size, [](int &value) {
 34         return value & 1;
 35     });
 36     cout << "# of odd nums: " << odd << endl;
 37     cout << endl;
 38 
 39     return 0;
 40 }

Yukarıdaki kodu GNU C++ derleyeci ile aşağıda verildiği şekilde derleyip çalıştırabiliriz:

[student@host1]$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) 
[student@host1]$ g++ -o run -std=c++11 for_loop.cpp -lpthread -fopenmp
[student@host1]$ ./run
# of odd nums: 2
# of odd nums: 2
# of odd nums: 2
# of odd nums: 2
# of odd nums: 2

Elbette 6 elemanlı bir dizi için tek değerli elemanları sayma probleminin çözümünü paralelleştirmenin pratikte bir anlamı bulunmuyor. Ancak dizinin eleman sayısı artınca çekirdek sayısınca iplik kullanarak paralelleştirmek bir hızlandırma sağlayacaktır. Bu hızlanmanın ne kadar olduğunu inceleyelim. Önce problemi seri olarak çözelim:

#include <iostream>
#include <vector>
#include <random>
#include <algorithm>
#include <chrono>

using namespace std;

const int size = 100000000;

vector<int> numbers;
random_device rd; 
mt19937_64 rng(rd()); // mersene twister 19937 generator (64-bit)

typedef vector<int>::iterator vect_iter; // alias

void fill(vector<int> &data, int length) {
    for (int i = 0; i < length; ++i) {
        data.push_back(rng());
    }   
}

int main() {
    fill(numbers, size);

    auto strt = chrono::steady_clock::now();
    auto finalResult = count_if(numbers.begin(), numbers.end(), [](int &value) {
        return value & 1;
    }); 
    auto stp = chrono::steady_clock::now();
    cout << "Final result (sequential): "
         << finalResult
         << " @ "
         << (stp - strt).count()
         << endl;
    return 0;
}

Yukarıdaki seri kodu çalıştırıp başarıma bakalım:

[student@host1]$ g++ -o run -std=c++11 parallel_for_loop1.cpp
[student@host1]$ ./run 
Final result (sequential): 50007237 @ 1782937391

Şimdi ise problemi C++11 ile gelen thread kütüphanesini kullanarak çözmeye çalışalım:

1 #include <iostream>
  2 #include <thread>
  3 #include <vector>
  4 #include <mutex>
  5 #include <random>
  6 #include <chrono>
  7 #include <algorithm>
  8 
  9 using namespace std;
 10 
 11 const int cores = thread::hardware_concurrency();
 12 const int size = 100000000;
 13 const int size_per_core = size / cores;
 14 
 15 vector<int> numbers;
 16 random_device rd;
 17 mt19937_64 rng(rd()); // mersene twister 19937 generator (64-bit)
 18 vector<thread> threads;
 19 vector<int> partialSolution;
 20 mutex m;
 21 
 22 typedef vector<int>::iterator vect_iter; // alias
 23 
 24 void fill(vector<int> &data, int length) {
 25     for (int i = 0; i < length; ++i) {
 26         data.push_back(rng());
 27     }
 28 }
 29 
 30 void count_odd(vect_iter begin, vect_iter end, vector<int> &result) {
 31     int counter = count_if(begin, end, [](int &value) {
 32         return value & 1;
 33     });
 34     m.lock();
 35     result.push_back(counter);
 36     m.unlock();
 37 }
 38 
 39 int main() {
 40     fill(numbers, size);
 41     std::chrono::steady_clock::time_point strt, stp;
 42 
 43     strt = chrono::steady_clock::now();
 44     auto begin = numbers.begin();
 45     for (int i = 0; i < cores; ++i, begin += size_per_core) {
 46         auto end = begin + size_per_core;
 47         threads.push_back(thread(count_odd, begin, end, ref(partialSolution)));
 48     }
 49     for (auto &t : threads) t.join();
 50     auto finalResult = accumulate(partialSolution.begin(), partialSolution.end(), int());
 51     stp = chrono::steady_clock::now();
 52     cout << "Final result (parallel): "
 53          << finalResult
 54          << " @ "
 55          << (stp - strt).count()
 56          << endl;
 57 
 58     return 0;
 59 }

thread kütüphanesini kullanarak oluşturduğumuz çözümü derleyip çalıştıralım:

[student@host1]$ g++ -o run -std=c++11 parallel_for_loop2.cpp -lpthread
[student@localhost parallel-programming-example]$ ./run
Final result (parallel): 49995511 @ 1079533408

Son olarak problemi STL'deki paralel algoritmaları kullanarak çözelim:

1 #include <iostream>
  2 #include <thread>
  3 #include <vector>
  4 #include <mutex>
  5 #include <random>
  6 #include <parallel/algorithm>
  7 
  8 using namespace std;
  9 
 10 const int cores = thread::hardware_concurrency();
 11 const int size = 100000000;
 12 const int size_per_core = size / cores;
 13 
 14 vector<int> numbers;
 15 random_device rd;
 16 mt19937_64 rng(rd()); // mersene twister 19937 generator (64-bit)
 17 vector<int> partialSolution;
 18 
 19 typedef vector<int>::iterator vect_iter; // alias
 20 
 21 void fill(vector<int> &data, int length) {
 22     for (int i = 0; i < length; ++i) {
 23         data.push_back(rng());
 24     }
 25 }
 26 
 27 int main() {
 28     fill(numbers, size);
 29     std::chrono::steady_clock::time_point strt, stp;
 30 
 31     // parallel solution
 32     strt = chrono::steady_clock::now();
 33     auto finalResult = __gnu_parallel::count_if(numbers.begin(), numbers.end(), [](int &value) {
 34         return value & 1;
 35     });
 36     stp = chrono::steady_clock::now();
 37     cout << "Final result (parallel [STL] algorithm): "
 38          << finalResult
 39          << " @ "
 40          << (stp - strt).count()
 41          << endl;
 42     return 0;
 43 }

STL'deki paralel count_if fonksiyonunu kullanarak oluşturduğumuz çözümü derleyip çalıştıralım:


[student@host1]$ g++ -o run -std=c++11 parallel_for_loop3.cpp -lpthread -fopenmp
[student@host1]$ ./run
Final result (parallel [STL] algorithm): 49994371 @ 1391989217

Karşılaştırma yapabilmek için üç koduda tek bir kaynak kodda toplayalım:

1 #include <iostream>
  2 #include <thread>
  3 #include <vector>
  4 #include <mutex>
  5 #include <random>
  6 #include <parallel/algorithm>
  7 
  8 using namespace std;
  9 
 10 const int cores = thread::hardware_concurrency();
 11 const int size = 100000000;
 12 const int size_per_core = size / cores;
 13 
 14 vector<int> numbers;
 15 random_device rd;
 16 mt19937_64 rng(rd()); // mersene twister 19937 generator (64-bit)
 17 vector<thread> threads;
 18 vector<int> partialSolution;
 19 mutex m;
 20 
 21 typedef vector<int>::iterator vect_iter; // alias
 22 
 23 void fill(vector<int> &data, int length) {
 24     for (int i = 0; i < length; ++i) {
 25         data.push_back(rng());
 26     }
 27 }
 28 
 29 void count_odd(vect_iter begin, vect_iter end, vector<int> &result) {
 30     int counter = count_if(begin, end, [](int &value) {
 31         return value & 1;
 32     });
 33     m.lock();
 34     result.push_back(counter);
 35     m.unlock();
 36 }
 37 
 38 int main() {
 39 
 40     fill(numbers, size);
 41     std::chrono::steady_clock::time_point strt, stp;
 42 
 43     strt = chrono::steady_clock::now();
 44     auto begin = numbers.begin();
 45     for (int i = 0; i < cores; ++i, begin += size_per_core) {
 46         auto end = begin + size_per_core;
 47         threads.push_back(thread(count_odd, begin, end, ref(partialSolution)));
 48     }
 49     for (auto &t : threads) t.join();
 50     auto finalResult = accumulate(partialSolution.begin(), partialSolution.end(), int());
 51     stp = chrono::steady_clock::now();
 52     cout << "Final result (parallel)    :"
 53          << finalResult
 54          << " @ "
 55          << (stp - strt).count()
 56          << endl;
 57 
 58     // sequential solution
 59     strt = chrono::steady_clock::now();
 60     finalResult = count_if(numbers.begin(), numbers.end(), [](int &value) {
 61         return value & 1;
 62     });
 63     stp = chrono::steady_clock::now();
 64     cout << "Final result (sequential)  :"
 65          << finalResult
 66          << " @ "
 67          << (stp - strt).count()
 68          << endl;
 69 
 70     // parallel stl algorithm solution
 71     strt = chrono::steady_clock::now();
 72     finalResult = __gnu_parallel::count_if(numbers.begin(), numbers.end(), [](int &value) {
 73         return value & 1;
 74     });
 75     stp = chrono::steady_clock::now();
 76     cout << "Final result (stl parallel):"
 77          << finalResult
 78          << " @ "
 79          << (stp - strt).count()
 80          << endl;
 81     return 0;
 82 }

Derleyip 10 kere çalıştıralım ve çalışma zamanı başarımını karşılaştıralım:

[student@host1]$ for i in $(seq 1 10) ; do ./run; done
Final result (parallel)    :50003830 @ 391770880
Final result (sequential)  :50003830 @ 1722770927
Final result (stl parallel):50003830 @ 651193996
Final result (parallel)    :49995592 @ 399623007
Final result (sequential)  :49995592 @ 1745905026
Final result (stl parallel):49995592 @ 656114655
Final result (parallel)    :49996624 @ 403331645
Final result (sequential)  :49996624 @ 1751347528
Final result (stl parallel):49996624 @ 640525779
Final result (parallel)    :49998633 @ 382341755
Final result (sequential)  :49998633 @ 1731618656
Final result (stl parallel):49998633 @ 597300980
Final result (parallel)    :49999674 @ 380100468
Final result (sequential)  :49999674 @ 1737255926
Final result (stl parallel):49999674 @ 612496887
^TFinal result (parallel)    :50000214 @ 394109006
Final result (sequential)  :50000214 @ 1728814510
Final result (stl parallel):50000214 @ 598540879
Final result (parallel)    :50000307 @ 436324428
Final result (sequential)  :50000307 @ 1767703757
Final result (stl parallel):50000307 @ 592917336
Final result (parallel)    :50001977 @ 421724960
Final result (sequential)  :50001977 @ 1729831445
Final result (stl parallel):50001977 @ 620272259
Final result (parallel)    :49998127 @ 410899174
Final result (sequential)  :49998127 @ 1813701819
Final result (stl parallel):49998127 @ 591852057
Final result (parallel)    :50000586 @ 415081631
Final result (sequential)  :50000586 @ 1790710476
Final result (stl parallel):50000586 @ 554117088

Sonucu grafiksel olarak yorumlamaya çalışalım:



Hızlanma 4 çekirdekli Intel i7 işlemcisi aşağıdaki grafikteki gibi gerçekleşti:


Şimdi ise derleyicinin en iyileme ayarlarını sonuna kadar açalım ve uygulamayı tekrar çalıştıralım:

[student@host1]$ g++ -O3 -o run -std=c++11 parallel_for_loop.cpp -lpthread -fopenmp
[student@host1]$ for i in $(seq 1 10) ; do ./run; done
Final result (parallel)    :49993240 @ 85592097
Final result (sequential)  :49993240 @ 304494099
Final result (stl parallel):49993240 @ 87166009
Final result (parallel)    :49998753 @ 66552277
Final result (sequential)  :49998753 @ 312764432
Final result (stl parallel):49998753 @ 86458888
Final result (parallel)    :49988432 @ 67755317
Final result (sequential)  :49988432 @ 285352242
Final result (stl parallel):49988432 @ 87107690
Final result (parallel)    :49999577 @ 64545883
Final result (sequential)  :49999577 @ 289695428
Final result (stl parallel):49999577 @ 91366775
Final result (parallel)    :50000906 @ 72690986
Final result (sequential)  :50000906 @ 283258569
Final result (stl parallel):50000906 @ 81706558
Final result (parallel)    :49995889 @ 65718162
Final result (sequential)  :49995889 @ 298945194
Final result (stl parallel):49995889 @ 81954451
Final result (parallel)    :49988992 @ 66934218
Final result (sequential)  :49988992 @ 299720421
Final result (stl parallel):49988992 @ 113914656
Final result (parallel)    :50000404 @ 62499660
Final result (sequential)  :50000404 @ 311449204
Final result (stl parallel):50000404 @ 79420706
Final result (parallel)    :50005821 @ 72953287
Final result (sequential)  :50005821 @ 290275079
Final result (stl parallel):50005821 @ 87208042
Final result (parallel)    :49995294 @ 65187374
Final result (sequential)  :49995294 @ 310689078
Final result (stl parallel):49995294 @ 72715961

Süreler en iyileme sonunda önemli ölçüde aşağı indi, hızlanma oranlarında önemli bir iyileşme olmasa da biri birine yaklaştı:



thread sınıfı kullanarak oluşturulan çözümün seri çözüme göre sağladığı hızlanma ile OpenMP tabanlı paralel count_if  fonksiyonunun seri çözüme göre sağladığı hızlanmanın karşılaştırılması 

Son olarak en iyilemeli ve en iyilemesiz tüm çalışma zamanlarını gösteren zamanları inceleyelim:

No comments:

Post a Comment