The objective is to read the files in parallel, and store the words read from them alongside the fileName in order.
This approach follows the Map-Reduce paradigm: mappers read files in parallel, extracting and organizing words, while reducers sort and write the results efficiently, ensuring high concurrency and minimal synchronization overhead.
-
Assigning a fixed number of files to each thread is inefficient since some files may be larger, and some threads may be faster than others. This could result in idle threads waiting unnecessarily. To solve this, each thread dynamically picks up a new file as soon as it finishes processing the current one. If there are files left to process, the thread takes the next available one, otherwise, it stops.
-
Once a file is read, its data is temporarily stored in a local unordered_map. After processing the entire file, the collected data is merged into a shared unordered_map accessible by all threads. The unordered_map was chosen because it provides faster lookup times (O(1) average case) compared to a standard vector (O(n) for searching).
-
After all mapper threads have completed their tasks, the collected data is transferred into a structured vector for sorting. Specifically, a vector of 26 sub-vectors is used, where each sub-vector corresponds to a letter of the alphabet. This allows parallel sorting since each thread can independently process a specific letter, ensuring efficient and concurrent file writing.
-
The reducers wait until all mappers have finished execution, ensuring that all data is fully loaded into memory.
-
Once the mappers are done, reducers dynamically pick a letter from the alphabet vector—always selecting the first available one and removing it from the list.
-
If no words start with the chosen letter, the thread picks another. Otherwise, it sorts the words using C++’s built-in sort() function with a custom comparator.
-
After sorting, the thread writes the results to the corresponding output file.