## Final Report

Check out the final report here

## Summary of What We Completed

We have written and optimized the sequential version of the Huffman encoding and decoding algorithms, and tested it. For the parallel CPU version of this, we were debating between SIMD intrinsics and ISPC, and OpenMP.

However, huffman coding compression and decompression doesn’t seem to have a workload that can appropriately use SIMD. This is because there is no elegant way of dealing with bits instead of bytes in SIMD. Moreover, different bytes compress to a different number of bits (there is no fixed mapping of input vector size to output vector size), which makes byte alignment in SIMD very difficult (for example, the compressed form for a random 4 byte input could range from 2 to 4 bytes). This is much worse for decompression, where resolving bit-level conflicts (where a specific encoding spreads over 2 bytes) is almost impossible and might actually result in the algorithm being slower than the sequential version. Therefore, we decided to focus on OpenMP.

For compression, we first sort the array in parallel, to minimize number of concurrent updates to the shared frequency dictionary, reducing contention and false sharing. We also use fine-grained locking for the frequency dictionary, individually locking each key-value pair. Once the symbol codes have been determined, each symbol is replaced by its code, and all symbols are so processed in parallel.

Decompression is inherently sequential, and hence much harder to parallelize. In this case, we take advantage of the self-synchronizing property of Huffman coding, which allows us to start at an arbitrary point in the encoded bits, and assume that at some point, the offset in bits will correct itself, resulting in the correct output thereafter.

We read about the LZ77 algorithm and explored the different variants of the algorithm. We also explored different ways to parallelize LZ77. One naive approach is running the LZ77 algorithm along different segments of the data. This approach could output the same result as the sequential implementation if we use a fixed size sliding window and reread over some of the data. Another approach is the one outlined in Practical Parallel Lempel-Ziv Factorization which uses an unbounded sliding window and employs the use of prefix sums and segment trees to calculate the Lempel-Ziv factorization in parallel.

## Update on Deliverables

Our sequential implementations are close to finished, and we have some idea of how to parallelize the algorithms. Our goal for the checkpoint was to have both of these parts finished, but we have not completely met the goal. We may pivot and work on parallelizing the compression and decompression of the huffman coding algorithm and drop the LZ77 part of the project altogether.

**Our new goals:**

- Parallelize the Huffman Coding compression.
- Parallelize the Huffman Coding decompression or LZ77 compression

**Hope to achieve:**

- Both parts of part 2 in our new goals.

## What We Will Bring to the Poster Session

At the poster session we plan to present graphs that show the speedup vs number of processors and compression ratio vs number of processors for the Huffman Encoding and LZ77 algorithms. We will also have a demo of our compression algorithms with some sample data files to use them on.

## Issues

We are stuck at deciding what our encoding for LZ77 will be. There are so many
different implementations of LZ77 and encoding schemes for the algorithm.
We have not yet settled on one encoding scheme. *Practical Parallel Lempel-Ziv
Factorization* doesn’t specify what kind of encoding scheme to use. The fact
that they use unbounded windows complicates things.

##### schedule

Date | Goal |
---|---|

11/4 to 11/10 | Read papers and scour the web for techniques with which to parallelize the algorithms. |

11/11 to 11/17 | Write most of the sequential implementation of the Huffman Coding algorithm. Start planning how to parallelize each algorithm on the CPU and/or GPU. |

11/18 to 11/24 | Complete the parallelized compression and decompression for Huffman Coding. |

11/25 to 12/1 | Complete the sequenital and parallel compression code for LZ77. |

12/2 to 12/8 | Collect all the data necessary for the final report and finish a draft of the final report. |

12/9 to 12/15 | Finalize the final report. |

Final (12/15) | Present! |

## Summary

We will explore how various lossless compression algorithms can be parallelized, compare them, and analyze the the effect of parallelization on compression ratio. We will focus on the Huffman Coding and LZ77 algorithms, attempting to parallelize using both the CPU and the GPU . We will use the OpenMP API for the parallel version on the CPU (Xeon Phi Processors), and CUDA for the parallel version on the GPU (GHC clusters).

## Background

*Lossless* compression algorithms attempt to compress data such that it is
possible to decompress it exactly into the original input, without any loss of
data/reduction in quality. This contrasts *lossy* compression algorithms which can
lose data.

The algorithms that this project attempts to parallelize are two popular lossless compression algorithms.

The Huffman Coding algorithm uses entropy encoding in its algorithms. This means that they use less memory to represent characters that occur more frequently at the expense of using more memory to represent characters that occur more frequently.

The LZ77 algorithm uses a dictionary encoding algorithm which uses a fixed table to encode the input. Using a fixed table enables faster compression/decompression at the expense of lower compression ratio.

The performance of our implementation will be measured by both time taken to compress an input file and the compression ratio of the file compressed. Compression ratio is size of .

## Challenge

All compression algorithms require reading of input data from files and writing of data to files, which is (but is fairly parallelizable).

The first step of both of our compression algorithms is making a frequency dictionary. In order to achieve the same compression ratio, every processor must communicate its frequency dictionary with other processors. This results in a high communication to computation ratio. Hence, we will have to experiment with various techniques and determine the tradeoffs between performance and compression ratio in those techniques.

Moreover, every compression algorithm has 3 main parts to it:

- Reading from an input file to make a frequency dictionary.
- Building an encoding format from the input or use a pre-existing format.
- Encoding the input file using the encoding format.

All the 3 parts are inherently sequential and hence require synchronization to avoid threads from moving to the next step before all other threads reach the same point in the program.

## Resources

Shun, Julian, and Fuyao Zhao. “Practical parallel Lempel-Ziv factorization.” *Data Compression Conference (DCC), 2013*. IEEE, 20

W. Hu, J. Wen, W. Wu, Y. Han, S. Yang and J. Villasenor, “Highly Scalable Parallel Arithmetic Coding on Multi-Core Processors Using LDPC Codes,” in *IEEE Transactions on Communications*, vol. 60, no. 2, pp. 289-294, February 2012.

Youssef, Abdou S., and Abdou Youssef. *Parallel algorithms for entropy-coding
techniques*. US Department of Commerce, Technology Administration, National
Institute of Standards and Technology, 1998.

## Goals/Deliverables

**Plan to Achieve**:

- Complete sequential and parallel implementations of Huffman Coding and LZ77 algorithms.
- Achieve a speedup of around 20x with 64 processors, since some existing libraries/ papers are able to achieve around 15x with 64 processors.
- Demo will include the performance of the 2 algorithms as processors increase, the decrease in compression ratio of the 2 algorithms as processors increase.
- Graphs showing the % of time taken by each inherently sequential component in the algorithm.
- The effect of factors like memory bandwidth, and their effect on the speedup of the algorithm as the processors increase (this is especially important when figuring out the idle number of processors to use, while minimising the decrease in compression ratio).

**Hope to Achieve**:

In addition to what we plan to achieve,

- Speedup the algorithms by atleast 30x (2x more than other libraries) when using 64 processors.
- Run these compression algorithms on the NVIDIA GPU, and explore the tradeoff between the higher computation ratio (due to increased processors) vs the higher communication ratio ( due to copying over the memory from the CPU to the GPU).
- Compare the effect of memory bottlenecks on the CPU vs the GPU.
- Complete the sequential and parallel implemenatations for the arithmetic coding, another entropy encoding algorithm.

## Platform

Language: C++

API: OpenMP, CUDA

Machine: Xeon Phi on the Latedays Cluster, NVIDIA GPU on the GHC Cluster