In the shuffle stage, you can see that the data is copied through a large number of times. The data output from the map stage is copied through the network and sent to the reduce stage. This process involves a large amount of network IO. If the data can be compressed, then The amount of data sent will be much less.
Compression algorithms supported in Hadoop:
- gzip
- bzip2
- LZO
- LZ4
- Snappy
These compression algorithms combine compression and decompression rates. Google's Snappy is the best, and Snappy compression is generally chosen.
See you next time, bye!