9.8. Further Reading

  1. A Distributed Graph-Theoretic Framework for Automatic Parallelization in Multi-Core Systems 1

  2. SCOP: Scientific Control for Reliable Neural Network Pruning 2

  3. Searching for Low-Bit Weights in Quantized Neural Networks 3

  4. GhostNet: More Features from Cheap Operations 4

  5. AdderNet: Do We Really Need Multiplications in Deep Learning? 5

  6. Blockwise Parallel Decoding for Deep Autoregressive Models 6

  7. Medusa: Simple framework for accelerating LLM generation with multiple decoding heads 7

  8. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning 8

1

https://proceedings.mlsys.org/paper/2021/file/a5e00132373a7031000fd987a3c9f87b-Paper.pdf

2

https://arxiv.org/abs/2010.10732

3

https://arxiv.org/abs/2009.08695

4

https://arxiv.org/abs/1911.11907

5

https://arxiv.org/abs/1912.13200

6

https://arxiv.org/abs/1811.03115

7

https://www.together.ai/blog/medusa

8

https://arxiv.org/abs/2307.08691