[2023 New Book] Fast Python: High Performance Techniques for Large Datasets

84a8d234eb15f839a3f2a3ec9b6741ae.png

来源:专知
本文为书籍介绍,建议阅读5分钟
这本书的目的是帮助你在Python生态系统中编写更高效的应用程序。

230112fb58aa056bd2dd8ae0285a674f.png

The purpose of this book is to help you write more efficient applications in the Python ecosystem. By more efficient, I mean your code will use fewer CPU cycles, less storage space, and less network communication. This book takes a comprehensive approach to performance issues. We not only discuss code optimization techniques in pure Python, but also consider efficient use of widely used databases such as NumPy and pandas. Because Python's performance is insufficient in some cases, we also consider Cython when we need more speed. Consistent with this comprehensive approach, we also discuss the impact of hardware on code design: we analyze the impact of modern computer architectures on algorithm performance. We also examine the effect of network architecture on efficiency and explore the use of GPU computing for fast data analysis.

The chapters of this book are basically self-contained, and you can jump to any chapter that is important to you. Having said that, the book is divided into four parts. Part I, Fundamental Methods (Chapters 1-4), covers introductory material.

■ Chapter 1 introduces this problem and explains why we must focus on the efficiency of computation and storage. It also introduces the book's methodology and offers suggestions for navigating it according to your needs.

■ Chapter 2 covers native Python optimizations. We also discuss optimization of Python data structures, code analysis, memory allocation, and lazy programming techniques.

■ Chapter 3 discusses concurrency and parallelism in Python and describes how to best take advantage of multiprocessing and multithreading (including limitations when using threads for parallel processing). The chapter also touches on asynchronous processing as a way to efficiently handle multiple concurrent requests with low workload, typical of web services.

■ Chapter 4 introduces NumPy, a library that allows you to efficiently work with multidimensional arrays. NumPy is at the heart of all modern data processing techniques, and as such, it is considered a fundamental library. This chapter shares specific NumPy techniques to develop more efficient code, such as views, broadcasting, and array programming.

The second part, Hardware (Chapters 5 and 6), focuses on how to extract maximum efficiency from commonly used hardware and networks.

■ Chapter 5 covers Cython, a superset of Python that generates very efficient code. Python is a high-level interpreted language, therefore, it is not optimized for hardware. Several languages, like C or Rust, are designed to be as efficient as possible at the hardware level. Cython belongs to this category of languages: although it is very close to Python, it compiles to C code. Generating the most efficient Cython code requires attention to how the code maps to an efficient implementation. In this chapter, we learn how to create efficient Cython code.

■ Chapter 6 discusses the impact of modern hardware architectures on designing efficient Python code. Given the way modern computers are designed, some counter-intuitive programming methods may be more efficient than expected. For example, in some cases it may be faster to process compressed data than uncompressed data, even though we pay the cost of the decompression algorithm. This chapter also covers the impact of CPU, memory, storage, and networking on algorithm design in Python. We discuss NumExpr, a library that can make NumPy code more efficient by using properties of modern hardware architectures.

Part III, Applications and Libraries for Modern Data Processing (Chapters 7 and 8), looks at typical applications and libraries used in modern data processing.

■ Chapter 7 focuses on how to use pandas, the dataframe library used in Python, as efficiently as possible. We will see techniques related to pandas to optimize the code. Unlike most chapters in this book, this one builds on previous chapters. pandas is based on NumPy, so we'll take what we learned in Chapter 4 and discover NumPy-related techniques to optimize pandas. We also explore how to optimize pandas using NumExpr and Cython. Finally, I introduce Arrow, a library that can be used to improve the performance of processing pandas dataframes.

■ Chapter 8 studies the optimization of data persistence. We discuss Parquet, a library that can efficiently handle columnar data, and Zarr, which can handle very large disk arrays. We also started discussing how to handle datasets larger than memory.

Part IV, Advanced Topics (Chapters 9 and 10), deals with two final, very different approaches: using the GPU and using the Dask library.

■ Chapter 9 looks at how to use graphics processing units (GPUs) to process large data sets. We'll see that the GPU computing model -- using many simple processing units -- is well-suited to tackle modern data science problems. We use two different methods to utilize the GPU. First, we'll discuss existing libraries that provide a similar interface to libraries you know, such as CuPy as a GPU version of NumPy. Second, we'll discuss how to generate code from Python to run on the GPU.

■ Chapter 10 discusses Dask, a library that allows you to write parallel code that scales to many machines—whether locally or in the cloud—while providing a familiar interface similar to NumPy and pandas.

8bbd21b51a1f466582a9f7557cab5ba9.png

3363b00fc8501e7bd849554169924c50.png

a36f5eb325902a450272c3c693881fdc.png

873b7962e5661109924922b7db9f13c8.png

627080e8ccc8c6d69bd0986c38a7db86.png

おすすめ

転載: blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/132137832