Is Python too slow?

This article is reproduced from InfoQ, author Anna Anisienia

Although Python is slower than many compiled languages, it is easy to use and versatile. For many people, the usefulness of language trumps speed.

I am a Python engineer, so you might think I am biased. But I want to clarify some criticisms of Python and reflect on whether it is necessary to worry about speed in daily work such as data engineering, data science, and analysis using Python.

Is Python too slow?

In my opinion, this type of problem should be based on a specific context or use case. Compared with compiled languages ​​such as C, is Python slow to process numerical values? Yes, it is slow. This fact has been known for many years, which is why there are Python libraries that play a vital role in speed, such as numpy, which uses C underneath.

But for all use cases, is Python much slower than other (more difficult to learn and use) languages? If you look at the performance benchmarks of Python libraries optimized to solve specific problems, you will find that their performance is quite good compared to compiled languages. For example, take a look at FastAPI's performance benchmarks-obviously, Go as a compiled language is much faster than Python. However, FastAPI still outperforms some Go libraries in building REST APIs:

image

Web framework benchmarking

Side note: The higher-performance C++ and Java web frameworks are not included in the above list.

Similarly, when comparing Dask (written in Python) and Spark (written in Scala) in a data-intensive neuroimaging pipeline, the author draws the following conclusions:

Overall, our results show that there is no substantial difference in performance between the two engines.

The question we should ask ourselves is what kind of speed we really need. If you only need to trigger an ETL job once a day, you don't need to care whether it takes 20 seconds or 200 seconds. Then, you may prefer the code to be easy to understand, package, and maintain, especially considering that computing resources are becoming cheaper and cheaper than expensive engines.

Code speed vs practicality

From a practical point of view, when choosing a programming language for daily work, we need to answer several different questions.

Can you reliably solve multiple business problems in this language?

If you only care about speed, don't use Python. For various use cases, there are faster alternatives. The main advantage of Python lies in its readability, ease of use, and the ability to use it to solve various problems. Python can be used as a "glue" to connect various systems, services, and use cases together.

Can you find enough employees who understand this language?

Since Python is very easy to learn and use, the number of Python users has been growing. Business users who used to process numerical values ​​in Excel can now learn to code in Pandas very quickly, thus learning to be self-sufficient without relying on IT resources. At the same time, this also offloads the IT and analysis departments. It also shortens the time to value realization.

Nowadays, it is much easier to find a data engineer who understands Python and can maintain Spark data processing applications in this language than it is to find a Java or Scala engineer who does the same thing. Many organizations have turned to Python for many use cases just because they have a higher chance of finding employees who can "speak" the language.

In contrast, I know that some companies desperately need Java or C# developers to maintain existing applications, but these languages ​​are difficult (it takes years to master) and seem unattractive to novice programmers , Because they can use simpler languages ​​(such as Go or Python) to get more income.

Synergy between experts in different fields

If your company uses Python, business users, data analysts, data scientists, data engineers, back-end and web developers, DevOps engineers, and even system administrators are all likely to use the same language. This can create synergies in the project, allowing people from different fields to work together and use the same tools.

What is the real bottleneck in data processing?

According to my own work, the bottleneck I usually encounter is not the language itself, but external resources. More specifically, let's look at a few examples.

Write to relational database

When processing data in ETL, we eventually need to load the data to a centralized place. Although we can use multithreading in Python to write data to some relational databases faster (by using more threads), the increase in the number of parallel writes may maximize the CPU capacity of the database value.

In fact, this happened once when I used multi-threading to speed up writing to the RDS Aurora database on AWS. Then I noticed that the CPU utilization of the writer node rose so high that I had to deliberately slow down the code by using fewer threads to ensure that the database instance would not be damaged.

This means that Python has mechanisms to parallelize and accelerate many operations, but relational databases (limited by the number of CPU cores) have their limitations, and it is impossible to solve this problem only by using faster programming languages.

Call external API

Another example where the language itself is not a bottleneck is to use an external REST API (you may want to extract data from it to meet the needs of data analysis). Although we can use parallelism to speed up data extraction, it may be futile because many external APIs limit the number of requests we can make in a certain period of time. Therefore, you may often find that you need to deliberately reduce the running speed of your script to ensure that the API request limit is not exceeded:

time.sleep(10)

Processing big data

According to my experience dealing with large data sets, no matter which language is used, it is impossible to load real "big data" into the memory of a laptop. For such use cases, you may need to utilize distributed processing frameworks such as Dask, Spark, Ray, etc. When using a single server instance or laptop, the amount of data that can be processed is limited.

If you want to transfer the actual data processing work to a set of computing nodes, or even want to use GPU instances to further speed up the calculation, then Python happens to have a huge framework ecosystem that can simplify this task:

  • Do you want to use GPU to speed up data science calculations? Use Pytorch, Tensorflow, Ray or Rapids (or even SQL-BlazingSQL)

  • Do you want to speed up the Python code that processes big data? Use Spark (or Databricks), Dask or Prefect (Dask is abstracted at the bottom)

  • Do you want to speed up the processing speed of data analysis? Using fast column database dedicated to memory, high-speed processing can be ensured only by using SQL queries.

If you need to orchestrate and monitor data processing on a cluster of computing nodes, there are several workflow management platforms written in Python that can be used, which can speed up the development and maintenance of data pipelines, such as Apache Airflow, Prefect or Dagster. If you want to know more about it, please check my previous article.

By the way, I can imagine that some people who complain about Python are not taking full advantage of it, or maybe not using the proper data structure to solve the problem at hand.

All in all, if you need to process large amounts of data quickly, you may need more computing resources instead of faster programming languages, and some Python libraries can easily distribute work to hundreds of nodes.

in conclusion

In this article, we discussed whether Python is the real bottleneck in the current data processing field. Although Python is slower than many compiled languages, it is easy to use and versatile. We noticed that for many people, the usefulness of language trumps speed.

Finally, we discussed that, at least in data engineering, the language itself may not be the bottleneck, but the limitation of the external system, and the large amount of data that cannot be processed on a single machine no matter which programming language is selected.

Guess you like

Origin blog.csdn.net/m0_46163918/article/details/112997802