9 Best Programming Languages for Big Data Processing

The wave of big data continues. It permeates almost every industry, flooding businesses with information, making software behemoths like Excel look increasingly unwieldy. Data processing is no longer trivial, and the need for sophisticated analytics and powerful, real-time processing has never been greater.

So, what are the best tools for sifting through huge datasets? Talking with data hackers, we learned their favorite languages ​​and toolkits for hardcore data analysis.

R Language

In this list of languages, if R language is second, there is no other language that ranks first. Since 1997, it has grown in popularity around the world as a free alternative to expensive statistical software such as Matlab and SAS.

Over the past few years, R has become the darling of data science — data science is now well known not only among nerdy statisticians, but also developed for Wall Street traders, biologists, and Silicon Valley well-known. Companies in a variety of industries, such as Google, Facebook, Bank of America, and The New York Times use R, which continues to spread and proliferate for commercial use.

The R language has a simple and obvious appeal. Using R, with just a few lines of code, you can sift through complex datasets, process data through advanced modeling functions, and create flat graphs to represent numbers. It is likened to a hyperactive version of Excel.

The greatest asset of the R language is the vibrant ecosystem that has been developed around it: the R language community is always adding new packages and features to its already quite rich feature set. It is estimated that over 2 million people use R, and a recent poll indicated that R is by far the most popular language for scientific data, used by 61% of respondents (followed by Python at 39%).

In addition, its figure has gradually appeared on Wall Street. In the past, bank analysts would be engrossed in Excel files until late at night, but now R is being used more and more for financial modeling, especially as a visualization tool, says Niall O'Connor, vice president of Bank of America. . "The R language makes our mundane tables special," he said.

R's maturation has made it the language of choice for data modeling, although its capabilities become limited when companies need to produce large-scale products, and some say this is because its position is being usurped by other languages.

"R is more suitable for doing a sketch and generalization rather than a detailed build," said Michael Driscoll, CEO of Metamarkets. "You won't find R at the heart of Google's page ranking and Facebook's friend recommendation algorithms. Engineers will prototype in R and hand off to models written in Java or Python." Then

again, back in 2010, Paul Butler is famous for creating a global Facebook map in R, a testament to the language's rich visualization capabilities. Although he doesn't use R as often as he used to.

"R is becoming obsolete a little bit because of its slowness and its clunky handling of large datasets," Butler said.

So, what does he use instead? Read on.

Python

If the R language is a neurotic and lovable master, then Python is its easy-going and flexible cousin. Python quickly gained mainstream traction as a more practical language that combined R's ability to quickly mine complex data and build products. Python is intuitive and easier to learn than R, and its ecosystem has grown dramatically in recent years, making it more useful for statistical analysis previously reserved for R.

"It's progress for the industry. There's been a very clear transition from R to Python over the past two years," Butler said.

In data processing, there is often a trade-off between scale and complexity, and Python becomes a compromise. IPython notebooks and NumPy can be used as a kind of scratchpad for light work, while Python can be a powerful tool for medium-scale data processing. The rich data community is also an advantage of Python, as it provides a large number of toolkits and functions.

Bank of America uses Python to build new products and interfaces within the bank's infrastructure, as well as to process financial data in Python. "Python is broad and flexible, so people are flocking to it," O'Donnell said.

However, it's not the most performant language and is only used occasionally for large-scale core infrastructure, Driscoll said.

Julia

While the vast majority of current data science is performed through R, Python, Java, MatLab, and SAS. But there are still other languages ​​surviving in the cracks, and Julia is a rising star worth watching.

There is a general consensus in the industry that Julia is too obscure. But data hackers can't help but get excited when it comes to its potential to replace R and Python. Julia is a high-level, extremely fast expressive language. It is faster than R, more extensible than Python, and fairly easy to learn.

"It's growing. Eventually, with Julia, you can do anything you can do with R and Python," Butler said.

But so far, young people are still hesitant about Julia. It's still early days for the Julia data community, and to be able to compete with R and Python, it will need to add more packages and tools.

"It's still young, but it's making waves and it's very promising," Driscoll said.

JAVA

Java, and Java-based frameworks, have been found to be the backbone of Silicon Valley's biggest tech companies. "If you look at Twitter, LinkedIn and Facebook, Java is the underlying language for all of their data engineering infrastructure," Driscoll said.

Java does not provide the same quality of visualization as R and Python, and it is not the best choice for statistical modeling. However, if you're moving past prototyping and need to build large systems, Java is often your best bet.

Hadoop and Hive

are a group of Java-based tools developed to meet the huge demands of data processing. Hadoop has ignited enthusiasm as the preferred Java-based framework for batch processing data. Hadoop is slower than some other processing tools, but it is surprisingly accurate, so it is widely used for back-end analytics. It pairs well with Hive, a query-based framework that runs on top.

Scala

Scala is another Java-based language, and like Java, it is increasingly becoming a tool for large-scale machine learning, or building high-level algorithms. It is expressive and also able to build robust systems.

"Java is like steel when it's built, and Scala is like clay because you can put it in a kiln and turn it into steel later," Driscoll said.

Kafka and Storm

So what do you do when you need fast real-time analytics? Kafka will be your best friend. It's been around since about 5 years ago, but only recently became a popular framework for stream processing.

Kafka, born inside LinkedIn, is an ultra-fast query messaging system. The downside of Kafka? Well, it's too fast. Causes itself to error when operating in real-time, and occasionally misses things.

"There's a tradeoff between accuracy and speed," Driscoll said, "so all the big tech companies in Silicon Valley use two pipelines: Kafka or Storm for real-time processing, and then Hadoop for batch systems, while It's slow but super accurate."

Storm is another framework written in Scala that has gained a lot of traction in Silicon Valley for stream processing. It was included by Twitter, no doubt, so that Twitter could benefit enormously in fast event handling.

Incentive Award:

➤MatLab

MatLab has been around for a long time, and despite its high price tag, it is still widely used in some very specific fields: research-intensive machine learning, signal processing, image recognition, to name a few.



Octave Octave is very similar to MatLab, but it's free. However, it is rarely seen outside of academic signal processing circles.

➤GO

GO is another rising star that is making waves. Developed by Google, it is loosely derived from C, and is gaining share from competitors such as Java and Python in building robust infrastructure.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327038460&siteId=291194637