10 programs to teach the language you understand the potential value of Big Data

With the boom in big data continues to heat up, almost all areas of information coming flood pouring down like the face of thousands of users browsing history, recording behavioral data, if the data processing is performed on a simple Excel is far from fulfilled. However, it was only simple data processing if only to analyze, how to analyze rather than logical data How to use the operating software of some words.

Alternative high work without getting to the core planning strategy.

Of course, the most basic skills are not negligible part, want to become data scientists, for these procedures you should have a certain understanding of:

Hadoop and Hive

In order to meet the needs of large amounts of data processing, Java-based tools group rise. Hadoop to process a batch of data processing, development of infrastructure critical Java-based; compared to other processing tools, Hadoop is much slower, but extremely accurate and can be widely used in back-end database analysis. Hive and work well together, Hive is architecture-based query, the operation is quite good.

Python

If R is a neurotic and pleasing Geek, that Python is easy-going and good to get along with the girls.

Python combines fast R, handling various characteristics of complex data mining capabilities and more pragmatic language, rapidly becoming mainstream, Python than R, learn more simpler and more intuitive, and it's incredible ecosystem in recent years rapid growth in the R statistical analysis than the more powerful.

Butler said, "Over the past two years, a significant change from R to Python ground, like a giant continue to push forward '.

Within the context of data processing, usually there must be a trade-off between size and complexity, and Python appear to compromise gesture. IPythonNotebook (Notepad software) and NumPy are used to temporarily lower the burden of the workload of access, however Python for medium-sized data processing is a very good tool; Python has a wealth of information on the family, providing a number of packages and statistical features .

Bank of America in Python to create new products and in infrastructure interfaces bank, but also deal with financial data, "Python is more extensive and quite flexible, so we will have it in droves." O'Donnell says.

However, although its advantage of being able to make up for the shortcomings of R, it is still not the highest performance language, in order to deal with the occasional large-scale infrastructure, core. Driscoll thought so.

Julia

Today, most of the scientific data are transmitted through R, Python, Java, Matlab and SAS-based, but there are still going to make up the gap, but this time, new entrants Julia saw the pain points.

Julia is still too mysterious and has not been widely adopted industry, but its potential when it comes to R and Python enough to snatch the throne, data hackers are difficult to interpret. The reason is that Julia is a high order, incredibly fast and good at the language than many, R data much faster than Python have the potential to handle a larger scale, but also very easy to use.

"Julia will become increasingly important, ultimately, in R and Python can do in Julia can." Butler thought so.

For now, to say the reason Julia will regressive development, it is probably too young. Julia's still in the initial stages of the cell data, it should be able to compete and former R or Python, it requires more kits and packages.

Driscoll said, it is because it is young, there will likely become mainstream prospects.

R

To list all programming languages, you can forget the other does not matter, but most can not forget is R. Quietly appeared in 1997, the biggest advantage is that it is free for expensive statistical software such as Matlab or another choice of SAS.

But in the past few years, it's worth big flip, into a treasure in the eyes of the scientific community information. Not just stiff statisticians familiar with it, including WallStreet traders, biologists, and Silicon Valley developers, who are quite familiar with R. Diversified company like Google, Facebook, Bank of America and NewYorkTimes all use R, continued to improve its business effectiveness.

R benefits is that it is simple and approachable, through R, you can filter the data you want to focus on complex data from a complex model function in operating data from the orderly establishment of the chart to present figures, which only take a few lines of code on it, an analogy, it's like restless version of Excel.

R The best asset is the active dynamic system, R community continues to add new packages, as well as to the rich feature set is built on. Currently estimated at more than 200 million people use R, a recent survey showed, R data in the scientific community, so far the most popular language, accounting for 61% of respondents (in hot pursuit after the 39% of Python ).

It also attracted the attention of WallStreet. Traditionally, securities analysts see during the day and at night from the Excel file, but now R gradually increased usage in financial modeling, visualization tools in particular, vice president of Bank of America NiallO'Conno said, "Let's tacky R form becomes prominent. "

On the data modeling, it is to move gradually mature professional language, although R still limited when the company needs a large-scale manufacture of the product, while others say he was usurped the status of the other languages.

"R is more useful in drawing, rather than modeling." Metamarkets leading analyst firm's CEO, MichaelDriscoll said,

"You do not rank in the core Google or Facebook pages friends see when traces of R recommendation algorithm, engineers can build a prototype in R, and then to write in Java or Python syntax model."

As an example of using well-known R, in the year 2010, PaulButler R with Facebook to create the map of the world, proves how rich the language and more powerful data visualization capabilities, although he now uses less R than before.

"R has gradually become obsolete, under the large data sets it running slow and cumbersome" Butler said.

So then what he does?

Java

Driscoll said, Java and Java-based architecture is the core of Silicon Valley's biggest technology companies established, if you look from Twitter, Linkedin or Facebook, you will find that all the data Java for infrastructure projects , it is a very basic language.

R and Python and Java is not as good visualization capabilities, it is not the best tool for statistical modeling, but if you need to build a huge system, using past the prototype, it will usually be your most Java-based choice.

KafkaandStorm

When it comes when you need a quick, real-time analysis, what would you think? Kafka will be your best partner. In fact, it has appeared five years there, just because of the recent rise of streaming process was becoming more and more popular.

Kafka was born from Linkedin, it is a particularly fast query message system. The drawback of Kafka? Is it too fast, it will make mistakes during real-time operation, and sometimes miss things.

Fish and can not have both, "must make a choice between speed with accuracy", Driscoll said. So all in Silicon Valley's big technology companies are using two pipes: Storm with Kafka or real-time data processing, Hadoop open the next batch of batch data processing system, so it sounds a little trouble will slow somewhat, but the advantage is that it is very very accurate.

Another Storm is written from Scala architecture, Silicon Valley gradually increase its popularity sharply in the streaming process, acquired Twitter, this is not surprising, because Twitter to quickly deal with events of great interest.

Scala

Is another Java-based languages ​​like Java and tools for anyone who wants to carry out a large-scale mechanical or create higher-order learning algorithm, Scala will gradually rise. It is good at presenting and have the ability to establish a reliable system.

"Java like with steel construction; Scala is so you can take it into the kiln baked clay and then turned into steel" Driscoll said.

Matlab

Matlab can be said to be long lasting, even if it is priced very high; it is used quite widely in a very specific niche markets, including intensive study machine learning, signal processing, image recognition and so on.

GO

GO is another new entrants gradually rise, from Google developed, relax point that it is from the C language, and on building a strong infrastructure, Java and Python gradually become competitors.

Octave

Octave and Matlab like, except that it is free of outside. However, in the academic circles of signal processing, almost always mention it.

The software can be used so much, but I think not necessarily have to be each job, know what your goals and direction, to select the most suitable tool to use it! Can help you improve efficiency and achieve precise result.

Authors strongly recommend reading the article:

Big Data engineers must master the open source tools summary

Big Data senior teach you how to read a large data core technology

Top Big Data engineers need to master the skills

8 big factor data, machine learning and artificial intelligence for future development

 

Guess you like

Origin blog.csdn.net/sdddddddddddg/article/details/91402164