Big Data development and the development of java What is the difference?

Recently found that some students do not quite understand big data development engineer for this position, I would like to briefly explain what is big data development engineers, the company's current Internet data development in the end look like? And general Java or PHP engineer any difference at work?

What is not a big data development?

Use only databases (relational mysql, sqlserver, oracle and other non-relational mongo redis, etc.), although the amount of data reaches millions of levels, billion-level data is not a big development.

Query data and then output the report is not a big data business development from the database system.

End (page, h5, phone native) Buried report data records to the database is not a big data development.

What is big data development?

1. Big Data skills development needs

Zhaopin to search a bit big data development engineer for this position, just a few points positions, shots are as follows:

 

Big Data development and the development of java What is the difference?

 

 

Big Data development and the development of java What is the difference?

 

 

So, now the Internet company within the meaning of the tools used in the development of big data is : hadoop, hive, hbase, spark , kafka and so on.

2. Big Data development to do

To streamline a word is this: Statistics

It is streamlined to two types of indicators: PV and UV

Streamlining the words: statistical indicators of PV and UV

PC Internet era, each portal (such as: Sina, Netease, Sohu) is concerned that their site has been opened today several times (pv), today there are many people (uv) visited the site. A little more complicated example: want the system to learn big data , you can join the big data exchange technology to learn buttoned Junyang: 522 189 307

A button on a page or how many people clicked on the connection several times

Heat map on a page (click on the more local, the more severe the color chart)

Mobile Internet era, focusing on mobile application is opened and the number of times the user is everyone's concern, but in addition to more than a lot of other very important data due to the limitation of the phone screen, the information flow has become the mainstream in the mobile era.

The major portals are very concerned about their own news client: how much exposure the articles in the stream, how many articles by the user clicked. Each article read for a long time, because the more the user clicks on the article, longer use the client's time, only companies higher advertising revenue, so companies find ways to recommend users like content.

3. How do these things

Because the browsing behavior of website, mobile client exposure in the article, or click on these data is very large, the basic unit of billions to play. Therefore, the traditional statistical information into the database in such a way can not do this statistical work. (Example: wordpress blog, every user reading an article, mysql will be updated in the number of reading +1 this article)

So big data is to log statistics by these indicators.

For example: Log background services (apache, tomcat, weblogic, nginx logs)

For example the following figure, my personal website apache services access logs.

 

Big Data development and the development of java What is the difference?

 

 

The number of url field to log / year (red) at the beginning of this site is the number of article pages accessed rows, the number of lines to / category (blue) at the beginning of the number of times the site is classified directory is accessed.

Of course, my this log is not statistics the number of users, number of users because statistics need to record the current user's unique identification on each log, and then make a de-emphasis, de-duplicated number is the number of users, but here no reporting uniquely identifies the user.

How does that count the number of general Internet companies will own the page or create a user on the client to uniquely identify, and then take the initiative to report to their own log server.

The main difficulty lies in big data:

Log too big (as big as the point of Internet companies, a business line has several log t day, then bigger every day dozens of t, hundreds of t's no surprise), you need to master big data technologies such as the previously mentioned to hadoop, hive and so on.

The timeliness of the data , from the off-line calculation, the general zero, the day before the logs are receiving every day complete, counted the previous day's data points can be calculated completed? It depends on each company's respective requirements.

Accuracy of the data. (This is the most important job is to develop large data statistics, statistical data, if allowed ....) If it is calculated in real time, in real time to master the relevant technology. For example: the number of online websites every 5 minutes.

Monitoring Monitoring Monitoring: Monitoring mission has failed, whether the data output, whether the output data anomalies.

Disaster Recovery Disaster Recovery Disaster Recovery: If the task fails how to remedy. Such as real-time tasks, because the data 13:00 to 14:00 for some reason did not, how to back up data.

Comparison big data development and general business development

Before forwarding the big data development, it has been used for business systems Java: for example, hr systems (attendance, payroll, etc.), charging systems.

Talk about my personal understanding of business systems development and large data development:

business system:

Bottom line: the database of various CRUD operations.

The difficulty is focused on:

The understanding of complex operations (such as calculation of wages: basic salary, five insurance payments, attendance bonus, high subsidies, reimbursement, bonuses, overtime pay ..... and so on are to be calculated).

Maintain the normal operation of the site stable online services, such as facebook, Taobao and other sites highly concurrent pressure.

Big Data Development

Bottom line: all kinds of arithmetic on strings.

The difficulty is heavy:

The timeliness of the data. Such as real-time data and you want to know 12:00 ~ 12:10 Number of users of these 10 minutes, if the data was calculated at 20 o'clock at night to complete, there would be no meaning. As another example, we should have experienced: brush the news on the phone, when you point a certain article, and then continue to brush the news, will soon be back out in front of a lot of the article and click the similar article, this is based on your clicks to you in a timely manner may recommend that you have something bigger point.

Accuracy of the data. The importance of this is self-evident

Stability and disaster recovery data.

Published 168 original articles · won praise 3 · views 20000 +

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104762421