Big Data application framework hadoop six major recommendations

I was asked, "Do you have much experience in big data and Hadoop terms?" I told them, I have been using Hadoop, but I deal with data sets rarely have more than a few of TB.

They asked me, "Can you use Hadoop to do simple grouping and statistics?" I said of course I can, I just tell them I need to see some examples of file formats.

They handed me a flash drive that contains 600MB of data, these data seem not sample data, for some reason I can not understand, when my solution involves pandas.read_csv file instead of Hadoop, they are very happy.

Hadoop fact, there are many limitations. Hadoop allows you to run a general-purpose computing, I explained below in pseudo code:

Big Data

Goal: calculate the number of library books

Map: You count the number of odd bookshelf letter, I count the number of bookshelves letter of even number. (The more people, the faster statistics)

Reduce: the statistical data we alone together.

We do only two: F (k, v) and G (k, v), in addition to performance optimization in the intermediate step, all fixed.

It will force you to do all the calculations, grouping and statistics in the Map, a way to perform operations like wearing tights, in fact, a lot of computing more suitable choice of other models. The only reason to wear tights that this may be extended to very large data sets, and in most cases, your data may be a small amount several orders of magnitude.

However, due to "big data" and "Hadoop" These two popular word, even if many people do not actually need Hadoop, they are willing to wear "tights."

First, if the amount of data I was a few megabytes, Excel might not be able to load it

For Excel software, the "big data" is not big data, in fact, there are other excellent tools you can use - I like Pandas. Pandas Numpy libraries built on, so as to be effective in the vector format, the number of megabytes of data loaded into memory. In the three years I had to buy a laptop, it can be used in the blink of an eye the Numpy 100 million of floating-point numbers multiplied together. Matlab and R is also an excellent tool.

For hundreds of megabytes of data volume, the typical approach is to write a simple Python script reads the file line by line, and deal with it, write to another file.

Second, if my data is 10GB it

I bought a new laptop, it has a SSD 16GB of memory and 256GB. If you want to load a CSV file to 10GB of Pandas, it is actually very small amount of memory - the result is saved in numeric string, such as "17284832583" as 4 bytes 8 bytes of goods integer, or stores "284,572,452.2435723" as a string of 8-byte double-precision floating point number.

The worst case is that you probably can not put all the data is loaded into memory at the same time.

Third, if my data is 100GB, 500GB or 1TB of it

Buy a 2TB or 4TB hard disk, install a Postgre on your desktop PC or server to deal with it.

Four, Hadoop far less than the SQL script or Python

Calculated in terms of expression, Hadoop weaker than SQL, but also weaker than Python script.

SQL is a very straightforward query language, suitable for business analytics, SQL queries is quite simple, but also very fast - if your database using the correct index, second-level query or query is another matter.

The concept Hadoop without an index, only full table scan Hadoop, Hadoop has leaked highly abstract - I spent a lot of time to deal with Java memory errors, file fragmentation and cluster competitiveness, which is much greater than the time I spent on data analysis.

If your data is not like SQL tables as structured data (such as plain text, JSON objects, binary objects), usually directly written a small Python script to handle your data in rows. The data is stored in files, each file processing, and the like. If it were Hadoop is very troublesome.

In SQL or Python script, Hadoop much slower in comparison. After the correct use of indexes, SQL queries are always non-rapid --PostgreSQL simple lookup index, retrieve the exact key. The Hadoop is a full table scan, it will reorder the entire table. By slicing the data table on multiple computers, the reordering is very fast. On the other hand, the processing binary object, and from the need to repeat the Hadoop node name, object is to find and process the data. This is suitable for use Python scripts to achieve.

Fifth, I am more than 5TB of data

You should consider using Hadoop, without having to do too much choice.

The only advantage is the use of Hadoop scalability is very good. If you have a table containing the number of TB of data, Hadoop has the option for a full table scan. If you do not have such a large amount of data tables, then you should avoid like the plague as the use of Hadoop. Such use traditional methods to solve the problem will be easier.

Six, Hadoop is an excellent tool

I do not hate Hadoop, when I can not handle the data with other tools I would choose Hadoop. In addition, I recommend using Scalding, do not use Hive or Pig. Scalding supports Scala language to write Hadoop task chain, MapReduce hidden underneath.

Big Data application framework hadoop six major recommendations

Guess you like