A Small Data Artifact in the Big Data Era - asqlcell

Since Google released the classic MapReduce paper, and Yahoo open sourced the implementation of Hadoop, the word big data has become a hot topic in the industry. With the continuous improvement of machine performance and the blessing of various tool frameworks, data analysis has begun to change from sampling and checking in the past to a full-scale whole. The value of hidden information that was originally discarded by sampling has also emerged, and data owners can discover new ones. Associate information, find features of finer dimensions, and make more accurate predictions...

When it comes to the construction of a big data system, various high-level concepts naturally start to fly all over the sky, such as data warehouses, data lakes, integration of lakes and warehouses, and stream computing terms. It is true that for a large enterprise that generates huge amounts of data every day, these infrastructures are indispensable, and it is even acceptable to sacrifice a certain amount of flexibility for system construction. But for most small and medium-sized enterprises, or analysts of some large institutions, do they really need such a huge system?

For most data workers, this is the current workflow. After getting the original data, you can either import it into Excel to start working, or apply for data import into the data warehouse, and then start working on the company's self-built platform. The limit of data processed by simply using Excel does not exceed 1 million rows, and the performance beyond that is unacceptable. Moreover, Excel is not friendly to engineering, and it is not easy to find and record various misoperations. There are still high barriers to use in actual work for dragon-slaying skills such as function VB. For the latter case, it depends on the response speed of the company's process. In some companies, the actual analysis work may be completed in less than 30 minutes, but it takes two days to go through the process of entering the warehouse.

Although Python is very popular now, and related data analysis and processing tools are emerging in an endless stream, for most data workers, SQL is a language they are familiar with. Compared with the process-oriented Python, SQL is a descriptive language. You only need to follow the specification and tell the system what you want to do, and the background will automatically complete the execution of the task. However, using Python requires you to design your own algorithm and tell the system how to execute it. In contrast, the learning threshold and expression ability of SQL are much stronger than that of Python. The question that the industry has been thinking about is whether there is a tool that allows us to combine the advantages of Python and SQL to quickly analyze the source data?

Before further exploring this question, we have to answer how much data is considered big data? According to the reference on Baidu, the data of toC is more than tens of millions, and the data of toB is about 100,000. Here I take an offline advertising industry that has undergone digital transformation as an example. The company has 1 million offline digital display screens, and the amount of data released every week is 1 million pieces of data, and the size of each piece of data is about 20K. Then The total data volume is about 20GB. For this level of data, it is no problem to use a standard distributed big data processing engine, but today, with the rapid decline in hardware prices, especially memory and disk, is it possible to directly complete the processing of this level of data volume on a single machine?

The answer is yes, and this is the small data we are going to talk about today , that is, a collection of data that can be loaded and processed by a single computer. Today, I would like to recommend a small data processing artifact asqlcell , which is a plug-in of jupyter lab, and can be installed quickly through " pip install asqlcell ". Through this plug-in, you are allowed to directly use csv files as the data source, complete the data analysis and processing through SQL, and also support the mixed programming of Python and SQL, give full play to the strengths of different programming languages, and improve the efficiency of problem-solving.

Let's look at an example below, download the 2020 US election data from Kaggle , open jupyter lab, create a new notebook, and import the asqlcell plugin.

import asqlcell

After importing the plug-in, the syntax of using sql is as follows: %% syntax means that the input of this cell is a sql statement.

%%sql data set name

your sql script

 Next, we load the presidential selection data of each county into the dataset president_county_candidates, which is a pandas dataframe type and can be used for other sql input sources.

%%sql president_county_candidates
select * from 'us2020/president_county_candidate.csv'

The result after execution is as follows. The csv file size is about 1.5M, with more than 30,000 pieces of data, and the loading is completed in 30 milliseconds. And the returned result data set supports page flipping for full viewing. And for columns of numeric type, a numerical distribution diagram is added to the title bar.

What we're going to do next is find the candidate with the most votes in each state based on the results for each county and tally the votes. For students who are not particularly familiar with SQL and Python, such as me, you can get the desired result step by step by taking a step-by-step approach:

1. Group the raw data by week and candidate and count the vote information of each candidate

%%sql state_result
select state, candidate, sum(total_votes) as total
from president_county_candidates
group by state, candidate
order by state, total desc

2. According to the results of the previous step, make a subquery to first count the number of votes of the candidate with the most votes each week, and then select the query to match the number of votes.

%%sql state_winner
select * from state_result 
where (state, total) in (
    select (state, max(total)) 
    from state_result 
    group by state
)
order by total desc

3. We can also use the mixed programming mode to merge the contents of the above two cells together:

state_result = %sql select state, candidate, sum(total_votes) as total from president_county_candidates group by state, candidate order by state, total desc
state_winner = %sql select * from state_result where (state, total) in (select (state, max(total)) from state_result group by state) order by total desc
state_winner

 Compared with the traditional data processing mode, asqlcell has the following advantages:

  1. The size of the processed data is only limited by the memory. For small and medium-sized enterprises with general non-online business, one server is enough to hold all the data and process it, and the data can be used out of the box.
  2. The loading speed block, a csv file with a size of 100M and about 2.5 million rows of data, is loaded within 0.7 seconds. If the raw data is preprocessed into parquet format for column compression, it takes only 0.2 seconds to load.
  3. The background analysis engine uses duckdb based on memory , which is fast, OLAP query SQL syntax is well supported, and the learning threshold is low.
  4. Support Python mixed programming, reduce learning costs, and maximize the respective advantages of Python and SQL.

As hardware costs continue to decrease, the processing scale of small data will also continue to increase. For general non-online business enterprises, if the out-of-the-box analysis of GB-level data can be achieved, production efficiency will be greatly improved, and infrastructure construction costs can also be reduced.

Project code address: GitHub - datarho/asqlcell

Valuable comments are welcome. !

Guess you like

Origin blog.csdn.net/panda_lin/article/details/128993917