Use open source software to quickly build a data analysis platform

Recently, many data analysis platform products have emerged in China, such as Magic Mirror and Data View .

The goals of these products should be self-service BI, using visualization to provide data exploration functions, and adding machine learning and prediction functions. Their target product should be Tableau or SAP Lumira. Because I used to develop data visualization functions for Lumira, I was very interested in this piece, so I tried these products and felt that there seemed to be a big gap between these products, so I wanted to use open source software to build a simple data analysis. Try the platform.

The code is here  https://github.com/gangtao/dataplay2 

Without further ado, here's the architecture diagram:

List the main open source software used:

Service-Terminal:

Client:

Development build tools

  • nodejs  https://nodejs.org/en/ 

    This should also need no introduction

  • babel https://babeljs.io/ 

    The javascript compiler supports converting ES6 code into browser executable code, mainly to support the compilation of jsx used by reactjs.

Well, after listing so many open source software, let's take a look at the functions of dataplay2, and then look at the role of these open source software and the reasons why I chose them.

Before getting into the main topic, let's talk about the name dataplay2. Dataplay is easy to understand. I hope to create an easy-to-use data platform that is as pleasant to use as it is to play. But why 2? Because this software is second? of course not. In fact, I wrote a dataplay before . The architecture at that time was slightly different. In order to use ggplot in R to support the syntax-driven visualization scheme, I used the R/Python bridge scheme in the background. The visualization operation in the foreground will generate the ggplot command. , the advantage is that there can be a unified data model and syntax to drive the visual analysis of data, which is convenient for users to explore data. However, such an architecture is too complicated. There are both R and Python on the server side. I couldn't stand it anymore, and then gave up. The new dataplay2 uses echart's chart library for visualization, the advantages and disadvantages will be discussed later.

Well, running dataplay2 is very simple. After downloading the code on github, it is recommended to install anaconda . All Python dependencies are ready. Go to the dataplay2/package directory and run:

python main.py

Here is a supplementary explanation, because the jsf of react needs to be compiled, you need to run the following command to compile the jsf with babel. The specific command is as follows:

## install node first
## cd package/static
npm install -g babel-cli
npm install babel-preset-es2015 --save
npm install babel-preset-react --save
babel --presets es2015,react --watch js/ --out-dir lib/

In addition, you need to use bower to install all dependencies of the client

## install bower first
## cd package/static
bower install

You can also refer to package/static/package.json for the required dependencies. There are times when a simpler build script needs to be integrated to do these things. The generated JS files are in the lib directory. Modify the original file in the js directory, babel will trigger compilation, and generate a new js file in the lib directory.

Then start the client by typing localhost:5000 in the browser.

First we go to the data menu

On this page, users can browse existing data, or upload a CSV file to add a data set.

Briefly describe the implementation of this part.

The data upload uses the file input control, and the data table uses the datatable control. For convenience, CSV files are stored directly in the local file system. The csv file is processed with pandas in the background. The foreground uses the Rest API to read the csv file, and then parses it with papaparse and displays it in the data table. This is done purely for convenience, as the whole POC is what I spent 3/4 days doing during the holidays, so it comes as it is convenient. A better approach is to parse the CSV file in Python behind the scenes.

Note that here we have strict requirements for the uploaded CSV file. There must be a header on the first line and no blank line at the end.

Once you have the data, you can start your analysis. First, let's look at visual analysis. Click on the menu Analysis/Visualization

For example, we select the Iris data source to do a Scatter Plot

The main work of the visualization is to transform the table structure data from CSV to the data structure of echart according to data binding. Because echart does not have a unified data model, each type of chart needs to have corresponding data transformation logic. (code package/static/js/visualization )

Now the main charts are Pie, Bar, Line, Treemap, Scatter, Area.

Now that I use it, I feel that echart has obvious advantages and disadvantages. The auxiliary functions he provides are very good, and you can easily add auxiliary lines, notes, and store as graphics. However, due to the lack of a unified data model, it is more troublesome to expand. I hope to have time to try plotly . Of course, highchart is a very mature charting library without proof.

In fact, I hope to find a D3 implementation of ggplot, such as this http://benjh33.github.io/ggd3/  , but unfortunately the project seems to be inactive.

In addition to visualization-based analysis capabilities, there are machine learning capabilities.

Classification

Algorithms for classification can use KNN, Bayes and SVM.

If two features are selected for prediction, I use D3 to draw a model for the prediction. When it is more than two, there is no way to draw it.

The user can then choose to make predictions based on that model.

The functions of clustering and regression are basically the same as classification.

clustering

The clustering algorithm now implements Kmeans

Linear regression

logistic regression

 

That's it for the basic functions, here are some of the functions I want to achieve:

  • data source

    The current data source is only CSV file, and more data source support can be considered, such as database/data warehouse, REST call, stream and so on.

  • data model

    The current data model is relatively simple, which is a pandas dataframe or a simple cvs table structure. Consider introducing a database. Also need to add support for hierarchical data (hierachical)

  • data warping

    Data deformation is a necessary preparation for data analysis. There are many products in the industry that focus on data preparation, such as paxata , trifacta

    This version of dataplay does not have any data transformation and preparation functions. In fact, pandas has a very rich data wrangling function. I hope to wrap a data wrangling DSL on top of this, allowing users to quickly prepare data.

  • visualization library

    Baidu's echart is a very good visualization library, but it's not good enough for data exploration. I hope to have a set of front-end visualization libraries similar to ggplot to use. In addition, map functions and hierarchical charts are also common functions of data analysis.

    Also need the option to add a chart

  • Dashboard Features

    This version of dataplay does not have a dashboard function, which is standard in data analysis software and must be available. pyxley seems to be a good choice, and it is also consistent with the architecture of dataplay (python, reactjs), you can try it when you have time

  • Machine Learning and Prediction

    dataplay now implements some of the simplest machine learning algorithms. I think the direction should be user-oriented and become simpler. The user only gives simple options, such as the target attribute to be predicted, and the attribute used for prediction, and then automatically selection algorithm. In addition, the algorithm needs to be extended more conveniently.

Well, finally, let's talk about simple feelings

  • Reactjs is really good. I have always disliked MVC. The componentization of reactjs is more comfortable to use, and the development efficiency is really high. I completed the whole project in 3/4 days of vacation, and react contributed a lot.

  • The function of dataplay is still relatively weak, but the basic structure has been set up. If you like it, you can use it to expand. I won't necessarily have time to continue to enhance it, but you are welcome to discuss it with me.

renew:

Because many students reported that it could not work properly, I made a Dockerfile, you can refer to https://github.com/gangtao/dataplay2/tree/master/docker to build. I hope it can solve the problem that everyone can't run. 
Image has been published to docker hub: https://hub.docker.com/r/naughtytao/dataplay/ 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324386820&siteId=291194637