Big Data Artificial Intelligence Technology Raiders (2)


A detailed summary of knowledge of common technology frameworks and algorithms for big data artificial intelligence

Hadoop

Hadoop is a distributed system infrastructure developed by the Apache Foundation.

Users can develop distributed programs without understanding the underlying details of distributed. Make full use of the power of clusters for high-speed computing and storage. Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS. HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput (high throughput) to access application data, suitable for those with large data sets (large data sets). set) application. HDFS relaxes the requirements of POSIX, and can access data in the file system in the form of streaming access.

The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive amounts of data, while MapReduce provides calculations for massive amounts of data.

1. Hadoop is a database for storage. Distributed, large stock

2. The calculation engine uses the data inside to perform calculations and runs fast. With an order, multiple machines united and cooperated, each machine divided into its own part of the task, and run at the same time. After all the machine tasks are finished, the reports will be collected and the total tasks will be completed.

Spark

Apache Spark is a big data processing framework built around speed, ease of use, and complex analysis. It was originally developed by AMPLab of the University of California, Berkeley in 2009 and became one of Apache's open source projects in 2010.

Compared with other big data and MapReduce technologies such as Hadoop and Storm, Spark has the following advantages.

Before sharing, I still want to recommend the big data learning exchange Qun710219868 I created by myself. Enter the Qun chat invitation code and fill in Nanfeng (required) I know it’s you. First of all, Spark provides us with a comprehensive and unified framework for management. Big data processing needs of various data sets and data sources (batch data or real-time streaming data) with different properties (text data, chart data, etc.).

Spark can increase the running speed of applications in a Hadoop cluster by 100 times in memory, and can even increase the running speed of applications on disk by 10 times.

Spark allows developers to quickly write programs in Java, Scala or Python. It comes with a collection of more than 80 high-level operators. And you can use it to query data interactively in the shell.

In addition to Map and Reduce operations, it also supports SQL queries, streaming data, machine learning and chart data processing. Developers can use certain capabilities alone or combine these capabilities in a data pipeline use case.

Hadoop and Spark Hadoop, a big data processing technology, has a history of about ten years, and is regarded as the preferred solution for processing big data collections. MapReduce is an excellent solution for all the way calculations, but it is not very efficient for use cases that require multiple calculations and algorithms. Each step in the data processing process requires a Map phase and a Reduce phase, and if this solution is to be used, all use cases need to be converted to MapReduce mode.

1. Spark does not store data, only calculations. When calculating, he uses the data in hadoop to do calculations. Local can also use local server data.

2. Multiple functions: spark steaming flow computing, Spark SQL, machine learning library, graph computing, and a deep learning

Hive

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and provides simple SQL query functions, and can convert SQL statements into MapReduce tasks for execution.

Its advantages are low learning costs, simple MapReduce statistics can be quickly realized through SQL-like statements, no need to develop special MapReduce applications, and it is very suitable for statistical analysis of data warehouses. Hive is a data warehouse infrastructure built on Hadoop. It provides a series of tools that can be used for data extraction, transformation and loading (ETL), which is a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop.

Hive defines a simple SQL-like query language called HQL, which allows users familiar with SQL to query data. At the same time, this language also allows developers familiar with MapReduce to develop custom mappers and reducers to handle complex analysis tasks that the built-in mappers and reducers cannot complete.

Hive has no special data format. Hive can work well on top of Thrift, controlling the separator, and also allowing users to specify the data format.

1. The first choice for distributed data warehouses, the standard for Internet companies

2. SQL statements are used for data analysis and processing, with strong functions

Strom

Storm is an open source distributed real-time computing system that can handle a large number of data streams simply and reliably.

Storm has many usage scenarios: such as real-time analysis, online machine learning, continuous computing, distributed RPC, ETL and so on.

Storm supports horizontal expansion, has high fault tolerance, guarantees that every message will be processed, and the processing speed is very fast (in a small cluster, each node can process millions of messages per second).

Storm deployment and operation and maintenance are very convenient, and more importantly, any programming language can be used to develop applications.

The Storm structure is called topology and consists of stream (data stream), spout (data stream generator), and bolt (data stream operator).

1. Distributed real-time computing, millisecond delay, said to be real-time, but also for offline processing. The Java web interface is the real real-time response.

2. Quasi-real-time analysis, quasi-online machine learning, quasi-real-time scenarios. There is a delay, but it doesn’t need to be counted once a day or every few hours.

Hbase

HBase is an Apache Hadoop database that can provide random, real-time read and write access to large data. It is an open source implementation of Google's BigTable. The goal of HBase is to store and process large data. More specifically, it can handle large databases composed of thousands of rows and columns with only ordinary hardware configuration.

HBase is an open source, distributed, multi-version, column-oriented storage model. You can directly use the local file system or Hadoop's HDFS file storage system. In order to improve the reliability of data and the robustness of the system, and to give play to HBase's ability to process large data, it is better to use HDFS as a file storage system.

In addition, HBase stores loose data. Specifically, HBase stores data between mapping (key/value) and relational data.

The data stored in HBase is logically a very large table, and its data columns can be dynamically added as needed. The data in each cell can have multiple versions (distinguished by timestamp)

HBase also has the characteristic of "providing storage downwards and computing upwards".

1. Hbase is a database built on Hadoop. Hbase cannot run without the Hadoop basic platform. Hadoop HDFS for underlying storage

2. Distributed, key/value, but one value is multiple values, that is, multiple columns

Solr

Solr is an open source, Lucene Java-based search server that is easy to add to web applications.

Solr provides level search (that is, statistics), highlight display of hits, and supports multiple output formats (including XML/XSLT and JSON formats). It is easy to install and configure, and it comes with an HTTP-based management interface. Solr has been used in many large-scale websites and is relatively mature and stable.

Solr wraps and extends Lucene, so Solr basically follows Lucene's related terms. More importantly, the index created by Solr is fully compatible with the Lucene search engine library.

With proper configuration of Solr, coding may be required in some cases so that Solr can read and use indexes built into other Lucene applications.

In addition, many Lucene tools (such as Nutch, Luke) can also use indexes created by Solr. Solr's excellent basic search function can be used, or it can be extended to meet the needs of enterprises.

1. Used for full-text search search engine, search function of many e-commerce and recruitment websites

2. The bottom layer is based on Lucene to search for java class libraries

Elasticsearch

Elasticsearch is an open source search engine based on Apache Lucene™. Whether in open source or proprietary fields, Lucene can be considered as the most advanced, best-performing, and most fully-functional search engine library so far.

Elasticsearch is not only Lucene and full-text search, we can also describe it like this: Distributed real-time file storage, each field is indexed and searchable. Distributed real-time analysis search engine can be extended to hundreds of servers, processing PB-level structured or unstructured data objects in document-oriented applications are rarely just simple key-value lists, and more often they have complex data structures, such as dates, geographic locations, another object or array. One day you will think of storing these objects in a database. Storing these data in a relational database composed of rows and columns is like disassembling a rich, expressive object into a very large table: you have to disassemble the object to adapt to the table mode (usually One column represents one field), and then they have to be rebuilt when querying.

Elasticsearch is document oriented, which means it can store entire objects or documents. However, it is not only storage, it also indexes the content of each document so that it can be searched. In Elasticsearch, you can index, search, sort, and filter documents (rather than rows and columns). This way of understanding data is completely different from the past, which is one of the reasons Elasticsearch can perform complex full-text searches.

1. It is a full-text search engine like solr, ELK's E refers to him.

2. It is also based on Lucene's java class library, which is basically similar to Solr

LDA latent Dirichlet distribution model


Insert picture description here
Insert picture description here

MinHash clustering

MinHash is a type of LSH, which can be used to quickly estimate the similarity of two sets. MinHash was proposed by Andrei Broder and was originally used to detect duplicate web pages in search engines. It can also be applied to large-scale clustering problems.

Similarity measurement: Jaccard index [2] is a metric used to calculate similarity, which is distance. If there are sets A and B, then J(A,B)=(A intersection B)/(A union B) In other words, the Jaccard coefficient of set A, B is equal to the number of elements in A, B and A, The proportion of the total number of elements that B has. Obviously, the Jaccard coefficient value range is [0,1].

Mahout's minhash combat: input and input are serialized files, not text files. But you can configure whether the output is a serialized file through the debugOutput parameter

命令:mahout minhash --input /vsm1/reuters-vectors/tfidf-vectors --output /minhash/output – minClusterSize 2 --minVectorSize 3 --hashType LINEAR --numHashFunctions 20 – keyGroups 3 --numReducers 1 -ow

Kmeans clustering

K-means algorithm is the most classic partition-based clustering method, and it is one of the ten classic data mining algorithms. The basic idea of ​​the Kmeans algorithm is to cluster around k points in the space and classify the objects closest to them. Through the iterative method, the value of each cluster center is updated successively until the best clustering result is obtained. Suppose you want to divide the sample set into c categories

The algorithm is described as follows:
(1) Appropriately select the initial centers of the c classes;
(2) In the k-th iteration, for any sample, find the distance to each center of c, and classify the sample to the center with the shortest distance class;
(3) method using the average value and the like of the center value of the class;
(4) for all c cluster centers, if the use of (2) (3) update the iterative method, the value remains unchanged, the iteration End, otherwise continue to iterate.
The biggest advantage of this algorithm is its simplicity and speed. The key of the algorithm is the selection of the initial center and the distance formula.

Process: firstly select k objects from n data objects as the initial cluster centers; and for the remaining objects, according to their similarity (distance) to these cluster centers, assign them to the most similar The clustering (represented by the cluster center); then calculate the cluster center (the mean of all objects in the cluster) for each new cluster obtained; repeat this process until the standard measure function starts to converge. Generally, the mean square error is used as the standard measurement function. k clusters have the following characteristics: each cluster itself is as compact as possible, and each cluster is as separate as possible.
Insert picture description here
Insert picture description here

Canopy clustering

Canopy clustering is a simple, fast and accurate method for grouping objects into classes. Each object is represented by a point in the multi-dimensional feature space. This algorithm uses a fast approximate distance metric and two distance thresholds T1>T2 for processing. The basic algorithm is to start with a set of points and randomly delete one, create a Canopy that contains this point, and iterate on the remaining point sets. For each point, if its distance from the first point is less than T1, then this point is added to the cluster.

Insert picture description here

Bayesian classification algorithm

Bayesian classification algorithm is a classification method of statistics, it is a kind of classification algorithm that uses probability and statistics knowledge. On many occasions, the Naïve Bayes (NB) classification algorithm can be comparable to the decision tree and neural network classification algorithms. The algorithm can be applied to large databases, and the method is simple, the classification accuracy is high, and the speed is fast. Since Bayes' theorem assumes that the influence of an attribute value on a given class is independent of the values ​​of other attributes, and this assumption is often not true in actual situations, its classification accuracy may decrease. For this reason, many Bayesian classification algorithms that reduce the assumption of independence have been derived, such as the tree augmented Bayes network (TAN) algorithm.

Bayes' theorem formula: P(A|B)=P(B|A)*P(A)/P(B)

Posterior probability = prior probability * conditional probability

to sum up

This article has a corresponding supporting video. For more exciting articles, please download the charging app, you can get tens of thousands of free lessons and articles. For the supporting new book and textbook, please see Chen Jinglei’s new book: "Distributed Machine Learning in Action" (Artificial Intelligence Science) And Technology Series)

[New book introduction]
"Distributed machine learning in practice" (artificial intelligence science and technology series) [edited by Chen Jinglei] [Tsinghua University Press]
Features of the new book: Explain the framework of distributed machine learning and its application supporting personalized recommendation algorithm system step by step , Face recognition, dialogue robots and other practical projects

[New book introduction video]
Distributed machine learning practice (artificial intelligence science and technology series) new book [Chen Jinglei]

Video features: focus on the introduction of new books, analysis of the latest cutting-edge technology hotspots, and technical career planning suggestions! After listening to this lesson, you will have a brand new technological vision in the field of artificial intelligence! Career development will also have a clearer understanding!

[Excellent Course]
"Distributed Machine Learning Practical Combat" Big Data Artificial Intelligence AI Expert-level Excellent Course

[Free experience video]:

Artificial intelligence million annual salary growth route / from Python to the latest hot technology

From the beginner's introduction to Python programming with zero foundation to the advanced practical series of artificial intelligence courses

Video features: This series of expert-level fine courses has a corresponding supporting book "Distributed Machine Learning Practical Combat". The fine courses and books can complement each other and complement each other, which greatly improves the learning efficiency. This series of courses and books take distributed machine learning as the main line, and give a detailed introduction to the big data technology it depends on. After that, it will focus on the current mainstream distributed machine learning frameworks and algorithms. This series of courses and books focus on actual combat. , Finally, I will talk about a few industrial-level system combat projects for everyone. The core content of the course includes Internet company big data and artificial intelligence, big data algorithm system architecture, big data foundation, Python programming, Java programming, Scala programming, Docker container, Mahout distributed machine learning platform, Spark distributed machine learning platform, Distributed deep learning framework and neural network algorithm, natural language processing algorithm, industrial-grade complete system combat (recommended algorithm system combat, face recognition combat, dialogue robot combat), employment/interview skills/career planning/promotion guidance, etc. .

[Is it charged? Company introduction]

Rechargeable App is an online education platform focusing on rechargeable learning for vocational training for office workers.

Focus on the improvement and learning of work vocational skills, improve work efficiency, and bring economic benefits! Are you charging today?

Is it charging official website
http://www.chongdianleme.com/

Is it charged? App official website download address
https://a.app.qq.com/o/simple.jsp?pkgname=com.charged.app

Features are as follows:

【Full Industry Positions】-Focus on improving the vocational skills of office workers

Covering all industries and positions, whether you are an office worker, executive or entrepreneur, there are videos and articles you want to learn. Among them, big data intelligent AI, blockchain, and deep learning are the practical experience of the Internet's first-line industrial level.

In addition to professional skills learning, there are general workplace skills, such as corporate management, equity incentives and design, career planning, social etiquette, communication skills, presentation skills, meeting skills, emailing skills, how to relax work pressure, personal connections, etc. Improve your professional level and overall quality in all aspects.

【Niuren Classroom】-Learn the work experience of Niuren

1. Intelligent personalization engine:

Massive video courses, covering all industries and all positions, through the skill word preference mining analysis of different industries and positions, intelligently matching the skill learning courses that you are most interested in for the current position.

2. Search the whole network

Enter keywords to search for massive video courses, there are everything, there is always a course suitable for you.

3. Details of listening to the class

Video playback details, in addition to playing the current video, there are also related video courses and article reading, which strengthens a certain skill knowledge point, allowing you to easily become a senior expert in a certain field.

【Excellent Reading】-Interesting reading of skill articles

1. Personalized reading engine:

Tens of millions of articles to read, covering all industries and all positions, through the skill word preference mining analysis of positions in different industries, intelligently matching the skills learning articles you are most interested in in your current position.

2. Read the whole network search

Enter keywords to search for a large number of articles to read, everything is available, there are always skills learning articles you are interested in.

[Robot Teacher]-Personally enhance fun learning

Based on the search engine and intelligent deep learning training, we will create a robot teacher who understands you better, chat and learn with the robot teacher in natural language, entertaining and learning, efficient learning, and happy life.

【Short Course】-Learn knowledge efficiently

Massive short courses to satisfy your time fragmented learning and quickly improve a certain skill knowledge point.

Guess you like

Origin blog.csdn.net/weixin_52610848/article/details/111572132