Hardcore! Metro big data passenger flow analysis system

This hard-core project is gaining momentum recently. The author has open sourced the project to GitHub and Code Cloud, and has been on the Gitee hot list for many consecutive days. 

Lao Guang contacted the author of this project. The author's original intention: "At first I came into contact with the Shenzhen Municipal Government Data Open Platform, mainly to understand the data source of the contest topic. I accidentally saw the Shenzhen Tong card data and felt that it could be tapped. There was a lot of information, so I developed this project."

This project mainly analyzes Shenzhen Tong's credit card data, studies the passenger transport capacity of Shenzhen Metro from the perspective of big data technology, and explores the direction of Shenzhen Metro's optimized service.

GitHub Star trend chart

The following are the open source addresses of GitHub and Gitee. Welcome to Star. The data used by this project is also in the project. It is recommended to visit Gitee faster in China.

  • https://github.com/geekyouth/SZT-bigdata

  • https://gitee.com/geekyouth/SZT-bigdata

1. Introduction

For big data engineers who are just getting started, this project is a good actual project, because this project uses more common technology frameworks, which can deepen the understanding and application of each technology stack. Only in the process of using can you experience the differences and advantages and disadvantages of each framework, so as to lay the foundation for future project development technology selection.

The source of the data is the Shenzhen Municipal Government Data Open Platform. Shenzhen Tong's credit card data is 1.337 million offline data. It seems that the official website has stopped serving. The author provides a backup data source that can be downloaded in the project. The project adopts multiple solutions of offline + real-time thinking.

2. Effect

Let's take a look at the effect after the completion of the current stage:

It can be seen from the figure that the card swiping records of 2018-09-01 are concentrated between 6 am and 12 am, and the morning peak data is more consistent. Although this day is Saturday, the peak period is not particularly obvious. Let's continue to zoom the kibana timeline to see more detailed curves:

2018-09-01, the ranking of station throughput of the day:
Wuhe Station, Buji Station (Shenzhen East Railway Station), Luohu Station (Shenzhen Railway Station), Shenzhen North (Shenzhen North High-speed Railway Station)

On September 1, 2018, the passenger with the highest fare on that day spent RMB 48.

On September 1, 2018, the passenger traffic volume of Line 5 was far ahead on that day, and the Longgang Line crushed the Line 1, which made the people of Longgang distressed!

There are many more such as:

  • Ranking of sections with the most passengers transported daily

  • Ranking of average time-consuming one-way direct passengers per route

  • Average commuting time of all passengers

  • All passengers commuting time ranking

  • Ranking of the number of gates in and out of each station

  • Ranking of the number of gates in and out of each line

  • Revenue ranking of each station

  • Revenue ranking by line

  • Percentage ranking list of outbound passengers transferring for each line

  • Percentage list of discounted number of direct passengers

  • Ranking of passengers with the longest transfer time, etc.

3.  Technical selection

The overall structure of this project is like this:

Technology stack, the following figure shows the common technologies used in this project:

J ava-1.8/Scala-2.11: Rich ecology, enough wheels;

Flink-1.10: Streaming business, ETL preferred. The development momentum is in full swing, endorsed by Alibaba, brisk, flexible, and fast;

Redis-3.2: Natural deduplication, automatic sorting, in addition to being fast or fast. The cheap version of the hard disk implements similar products SSDB. Win10|CentOS7|Docker Redis-3.2: Choose one of three, CentOS REPL yum installation defaults to version 3.2;

Kafka-2.1: Decoupling message queue services, reducing traffic peaks, and subscription publishing scenarios. Best CP:

  • kafka-eagle-1.4.5: integrates production, consumption, Ksql, large screen, monitoring, and alarm, and monitors zk at the same time. The other Kafka monitoring components I used were finally abandoned:

  • KafkaOffsetMonitor has too many problems, ugly rejection;

  • Kafka Manager has been renamed CMAK. The software written by foreigners feels awkward to use, and is only compatible with Kafka 0.11 at the highest, but Kafka has officially been upgraded to 2.4.

Zookeeper-3.4.5: The basic dependence of the cluster, the larger the ID, the more advantageous it is during the election. The online status of each component is maintained through the session mechanism;

CDH-6.2: Solve the most difficult software compatibility problem for programmers, one-click installation of Family Bucket service

Docker-19: Deploy a new software at the fastest speed, no intrusion, no pollution, rapid expansion, and service packaging. If there is currently no suitable operating environment, then docker must be the first choice;

Spring Boot-2.13: General JAVA ecology, necessary for agile development;

knife4j-2.0: formerly swagger-bootstrap-ui, REST API project debugging is simply not too convenient, it kills ten orders of magnitude of the original stockings;

Elasticsearch-7: The only reliable database in the full-text search field, the core search engine service, the millisecond response of billions of data, there are many pits when it is true.

Kibana-7.4: ELK family members, front-end visualization, Xiaobai is not afraid;

ClickHouse: The well-known nginx server is a representative work in Russia. Next, the popular clickhouse is also as light as a swallow, but its performance far exceeds all similar databases currently on the market, and its storage capacity can reach PB level. There is not much information yet, and I am studying;

MongoDB-4.0: Document database, friendly to Json data, mainly used for crawler databases;

Spark-2.3: The mainstream solution for real-time micro-batch processing and offline batch processing in the current domestic big data framework. This component is too resource-intensive. When I was developing, my laptop got a blue screen, so I submitted it to the spark cluster directly.

Hive-2.1: A must for Hadoop ecological data warehouse, big data offline processing OLAP structured database, to be precise, it is an HQL parser, the query syntax is close to Mysql, but the window function is more complicated.

Impala-3.2: as brisk and vigorous as an antelope, the same hive sql complex query, impala millisecond return, but hive takes about 80 seconds or more;

HBase-2.1 + Phoenix: Unstructured database in the Hadoop ecosystem. The soul design of HBase is rowkey and multi-version control. Phoenix grafting hbase can realize more complex business;

Kylin-2.5: Kylin's multidimensional pre-analysis system relies on memory for fast calculation, but it has a lot of limitations. It is suitable for scenarios where the business is particularly stable and the latitude is fixed and less variable. Don't try the slag machine, because the memory is too small to afford;

HUE-4.3: Gifted by CDH Family Bucket, emphasizing user experience, operating data warehouse is very convenient, permission control, hive + impala query, hdfs file management, oozie task scheduling script writing depends on him;

Alibaba DataX: Heterogeneous data source synchronization tool, hosting most mainstream databases, and you can even develop plug-ins by yourself. If you feel that this cannot meet your special business needs, then I recommend you to use FlinkX, a distributed data synchronization tool based on Flink . In theory, you can also develop your own plug-ins;

Oozie-5.1: The UI itself is ugly, but it is acceptable to eat with HUE. It is mainly used to write and run task scheduling scripts;

Sqoop-1.4: Mainly used to export business data from Mysql to HDFS data warehouse, and vice versa;

Mysql-5.7: Programmers should use it. If it is a language that programmers all over the world can use, it must be SQL. The popularity rate of Mysql 8.0 is not high enough. MariaDB is not recommended for the time being. Complex functions are not compatible with Mysql. If there is a problem with the dependent components of the database, you can cry;

Hadoop3.0 (HDFS+Yarn): HDFS is currently the most mainstream distributed mass data storage system in the big data field. Yarn here specifically refers to the hadoop ecology, which is mainly used to allocate cluster resources and has its own execution engine MR;

Alibaba DataV visual display; 

The author chooses a newer version of the software because the new version has more pitfalls than the old version. If there are more pitfalls, the skills will be improved. When encountering new problems, you can use tricks and remedies.

4. Development environment

Win10 IDEA 2019.3 Ultimate Edition: Essential for JAVA|Scala development, integrating all kinds of functions in one;

Win10 DBeaver Enterprise Edition 6.3: Kill all database clients in the universe, almost all commonly used databases can be connected, choosing a good driver is the key;

Win10 Sublime Text3: The strongest lightweight editor on the surface, the speed of light starts, unlimited plug-ins, mainly used to edit scattered files, markdown real-time preview, writing front-end is particularly friendly.

CentOS7 CDH-6.2 cluster: Contains the following components, the corresponding host role and configuration are shown in the figure, the cluster needs at least 40 GB of total memory to meet the basic usage, and the RAM is of course within a reasonable range as large as possible. , Lu Xun always said that "the world of martial arts can only be broken quickly"; our pursuit is that the sooner the better;

推荐阅读
0. 逛逛GitHub交流群限时加入
1. 2020 年 GitHub 年度总结出炉!
2. 程序员找工作黑名单
3. 程序员的网易云是什么样的?
4. CentOS 7 安装教程(图文详解)

Guess you like

Origin blog.csdn.net/weixin_47080540/article/details/110848905