[Graduation thesis of Shandong First Medical University] Research and design of traditional Chinese medicine medical record data mining system based on Hadoop

Note: Show part of the document content and system screenshots. If you need complete videos, codes, articles and installation and debugging environment, please send a private message to the up owner.

2.1 Overview of Hadoop

Hadoop is an open source distributed computing platform developed byApache, < a i=4>alsois an open source distributed systems framework that provides a set of powerful tools and techniques for performing any task of large-scale data processing. Can carry out distributed storage and processing of large-scale data. It was originally developed by Doug Cutting and Mike Cafarella and its The name comes from the name of the toy elephant of Doug Cutting's son.  HadoopThrough distributed storage and computing, it can process data sets of thousands of nodes and has high scalability and fault tolerance. At the same time, Hadoop uses multiple cheap computers and low-cost storage to replace an expensive single server, greatly reducing the cost of computers processing large-scale data.

The core idea of ​​Hadoop is to store data distributedly in multiple computers, and then use MapReduceModel enables large-scale data processing. Among them, the MapReduce model is a distributed computing framework that divides tasks into mapreduce complete the computing task through data mapping and reduction. In the map stage, each node divides the input data into several records and maps these records to key /value is matched; in the reduce stage, the node is based on the output key value to merge records and perform reduction operations. In this way, resources in a computer cluster can be effectively utilized to achieve large-scale data processing.

Hadoop has a wide range of applications, especially in the field of big data. Hadoop can process not only structured data, but also unstructured or semi-structured data, such as Web< /span>Logs, emails, social networks, audio and video, etc. This processing method can also be used in search engines, recommendation systems, advertising, analyzing market trends, and processing bioinformatics data.

Overall,Hadoop is a powerful data processing tool that can help enterprises process and analyze big data efficiently to obtain better results. Many business opportunities. In the next chapter, this topic will be introduced in more detailHadoop The core components ofHDFS, MapReduce, Yarn.

2.2 Core components of Hadoop

Hadoop is an open source distributed computing platform developed byApache. It provides a series of Powerful tools and techniques for performing large-scale data processing tasks. The core components of Hadoop includeHadoop Distributed File System (HDFS), < /span>. Yarn and MapReduce

2.2.1 HDFS Overview

HDFS is the abbreviation ofHadoop Distributed File System, inIn Hadoop, HDFS is the file system layer responsible for managing distributed File storage and data replication mechanism, which distributes large-scale data storage on multiple nodes. Ithas high fault tolerance and scalability, can support a> a>)(BlockNode structure, by blockSlaver)(slave(Master)The data storage uses master level data storage capacity. PB or evenTB

Generally speaking, a complete distributed file system contains the following contents:

  1. NameNode: As the master node, leads the entire HDFS, usually there is only one in a single cluster master node.
  2. DataNode: As the working node of the system, it stores and manages data blocks. There are usually multiple nodes in the cluster.
  3. Secondary NameNode: To prevent NameNode from downtime The backup node causes the cluster to fail to work properly.
  4. Client: serves as the interface for connecting users in the system, allowing users to use operating tools to operate through the files of the Client and the cluster.

2.2.2 MapReduce Overview

MapReduce is a computing framework and the core part ofHadoop, used for distributed computing and processing of large-scale data stored in HDFS. MapReduceThe model divides user-written algorithms into scalable small tasks and executes them concurrently on multiple nodes to achieve high-speed large-scale data processing . It is divided into two stages: Map and Reduce. The Map stage decomposes large-scale data into key-value pairs and performs mapping calculations on different machines, while The Reduce phase summarizes and calculates the results output from the Map phase.

2.2.3 Yarn Overview

In addition toMapReduce, another core of Hadoop The component isYarn. The full name of Yarn is Yet Another Resource Negotiator, which is >Hadoop's resource management system is used to dynamically allocate computing resources to different users and programs. The emergence of Yarn enables Hadoop to support a variety of different computing tasks. And can perform resource management and scheduling on different computing tasks.

In generalHadoop’s core components provide a complete set of solutions for large-scale data storage, distributed computing and resource management. Flexible and efficient data processing can be achieved. In the next chapter, this topic will introduce the application of Hadoop in big data processing.

2.3 Application of Hadoop in big data processing

With the rapid development of the Internet, the amount of data is growing explosively. How to efficiently process massive data has become a hot topic in the field of information technology today. HadoopAs one of the most mainstream big data processing frameworks today, it has the characteristics of distribution, scalability, and high reliability, so it is widely used in big data processing. popular.

Hadoop mainly through HDFS and MapReduce is a distributed computing framework that splits large batches of data into several small data blocks and processes them on multiple computing nodes. calculation to achieve efficient processing of massive data. The mutual cooperation of these two tools enables Hadoop to cope with complex big data processing scenarios and has gradually become the Swiss Army Knife in the field of big data processing. MapReduce. Files are split into multiple chunks and distributed across multiple computers for efficient data access and fault tolerance. Hadoop is a distributed file system of HDFSTwo core tools process the data. Among them,

In the big data environment,Hadoop has become more and more widely used, and one of the most important applications is machine learning. Since machine learning usually requires the processing and analysis of massive amounts of data, Hadoop's distributed storage and distributed computing capabilities make it uniquely suited for machine learning. The advantages. In addition, Hadoop is also widely used in big data processing in various fields such as finance, e-commerce, and government, such as bank interest rate forecasts, product sales forecasts, administrative Regional division, etc.

Hadoop is not a panacea, and it also has some shortcomings in big data processing. For example, although Hadoop can handle large amounts of structured and semi-structured data, it is still difficult to process unstructured data. In addition, HadoopDue to its complexity, it requires professionals to configure, deploy and manage, so it is complicated to implement. Therefore, when using Hadoop for big data processing, its advantages and disadvantages need to be comprehensively considered and used rationally.

4.2 System design

In this section,this topic will describe the interactions between the system components and the technical framework within which they are implemented. First, this topicwill introduce the implementation of the systemclient and the overall system Architectural design, secondly describing the logical relationship between the components of the system, and then discussing the scalability and maintainability issues of the system.

This topic usesWebservice technology in the implementation of the client. Considering that most non-professional technical personnel are currently accustomed to using theWindows platform for operations, on the other hand, calling cloud computing as a service can achieve Separation from the client facilitates service integration. In view of these two considerations, the system design adopts Webservice technology, which is a remote calling technology across programming languages ​​and operating system platforms [1] uses the Hadoop computing cluster as its server. >Remote call implementation, the data interaction process is as shown in the figure. Webservice

In terms of system architecture design,this topicadoptsbased onHadoopThe distributed architecture of the platform realizes functions such as data storage, processing and analysis. At the same time, this subject also made corresponding designs for system security, such as user identity authentication, data access control, etc., to ensure the data security and reliability of the system.

First of all, it is necessary to introduce the system architecture design of the traditional Chinese medicine medical record data mining system based onHadoop. System architecture design is the core of the entire system design and one of the foundations of the entire system.

The overall system architecture adoptsHadoop distributed framework and cloud computing technology based on traditional Chinese medicine data mining algorithm to create an efficient and stable , easy to maintain system. On the whole, the system is divided into several parts: data collection, data storage, data preprocessing, data mining, and data visualization, and the corresponding data process is built on this basis. In addition, the system also adopts a combination of front-end interface design andWeb technology to provide a simple, easy-to-use, and highly interactive operation interface.

Forsystembetween eachcomponent, it /span> are designed in a highly reusable and scalable way. Therefore, when the system needs to extend functions or modify bugs, it only needs to add corresponding components or modify existing components, and there is no need to do this. It will have an excessive impact on the entire system. This improves the maintainability and scalability of the system. This topic. For example, data collection, preprocessing, mining and visualization constitute the data processing process in the entire system. The relationships and processes between these componentsimportant are also veryLogical relationships

This section comprehensively demonstrates the system architecture design of the traditional Chinese medicine medical record data mining system based on Hadoop The design concept and implementation method of this system provide a reference for subsequent system development and optimization.

4.3 Function implementation

This chapter will mainly introduce the each > function implementation. This includes data preprocessing, data storage and data visualization. Data preprocessing mainly includes cleaning, filtering, and conversion of original data to ensure the accuracy and completeness of the data.

The purpose of data cleaning is to improve data quality so that useful information can be extracted more accurately. Its main task is to detect and correct errors, missing values, duplicate values ​​and outliers in data, as well as convert data in different formats into a consistent format. It specifically includes:

  1. Missing value processing, filling in empty values.
  2. Noise data processing, deleting duplicate data and outliers, and grouping data.
  3. Consistency check.

To perform data analysis on the basis of data preprocessing, you first need to create several data tables needed to store each data in the database to store user information, medical records, etc. data. The specific data table is as follows:

  1. Medical record information table, table name: binglixinxi is used to store medical record information, including user id, creation time, user account, user name, gender, height, weight, blood pressure, blood sugar , heart rate, history of current illness, past history, drug sensitivity history and registration date. The specific fields and functions are shown in the table below.

Table 4-1 Medical record information table

Field Name

type

length

Field description

primary key

default value

id

bigint

primary key

primary key

addtime

timestamp

creation time

CURRENT_TIMESTAMP

yonghuzhanghao

varchar

200

user account

yonghuxingming

varchar

200

username

xingbie

varchar

200

gender

shengao

varchar

200

Height/cm

tizhong

float

Weight/kg

xueya

varchar

200

blood pressure

xuetang

float

blood sugar

xinlv

float

heart rate

xianbingshi

longtext

4294967295

History of present illness

jiwangshi

longtext

4294967295

Past history

yaominshi

longtext

4294967295

Drug sensitivity history

dengjiriqi

date

Registration Date

  1. Ordinary user information table,Table name:yonghu, used to store information of ordinary user accounts , including user id, creation time, user account, password, user name, avatar, gender and mobile phone number. The specific fields and functions are shown in the table below.

Table 4-2 Common user table

Field Name

type

length

Field description

primary key

default value

id

bigint

primary key

primary key

addtime

timestamp

creation time

CURRENT_TIMESTAMP

yonghuzhanghao

varchar

200

user account

mima

varchar

200

password

yonghuxingming

varchar

200

username

touxiang

longtext

4294967295

avatar

xingbie

varchar

200

gender

shoujihaoma

varchar

200

phone number

  1. Administrator user information table,Table name:users is used to store administrator user information , including user id, user name, password, role and new time. The specific fields and functions are shown in the table below.

Table 4-3 Administrator user table

Field Name

type

length

Field description

primary key

default value

id

bigint

primary key

primary key

username

varchar

100

username

password

varchar

100

password

role

varchar

100

Role

administrator

addtime

timestamp

Add time

CURRENT_TIMESTAMP

  1. token table, table name: token, used for user identity information data, including user id, username, indication, role, password, role, new time and expiration time. The specific fields and functions are as shown in the table below.

Table 4-4 token table

Field Name

type

length

Field description

primary key

default value

id

bigint

primary key

primary key

userid

bigint

user id

username

varchar

100

username

tablename

varchar

100

Table Name

role

varchar

100

Role

token

varchar

200

password

addtime

timestamp

Add time

CURRENT_TIMESTAMP

expiratedtime

timestamp

Expiration

CURRENT_TIMESTAMP

4.4 System performance test

After completing the implementation of system functions,this topic requires testing of the system. The main content is the actual testing of each function and the import and analysis of simulated medical record data. The main purpose of testing is to verify the implementation of each function of the system and its a>reliability. andstability

After users choose to log in to the system as an administrator or user, they can modify user information and passwords according to their own needs.

Table of contents

Preface

Chapter One Introduction

1.1 Research background

1.2 Research content

Chapter 2 Introduction to Hadoop Architecture

2.1 Overview of Hadoop

2.2 Core components of Hadoop

2.2.1 HDFS Overview

2.2.2 MapReduce Overview

2.2.3 Yarn Overview

2.3 Application of Hadoop in big data processing

Chapter 3 Research on Data Mining Technology of Traditional Chinese Medicine Medical Records

3.1 Overview of data mining of traditional Chinese medicine medical records

3.2 Research on data mining algorithm of traditional Chinese medicine medical records

3.3 Application scenarios of TCM medical record data mining

Chapter 4 Design and Implementation of Medical Record Data System

4.1 Requirements analysis

4.2 System design

4.3 Function implementation

4.4 System performance test

4.5 Test result analysis

Chapter 5 Summary and Outlook

5.1 Summary of research work

5.2 Future work prospects

Thanks

references

Guess you like

Origin blog.csdn.net/yvonneking1118/article/details/134116454