Latest, analysis of CDGP design and dissertation questions in June 2023

02743d2d828a8f9a2e85b1d492032b37.jpeg


Analysis of CDGP Design and Dissertation Questions in June 2023

( Add gzh "big data iron eater", reply "2023 cdgp " to get the full version)


  • Hotel member modeling

94b125b3d19df9d9cf9825e5ece86142.jpeg


  • Combined with domestic and foreign data security laws and regulations, talk about the construction of overseas data transmission security management system

Domestic: "Data Security Law", "Cyber ​​Security Law", "Data Export Security Assessment Measures" to be implemented in September 2022 Overseas: EU Data Protection Regulations, US Homeland Security Act and US Patriot Act, Federal Information Security Management Act, Canada 198 Act and others involve personal information: "Personal Information Protection Law" ● Construct data security management from the following aspects: (1) Information security, including: vulnerability, threat, risk, encryption, obfuscation/desensitization ( 2) Network security, including: backdoors, bots/corpses, firewalls, DMZs, keyloggers, penetration testing, virtual private networks (VPNs) (3) data security, including: facility security, device security, credential security, electronic Communication Security ● Management and construction through the data life cycle: the full data life cycle includes planning-design/activation-creation/acquisition-storage/maintenance-use-enhancement and disposal. Planning: Associating data with security and privacy requirements Design & Enablement: "Establish data protection and security measures in the system Creation/Acquisition: Classify new data so that data is properly protected Storage/Maintenance: Ensure data storage complies with policies and regulations Use: Manage access rights to ensure proper use of data and prevent misuse Enhancement: Stay ahead of regulatory requirements and identify new security threats Disposition: Process data in compliance with relevant policy and regulatory requirements


  • (1) Master data management challenges? (2) What are the objectives of master data management? (3) How to identify master data? (4) Implementation steps of master data management?

(1) Challenge: Entity resolution (identity management), which is the process of identifying and managing associations between data from different systems and processes. This process must be managed on an ongoing basis to keep these master data entities, instances, and identities consistent. (2) Goal : To ensure that the organization has complete, consistent, up-to-date and authoritative master data in each process, and encourage enterprises to share master data between business units and application systems. (3) Master data is data about business entities, mainly including reference data, enterprise structure data, and transaction structure data. The identification/parsing steps of master data entities are as follows: 1) Matching, 2) Standard analysis, 3) Matching workflow and reconciliation type, 4) Data ID management, 5) Subordinate management (4) Steps: Identify driving factors and needs, and evaluate Evaluate data sources, define architectural methods, model master data, define management responsibilities and maintenance VI, establish a governance system to promote the use of master data.

  • (1) How to build a data warehouse? (2) What are the characteristics of modern data architecture? (3) What are the similarities and differences between data warehouses and data lakes? (4) How to solve the SCD problem?

(1) The main process of data warehouse construction: 1) Understand requirements 2) Define and maintain data warehouse/business intelligence architecture 3) Develop data warehouse and data mart 4) Load data warehouse 5) Implement business intelligence product portfolio 6) Maintain data products (2) Characteristics of modern data architecture: characteristics of big data: 3V (large quantity, many types, fast change) + low value density, high value characteristics of data architecture: integration of lakes and warehouses, integration of streams and batches. Typical representatives are Lambda architecture and Kappa architecture. Here you can talk about it based on the score. (3) Similarities and differences between data warehouses and data lakes: ●  Similarities: both can be used for big data storage and analysis, and are oriented to enterprise-level applications. All have very large storage capacity and efficient data access speed. Both support batch and real-time data processing, and can cope with different data processing requirements. Both are geared towards business decision-making and data analysis. ● Similarities and differences: Data structure: Data warehouses adopt standardized data structures, while data lakes support arbitrary data formats and non-standardized data storage modes. Data source: The data warehouse mainly extracts data from different data sources through ETL, and then cleans, integrates and processes them. The data lake stores unprocessed and uncleaned raw data in a unified storage space, and supports direct reading and query of all data formats. Data usage: The data warehouse is mainly used for enterprise decision-making and report analysis, which is a relatively traditional data analysis method. The data lake has a wider range of applications and can support various fields including big data, machine learning, and artificial intelligence. Data timeliness: Data warehouse data is mainly historical data records, which are archived and processed in batches, so the actual data can only be obtained after hours or days. The data lake supports more real-time data processing and query, and can obtain and process data in real time. (4) SCD problem:The data in some dimension tables is not static, but changes slowly over time. This dimension that changes over time is called a slowly changing dimension. The problem of dealing with historical changes in dimension table data is called slowly changing. dimension problem, referred to as SCD problem. Solution: keep the original value, rewrite the attribute value, add a new row of dimensions, add a new column of dimensions, add a history table, and use a zipper table to save historical snapshots (recommended).


  • (1) How to determine the priority order of data quality management? (2) Combined with the actual situation of the company, build a data quality management system in the order of (1)

(1) Data quality management should start with the most important data in the organization. That is, higher quality and more value for the organization and customers. Data can be prioritized based on factors such as regulatory requirements, financial value, and immediate impact to customers. (2) Based on the actual situation, the company's data content and priority ranking (omitted). The method of building a data quality management system: conduct data quality management according to the data life cycle. Planning: Define the characteristics of high-quality data Design & Enable:Define system and process controls to avoid data problems and maintain data quality Create/acquire:Measure or check data to ensure data meets quality requirements Storage/Maintain:With systems and processes Test data to ensure that data can continue to meet expectations Use: Use feedback loop mechanisms to continuously improve data quality Enhancement: Take action on data quality improvement opportunities Disposition: Correctly identify and improve data based on data quality requirements

  • Combined with the company's practice, how to build a metadata management system to ensure the quality of metadata

Metadata is data. Like other data, it also has a life cycle and we must manage its life cycle. Planning: Define metadata requirements Design & Enable: Create and manage metadata as part of ongoing data management activities Create/Acquire: Ensure metadata is created and meet quality requirements Store/Maintain: Ensure metadata remains current And continue to meet the needs of use: use metadata to get value from data. Enabling feedback loops can improve metadata quality Enhancement: Enhance existing metadata with new knowledge to implement new metadata Requirement Disposition: Purge or archive outdated metadata Steps: Follow quality management steps to manage metadata quality (1) Define high-quality metadata, (2) Define metadata quality strategy, (3) Define initial assessment scope, (4) Perform initial metadata quality assessment. (5) Identify and prioritize improvements, (6) Define Metadata quality improvement goals, (7) development and deployment of metadata quality operations, etc. ● Metadata activities: define metadata strategy, understand metadata requirements, define metadata architecture, create and maintain metadata, query reports and analyze metadata


  • What is the content of the super outline?

1、Data Mesh及Data Fabric

Both are to solve the problem of data access and analysis across technology stacks and platforms, so that the data remains in the original place instead of being concentrated in one platform or field. Data fabric is centered on technology, while data mesh focuses on changes in methodology and organizational collaboration.

More detailed content reference:

Understand the difference between Data Fabric  and Data Mesh  in 10 minutes! - Know almost (zhihu.com)

2. Open source big data components (Atlas appeared in this multiple-choice question)

Common technology components are as follows:

● system platform  (Hadoop, CDH, HDP)

● Cloud platform  (AWS, GCP, Microsoft Azure)

● Monitoring management  (CM, Hue, Ambari, Dr.Elephant, Ganglia, Zabbix, Eagle, Prometheus)

● File system  (HDFS, GPFS, Ceph, GlusterFS, Swift , BeeGFS, Alluxio, JindoFS)

● Resource Scheduling  (K8S, YARN, Mesos, Standlone)

● coordination framework  (ZooKeeper , Etcd, Consul)

● Data storage  (HBase, Cassandra, ScyllaDB , MongoDB, Accumulo, Redis , Ignite, Geode, CouchDB, Kudu)

● Row and column storage  (Parquet, ORC, Arrow, CarbonData, Avro)

● Data Lake  (IceBerg, Hudi, DeltaLake)

● Data processing  (MaxCompute, Hive, MapReduce, Spark, Flink, Storm, Tez, Samza, Apex, Beam, Heron)

● OLAP (Hologres、StarRocks、GreenPlum、Trino/Presto、Kylin、Impala、Druid、ElasticSearch、HAWQ、Lucene、Solr、 Phoenix)

● Data collection  (Flume, Filebeat, Logstash, Chukwa)

● Data Exchange  (Sqoop , Kettle, DataX , NiFi)

● Message system  (Pulsar, Kafka, RocketMQ, ActiveMQ, RabbitMQ)

● Task Scheduling  (Azkaban, Oozie, Airflow, Contab, DolphinScheduler)

● Data Security  (Ranger, Sentry, Atlas)

● Data Lineage  (OpenLineage, Egeria, Marquez, DataHub)

● Machine Learning  (Pai, ​​Mahout, MADlib, Spark ML, TensorFlow, Keras, MxNet)


  • Other knowledge points that are easily overlooked in the selection of topics

1. What are the first and second phases of data management? The first phase: data integration and interoperability, data storage and operation, data security, data modeling and design The second phase: data architecture, data governance, metadata Phase 3: Data Governance, Data Warehousing and Business Intelligence, Reference Data and Master Data, Document and Content Management Phase 4: Big Data Analysis, Data Mining 2, Steps of Data Architecture: Define Scope, Understand Requirements, Design, Implement3. What is unstructured data: word processing documents, emails, social media, chat rooms, flat files, spreadsheets, xml files, transactional information, reports, graphics, digital images, microfilm, video and audio. There is also a large amount of unstructured data in paper documents.

おすすめ

転載: blog.csdn.net/kuankuanerfei/article/details/131453378