[Data Development] Data full-stack knowledge architecture, data (platform, development, management, analysis)

16967073:

Powered By Poe

1. Data full-stack knowledge architecture

data development

  • Refers to the process of using programming and technical tools to process and manage data. It involves collecting data from different sources, cleaning, transforming and integrating the data, building and maintaining data pipelines, and storing the data in an appropriate data warehouse or database. Data development also includes writing and maintaining data processing scripts, jobs, and workflows to ensure data quality, consistency, and reliability.
  • The goal of data development is to provide data scientists, analysts, and business users with reliable and meaningful data to support business decisions and insight discovery. Data Developers work closely with data engineers, data scientists, and business teams to understand business needs, design and implement data solutions, and ensure data accuracy and completeness.
  • Data development usually involves the use of programming languages ​​(such as Python, SQL, etc.) and technical tools (such as ETL tools, data flow processing frameworks, etc.) for data processing and transformation. Data developers need to have knowledge of data modeling, database management, data warehouse design and data governance, and be familiar with data processing processes and best practices.
  • In summary, data development is the process of building and maintaining data pipelines designed to support data-driven decision-making and analysis efforts. It plays a critical role in modern data-driven organizations, helping to achieve data reliability, availability, and actionability.

Data developers need the following skills and knowledge:

  • Programming language and scripting skills: Data developers should be proficient in at least one programming language, such as Python, SQL, etc., and be able to write efficient scripts and codes to process and transform data.
  • Database and data warehouse: Understand the basic concepts of relational databases and non-relational databases, be familiar with SQL language, be able to design and manage database table structures, and perform efficient data queries and operations.
  • Data processing and transformation: Have the skills of data cleaning, transformation and integration, and be able to use ETL (Extract, Transform, Load) tools or programming languages ​​for data processing, solve data quality problems and data structure transformation and other tasks.
  • Data model and data architecture: Understand the principles and methods of data modeling and data architecture design, and be able to design and optimize data models to ensure effective storage and retrieval of data.
  • Data flow and data pipeline: Familiar with data flow processing frameworks (such as Apache Kafka, Apache Flink, etc.) and workflow scheduling tools (such as Apache Airflow), able to build and manage data flows and data pipelines, and implement real-time data processing and batch processing tasks.
  • Data quality and data governance: Understand the concepts and practices of data quality management and data governance, and be able to formulate and implement data quality rules to ensure the accuracy, consistency and completeness of data.
  • Version control and team collaboration: Familiar with the use of version control systems (such as Git) and able to collaborate with team members to develop and maintain data processing code and workflows.
  • Business understanding and communication skills: Have the ability to understand and analyze business needs, and be able to communicate effectively with data scientists, analysts and business teams to understand their needs and provide corresponding data solutions.
1. Data methods (thinking, statistics, practice, North Star)

Data methods refer to a set of methodologies and techniques used in processing and analyzing data. It combines knowledge and tools from multiple fields, including thinking methods, statistical principles, practical experience and indicator systems, to help people better understand and utilize data.

Thinking methods: Data methods emphasize the application of scientific thinking and logical thinking. In data analysis, you need to have a clear problem awareness, the ability to formulate and verify hypotheses. Thinking methods also include systematic thinking and overall observation to help discover potential connections and patterns behind the data.

Statistics: Statistics is an important foundation of data methods. It provides a set of probabilistic and inferential tools for extracting meaningful information from data. Statistical methods include descriptive statistics, inferential statistics, and regression analysis, which can help reveal relationships between data, verify hypotheses, and make predictions and decisions.

Practical experience: Practical experience refers to the accumulation of experience in data analysis and problem solving through the actual application of data methods. Practical experience includes understanding of data quality, data preprocessing techniques, model selection and optimization practices, etc. Through practice, data analysts can better understand the characteristics and limitations of data and improve the accuracy and effectiveness of analysis.

North Star Metrics: North Star Metrics are a system of indicators for measuring performance and evaluating goal achievement. It helps organizations or individuals quantitatively track and evaluate business or work performance by setting key performance indicators (KPIs) and setting measurable goals. North Star Metrics can be applied across a variety of areas, including marketing, sales, operations, and more, to measure and improve business performance.

Balanced Scorecard is a performance management tool used to measure the performance of an organization or individual in achieving business goals and strategic direction. It not only focuses on financial indicators, but also includes indicators on customers, internal business processes and learning and growth to provide a comprehensive performance evaluation system.

The North Star indicator was first proposed by Robert Kaplan and David Norton in 1992 and has been widely used in subsequent developments. It is based on a core idea: focusing solely on financial indicators cannot fully reflect an organization's performance and potential. Therefore, North Star Metrics provide organizations with a more comprehensive and balanced performance evaluation framework by measuring four different dimensions.

Here are the four dimensions of the North Star indicator:

  • Financial Perspective: This dimension focuses on the financial performance of the organization , including revenue, profit, cash flow and other indicators. The financial dimension is often an important indicator for assessing the economic status and sustainable development of an organization.

  • Customer Perspective: The customer perspective focuses on the organization's performance in meeting customer needs and providing value . This includes metrics such as customer satisfaction, market share, customer retention, etc., which are designed to measure how well an organization meets customer expectations and builds customer relationships through its products or services.

  • Internal Business Process Perspective: This dimension focuses on the efficiency and quality of the organization's internal processes and operations. It involves key business processes such as production processes, supply chain management, customer service, etc. to ensure that the organization can deliver products or services efficiently.

  • Learning and Growth Perspective: The Learning and Growth Perspective focuses on the training, development and innovation capabilities of organizational employees. It includes indicators such as employee satisfaction, employee training investment, innovative projects, etc., and is designed to evaluate the organization's learning and growth capabilities to promote the organization's long-term development.

2. Data tools: data warehouse

Data Warehouse is a centralized data storage system used to store, manage and analyze large amounts of structured and unstructured data. It is a subject-oriented, integrated, stable, and queryable data collection used to support enterprise decision-making and data analysis.

The main goal of a data warehouse is to integrate data from different data sources into a unified data model so that users can easily perform data analysis and query. It transforms data from the source system into a form suitable for analysis and query through the data extraction, transformation and loading (ETL) process, and stores it in the data warehouse.

  • Topic-oriented: The data warehouse is centered on business topics and organizes data to support specific analytical needs. Topics can be sales, customers, products, etc., allowing users to conduct in-depth analysis on specific business areas.

  • Integrated: The data warehouse integrates data from different data sources, including relational databases, operating system logs, sensor data, etc. By integrating data into a unified data model, issues of data dispersion and redundancy are eliminated.

  • Stability and reliability: Data warehouse is a stable and reliable data storage system used for long-term storage and management of data. It has high availability and data redundancy mechanisms to ensure data security and reliability.

  • Queryable: The data warehouse provides flexible and high-performance query capabilities to support various data analysis and reporting needs. By using query languages ​​(such as SQL) and analysis tools, users can extract the required information from the data warehouse.

  • Support decision-making: Data warehouse provides important data support for enterprise decision-making. By analyzing and mining data, users can discover potential business trends, patterns and correlations to make more accurate decisions.

The following are some commonly used and famous data warehouses:

  • Teradata: Teradata is a well-known data warehouse solutions provider. They provide a high-performance and scalable data warehouse platform for storing and analyzing large-scale data. Teradata features include parallel processing capabilities, high availability, flexible data models and rich analytical capabilities.

  • Snowflake: Snowflake is a cloud-native data warehouse solution that is highly elastic and flexible. It adopts distributed architecture and column storage technology, supports structured and semi-structured data, and provides high-performance query and expansion capabilities. Snowflake also provides global data replication and security capabilities.

  • Amazon Redshift: Amazon Redshift is a high-performance data warehouse service provided by Amazon AWS. It is based on column storage technology and parallel processing architecture and is suitable for processing large-scale data sets. Redshift has elastic expansion capabilities and can automatically adjust computing and storage resources according to demand.

  • Google BigQuery: Google BigQuery is a managed cloud data warehouse service provided by Google Cloud. It has fast query performance and powerful scalability, supporting large-scale data analysis and real-time queries. BigQuery also integrates machine learning and AI capabilities for data mining and model training.

  • Microsoft Azure Synapse Analytics: Azure Synapse Analytics (formerly Azure SQL Data Warehouse) is an enterprise-grade data warehousing solution on the Microsoft Azure platform. It provides high-performance data storage and processing capabilities, supporting structured and unstructured data. Synapse Analytics also integrates data lake storage and machine learning capabilities.

  • Alibaba Cloud Data Warehouse (AnalyticDB): Alibaba Cloud Data Warehouse is a big data analysis and storage solution provided by Alibaba Cloud. It is based on distributed architecture and column storage technology, with high performance and scalability. Alibaba Cloud Data Warehouse supports PB-level data storage and real-time query, and is widely used in e-commerce, finance, logistics and other fields.

  • Tencent Cloud Data Warehouse (TencentDB for TDSQL): Tencent Cloud Data Warehouse is a data warehouse solution provided by Tencent Cloud. It provides distributed, highly available data storage and computing capabilities, supporting petabyte-level data processing and analysis. Tencent Cloud Data Warehouse is widely used in games, social media, advertising and other fields.

  • Huawei Cloud Data Warehouse (FusionInsight): Huawei Cloud Data Warehouse is a big data analysis platform provided by Huawei Cloud. It provides powerful data storage and analysis capabilities, supporting structured and semi-structured data. Huawei Cloud Data Warehouse is suitable for various industries, such as finance, manufacturing, telecommunications, etc.

  • JD Cloud DWS: JD Cloud DWS is a big data warehouse solution provided by JD Cloud. It is based on column storage and distributed computing technology, with high performance and elastic expansion capabilities. JD Cloud Data Warehouse is widely used in e-commerce, logistics, finance and other fields.

The following are some common methods and techniques for using data warehouses:

  • Data model design : Good data model design is the foundation of the data warehouse. When designing a data model, you need to consider business needs and analysis goals, organize the data structure reasonably, and establish appropriate associations and hierarchical relationships. Commonly used data models include star schema, snowflake model, etc. Choosing an appropriate data model can help simplify queries and improve performance.

  • Data cleaning and transformation : Before loading data into the data warehouse, a process of data cleaning and transformation is usually required. This includes handling data quality issues such as missing values, duplicate values, outliers, and normalizing, standardizing, and formatting the data. The purpose of data cleaning and transformation is to ensure the consistency and accuracy of data and improve the reliability of subsequent analysis.

  • Regular maintenance and updates : The data warehouse needs to be maintained and updated regularly to ensure the timeliness and accuracy of the data. This includes scheduled data extraction, transformation, and loading (ETL) processes, as well as data quality checks and corrections. Regularly updating the data warehouse also maintains the integrity and adaptability of the data model in response to changes in business needs.

  • Use appropriate query tools and techniques : Choosing appropriate query tools and techniques can improve the efficiency of data warehouse query and analysis. Common query tools include SQL query language and business intelligence tools (such as Tableau, Power BI, etc.), which provide intuitive interfaces and rich visualization functions. In addition, using methods such as query optimization techniques and indexes can speed up queries and improve performance.

  • Leverage the analytical capabilities of the data warehouse : The data warehouse is not only a place for data storage, but also provides rich analytical capabilities. Users can use data warehouses to perform data mining, statistical analysis, trend analysis, predictive modeling, etc. By applying appropriate analysis methods and algorithms, valuable information hidden in the data can be discovered and support decision-making and business optimization.

  • Data security and rights management : Data stored in a data warehouse may contain sensitive information, so data security and rights management are important considerations. Ensure that access to the data warehouse is restricted and appropriate security measures are taken, such as data encryption, access log monitoring, user rights management, etc., to protect the confidentiality and integrity of the data.

3. Data specification

Data specifications in data development are a series of rules and standards defined to ensure the consistency, reliability and maintainability of data. Here are some common data specifications:

  • Naming convention: Naming convention is used to define how data objects (tables, columns, views, etc.) are named. This includes using meaningful and understandable names, following certain naming conventions (such as camel case or underline naming), avoiding the use of reserved words and special characters, etc.

  • Data type specification: The data type specification defines the data type that each field should use, such as integer, floating point number, string, etc. Making sure you choose the right data type can save storage space and improve query performance.

  • Constraint specifications: Constraint specifications are used to define constraints on data objects, such as primary keys, unique keys, foreign keys, etc. These constraints can ensure the integrity and consistency of data and prevent data that does not comply with business rules from being inserted or modified.

  • Data format specification: Data format specification defines the storage format and display format of data, such as date and time format, currency format, numerical precision, etc. This helps maintain data consistency and ensures correct data processing and calculations.

  • Data dictionary specification: The data dictionary specification defines the metadata information of the data object, including field meaning, value range, business rules, etc. Data dictionary can help data developers and data users understand the meaning and purpose of data, and improve the understandability and maintainability of data.

  • Coding specifications: Coding specifications define coding standards and rules for data development. This includes code indentation, naming conventions, comment conventions, etc. to improve code readability, maintainability, and reusability.

  • Data quality specifications: Data quality specifications define the quality standards and inspection rules for data. This includes provisions for data completeness, accuracy, consistency, timeliness, etc. to ensure the high quality and credibility of the data.

  • Data security specifications: Data security specifications are used to define security standards and protection measures for data. This includes access rights management, data encryption, handling of sensitive information, etc. to ensure data confidentiality and security.

2. Data analysis tools

1. Big data platform

Big data platform refers to a technology platform used to store, process and analyze large-scale data . Here are some common big data platforms:

  • Apache Hadoop : Hadoop is the most widely used open source big data platform and is adopted by many companies. Companies including Cloudera, Hortonworks, MapR and other companies provide commercial solutions based on Hadoop.

  • Apache Spark : Spark is a big data processing platform that has emerged rapidly in recent years and is widely used by many companies. Large technology companies such as Facebook, Netflix, Uber, etc. are using Spark for big data processing and analysis.

  • Apache Kafka : Kafka is a streaming processing platform that is very popular for real-time data transmission and processing. Many large Internet companies such as LinkedIn, Netflix, Uber, etc. are using Kafka as a data streaming platform.

  • Amazon Web Services (AWS): AWS provides a range of cloud computing services, including big data processing and analysis services. Its big data services include Amazon EMR, Amazon Redshift, Amazon Kinesis, etc., and are widely used by many companies.

  • Google Cloud Platform (GCP) : GCP also provides various big data processing and analysis services. Services such as Google BigQuery and Google Cloud Dataflow are widely used by many companies, including Spotify, HSBC, etc.

  • Microsoft Azure: Azure is a cloud computing platform provided by Microsoft, which also provides big data processing and analysis services. Services such as Azure HDInsight and Azure Data Lake Analytics are used by many companies, including Adobe, Walmart, etc.

  • Cloudera: Cloudera is a company that provides enterprise-level big data solutions based on Hadoop. Its products include Cloudera Distribution for Hadoop (CDH) and Cloudera Data Platform (CDP), which are used by many companies in the field of big data.

  • MapR: MapR is another company that provides enterprise-level big data solutions based on Hadoop. Its products include the MapR Data Platform, which is used by many companies for big data processing and analysis.

The following are big data platforms commonly used by some large companies in China:

  • It should be noted that the big data platforms of large companies are usually customized and developed based on their own business needs and technology stacks or appropriate open source solutions are selected, so the specific platform and technology choices may vary from company to company. In addition, some companies will use multiple big data platforms to meet different needs.

  • Alibaba Group: Alibaba has its own big data platform, including MaxCompute (distributed data processing platform), AnalyticDB (large-scale distributed database), DataWorks (data integration and development platform), etc. Alibaba has also open sourced some big data-related technologies, such as Flink SQL, Blink, etc.

  • Tencent Group: Tencent’s big data platform includes TencentDB (large-scale distributed database), Tencent Data Warehouse (data warehouse), Tencent Cloud Data Lake (data lake), etc. Tencent also makes extensive use of open source big data technologies such as Hadoop and Spark.

  • TDW: Tencent distributed data warehouse (TDW)
    is built based on the open source software Hadoop and Hive, and has been extensively optimized and transformed according to the company's specific circumstances such as large data volume and complex calculations. Currently, the maximum size of a single cluster reaches With 5,600 units and more than 1 million daily operations, it has become the company's largest offline data processing platform.

  • Baidu: Baidu has its own big data platform, including Baidu Data Warehouse (data warehouse), Baidu BigQuery (big data analysis platform), Baidu FusionInsight (big data processing and analysis platform), etc.

  • Bytedance: Bytedance uses self-developed data platforms in the field of big data, including DolphinDB (high-performance distributed data processing and analysis engine), Bytedance Data Platform (big data processing platform), etc.

  • Huawei Technologies Co., Ltd.: Huawei provides big data solutions such as FusionInsight HD (big data processing and analysis platform) and FusionInsight LibrA (data management and analysis platform).

  • Meituan-Dianping: Meituan-Dianping adopts its own data platform in the field of big data, including DolphinDB (high-performance distributed data processing and analysis engine), Meituan Cloud Data Center, etc.

  • JD Group: JD uses its own data platform in the field of big data, including JDP (big data platform), JDP Fusion (data processing platform), etc.

2. Data development: warehousing + calculation (emphasis)

OLTP & OLAP

Online Analytical Processing (OLAP) systems and Online Transaction Processing (OLTP) systems are two different data processing systems designed for different purposes. OLAP is optimized for complex data analysis and reporting, and OLTP is optimized for transaction processing and real-time updates.

Just remember that T is faster.

Insert image description here

KV storage is a common data storage method, where K represents key and V represents value.
DB storage is a relational database.
Models need to be defined.

Incremental table inventory. Full table, incremental table, zipper table, flow table, snapshot table in the data warehouse.
Full table: all the latest status data every day.
(1) The entire table, whether there is any change, must be reported
(2) The data reported each time is all data (changed + unchanged)
incremental table: new data, incremental data is after the last export new data.
(1) Record the amount of each increase, not the total amount;
(2) Traffic refers to the increment within a certain period of time;
(3) Traffic is generally designed as an increment table (daily - common, monthly report);
(4) ) The difference between flow and stock: flow is an increment; stock is a total amount;
(5) Increment table, only changes are reported, no changes are required.

3. Data management: dictionary + dimension table (emphasis)

Data dictionary dimension table
table information data information task information data warehouse information

A data dictionary typically contains the following information:

  • Data element name: Record the name of the data element, such as table name, column name, etc.
    Data element definition: Describes the meaning and purpose of the data element, including business meaning and technical definition.
    Data type: Specify the data type of the data element, such as integer, string, date, etc.
    Length and precision: Specify the length and precision limits of data elements, such as the maximum length of character fields, the number of decimal places for decimal fields, etc.
    Value range: Define the value range or allowed value list of data elements to ensure data validity and consistency.
    Data format: Specify the format of data elements, such as date and time format, currency format, etc.
    Constraints: Specify constraints on data elements, such as primary keys, unique keys, foreign keys, etc.
    Association: record the association between data elements, such as the relationship between tables, the relationship between columns, etc.

The functions of the data dictionary include:

  • Data understanding and documentation : The data dictionary provides definitions and descriptions of data elements to help users understand the meaning and purpose of the data. It can be used as a data document to facilitate user review and reference.
  • Data management and maintenance : The data dictionary records the attributes and constraints of data elements and can be used for data management and maintenance. For example, data can be checked for completeness, consistency, and accuracy through a data dictionary.
  • Data development and data integration : The data dictionary provides data developers with the definitions and attributes of data elements, helping them use the correct data elements during data development and data integration processes.

Dimension table, also known as dimension table, is a table in the data warehouse used to describe the dimensional attributes of fact data.
Dimension tables usually contain business-related information, such as product, customer, time and other dimension attributes. The design of dimension tables includes defining dimension attributes, hierarchical relationships, dimension relationships, etc. to support data analysis and report queries.

In the data dictionary, dimension table information can include:

  • Dimension table name: Record the name of the dimension table.
    Dimension table fields: Describe the fields in the dimension table, including field names, data types, lengths, etc.
    Dimension attributes: Record dimension attributes, such as product name, customer name, time, etc.
    Hierarchical relationship: Define the hierarchical relationship between dimension attributes, such as product category, product subcategory, product name, etc.
    Dimension relationship: Record the relationship between dimensions, such as the association between product dimensions and customer dimensions.

————————————————————————

List retrieval, data details and field descriptions, associated tasks
List retrieval (List Retrieval) refers to using query operations in data management systems or applications to obtain data sets that meet specific conditions and present them to users in list form . List searches are often used to find and display summary information of data so that users can quickly browse and filter results.

List search generally includes the following steps:

  1. Query condition definition: The user specifies the conditions for the data to be retrieved, which can be filter conditions based on one or more fields, such as date range, keywords, status, etc.
  2. Data query: The system performs query operations based on user-defined query conditions and retrieves data that meets the conditions.
  3. Data presentation: Query results are displayed to users in list form, usually including key fields or summary information for each piece of data, so that users can quickly browse and filter the results.
  4. Paging and sorting: If the query result data is large, paging is usually performed to divide the results into multiple pages for display. Additionally, users can specify sorting rules to sort results by specific fields.

Data Details and Field Descriptions refers to providing detailed information and attribute descriptions about specific data objects (such as tables, columns, fields, etc.). This information usually includes the definition, purpose, data type, length, constraints, etc. of the data object.

The functions of data details and field descriptions include:

  1. Data understanding and interpretation: Provide detailed information and attribute descriptions of data objects to help users understand the meaning and purpose of the data.
  2. Data Development and Data Usage: For both data developers and data consumers, data details and field descriptions provide necessary information about data objects in order to use and process the data correctly.
  3. Data document and metadata management: Data details and field descriptions can be used as part of the data document to record and manage metadata information of data objects to facilitate maintenance and sharing.

Join Tasks refer to the association of multiple data tables or data sets based on shared fields in data management and data analysis to obtain richer data information. Correlation tasks are often used to combine data from different sources or different dimensions for more comprehensive data analysis and report generation.

Association tasks generally include the following steps:

  1. Association field selection: Determine shared fields for association that have the same or similar values ​​in different data tables.
  2. Data association: Perform association operations to connect data tables or data sets based on the selected related fields, and generate a new data table or data set containing related data.
  3. Association type: Determine the association type based on the matching of the associated fields, such as inner join, left join, right join, etc.
  4. Result presentation: Present the associated data results, usually in the form of a new data table or data set to the user or for subsequent analysis.

Data lineage, partition information, data preview, data label annotation

Data Lineage refers to the information that tracks and records the source, transmission path and transformation process of data in data management and data analysis. It provides data traceability capabilities and helps users understand the generation, conversion and usage history of data, as well as the relationships and dependencies between data.

Data lineage can include the following information:

  1. Data source: The original source of recorded data, which can be a database table, file, external system, etc.
  2. Data transmission path: describes the transmission path of data between different systems, components or tasks, including the input, output and transfer process of data.
  3. Data transformation operations: record the transformation operations performed on the data, such as filtering, aggregation, calculation, etc., as well as the order and parameters of the transformation operations.
  4. Data usage: Track which tasks, analyses, or reports the data is used by, and the dependencies between them.

Through data lineage, users can gain a global view of data, track changes and flows of data, identify the root causes of data quality issues, and analyze the reliability and trustworthiness of data.

Partition Information refers to information that organizes and manages data according to specific partition strategies in a data storage system . Partition information can help improve the efficiency of data query and processing, as well as optimize the performance of data storage and access.

Partition information generally includes the following:

  1. Partition Key: A field or attribute used to divide data, such as date, region, product, etc. According to the value of the partition key, the data will be allocated to the corresponding partition.
  2. Partition type: Specify the data type of the partition key, such as integer, date, string, etc.
  3. Partition rules: Define how the data is divided and organized according to the partition key, which can be range partitioning, hash partitioning, list partitioning, etc.
  4. Number of partitions: Specify the number of partitions, which determines the distribution and storage structure of data in physical storage.
    Through partition information, data can be divided and organized according to partition keys, so that queries and processing only need to access specific partitions, reducing the scanning scope of data and improving query efficiency and performance.

Data Preview refers to the function of viewing and previewing samples or summary information of data in data management systems or tools. Data preview allows users to quickly understand the structure, content and quality of data before performing specific queries or operations, so that they can make appropriate decisions and adjustments.

Data preview typically includes the following:

  1. Data sample: Displays partial records or samples of data so that users can understand the structure and content of the data.
  2. Field summary: Provides basic statistical information of data fields, such as field type, maximum value, minimum value, average value, etc.
  3. Data quality check: Check for missing values, duplicate values, outliers and other data quality issues in the data , and give corresponding prompts or warnings.

Data preview can help users quickly understand the characteristics and quality of data so that they can make appropriate adjustments and decisions during data processing and analysis.

Data Tagging and Labeling refers to adding labels or identifiers to data objects (such as tables, columns, records, etc.) in order to classify, organize and manage data. Data labeling can be performed based on business needs and data characteristics, making the data easier to search, index, and identify.

The method and content of data labeling can be defined according to specific needs, for example:

  1. Business tags: Add business-related tags to data objects, such as product categories, customer types, geographical locations, etc., to classify and organize data according to business.

  2. Data quality label: Mark the quality status of data objects, such as completeness, accuracy, consistency, etc., to facilitate data quality management and control.

  3. Security labels: Add security level or sensitivity labels to data objects for data permission control and protection.

  4. Association tags: Mark associations between data objects, such as primary and foreign key relationships, dependencies between data sets, etc., to facilitate data association and analysis.

Data tagging can help organize and manage data, make data easier to search and discover, speed up data location and access, and also provide support for data analysis, data governance and compliance requirements. Through tagging, data can be managed and utilized in a more fine-grained manner.

————————————————————————
Data dimension table management
Data classification, grouping, filtering

Data Dimension Table Management refers to the process of managing and maintaining data dimension tables in a data management system or data warehouse. Data dimension tables are tables that store dimensional information related to business , such as product dimensions, time dimensions, geographical dimensions, etc. Data dimension table management includes data import, update, cleaning and maintenance to ensure the accuracy and consistency of dimension table data.

The main tasks of data dimension table management include:

  1. Data import: Import dimension table data from the source system or external data source into the data management system, usually through the ETL (Extract, Transform, Load) process.
  2. Data update: Update dimension table data in a timely manner according to changes in business needs and dimension information to maintain the timeliness and accuracy of the data. Updates can be full updates or incremental updates.
  3. Data cleaning: Clean and repair dimension table data, deal with data quality issues such as duplicate values, missing values, and erroneous values ​​to ensure the consistency and integrity of dimension table data.
  4. Data maintenance: Monitor the changes and usage of dimension table data, perform backup, recovery and performance optimization of dimension table data to ensure data reliability and availability.

Data Classification refers to classifying and organizing data according to certain rules and standards in order to better manage and utilize data. Data classification can classify and classify data based on multiple dimensions, such as business fields, data types, security levels, etc.

The purposes of data classification include:

  1. Data organization: Organize data according to certain classification methods to make data easier to find, access and manage. For example, data can be divided into sales data, financial data, human resources data, etc. according to business areas.
  2. Data permission control: Classify and mark data according to the security level and sensitivity of the data to facilitate corresponding permission control and data protection. For example, classify data into public data, internal data, confidential data, etc.
  3. Data analysis and report generation: Classify and group data according to needs to facilitate data analysis and generate corresponding reports. For example, classify sales data by product category to generate product sales reports.

Data classification can be defined according to specific needs and business rules. By classifying and organizing data, the management efficiency and utilization value of data can be improved.

Grouping refers to grouping data according to the values ​​of a certain field or multiple fields in the process of data analysis and query, in order to perform aggregate calculations or statistical analysis. Grouping operations are commonly used in scenarios such as data reports, data summaries, and data visualization.

Grouping operations can help users conduct more in-depth analysis and understanding of data, provide summary information and statistical results of data, and support decision-making and the generation of insights.

Filtering refers to selecting data records that meet the conditions from the data set based on specific conditions or rules, and filtering out the data that meets the conditions. Filtering operations are commonly used in scenarios such as data query, data analysis, and data processing.

Filtering operations can help users obtain specific data subsets as needed, filter out irrelevant or unqualified data, and provide more precise and targeted data analysis and processing capabilities.

4. Data analysis: reports

Data analysis refers to the process of collecting, cleaning, transforming and interpreting data to extract useful information, insights and patterns to support decision making and problem solving. Data analysis can be applied to various fields and industries to help people better understand data, discover trends, identify correlations, and make effective data-based decisions.

A report is a form of data analysis, which is the result of summarizing, organizing and presenting data. Reports are usually presented in tables, charts, graphs, or other visual formats so that users can understand and interpret the data more intuitively.

The characteristics and purposes of data analysis reports include:

  1. Summary and summary: Reports summarize and summarize large amounts of data to provide an overview and core indicators of the data. For example, a sales report can display indicators such as total sales, sales volume, and average sales price.
  2. Visual display: Reports present data through visual methods such as charts and graphs, making the data easier to understand and compare. Common report chart types include bar charts, line charts, pie charts, etc.
  3. Trend analysis: Reports can display the time trend of data and help users discover cyclical, seasonal or long-term trends. Trend analysis can be presented through line charts, area charts, etc.
  4. Comparison and correlation: Reports can compare and correlate data from different dimensions or different data sets to reveal the relationships and differences between them. For example, a market share report can compare the sales share of different products.
  5. Decision support: Reports provide data visualization and integration, helping decision makers understand the data more accurately and make decisions based on the data. Reports can provide important guidance and basis for decision makers.

——————

When creating a data analysis report, the following steps are often involved:

  1. Data preparation: Collect, clean and organize the data to be analyzed to ensure the accuracy and completeness of the data.
  2. Report design: Determine the goals, audience and content of the report, and select appropriate chart types and presentation methods.
  3. Data summary and calculation: Summarize, calculate and aggregate data to generate indicators and values ​​in reports.
  4. Visual display: Use charts, graphs, etc. to visually present data to increase the readability and understandability of reports.
  5. Interpretation and analysis: Interpret, analyze and gain insight into the data in the report, and provide understanding and insights into the data.
  6. Report Publishing and Sharing: Share reports to relevant stakeholders so they can access and use them.

3. Data language (DDL, DML)

In database management systems, DDL (Data Definition Language) and DML (Data Manipulation Language) are two common data languages ​​used to define and operate data and structures in the database.

  • DDL (Data Definition Language) is used to define and manage database structures and objects , including tables, indexes, views, constraints, etc. Common DDL commands include:
    CREATE: used to create database objects, such as creating tables, views, indexes, etc.
    ALTER: used to modify the structure of database objects, such as modifying table structure, adding columns, modifying constraints, etc.
    DROP: used to delete database objects, such as deleting tables, views, indexes, etc.
    TRUNCATE: Used to quickly delete all data in the table but retain the table structure.
    COMMENT: Used to add comments or descriptions to database objects.

  • DML (Data Manipulation Language) is used to operate data in the database , including inserting, querying, updating and deleting data. Common DML commands include:
    SELECT: used to query data in the database and return a result set.
    INSERT: used to insert new data into the database table.
    UPDATE: Used to update data in database tables.
    DELETE: used to delete data in the database table.

In addition to DDL and DML, there are other data languages, such as DCL (Data Control Language) and TCL (Transaction Control Language):

  • DCL (Data Control Language) is used to define and manage database security and permissions , including authorizing user access permissions, revoking permissions, etc. Common DCL commands include GRANT and REVOKE.

  • TCL (Transaction Control Language) is used to manage database transactions , including controlling transaction submission and rollback. Common TCL commands include COMMIT and ROLLBACK.

These data languages ​​provide the ability to define, operate and manage databases. When using a database, you can use DDL to define the database structure, use DML to operate data in the database, use DCL to manage the security and permissions of the database, and use TCL to manage the consistency and concurrency control of database transactions.

SQL commands (key points)

SQL (Structured Query Language) is a standardized language for interacting with relational databases. The following is the basic syntax of SQL and common keywords and statements:

  1. Create table:
CREATE TABLE table_name (
  column1 datatype,
  column2 datatype,
  column3 datatype,
  ...
);
  1. Insert data:
INSERT INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);
  1. Query data:
SELECT column1, column2, ...
FROM table_name
WHERE condition;
  1. update data:
UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;
  1. delete data:
DELETE FROM table_name
WHERE condition;
  1. Filter and sort query data:
SELECT column1, column2, ...
FROM table_name
WHERE condition
ORDER BY column1 ASC/DESC;
  1. Use aggregate functions:
SELECT aggregate_function(column) AS alias
FROM table_name
GROUP BY column;

Common aggregate functions include COUNT, SUM, AVG, MINand MAX.

  1. Join multiple tables:
SELECT column1, column2, ...
FROM table1
JOIN table2 ON table1.column = table2.column;

Common connection types include INNER JOIN, LEFT JOIN, RIGHT JOINand FULL JOIN.

  1. Use subquery:
SELECT column1, column2, ...
FROM table_name
WHERE column IN (SELECT column FROM another_table WHERE condition);
  1. Create index:
CREATE INDEX index_name
ON table_name (column1, column2, ...);

The above is the basic syntax and common keywords and statements of SQL. The SQL language is very flexible and has many advanced features and syntax for complex queries, data operations, and database management.

SQL table join (supplement)

When using SQL for table joins, you can use different types of joins (INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN) to handle the relationships between tables. Below are definitions and specific examples of each type of connection:

  1. INNER JOIN:, intersection

    • Definition: Inner join returns the rows in two tables that meet the join conditions, that is, only returns the intersection part of the data in the two tables.
    • Example:
    SELECT Orders.OrderID, Customers.CustomerName
    FROM Orders
    INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID;
    ```
    
    在以上示例中,内连接将Orders表和Customers表连接起来,连接条件是两个表中的CustomerID列相等。结果将返回满足连接条件的OrderID和CustomerName列。
    
  2. LEFT JOIN: left set + intersection

    • Definition: A left join returns all rows in the left table and rows in the right table that meet the join conditions. If there are no matching rows in the right table, the right column is displayed as NULL.
    • Example:
    SELECT Customers.CustomerName, Orders.OrderID
    FROM Customers
    LEFT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;
    ```
    
    在以上示例中,左连接将Customers表和Orders表连接起来,连接条件是两个表中的CustomerID列相等。结果将返回所有的CustomerName和对应的OrderID,如果某个CustomerID在Orders表中没有匹配的行,则OrderID列显示为NULL
  3. Right JOIN: right set + intersection

    • Definition: Right join returns all rows in the right table and rows in the left table that meet the join conditions. If there are no matching rows in the left table, the left column is displayed as NULL.
    • Example:
    SELECT Customers.CustomerName, Orders.OrderID
    FROM Customers
    RIGHT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;
    ```
    
    在以上示例中,右连接将Customers表和Orders表连接起来,连接条件是两个表中的CustomerID列相等。结果将返回所有的CustomerName和对应的OrderID,如果某个CustomerID在Customers表中没有匹配的行,则CustomerName列显示为NULL
  4. Full join (FULL JOIN):, union

    • Definition: Full join returns all rows in the left table and right table. If there are no matching rows in the left table or right table, the corresponding column is displayed as NULL.
    • Example:
    SELECT Customers.CustomerName, Orders.OrderID
    FROM Customers
    FULL JOIN Orders ON Customers.CustomerID = Orders.CustomerID;
    ```
    
    在以上示例中,全连接将Customers表和Orders表连接起来,连接条件是两个表中的CustomerID列相等。结果将返回所有的CustomerName和对应的OrderID,如果某个CustomerID在Customers表或Orders表中没有匹配的行,则对应的列显示为NULL
  5. Self Join:
    Syntax: SELECT columns FROM table1 JOIN table2 ON condition
    Matching conditions: Treat two aliases in the same table as different tables and match according to the connection conditions.
    Result: Used to compare relationships between different rows in the same table.
    Note: Self-joining requires the use of different table aliases to distinguish the two tables.

Insert image description here

The above are definitions and examples of various table join types. According to the relationship between data and query requirements, selecting the appropriate connection type can achieve the required data combination and association operations.

In SQL, the default table connection type is INNER JOIN.

When multiple tables are used in a query and the connection type is not explicitly specified, SQL will use inner joins by default to associate the tables. Inner joins only return rows that meet the join conditions, that is, the intersection of the two tables.

For example, the query in the following example uses multiple tables but does not specify a join type:

SELECT Orders.OrderID, Customers.CustomerName
FROM Orders, Customers
WHERE Orders.CustomerID = Customers.CustomerID;

In this case, the default table join type is an inner join. The connection condition is that the CustomerID column of the Orders table and the CustomerID column of the Customers table are equal. The OrderID and CustomerName columns are returned only if the CustomerID in both tables matches.

It should be noted that using implicit inner join syntax may cause the query to be less readable and maintainable. Therefore, it is recommended to explicitly use the JOIN keyword to specify the connection type to enhance query understandability and maintainability.

python (Spark, etc.)

Python has a wide range of data development libraries. Here are some common Python data development libraries:

  1. NumPy: NumPy is the most basic and commonly used numerical calculation library in Python. It provides efficient multi-dimensional array objects and broadcast functions, as well as a rich mathematical function library, and is a basic library for scientific computing and data analysis.

  2. pandas: pandas is a powerful data analysis and data processing library. It provides high-performance, flexible data structures (such as DataFrame and Series), as well as various data operation and processing functions, including data cleaning, data conversion, data filtering, data statistics, etc.

  3. Matplotlib: Matplotlib is a library for drawing high-quality charts and visualizations. It provides a wide range of drawing functions, including line graphs, scatter plots, bar charts, pie charts, etc., which can be used to explore data, display analysis results, and generate reports.

  4. seaborn: seaborn is a statistical data visualization library based on Matplotlib. It provides higher-level, more beautiful chart styles and drawing interfaces, and can easily create various statistical charts, such as heat maps, box plots, density plots, etc.

  5. scikit-learn: scikit-learn is a widely used machine learning library that provides a rich set of machine learning algorithms and tools. It supports tasks such as data preprocessing, feature engineering, model training and evaluation, and is an important tool for machine learning and data mining.

  6. TensorFlow and PyTorch: TensorFlow and PyTorch are two popular deep learning frameworks used for building and training neural network models. They provide flexible computational graph construction and automatic differentiation functions to support various deep learning tasks such as image classification, natural language processing, etc.

  7. SQLalchemy: SQLalchemy is a powerful SQL toolkit that provides access and operation interfaces to relational databases. It supports a variety of database backends and can perform operations such as SQL queries, data writing, and data management through Python code.

  8. PySpark: PySpark is an integrated library of Python and Apache Spark for large-scale data processing and distributed computing. It provides high-level APIs and tools to support tasks such as distributed data processing, machine learning, and graph computing.

These libraries cover multiple fields such as data analysis, visualization, machine learning, and big data processing, providing Python developers with a wealth of tools and functions to support various data-related development and application scenarios.

PySpark is an integrated library of the Python programming language and Apache Spark for large-scale data processing and distributed computing. Apache Spark is a fast, general-purpose and scalable open source cluster computing system that can process large-scale data sets and support complex data analysis and machine learning tasks.

PySpark provides complete integration with Spark's core functionality, enabling Python developers to leverage Python's simplicity and ease of use for large-scale data processing. Here are some features and functions of PySpark:

  1. Distributed data processing: PySpark is based on Spark's distributed computing engine and can process large-scale data sets. It supports parallel processing of data on the cluster, making full use of the computing resources of the cluster and improving data processing efficiency.
  2. High performance: PySpark utilizes Spark's in-memory computing and RDD (elastic distributed data set)-based data model to achieve high-performance data processing. It can cache data in memory and perform iterative calculations, greatly speeding up data processing.
  3. Data abstraction and manipulation: PySpark provides rich data abstraction and manipulation interfaces, including DataFrame and SQL queries. DataFrame is a data structure similar to a relational database table. It supports SQL-like query, filtering, aggregation and connection operations, making data processing more convenient and intuitive.
  4. Machine learning support: PySpark has a built-in machine learning library (MLlib), which provides common machine learning algorithms and tools. It supports machine learning tasks such as feature extraction, model training, model evaluation and prediction, and can handle large-scale machine learning data sets.
  5. Streaming processing: PySpark supports streaming data processing, which can process data streams in real time and perform streaming calculations. It integrates Spark Streaming and Structured Streaming to receive data streams from multiple data sources and process and analyze them in a micro-batch manner.
  6. Data source support: PySpark supports read and write operations from multiple data sources, including HDFS, Hive, JSON, CSV, Parquet, etc. It can read data from different types of data sources and write the processing results back to the data source.
  7. Extensibility and ecosystem: PySpark has good extensibility and can integrate third-party Python libraries and tools. In addition, the Spark ecosystem provides a wealth of tools and libraries, such as Spark SQL, Spark Streaming, and GraphX, which can be seamlessly integrated with PySpark to expand data processing and analysis functions.

——————————————————

When using PySpark for data processing and analysis, the following are some commonly used functions and their code implementation and explanation:

  1. Read data:
from pyspark.sql import SparkSession

# 创建SparkSession对象
spark = SparkSession.builder.getOrCreate()

# 读取CSV文件数据
df = spark.read.csv("data.csv", header=True, inferSchema=True)

Explanation: The above code SparkSessioncreates a Spark session object and uses read.csvthe function to read a CSV file data. header=TrueIndicates that the first line is the column name, inferSchema=Trueindicating that the data type of the column is automatically inferred.

  1. Data preview:
# 显示DataFrame前几行数据
df.show(5)

# 查看DataFrame的列名
df.columns

# 查看DataFrame的行数和列数
df.count(), len(df.columns)

Explanation: In the above code, showthe function is used to display the first few rows of DataFrame data, and the first 20 rows are displayed by default. columnsProperty returns a list of column names of the DataFrame. count()The function returns the number of rows in the DataFrame and len(df.columns)returns the number of columns in the DataFrame.

  1. Data screening and filtering:
# 筛选满足条件的数据
filtered_df = df.filter(df["age"] > 30)

# 多个条件的筛选
filtered_df = df.filter((df["age"] > 30) & (df["gender"] == "Male"))

Explanation: In the above code, filterthe function is used to filter data that meets the conditions. You can use column names and operators to build filter conditions, such as greater than ( >), equal to ( ==), etc. Multiple conditions can be combined using logical operators, such as AND ( &).

  1. Data aggregation and statistics:
from pyspark.sql import functions as F

# 计算某列的平均值
avg_age = df.select(F.avg("age")).first()[0]

# 按某列分组并统计数量
grouped_df = df.groupBy("gender").count()

# 按某列分组并计算平均值
grouped_df = df.groupBy("gender").agg(F.avg("age"))

Explanation: In the above code, avgthe function is used to calculate the average of a certain column. groupByFunctions are used to group by a column, and then you can use countthe function to calculate the quantity for each group, or use aggthe function to calculate other statistics for each group.

  1. Data sorting:
# 按某列升序排序
sorted_df = df.orderBy("age")

# 按多列排序
sorted_df = df.orderBy(["age", "salary"], ascending=[False, True])

Explanation: In the above code, orderBythe function is used to sort by a certain column or multiple columns. The default is ascending order. ascendingParameters can specify ascending or descending sorting order.

  1. Data writing:
# 将DataFrame写入CSV文件
df.write.csv("output.csv", header=True)

# 将DataFrame写入数据库表
df.write.format("jdbc").option("url", "jdbc:mysql://localhost/mydatabase") \
    .option("dbtable", "mytable").option("user", "username") \
    .option("password", "password").save()

Explanation: In the above code, write.csvthe function is used to write the DataFrame into a CSV file, and you can specify whether to include column names. write.format("jdbc")The function is used to write a DataFrame into a database table and requires the database connection URL, table name and authentication information.

——————————————————

In addition to Spark, Python also has some other big data development libraries. Here are some commonly used libraries:

  1. Dask: Dask is a flexible parallel computing library that can perform large-scale data processing in a single machine or distributed environment. It provides a Pandas-like API and supports parallelization and distributed computing, and can handle data sets that exceed the memory limits of a single machine.
  2. Hadoop: Hadoop is an open source distributed computing framework for processing large-scale data sets. Python provides libraries that integrate with Hadoop, such as Hadoop Streaming, which allow MapReduce jobs to be written in Python and run on a Hadoop cluster.
  3. Apache Kafka: Kafka is a high-throughput distributed messaging system for processing real-time streaming data. Python provides a Kafka client library. You can use Python to write producers and consumers to realize the transmission and processing of stream data.
  4. Apache Flink: Flink is a distributed stream processing and batch processing framework for real-time and offline data processing. Python provides Flink's Python API. You can use Python to write Flink jobs and perform distributed computing on the Flink cluster.
  5. Apache Airflow: Airflow is a programmable, schedulable, and monitorable workflow management platform for building and scheduling data pipelines. Python is the main programming language of Airflow. You can use Python to write workflow tasks and scheduling logic.
  6. Apache Storm: Storm is a distributed real-time computing system for processing high-speed data streams. Python provides Storm's Python API. You can use Python to write Storm topology and implement real-time streaming data processing and analysis.

References: 1 , 2 , 3

Guess you like

Origin blog.csdn.net/qq_33957603/article/details/133212610