What is big data and why is it so important?

Insert picture description here

Big data is a combination of structured, semi-structured, and unstructured data collected by an organization. These data can be used for information mining, and for machine learning projects, predictive modeling, and other advanced analysis applications.

The system for processing and storing big data has become a common component of the data management architecture in the organization. The characteristics of big data are often 3V: Volume (massive), Velocity (speed), Variety (diversity). In many environments, the amount of data is large, the types of data stored in big data systems are diverse, and the speed of data generation, collection, and processing. In 2001, Meta Group Inc. analyst Doug Laney discovered these features for the first time; Gartner further promoted these features after acquiring Meta Group in 2005. Recently, several other Vs have been added to different descriptions of big data, including accuracy, value, and variability.
Although big data is not equivalent to any specific amount of data, big data deployment usually involves TB (TB), PB (PB) size levels, and data captured over time or even EB (EB).

The importance of big data

The company uses the big data accumulated in its system to improve operations, provide better customer service, create personalized marketing campaigns based on specific customer preferences, and ultimately increase profitability. Companies that use big data have a potential competitive advantage over those without big data because they can make faster and more informed business decisions as long as they use data effectively.

For example, big data can provide companies with valuable insights into their customers, and these insights can be used to improve marketing activities and technologies to increase customer engagement and conversion rates.

In addition, the use of big data makes companies more and more customer-centric. Historical and real-time data can be used to evaluate the changing preferences of consumers, so that companies can update and improve their marketing strategies and better respond to customer desires and needs.

Big data is also used by medical researchers to identify disease risk factors, and by doctors to help diagnose the diseases and conditions of individual patients. In addition, data agencies from electronic health records (EHRs), social media, the Internet, and other sources provide up-to-date information about infectious disease threats or outbreaks. In the energy industry, big data helps oil and gas companies identify potential drilling locations and monitor pipeline operations; similarly, utilities also use it to track the power grid. Financial service companies use big data systems for risk management and real-time analysis of market data. Manufacturers and transportation companies rely on big data to manage their supply chains and optimize delivery routes. Other government uses include emergency response, crime prevention, and smart city initiatives.

Examples of big data

Big data comes from countless different sources, such as business transaction systems, customer databases, medical records, Internet clickstream logs, mobile applications, social networks, scientific research repositories, machine-generated data, and use in IoT (lottery) environments Real-time data sensor. The data may be left in the big data system in its original form, or it may be preprocessed using data mining tools or data preparation software to prepare it for specific analytical purposes.

Taking customer data as an example, the different branches that can be analyzed using the information in the big data set include the following:

comparative analysis. This includes checking user behavior indicators and observing real-time customer recruitment in order to compare a company's products, services, and brand authority with its competitors.
Social media listening. This is information about what people say about a particular company or product on social media. This information is beyond what can be conveyed in opinion polls or surveys. These data can help determine the target audience of marketing activities by observing activities from different sources around specific topics.
Market analysis. This includes information that can be used to make the promotion of new products, services and initiatives more informed and innovative.
Customer satisfaction and sentiment analysis. All the information collected can reveal how customers feel about the company or brand, if any potential problems may arise, how to maintain brand loyalty, and how to improve customer service.

Break down the characteristics of big data

Volume is the most frequently cited feature of big data. A big data environment does not need to contain a large amount of data, but in most cases it needs to contain a large amount of data because the data is collected and stored in it. Clickstreams, system logs, and stream processing systems are one of the sources that usually continuously generate large amounts of big data.

Big data also includes a variety of data types, including:

Structured data in databases and data warehouses based on Structured Query Language (SQL);
unstructured data, such as text and document files saved in a Hadoop cluster or NoSQL database system;
and semantic structure data, such as web server logs Or streaming data from sensors.
All different data types can be stored in data pools, which are usually based on Hadoop or cloud object storage services. In addition, big data applications often include multiple data sources, otherwise integration may not be possible. For example, a big data analysis project may try to measure the success of the product and future sales by correlating past sales data, returned data, and online buyer review data for the product.

Speed ​​refers to the speed at which big data is generated, which must be processed and analyzed. In many cases, big data sets are updated in real time or near real time, rather than daily, weekly, or monthly updates in many traditional data warehouses. Big data analysis applications ingest, correlate, and analyze the incoming data, and then provide answers or results based on the overall query. This means that data scientists and other data analysts must have a detailed understanding of the existing data and have a certain understanding of the answers they are looking for to ensure that the information they receive is valid and current.

As big data analysis expands to areas such as machine learning and artificial intelligence (AI), managing data speed is also important. In these areas, the analysis process automatically finds patterns in the collected data and uses them to generate insights.

More features of big data

From the perspective of the original 3V, data accuracy refers to the degree of certainty of the data set. Uncertain raw data collected from multiple sources such as social media platforms and web pages may cause serious data quality issues, which may be difficult to determine. For example, a company that collects large data sets from hundreds of sources may be able to identify inaccurate data, but its analysts need data genealogy information to track where the data is stored so they can correct the problem.

Poor data leads to inaccurate analysis and may damage the value of business analysis, because it may cause executives to distrust the entire data. Before using it in big data analytics applications, the amount of uncertain data in the organization must be accounted for. IT and analytics teams also need to ensure that they have enough accurate data to produce effective results.

Some data scientists have also added value to the feature list of big data. As mentioned above, not all collected data has actual business value, and the use of inaccurate data can weaken the insights provided by analytical applications. The important thing is that organizations must adopt data cleaning and other practices before using big data analysis projects, and confirm that the data is related to related business issues.

Variability also often applies to large data sets, which are less consistent than traditional transactional data, may have multiple meanings, or are formatted in different ways from one data source to another-these factors make processing And the effort to analyze the data is further complicated. Some people attribute more features to big data; data scientists and consultants have created various lists of 7 to 10 features.

How big data is stored and processed

The need to process big data speeds places unique requirements on the underlying computing infrastructure. The computing power required to process large and diverse data quickly may overwhelm a single server or server cluster. Organizations must apply sufficient processing power to big data tasks to achieve the required speed. This may require hundreds or thousands of servers, which can distribute processing work and operate collaboratively in a cluster architecture, usually based on technologies such as Hadoop and Apache Spark.

Achieving this speed in a cost-effective manner is also a challenge. Many business leaders are reluctant to invest in extensive server and storage infrastructure to support big data workloads, especially those that do not run 24/7. Therefore, public cloud computing is now the main tool for carrying big data systems. Public cloud providers can store gigabytes of data and expand the number of servers required to complete big data analysis projects. The business only pays for the storage and computing time actually used, and the cloud instance can be shut down until it is needed again.

In order to further improve service levels, public cloud providers include:


Alibaba Cloud Tencent Cloud
Amazon EMR (previously reduced elastic map)
Microsoft Azure HDInsight
In the cloud environment, big data can be stored as follows:

Hadoop Distributed File System (HDFS);
low-cost cloud object storage, such as Amazon Simple Storage Service (S3);
no SQL database;
relational database;
for organizations wishing to deploy big data systems on premises, in addition to Hadoop and Spark, Commonly used Apache open source technologies also include the following:

YARN is Hadoop's built-in resource manager and job scheduler. It represents another resource negotiator, but it is usually referred to as solitary. The
map low-code programming framework is also the core component of Hadoop;
Kafka is an application-to-application messaging And data flow platform;
based on database;
SQL-on-Hadoop query engine, such as Drill, Hive, Impala and Presto.

Big data challenge

In addition to processing power and cost issues, designing a big data architecture is another common challenge faced by users. Big data systems must be tailored to the special needs of the organization. This is a DIY business that requires IT teams and application developers to piece together a set of tools from all available technologies. Compared with database administrators (DBAs) and developers who specialize in relational software, deploying and managing big data systems also require new skills.

Both of these problems can be alleviated by using managed cloud services, but IT managers need to pay close attention to cloud usage to ensure that costs are not out of control. In addition, migrating internal data sets and processing workloads to the cloud is often a complex process for organizations.
Giving data scientists and other analysts access to data in big data systems is also a challenge, especially in distributed environments that include combinations of different platforms and data storage. In order to help analysts find relevant data, IT and analysis teams are increasingly committed to building data catalogs that include metadata management and data genealogy functions. Data quality and data governance also need to be priorities to ensure that data sets are clean, consistent, and used correctly.

Big data collection practices and regulations

For many years, the company has almost no restrictions on the data collected from customers. However, as the collection and use of big data increases, data misuse also increases. Those relevant citizens who have experienced improper processing of personal data or who have become victims of data violations are calling for laws on transparency in data collection and consumer data privacy.

Protests against violations of personal privacy led the European Union to pass the General Data Protection Regulation (GDRP), which came into effect in May 2018; it limits the types of data that organizations can collect and requires individuals to agree to or comply with personal data collection Other specific legal reasons. The GDP R also includes a right to be forgotten clause that allows EU residents to require companies to delete their data.

Although there is no similar federal law in the United States, the California Consumer Privacy Act (CCPA) aims to give California residents more control over the company's collection and use of their personal information. CCPA was signed into law in 2018 and is scheduled to take effect on January 1, 2020. In addition, US government officials are investigating data processing practices, especially data processing practices that collect consumer data and sell it to other companies with unknown uses.

The human side of big data analysis

Ultimately, the value and effectiveness of big data depends on the staff responsible for understanding the data and formulating appropriate queries to guide big data analysis projects. Some big data tools cater for specialized niches and enable fewer technical users to use daily business data in predictive analytics applications. Other technologies—such as Hadoop-based big data devices—help companies implement a suitable computing infrastructure to handle big data projects while minimizing the need for hardware and distributed software technologies.

Big data can be contrasted with small data. This is another evolving term that is usually used to describe the volume and format of data that can be easily used for self-service analysis. A commonly used method of distinction: "Big data belongs to machines, and small data belongs to people."

This article is reproduced from snow beast software
more exciting recommend please visit @ snow beast Software's official website

Guess you like

Origin blog.csdn.net/u014674420/article/details/111930473