Techniques for working with Traditional Data

 
  
  
   
  
 Raw data is called it wrong facts or primary data is data which cannot be analyzed straight away. It is untouched data you have accumulated and stored on the server. The gathering of raw data is referred to as data collection.
How the processing step takes place and can turn raw data into useful and meaningful information. The first thing to do after gathering enough raw data is what is called data preprocessing. This is a group of operations that will basically convert your raw data into a format that is more understandable and hence userful for further processing.
Data Pre-processing aims to fix the problems that can inevitably occur with data gathering a problem. For example, within some customer data that has been gathered. You may have person registered as 932 years old or see United Kingdom as the name of a person obviously. Those data entries are incorrect before proceeding with any type of analysis. Such data must be marked as invalid or corrected. That’s what data preprocessing is all about.
Class labeling: This involves labeling the data point to the correct data type or arranging data by category. One such label is numerical. For example, if yu are storing the number of goods sold daily then you are keeping track of numerical values. These are numbers which can be manipulated such as the average number of goods sold per day or month. The other label is categorical. Here you are dealing with information that cannot have mathematical manipulations. For example, a person’s profession or place of birth as this just provides information about them.
Data cleansing: It also known as data cleaning or data scrubbing. The goal is to deal with inconsistent data. This can come in various forms. You are provided with a data set containing the US states and a quarter of the names are misspelled in this situation. Certain techniques must be performed to corrrect these mistakes.
Missing values: not all customers will give you the data you are asking for. What often happens is that a customer will give you his name and occupation but not his age. For example, should the customer’s entire record be disregarded or perhaps you should just enter the average age of the remaining customers data cleansing and dealing with missing values are problems that must be solved before you can process the data further.
Balancing: It is a common approach. Imagine you have compiled a survey to gather data on the shopping habits of men and women. You want to ascertain who spends more money during the weekend. For example, however when you have yoru data you’ll notice that 80 percent of respondents were female and only 20 persent male. The trends you may notice are not going to be towards men as much as they are to women to counteract this problem applying balancing techniques would be the best thing to do such as taking an equal number of respondents from each group. So the ratio is 50/50.
Shuffling datasets: It is another common approach. Shuffling the observations from your data set is just like shuffling a deck of cards. It will ensure that your data set is free from unwanted patterns caused by problematic data collection. Data shuffling is a technique which improves predictive performance and helps avoid misleading results. But how does it avoid delusive results. It is a detailed process but in a nutshell shuffling is a way to randomize data. If I take the first 100 observations from the data set that’s not a random sample observations that were entered first would be the first to be extracted. If I shuffled the data, I am sure that when I take 100 consecutive entries they’ll be random and most likely representative.
ER diagram: It is a complex theoretial way of illustrating a database architecture specialists love it for the straightforward way in which the different shapes convey how the tables in a daabase are related.
Relational schema: Each rectangle represents a distinct data table and the lines show which tables are related and which aren’t.
Data collection and pre-processing are essential for quantitative analysis.
Traditional data example. Think of basic customer data the type of data set. The table contains text information about a given customer. Use it to give a clear exmaple of the difference between a numerical and categorical variable. The customer_id numbers cannot be manipulated. Calculating an average ID is not something that would give you any sort of useful information. This means that even though they are numbers they hold no numerical value and therefore representing categorical data. Numerical data is useful information.
 
Techniques for working with Traditional Data

猜你喜欢