Project Overview

In this project, you will use R and apply exploratory data analysis to explore the relationship between a variable or variables, as well as to explore distributions, outliers, and anomalies in a selected dataset.

Why do this project?

Exploratory data analysis (EDA) is the numerical and graphical testing of data characteristics and relationships before formal, rigorous statistical analysis is applied.
EDA can generate insights that can reflect other issues and ultimately lead to predictive models. This is an important "line of defense" against bad data and an opportunity to notice whether your assumptions or intuitions about the dataset are correct.

What will I learn?

After completing this project, you will:
Understand the distribution of variables and check for outliers and
outliers Learn to quantify and visualize individual variables
of a dataset by using appropriate charts such as scatterplots, histograms, bar charts, and boxplots Build predictive models Before, explore variables to identify the most important variables and relationships in a dataset, calculate their correlations, and investigate conditional means
Learn useful methods and visualizations to examine relationships between multiple variables, such as reframing data and using Colors and shapes to discover more

Why is this important to my professional development?

Why study data analysis? If you're looking for a career that's in high demand, you'll need to provide services that are scarce and come with added value for something that's increasingly common and cheap. So what is becoming more common and cheap? data. So, what are the services that provide data that is scarce and comes with additional value? analyze.
—Hal Varian, UC Berkeley, Chief Economist at Google

Introduction

For this project, you will do your own exploratory data analysis and create an RMD file to explore the variables, structure, patterns, anomalies, and potential relationships of your chosen dataset. As you ask questions, create visualizations, and explore data, you'll gain insight into the entire analysis process.

This project is open ended, there is more than one correct answer. As John Tukey said: "Some data combined with a desperate desire for answers does not guarantee a reasonable answer from a given set of data." We hope you ask interesting data questions and give yourself a chance to explore Opportunity. We will provide some datasets for you to explore, but you can also choose a completely different dataset. It's important to note that finding your own dataset and then arranging it into a form that R can read is time-consuming and labor-intensive. It may take you an extra day, a week, or even months to complete your project. Therefore, only find and organize datasets if you do have programming and data wrangling skills.

Now, let's get to the project details!

Step 1 - Choose a dataset

First, you need to select a dataset from the dataset options document. You should choose a dataset based on previous programming and data processing experience. The dataset you choose will not increase or decrease your chances of passing the final project assessment. Often, tidy datasets are easier to work with because each variable is a column, each row is an observation, and there is no need to tidy the data. The guidance we provide below can help you choose a dataset. The estimated time includes the time to read all project descriptions and evaluation criteria, conduct analysis, and submit the final project.

Step 2 - Make sure the project is organized

You will eventually submit projects and share them with your friends, family, and employers. Make sure your project is organized before you start. We recommend that you create a separate folder on your desktop that will eventually contain:

RMD file, which contains the analysis report, final graph and summary, and in turn, a reflection
HTML file (stitched together from your RMD file)
for the dataset you use (submit only if you use your own dataset)

Step 3 - Explore the data

This part is fun, start exploring your data! As you explore the data, record your thoughts in an RMD file. Please refer to our provided sample project. Your report should be similar to this project!

Step 4 - Document Your Analysis

Document your exploration and analysis work in your submitted RMD file. The file should be formatted in markdown and should in turn contain:

Analysis and exploration of data in a stream-of-consciousness fashion.

a. Your thoughts should be organized by headings and text, and reflect your analytical work as you explore the data.

b. The graphs in this analysis need not be decorated with labels, units, and titles; these graphs are exploratory (temporary). However, they should be of the appropriate type and effectively convey the information you glean from them.

c. You can iterate the graph in the same R block, but you do not need to specify each graph iteration in the analysis.

"Final Figures and Summary" closing section

a. You will choose three graphs from the analysis report to decorate and share in this section. The three graphs should show different trends and should be decorated with appropriate labels, units, and titles (see Project Evaluation Guidelines for more information).

the final part called "reflection"

a. In this section, describe in a few sentences your efforts, successes, and ideas for how to explore the dataset in the future (see Project Evaluation Guidelines for more information).

Step 5 - Flatten the RMD file

Your flattened RMD file should not be one long block of R code. It should contain text and graphics interspersed with it so that readers of the document can gain insight into the questions you are thinking about as you explore the data.

Step 6 - Record the data (if you have selected your own dataset) The dataset
you submit (only if you have selected your own dataset) should contain a text file similar to those in the R documentation describing the data source documentation; should also contain an explanation of the variables in the dataset (definition of variables, units, levels of categorical variables, and data generation processes, such as how data were collected when possible).

Auxiliary material
DataSetOptions
instance item

【Project 2】Exploratory Analysis