Detailed explanation of data analysis techniques quickly using Python


overview

Some little tips and tricks can be very useful, especially in the field of programming. Sometimes a little hacking can save time and possibly "life."

A little shortcut or add-on can sometimes be a godsend and can be a real productivity booster. So, here are some little tips and tricks, some of which may be new but which I believe will come very handy for you on your next data analysis project.


Profiling process of data frame data in Pandas

Profiling (analyzer) is a process that helps us understand data, and Pandas Profiling is a Python package that can easily and quickly perform exploratory data analysis on Pandas data frame data.

The df.describe() and df.info() functions in Pandas can realize the first step of the EDA process. However, they only provide a very basic overview of the data and are not very helpful for large data sets. The Profiling function in Pandas can display a large amount of information simply with one line of code, and the same is true in interactive HTML reports.

For a given dataset, the profiling package in Pandas calculates the following statistics:

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Statistical information calculated by the Pandas Profiling package includes histograms, modes, correlation coefficients, quantiles, descriptive statistics, and other information - types, single variable values, missing values, etc.

Install

Install with pip or install with conda

pip install pandas-profiling 
conda install -c anaconda pandas-profiling

usage

The following code uses the long-ago Titanic data set to demonstrate the results of the versatile Python analyzer.

#importing the necessary packages 
import pandas as pd 
import pandas_profiling
df = pd.read_csv('titanic/train.csv') 
pandas_profiling.ProfileReport(df)

One line of code can display a complete data analysis report in Jupyter Notebook. The report is very detailed and contains necessary chart information.

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

The report can also be exported to an interactive HTML file using the following code. ​​​​​​​

profile = pandas_profiling.ProfileReport(df)
profile.to_file(outputfile="Titanic data profiling.html")

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Pandas implements interactive plotting

Pandas has a built-in .plot() function as part of the DataFrame class. However, the visualization rendered using this feature is not interactive, which makes it less engaging. Similarly, using the pandas.DataFrame.plot() function to draw charts is not interactive. What if we need to draw interactive charts with Pandas without making major changes to the code? At this time, you can use the Cufflinks library to achieve this.

The Cufflinks library can combine the powerful functions of plotly and the flexibility of pandas, which is very convenient for drawing. Let’s take a look at how to install and use the Cufflinks library in pandas.

Installation

pip install plotly
# Plotly is a pre-requisite before installing cufflinks
pip install cufflinks

Usage

#importing Pandas  
import pandas as pd
#importing plotly and cufflinks in offline mode 
import cufflinks as cfimport plotly.offline 
cf.go_offline() 
cf.set_config_file(offline=False, world_readable=True)

It’s time to show off the magic of the Titanic dataset.

df.iplot()

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

df.iplot() vs df.plot()

The visualization on the right shows a static chart, while the chart on the left is interactive and more detailed, and all this without any major changes in syntax.

Magic command

Magic commands are a set of convenient functions in Jupyter notebooks designed to solve some common problems in standard data analysis. Use the command %lsmagic to see all available commands.

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

List of all available Magic commands

There are two types of Magic commands: line magics, which are prefixed by a single % character, and operations can be entered on a single line; cell magics, which are prefixed by double %% characters, can be entered on multiple lines. If set to 1, the Magic function can be called without typing %.

Next, let’s look at some commands that may be used in common data analysis tasks:

% pastebin

%pastebin uploads the code to Pastebin and returns the url. Pastebin is an online content hosting service that can store plain text, such as source code snippets, which can then be shared with others via URL. In fact, Github gist is similar to pastebin, but with version control.

Write a python script containing the following content in the file.py file and try to run it to see the results. ​​​​​​​

#importing Pandas  
import pandas as pd
#importing plotly and cufflinks in offline mode 
import cufflinks as cfimport plotly.offline 
cf.go_offline() 
cf.set_config_file(offline=False, world_readable=True)

Use %pastebin in Jupyter Notebook to generate a pastebin url.

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

%matplotlib notebook

Function for rendering static matplotlib plots in Jupyter notebook. Replace inline with notebook to easily get scalable and resizable drawings. But remember that this function must be called before importing the matplotlib library.

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

%run

Try running a python script in the notebook using the %run function. ​​​​​​​

%run file.py
%%writefile

%% writefile writes the cell contents to a file. The following code writes a script to a file named foo.py and saves it in the current directory.

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

%%latex

The %%latex function renders cell contents in LaTeX form. This function is useful for writing mathematical formulas and equations in cells.

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Find and resolve errors

The interactive debugger is also a magical feature, and I defined it in a separate category. If an exception occurs while running the code unit, type %debug on a new line and run it. This will open an interactive debugging environment that can navigate directly to the location where the exception occurred. You can also check the values ​​of variables assigned in the program and perform operations there. Just click q to exit the debugger.

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Printing also has tips

If you want to generate beautiful data structures, pprint is the first choice. It is especially useful when printing dictionary data or JSON data. Next look at an example of using print and pprint to display output.

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Make your notes stand out

We can use alert boxes/comment boxes in your Jupyter notebook to highlight important content or other content that needs to be highlighted. The color of the annotation depends on the specified alert type. Just add any or all of the following codes in the cells you want to highlight.

Blue warning box: information prompt

<div class="alert alert-block alert-info"> 
<b>Tip:</b> Use blue boxes (alert-info) for tips and notes.  
If it’s a note, you don’t have to include the word “Note”.
 </div>

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Yellow warning box: warning

<div class="alert alert-block alert-warning"> <b>Example:</b> Yellow Boxes are generally used to include additional examples or mathematical formulas. </div>

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Green warning box: Success

<div class="alert alert-block alert-success"> 
Use green box>
 </div>

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Red warning box: high risk

<div class="alert alert-block alert-danger">
It is good to avoid red boxes but can be used to alert users to not delete some important part of code etc. 
</div>

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Print the output of all codes in the cell

Suppose there is a cell in a Jupyter Notebook that contains the following lines of code:​​​​​​​

In  [1]: 10+5                    
        11+6
Out [1]: 17

The normal property of a cell is to print only the last output, and for other outputs we need to add the print() function. However, it is possible to print all the output at once by adding the following code snippet at the top of the notebook.

After adding the code, all output results will be printed one after another. ​​​​​​​

In  [1]: 10+5  
          11+6
          12+7
Out [1]: 15 
Out [1]: 17 
Out [1]: 19

Restore original settings:

InteractiveShell.ast_node_interactivity = "last_expr"

Run python script using 'i' option

The typical way to run a python script from the command line is: python hello.py. However, if you add -i when running the same script, such as python -i hello.py, it provides more advantages. Let’s see what happens next.

First, even if the program ends, python will not exit the interpreter. Therefore, we can check the correctness of the values ​​of variables and functions defined in the program.

640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Secondly, we can easily invoke the python debugger since we are still within the interpreter:​​​​​​

import pdb
pdb.pm()

This locates where the exception occurred and we can then handle the exception code.

Auto comment code

Ctrl/Cmd + / automatically comments selected lines in cells, hitting the combination again will uncomment the same lines of code.

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

Easy to delete but difficult to restore

Have you ever accidentally deleted a cell in your Jupyter notebook? If the answer is yes, then you can master this shortcut to undo a delete operation.

If you delete the contents of a cell, you can easily restore it by pressing CTRL/CMD + Z.

If you need to recover the entire deleted cells, press ESC + Z or EDIT > Undelete Cells.

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1&tp=wxpic

in conclusion

In this article, I list some tips I collected while working with Python and Jupyter notebooks. I believe they will be useful and informative for you, making coding easy!

 
 
 

Guess you like

Origin blog.csdn.net/Rocky006/article/details/133136354
Recommended