7 Settings for Data Science Projects: Ensure Dependencies, Keep Output Stable

When starting a data science project, we usually need to set up or configure to ensure the required dependencies, keep the output stable, and prepare common functions.

An example of project setup (from Handson-ML2)

This article will introduce some of the most helpful project settings in JuypterNotebook.

1. Make sure the Python version

Check the Python interpreter version in JupyterNotebook:

import sys

sys.version'3.7.6 (default, Jan 8 2020, 13:42:34) \n[Clang 4.0.1 (tags/RELEASE_401/final)]'

To ensure that the project is run by the minimum required version of the Python interpreter, add the following code to the project settings:

# Python ≥3.7 is required

import sys

assert sys.version_info >= (3, 7)

Python needs to be version 3.7 or above, otherwise an AssertionError will be thrown.

2. Make sure the package version

Check installed package versions such as TensorFlow.

import tensorflow as tf

tf.__version__'2.0.0'

Make sure the project is run by TensorFlow2.0 and above, otherwise an AssertionError will be thrown.

# TensorFlow ≥2.0 is required

import tensorflow as tf

assert tf.__version__ >= "2.0"

3. Avoid drawing blurry images

Default plots in JuypterNotebook look blurry. For example, a simple heatmap to find missing values. (https://towardsdatascience.com/using-pandas-pipe-function-to-improve-code-readability-96d66abfaf8)

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline# Default figure format png

sns.heatmap(df.isnull(),

yticklabels=False,

cbar=False,

cmap='viridis')

Default images look blurry

As can be seen from the figure above, the text is blurred, the missing values ​​in the Cabin column are too crowded, and the missing values ​​in the Embarked column cannot be recognized.

To solve this problem, use %config InlineBackend.figure_format='retina' or %configInlineBackend.figure_format = 'svg' after %matplotlib inline, namely:

%matplotlib inline

%config InlineBackend.figure_format = 'retina' # or 'svg'sns.heatmap(df.isnull(),

yticklabels=False,

cbar=False,

cmap='viridis')

Image format is set to retina or svg

Compared with the previous picture, the above picture is clearer, and the missing values ​​in the Embarked column can also be successfully identified.

4. Keep the output stable in different runs

Random numbers are used in many places in data science projects. For example:

train_test_split() from Scikit-Learn

np.random.rand() for initializing weights

If the random seed is not reset, a different number will appear for each call:

>>> np.random.rand(4)

array([0.83209492, 0.10917076, 0.15798519, 0.99356723])

>>> np.random.rand(4)

array([0.46183001, 0.7523687 , 0.96599624, 0.32349079])

np.random.seed(0) makes random numbers predictable:

>>> np.random.seed(0)

>>> np.random.rand(4)

array([0.5488135 , 0.71518937, 0.60276338, 0.54488318])

>>> np.random.seed(0)

>>> np.random.rand(4)

array([0.5488135 , 0.71518937, 0.60276338, 0.54488318])

If you reset the random seed (every time), you'll end up with the same data set every time. Therefore, the project can keep the output stable from run to run.

5. Multi-unit output

By default, JupyterNotebook cannot output multiple results in the same cell. To output multiple results, the shell can be reconfigured using IPython.

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

6. Save the picture to a file

Matplotlib can save figures with the savefig() method, but an error will be thrown if the given path does not exist.

plt.savefig('./figures/my_plot.png')FileNotFoundError: [Errno 2] Nosuch file or directory: './figures/my_plot.png'

The best practice is to put all the pictures in one place, such as the figures folder of the workspace. You can manually create a figures folder using the OS GUI (operating system interface) or by running the logic command in Jupyter Notebook, but it is better to create a small function to do this.

This is especially useful when some custom graphics settings or additional subfolders are required to group graphics. Here is the function to save an image to a file:

import us

%matplotlib inline

import matplotlib.pyplot as plt# Where to save the figures

PROJECT_ROOT_DIR = "."

SUB_FOLDER = "sub_folder" #a sub-folder

IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", SUB_FOLDER)defsave_fig(name, images_path=IMAGES_PATH, tight_layout=True,extension="png", resolution=300):

if not os.path.isdir(images_path):

os.makedirs(images_path)

path = os.path.join(images_path, name+ "." + extension)

print("Saving figure:",name)

if tight_layout:

plt.tight_layout()

plt.savefig(path, format=extension,dpi=resolution)

Now call save_fig('figure_name'), an images/sub_folder directory will be created in the workspace, and the picture will be saved to the directory with the name "figure_name.png". In addition, the three most commonly used settings are provided:

· tight_layout can automatically adjust the submap filling

The extension can save pictures in various formats

· resolution can set the image resolution

7. Download the data (and extract it)

Working with web data is commonplace for data scientists. It is possible to use a browser to download the data and run the command to unzip the file, but it would be best to create a small function to do that. This is especially important when data needs to change on a regular basis.

Write a small script to run when the latest data is obtained (you can also set up a scheduled job that is automatically executed on a regular basis). Automating the scraping process is also useful if the dataset needs to be installed on multiple machines.

Here is the function to download and decompress the data:

import us

import tarfile

import zipfile

import urllib

# Where to save the data

PROJECT_ROOT_DIR = "."

SUB_FOLDER = "group_name"

LOCAL_PATH = os.path.join(PROJECT_ROOT_DIR, "datasets", SUB_FOLDER)defdownload(file_url, local_path = LOCAL_PATH):

if not os.path.isdir(local_path):

os.makedirs(local_path)

# Download file

print(">>>downloading")

filename = os.path.basename(file_url)

file_local_path =os.path.join(local_path, filename)

urllib.request.urlretrieve(file_url,file_local_path)

# untar/unzip file

if filename.endswith("tgz")or filename.endswith("tar.gz"):

print(">>>unpacking file:", filename)

tar =tarfile.open(file_local_path, "r:gz")

tar.extractall(path = local_path)

tar.close()

eliffilename.endswith("tar"):

print(">>> unpackingfile:", filename)

tar =tarfile.open(file_local_path, "r:")

tar.extractall(path = local_path)

tar.close()

eliffilename.endwith("zip"):

print(">>>unpacking file:", filename)

zip_file = zipfile.ZipFile(file_local_path)

zip_file.extractall(path =local_path)

zip_file.close()

print("Done")

Now calling download("http://a_valid_url/housing.tgz") will create a datasets/group_name directory in the workspace, download housing.tgz, and extract housing.csv from this directory. This small function can also be used For CSV and text files.

Guess you like

Origin blog.csdn.net/qq_40016005/article/details/127020487