When starting a data science project, we usually need to set up or configure to ensure the required dependencies, keep the output stable, and prepare common functions.
An example of project setup (from Handson-ML2)
This article will introduce some of the most helpful project settings in JuypterNotebook.
1. Make sure the Python version
Check the Python interpreter version in JupyterNotebook:
import sys
sys.version'3.7.6 (default, Jan 8 2020, 13:42:34) \n[Clang 4.0.1 (tags/RELEASE_401/final)]'
To ensure that the project is run by the minimum required version of the Python interpreter, add the following code to the project settings:
# Python ≥3.7 is required
import sys
assert sys.version_info >= (3, 7)
Python needs to be version 3.7 or above, otherwise an AssertionError will be thrown.
2. Make sure the package version
Check installed package versions such as TensorFlow.
import tensorflow as tf
tf.__version__'2.0.0'
Make sure the project is run by TensorFlow2.0 and above, otherwise an AssertionError will be thrown.
# TensorFlow ≥2.0 is required
import tensorflow as tf
assert tf.__version__ >= "2.0"
3. Avoid drawing blurry images
Default plots in JuypterNotebook look blurry. For example, a simple heatmap to find missing values. (https://towardsdatascience.com/using-pandas-pipe-function-to-improve-code-readability-96d66abfaf8)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline# Default figure format png
sns.heatmap(df.isnull(),
yticklabels=False,
cbar=False,
cmap='viridis')
Default images look blurry
As can be seen from the figure above, the text is blurred, the missing values in the Cabin column are too crowded, and the missing values in the Embarked column cannot be recognized.
To solve this problem, use %config InlineBackend.figure_format='retina' or %configInlineBackend.figure_format = 'svg' after %matplotlib inline, namely:
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # or 'svg'sns.heatmap(df.isnull(),
yticklabels=False,
cbar=False,
cmap='viridis')
Image format is set to retina or svg
Compared with the previous picture, the above picture is clearer, and the missing values in the Embarked column can also be successfully identified.
4. Keep the output stable in different runs
Random numbers are used in many places in data science projects. For example:
train_test_split() from Scikit-Learn
np.random.rand() for initializing weights
If the random seed is not reset, a different number will appear for each call:
>>> np.random.rand(4)
array([0.83209492, 0.10917076, 0.15798519, 0.99356723])
>>> np.random.rand(4)
array([0.46183001, 0.7523687 , 0.96599624, 0.32349079])
np.random.seed(0) makes random numbers predictable:
>>> np.random.seed(0)
>>> np.random.rand(4)
array([0.5488135 , 0.71518937, 0.60276338, 0.54488318])
>>> np.random.seed(0)
>>> np.random.rand(4)
array([0.5488135 , 0.71518937, 0.60276338, 0.54488318])
If you reset the random seed (every time), you'll end up with the same data set every time. Therefore, the project can keep the output stable from run to run.
5. Multi-unit output
By default, JupyterNotebook cannot output multiple results in the same cell. To output multiple results, the shell can be reconfigured using IPython.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
6. Save the picture to a file
Matplotlib can save figures with the savefig() method, but an error will be thrown if the given path does not exist.
plt.savefig('./figures/my_plot.png')FileNotFoundError: [Errno 2] Nosuch file or directory: './figures/my_plot.png'
The best practice is to put all the pictures in one place, such as the figures folder of the workspace. You can manually create a figures folder using the OS GUI (operating system interface) or by running the logic command in Jupyter Notebook, but it is better to create a small function to do this.
This is especially useful when some custom graphics settings or additional subfolders are required to group graphics. Here is the function to save an image to a file:
import us
%matplotlib inline
import matplotlib.pyplot as plt# Where to save the figures
PROJECT_ROOT_DIR = "."
SUB_FOLDER = "sub_folder" #a sub-folder
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", SUB_FOLDER)defsave_fig(name, images_path=IMAGES_PATH, tight_layout=True,extension="png", resolution=300):
if not os.path.isdir(images_path):
os.makedirs(images_path)
path = os.path.join(images_path, name+ "." + extension)
print("Saving figure:",name)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=extension,dpi=resolution)
Now call save_fig('figure_name'), an images/sub_folder directory will be created in the workspace, and the picture will be saved to the directory with the name "figure_name.png". In addition, the three most commonly used settings are provided:
· tight_layout can automatically adjust the submap filling
The extension can save pictures in various formats
· resolution can set the image resolution
7. Download the data (and extract it)
Working with web data is commonplace for data scientists. It is possible to use a browser to download the data and run the command to unzip the file, but it would be best to create a small function to do that. This is especially important when data needs to change on a regular basis.
Write a small script to run when the latest data is obtained (you can also set up a scheduled job that is automatically executed on a regular basis). Automating the scraping process is also useful if the dataset needs to be installed on multiple machines.
Here is the function to download and decompress the data:
import us
import tarfile
import zipfile
import urllib
# Where to save the data
PROJECT_ROOT_DIR = "."
SUB_FOLDER = "group_name"
LOCAL_PATH = os.path.join(PROJECT_ROOT_DIR, "datasets", SUB_FOLDER)defdownload(file_url, local_path = LOCAL_PATH):
if not os.path.isdir(local_path):
os.makedirs(local_path)
# Download file
print(">>>downloading")
filename = os.path.basename(file_url)
file_local_path =os.path.join(local_path, filename)
urllib.request.urlretrieve(file_url,file_local_path)
# untar/unzip file
if filename.endswith("tgz")or filename.endswith("tar.gz"):
print(">>>unpacking file:", filename)
tar =tarfile.open(file_local_path, "r:gz")
tar.extractall(path = local_path)
tar.close()
eliffilename.endswith("tar"):
print(">>> unpackingfile:", filename)
tar =tarfile.open(file_local_path, "r:")
tar.extractall(path = local_path)
tar.close()
eliffilename.endwith("zip"):
print(">>>unpacking file:", filename)
zip_file = zipfile.ZipFile(file_local_path)
zip_file.extractall(path =local_path)
zip_file.close()
print("Done")
Now calling download("http://a_valid_url/housing.tgz") will create a datasets/group_name directory in the workspace, download housing.tgz, and extract housing.csv from this directory. This small function can also be used For CSV and text files.