Operation series (16) -- 12 essential command line tools for data scientists

> wget

wget is a file retrieval tool for downloading files from remote locations. The basic usage of downloading remote files is as follows:
wget:
https://en.wikipedia.org/wiki/wget

$ wget http://aima.cs.berkeley.edu/data/iris.csv
--2018-04-18 13:52:38--  http://aima.cs.berkeley.edu/data/iris.csv
Resolving aima.cs.berkeley.edu (aima.cs.berkeley.edu)... 128.32.189.73
Connecting to aima.cs.berkeley.edu (aima.cs.berkeley.edu)|128.32.189.73|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3716 (3.6K) [text/plain]
Saving to: ‘iris.csv
iris.csv

> cat

cat is a tool for standard output of the contents of a file, the name comes from the word concatenate. It can be used to implement some more complex file processing, including functions such as merging files together (that is, true file concatenation), appending files to another file, and numbering file lines.
cat:
https://en.wikipedia.org/wiki/Cat_(Unix)

$ cat iris.csv
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa
5.4,3.7,1.5,0.2,setosa
...

> wc

The wc command is used to generate word counts, line counts, byte counts, and content related to text files. When no other options are set, the default output of wc is one line, from left to right, the line count, word count (note: a single string without a space break on each line is counted as a word), character count and file name.
wc:
https://en.wikipedia.org/wiki/Wc_(Unix)

 $ wc iris.csv
 150  150 3800 iris.csv

The head command is the standard output of the first n lines of the file (the default is 10 lines), and the number of displayed lines can be set with the -n item, as follows.
Head:
https://en.wikipedia.org/wiki/Head_(Unix)

~$  head -n 6 iris.csv
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa

> tail

Next, guess what function tail is used to achieve?
tail:
https://en.wikipedia.org/wiki/Tail_(Unix)

~$ tail -n 5 iris.csv
6.7,3,5.2,2.3,virginica
6.3,2.5,5,1.9,virginica
6.5,3,5.2,2,virginica
6.2,3.4,5.4,2.3,virginica
5.9,3,5.1,1.8,virginica

> find

find is a filesystem tool for searching for specific files. The following command is an example of searching for special files in a tree structure, that is, starting from the current directory ("."), searching for files starting with "iris" and ending with any character, and the type is a common file type ("-type f") Documentation:
find:
https://en.wikipedia.org/wiki/Find_(Unix)

~$ find . -name 'iris*' -type f
./iris.csv
./notebooks/kmeans-sharding-init/sharding/tests/results/iris_time_results.csv
./notebooks/ml-workflows-python-scratch/iris_raw.csv
./notebooks/ml-workflows-python-scratch/iris_clean.csv
...

> cut

The cut command is used for text segmentation, although cut for text segmentation can be done under various standards, but it is especially useful for the extraction of column data in CSV files. The following command outputs the fifth column ("-f 5") of the iris.csv file using the comma delimiter ("-d ','"):
cut:
https://en.wikipedia.org/wiki/Cut_ (Unix)

~$ cut -d ',' -f 5 iris.csv
species
setosa
setosa
setosa
...

>uniq

uniq is a tool for normalizing text output by deduplicating duplicate lines in the text. On its own, this doesn't seem to be very useful, but when it's used to build pipelines (connecting the output of one command to the input of another command, etc.)
uniq:
https://en.wikipedia.org/wiki/Uniq
The following command results in the different categories and their counts contained in the fifth column of the iris dataset:

~$ tail -n 150 iris.csv | cut -d "," -f 5 | uniq -c
50 setosa
50 versicolor
50 virginica

> awk

awk is not actually a "command", but a complete programming language. It is used to process and extract text, and can be invoked as a single-line command from the command line.
awk:
https://en.wikipedia.org/wiki/AWK
It will take some time to fully master awk, but before that, here is an example you can practice. Given the rather limited textual diversity of the sample file iris.csv, the following line of command invokes awk to search for the string "setosa" in the given file ("iris.csv") and set all items (in $0 variable), one by one, the standard output is as follows:

~$ awk '/setosa/ { print $0 }' iris.csv
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5,3.6,1.4,0.2,setosa

> grep

grep is another text processing tool used to find matching strings and regular expressions.
grep:
https://en.wikipedia.org/wiki/Grep
When you need to spend a lot of time on text processing, grep is undoubtedly a good tool you need to master. For more useful information, please refer to the website:
https:/ /www.thegeekstuff.com/2009/03/15-practical-unix-grep-command-examples .

~$ grep -i "vir" iris.csv
6.3,3.3,6,2.5,virginica
5.8,2.7,5.1,1.9,virginica
7.1,3,5.9,2.1,virginica
...

> sed

sed is a stream editor and a text processing and transformation tool, similar to awk. Below we will use this command to change "setosa" to "irissetosa" in the iris.csv file:
sed:
https://en.wikipedia.org/wiki/Sed

~$ sed 's/setosa/iris-setosa/g' iris.csv > output.csv
~$ head output.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,iris-setosa
4.9,3,1.4,0.2,iris-setosa
4.7,3.2,1.3,0.2,iris-setosa
...

history

history is very simple, but also very useful, especially when you need to use command statements to complete some repetitive data preparation work.
History:
https://en.wikipedia.org/wiki/History_(Unix)

~$ history
547  tail iris.csv
548  tail -n 150 iris.csv
549  tail -n 150 iris.csv | cut -d "," -f 5 | uniq -c
550  clear
551  history

So far, this article has given a brief introduction to each of these 12 handy command-line tools, which is just a rough taste of the command-line tools that data science (or any other goal) might use. Now, it's time for them to free your productivity from the mouse.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325562668&siteId=291194637