Linux commands let you show up

@
This article will provide the reader with a brief overview of many different Linux commands. Particular emphasis will explain how to use each command data in the context of the implementation of scientific task. Our goal is to convince the reader that each of these commands are very useful, and let them understand what role each command during operation or analyze the data.

Pipe symbol "|"

Many readers may already be familiar with the "|" symbol, but if you are not familiar with the case, it is worth in advance that: The following command sections discussed all inputs and outputs can be used "|" symbol automatically "pipe" into each other! This means that each complete command of all the special tasks can be linked together to generate strong and very short mini-programs, all of which are done directly on the command line!

grep

What is grep? "Grep" is a tool that can be used to extract text from a matching file. You can specify a number of different control flags and options, these options allow you to mark and very selectively determine which hopes to extract a subset of text from a file or stream. Grep is usually used as a tool of "line-oriented" which means that when it finds matching text, Grep will print all the text on the line, although you can use the "-o" flag line print only part of the match.

Why grep is useful? "Grep" is useful because it is the fastest way to search for specific text block in a large number of files. There are some good use cases: filter access to certain web pages from the huge web server logs; for a particular keyword search example code library (this ratio search using the Eclipse Editor is much faster and more reliable); on Unix filter the output conduit another command.

Grep science have anything to do with the data? Grep is very useful scientific data for a particular task, as it allows you to focus very quickly filter out required information from the data. It is possible that your data source contains a wealth of information you are trying to answer the question irrelevant. If a single row of data is stored in a text file, you can use grep to extract only the rows to be processed, if you can think of a very precise rules to filter their search words. For example, if you have the following. Csv file, each line has sales records:

item, modelnumber, price, tax
Sneakers, MN009, 49.99, 1.11
Sneakers, MTG09, 139.99, 4.11
Shirt, MN089, 8.99, 1.44
Pants, N09, 39.99, 1.11
Sneakers, KN09, 49.99, 1.11
Shoes, BN009, 449.22, 4.31
Sneakers, dN099, 9.99, 1.22
Bananas, GG009, 4.99, 1.11

You can use this command:

grep Sneakers sales.csv 

Only filters out sales record contains the text "sports shoes". Here is the result of running this command:

Sneakers, MN009, 49.99, 1.11
Sneakers, MTG09, 139.99, 4.11
Sneakers, KN09, 49.99, 1.11
Sneakers, dN099, 9.99, 1.22

You can also use complex regular expressions with grep to search for text containing certain patterns. For example, this command will use grep to filter out all the "BN" or "MN" at the beginning, followed by at least a 3-digit model:

grep -o "\(BN\|MN\)\([0-9]\)\{3\}" sales.csv 

Here is the result of running this command:

MN009
MN089
BN009

and

What is sed? Sed is a tool for performing a search and replace operation. For example, you can use the following command:

sed -i 's/dog/cat/g' * 

The working directory of all files in the "dog" is replaced with "cat".

Why sed helpful? "Sed" very useful because you can use regular expressions to perform complex matching and replacement. Also it supports regular expressions to replace references to back, allowing you to match any pattern, then in some way change only part of the matching text. For example, the search string command sed two quoted on any given row, and then exchange their positions in the case does not change any other part of the text. It also quotes the same time will become parentheses:

echo 'The "quick brown" fox jumped over the "lazy red" dog.' | sed -E 's/"([^"]+)"([^"]+)"([^"]+)"/(\3)\2(\1)/'

The results are as follows:

The (lazy red) fox jumped over the (quck brown) dog.

Sed science have anything to do with the data? Sed in data science maximum use case is, if you want to use it, your data may not fully comply with the required format. For example, if your boss to give you a text file data.txt, which contains thousands of wrongly enclosed in double quotes figures:

age,value
"33","5943"
"32","543"
"34","93"
"39","5943"
"36","9943"
"38","8943"

You can run the file through the following sed command:

cat data.csv | sed 's/"//g'

Cancel all tasks with the following results:

age,value
33,5943
32,543
34,93
39,5943
36,9943
38,8943

If you need a digital imported into another program can not use quotes around the numbers, which will be very useful. If you have encountered some simple formatting errors cause problems or can not be imported correctly handle data sets, it is likely that there is a sed command to fix your problem.

awk

What is awk? Awk can be a more advanced search and replace operations may require general purpose computing tool.

Why awk helpful? Awk is useful because it basically is a general purpose programming language that can easily handle lines of text formatting. Sed can do with some overlap, but awk much stronger. Awk can also be used to remember to change the state between different rows.

Awk science have anything to do with the data? Suppose you have a CSV file containing temps.CSV temperature value, but the file is not used in Celsius or Fahrenheit, but a mixture of these two temperature indicating unit of Celsius temperature is c, the unit is Fahrenheit f:

temp,unit
26.1,C
78.1,F
23.1,C
25.7,C
76.3,F
77.3,F
24.2,C
79.3,F
27.9,C
75.1,F
25.9,C
79.0,F

You can use a simple awk command to accomplish this task:

cat temps.txt | awk -F',' '{if($2=="F")print (($1-32)*5/9)",C";else print $1","$2}'

The result will be:

temp,unit
26.1,C
25.6111,C
23.1,C
25.7,C
24.6111,C
25.1667,C
24.2,C
26.2778,C
27.9,C
23.9444,C
25.9,C
26.1111,C

All values ​​were normalized to the temperature in degrees Celsius.

sort

What is sort? sort of name revealed everything: it is used to sort of!

Why sort is useful? Sort alone it is not very useful, but for many other tasks, this is an important prerequisite: you want to find the largest / smallest? As long as they are classified, and then take the first one or the last one. Want top ten do? Classify them, and then took the last 10. Digital still need to sort dictionary sort? Sort command can do both! Let us use several different ways to sort random text foo.txt the following documents:

0 1 1234 11 ZZZZ 1010 0123 hello world abc123 Hello World 9 zzzz 

Here is a command to perform a default sort:

cat foo.txt | sort

The result is:

0
0123
1
1010
11
1234
9
abc123
Hello World
hello world
ZZZZ
zzzz

Please note that the above is a dictionary sort order, instead of numerical order, so the numbers may not be your desired order. We can use numbers instead of ordering the use of '-n' flag:

cat foo.txt | sort -n 

The results are as follows:

0
abc123
Hello World
hello world
ZZZZ
zzzz
1
9
11
0123
1010
1234

Now these numbers are correct sequence. Another common requirement is to reverse the sort order, you can use the '-r' flag:

cat foo.txt | sort -r

The results are as follows:

zzzz ZZZZ hello world Hello World abc123 9 1234 11 1010 1 0123 0

Sorting and scientific data have anything to do? In this paper several other scientific data related to the Linux command (comm, uniq, etc.) requires that you first sort the input data. Another useful marker "Sort" command is "-r" flag will be randomly rearranged input line. This is useful for a large number of test case for other software developers need to work, regardless of the order lines in a file.

comm

What is the comm? comm tool is a set of calculation result of the operation: (a joint cross and complement) based on the input text file line.

Why comm is useful? If you want to learn two different files in common or different lines, Comm very useful.

comm science have anything to do with the data? If you have two lists of email addresses: a file named signups.txt, which contains the email you registered person's e-mail address:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

And another file named purchases.txt, which contains buy your product's e-mail address:

[email protected]
[email protected]
[email protected]
[email protected]

For these files, you may want to know the answers to three different questions: 1) What are users to register and buy the product? 2) which registered users newsletter, but do not convert to purchase? 3) which users making a purchase but not registered to receive the newsletter? Use comm command, you can easily answer all three questions. There is a command, we can find those registered for the newsletter and purchased its users:

comm -12 signups.txt purchases.txt 

The results are as follows:

[email protected] 
[email protected]

Here is how we find those registered users newsletter but did not convert:

comm -23 signups.txt purchases.txt

The results are as follows:

[email protected] 
[email protected] 
[email protected]

Finally, there is a directive, those who did not show up to sing in front newsletter on the purchase of goods:

comm -13 signups.txt purchases.txt

The results are as follows:

[email protected] 
[email protected]

Comm command requests passed to its first input any sort. Often, your input files are not pre-sorted, but you can use the following syntax in bash directly use the sort command input passed to the comm, without creating any additional documents:

comm -12 <(sort signups.txt) <(sort purchases.txt)

uniq

What is uniq? "Uniq" command to help you answer questions about uniqueness.

Why uniq helpful? If you want to remove the duplicate rows and only output the only line, uniq can do it. Want to know each project to be replicated many times? Uniqlo will tell you. * * Output only want duplicate items (for example, should be the only sound check the input)? You can do the same.

Uniq science have anything to do with the data? Suppose you have a file called 'sales.csv' full sales data:

Shoes,19.00
Shoes,28.00
Pants,77.00
Socks,12.00
Shirt,22.00
Socks,12.00
Socks,12.00
Boots,82.00

You need a unique dataset concise list of all products. You only need to use awk for product and import results sort, then use uniq:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq

The results are as follows:

Boots 
Pants 
Shirt 
Shoes 
Socks

Then you might want to know is each unique merchandise sold much:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq -c

The results are as follows:

1 Boots
1 Pants
1 Shirt
2 Shoes
3 Socks

You can also make use of uniq '-d' flag to get a list of items occur more than once. It is almost the only useful when a list of processing.

tr

What is tr? Tr command is a single tool can be removed or replaced character or character set.

Why tr helpful? I found that the most common reason for using the tr command file is created on a Windows machine to delete unwanted carriage return character. The following example illustrates this point, and results into xxd, so we can check hex:

echo -en "Hello\r" | tr -d "\r" | xxd

You can also use the 'tr' command to correct under other special circumstances, such corrections may need to be applied in some other unix pipes. For example, sometimes you may encounter a null character separating the binary data instead of wrap. You can use the following command line tr all empty character code for that file exchange:

echo -en "\0" | tr \\0 \\n | xxd 

Note that the above command double 'character is necessary, desirable because tr "0" represents a null character, but' 'itself needs to be escaped in the shell. The above command displays the input to result in xxd, so you can verify the results. In actual use cases, you may not want to use xxd at the end of the pipe.

Tr science have anything to do with the data? Tr relationship commands and data science is not like the other commands listed here as profound, but it is usually necessary to add in special cases, you may need to repair and clean-up at another stage of processing the data.

cat

What is a cat? Cat command is a tool that you can use it to connect with file and print it to stdout.

Why cat helpful? When you need to stitch together multiple files, or you need to output files to stdout, cat command is useful.

cat and scientific data have anything to do? When performing data science mission, "concating" characteristic "cat" command does a lot of problems. A common situation is encountered multiple csv file that contains similar content to be aggregated format. Suppose you have three. Csv file registered email address from the newsletter, purchase, and purchase list. You may need to calculate the potential impact of range for all user data, it is necessary to calculate the number of independent e-mail all three files. You can use the cat print them out together and then use sort and uniq print out a unique set of e-mail:

cat signups.csv purchases.csv purchased.csv | awk -F'\t' '{print $1}' | sort | uniq

You've probably used to seeing people use cat to read the file and import it into other programs:

cat file.txt | somecommand

You will occasionally see people pointed out that this is a useless usage of cat, is not necessary, because you can use this syntax instead:

somecommand < file.txt

head

What head yes? "Head" command allows you to print only the first few lines (or bytes) file.

Why head to be useful? If you want to see a huge (many GiB) a small portion of the file, or if you want to calculate another part of the analysis of the resulting "top 3" result, then it is very useful.

How to head linked to scientific data? Suppose you have a file "sales.csv", which contains a list of the sales data you sell the product:

Shoes,19.00
Shoes,19.00
Pants,77.00
Pants,77.00
Shoes,19.00
Shoes,28.00
Pants,77.00
Boots,22.00
Socks,12.00
Socks,12.00
Socks,12.00
Shirt,22.00
Socks,12.00
Boots,82.00
Boots,82.00

You may want to know the answer to the following question: "From the most popular to least popular of the top three product What is?" You can use the following pipeline to answer this question:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq -c | sort -n -r | head -n 3

The above pipe shell awk sales data is input to, and prints the first column of each row. Then we sort the product name (because "uniq" program requires us to sort the data), then use the "uniq" to get the count of unique products. In order to sort the list of product count from largest to smallest, we use a 'sort-n-r' count value of the product ordered. Then, we can complete the list of input into the head-n 3 through pipes, you see only the first three in the list:

4 Socks
4 Shoes
3 Pants

tail

What is the tail? 'Tail' command is the 'head', a subsidiary of Command, so you can expect it to work with the 'head' command is similar, except that it prints the end of the file rather than the beginning.

Why tail is useful? 'Tail' command to 'head' command is useful for all tasks are useful.

tail and scientific data have anything to do? The following is an example of how to use the following commands calculated on the base section in the sales data of the three products:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq -c | sort -n -r | tail -n 3

The result is:

 3 Pants
 3 Boots
 1 Shirt

Please note that this is probably not what you want to presentation format, because the lowest count at the bottom. To view the output of the lowest count at the top, you can use the 'head' command, without requiring reverse order:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq -c | sort -n | head -n 3

The result is:

1 Shirt
3 Boots
3 Pants

Another good use case is the first line of the Tail command to delete the file. For example, if you have this CSV data:

product,price
Shoes,19.00
Shoes,28.00
Pants,77.00
Socks,12.00
Shirt,22.00

You try to use awk and uniq to calculate the different products as follows:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq -c

You will eventually get the following output:

1 Pants
1 product
1 Shirt
2 Shoes
1 Socks

Contains the head of "product" word, we do not want the word. We need to do is trim the title line, and start processing the data only on the remaining line (line 2 is in our example). We can use the 'tail' command, the data output line number (1-based index) is added before the '+':

cat sales.csv | tail -n +2 | awk -F',' '{print $1}' | sort | uniq -c

Now we get the desired result and the head omitted:

1 Pants
1 Shirt
2 Shoes
1 Socks

wc

What Wc that? Wc command is a tool that you can use to get the word count and line count.

Why wc helpful? When you want to quickly answer "How many rows?" This question, this command is useful or if this is the number of characters.

Wc have anything to do with the scientific data? In many cases, many problems can be quickly changed to "the number of firms that file?" If you want to know how many messages your mailing list have? You can use this command:

wc -l emails.csv 

And may be subtracted from the result of a (if the file contains csv head).

If you have more than one file in the working directory folder, the number of lines you want to calculate all the files (including the total number of rows), you can use wildcards:

wc -l *.csv

Calculate the number of characters in a text or a file is often useful. You can even paste the text into the echo statement (using -n to avoid line breaks, line breaks will count increased since 1):

echo -n "Here is some text that you'll get a character count for" | wc -c

The result is:

55

find

What is the find? "Find" command can be used a number of different options to search for files, it can execute commands for each file.

Why find useful? Find command to search for files in a given number of different options (file / directory type, file size, file permissions, etc.) are very useful, but one of its most useful features from "-exec" option, which allows you to locate the file after execute commands.

find and scientific data have anything to do? First, let us show an example to show how to use the find command to list the working directory folder and all of the following documents:

find .

As you can see above the wc command, you can count the number of rows in the working directory file all documents. However, if you want to iterate over all files, directories and subdirectories to get the number of rows in each file (for example, number of rows in your code library), you can use the find to print the text of each document, and then through the conduit of each file * * output is input to the polymerization 'wc' rows to obtain:

find . -type f -exec cat {} \; | wc -l

Of course, you can change '' to run a command similar to the above in addition to other specific directory becomes the working directory directory directory you want. Just be careful to run '-exec' the find, especially if you are running as the root user, then! If you are not careful on the "/" directory, run the wrong command could cause great damage.

that kind

What is tsort? "Tsort" is a tool that can be used to perform topological sort.

Why is it useful? "Topological sort" is that many solutions to problems in the real world, you might encounter these problems every day, but did not notice. A very famous example is to propose a timetable to complete some tasks that before the completion of a task can not be started. Such considerations are necessary in the construction, because you can not complete the work, paint walls, until the drywall is installed. You can not install drywall, until the electrical work has been completed, you can not complete the electrical work, and so on until the wall frame has been completed. If you're just building a house, you may be able to put them all in your head, but large-scale construction projects require more automated method. Let's review an example of using configuration task to build houses in the task dependencies.txt file:

wall_framing foundation
foundation excavation
excavation construction_permits
dry_wall electrical
electrical wall_framing
wall_painting crack_filling
crack_filling dry_wall

In the above file, each line consists of two "the word" component. When "tsort" command processing file, it assumes that the first word describes the content needs to appear after the second word. After all rows have been processed, "tsort" according to most of the items to a minimum dependency order downstream of the downstream output dependency all words. Now let's try:

cat task_dependencies.txt | tsort

The results are as follows:

wall_painting
crack_filling
dry_wall
electrical
wall_framing
foundation
excavation
construction_permits

You may remember the above, you can use the "-r" flag with the "sort" command to get the lines in a file in random order. If we repeat the "random" dependency sort the list, and import tsort, you will find that the result is always the same, although output "sort-r" is different each time:

cat task_dependencies.txt | sort -R | tsort

This is because even if we rearrange the lines in this file, task interdependence actual order will not change.

This is just a little touch of fur topological sort, but hope that this will arouse interest you enough to let you go see the Wikipedia page on topological sorting

This scientific data have anything to do? Topological sorting is a basic graph theory problems occur in many places: machine learning; logistics; scheduling; project management.

tee

Tee What is that? "Tee" command is a tool that allows you to stream information to a separate file, but also can be printed to the output current flow.

Tee What is the relationship with the scientific data? "Tee" command does not actually do any analysis of the work for you, but if you are trying to debug why the complex shell pipeline does not work, then it will be very useful. Let us take the example above, for example, place the reference to the 'tee' command between each stage of the pipeline:

cat sales.csv | tail -n +2 | tee after_tail.log | awk -F',' '{print $1}' | tee after_awk.log | sort | tee after_sort.log | uniq -c | tee after_uniq.log 

Now, when you run this command, you will get four files, they all show the output of each stage of the process is what. If you want to be able to go back and check the experienced rare or complicated by an errant shell pipeline, then this can be very convenient. Complex regular expression is usually used in such pipes, sometimes you do not want to match their match things, so use this method you can easily gain a deeper understanding of each stage is how it was.

">" Redirection symbol

Symbol is a symbol output redirection, may be used to redirect the output. It can be used to redirect the output file, instead of being printed to the screen.

cat sales.csv | tail -n +2 | awk -F',' '{print $1}' | sort | uniq -c > unique_counts.txt 

"<" Redirection symbol

What is <? Symbol is a symbol output redirection, it can point to a content file input program. This is an alternative to useless cat problem discussed above:

grep Pants < sales.txt

Unicode confusion results

Eventually a common problem encountered with mixing different Unicode encoding related. Of particular note is that many enterprise software providers will choose UTF-16 during encoding rather than UTF-8. Csv file or database dump.

For example, suppose you want to grep for all instances of the word processing 'Hello' by a group of files. First, you can check the contents of the file contains:

cat sometext.txt

You can see that it contains the text Hello this:

Hello World!

How could this happen? When you look at the hex file, the answer becomes more clear:

xxd sometext.txt

Output is given as follows:

00000000: fffe 4800 6500 6c00 6c00 6f00 2000 5700  ..H.e.l.l.o. .W. 
00000010: 6f00 7200 6c00 6400 2100 0a00            o.r.l.d.!...

What's happening here is that the file 'somefile.txt' to UTF-16 code, but your terminal (possible) by default to use UTF-8. UTF-16 encoded text characters in UTF-8 encoding print to a terminal and does not show a significant problem because the UTF-16 character space on the terminal has not been represented, but only every other odd number of bytes a look and UTF-8 encoding same conventional ASCII characters.

As you can see in the above output, this document does not use UTF-8 encode the file, but the use of UTF-16le. Did not find the text 'Hello', because when you type grep, you type characters set in the terminal environment (may be set to UTF-8) character encoding files as explained in the current your 'Hello' on the command line. Therefore, the search string does not include additional null bytes behind these ASCII characters, so the search fails. If you want to search for UTF-16 character, you can use the grep search:

grep -aP "H\x00e\x00l\x00l\x00o\x00" * sometext.txt

To open a binary search, 'a' sign is necessary because the UTF-16 characters will result in an empty file grep is interpreted as a binary file. 'P' flag is specified grep pattern should be interpreted as a Perl regular expressions, which will lead explain 'x' escape.

iconv -f UTF-16 -t UTF-8 sometext.txt > sometext-utf-8.txt 

Now you do not need to take any special steps in dealing with this document, because this code is now likely to be compatible with the current encoding your terminal:

00000000: 4865 6c6c 6f20 576f 726c 6421 0a         Hello World!.

Directly from the database pipeline

If you can not use the database, you will not be considered a data scientist. Fortunately, the most common database applications have some mechanism to run queries directly from the command line impromptu. Please note that this practice is very rough, did not recommended for serious investigation, but is used for fast, lo-fi results. Let's start with an example of using Postgres SQL server. Suppose you have a simple database table named url:

DROP TABLE urls;
CREATE TABLE urls (
  id serial NOT NULL PRIMARY KEY,
  url character varying(1000)
);
insert into urls (url) values ('http://example.com/');
insert into urls (url) values ('http://example.com/foo.html');
insert into urls (url) values ('http://example.org/index.html');
insert into urls (url) values ('http://google.ca/');
insert into urls (url) values ('http://google.ca/abc.html');
insert into urls (url) values ('https://google.ca/404.html');
insert into urls (url) values ('http://example.co.uk/');
insert into urls (url) values ('http://twitter.com/');
insert into urls (url) values ('http://blog.robertelder.org/');

You want to create a list showing how common it is for each domain name in the table in this url. You can extract the data by creating a url command (for a similar query with multiple columns, you can use a comma):

psql -d mydatascience -t -A -F"," -c "select url from urls;"

Produces this output:

http://example.com/
http://example.com/foo.html
http://example.org/index.html
http://google.ca/
http://google.ca/abc.html
https://google.ca/404.html
http://example.co.uk/
http://twitter.com/

Now we can add a simple regular expression in the pipeline, to select only the domain name:

psql -d mydatascience -t -A -F"," -c "select url from urls;" | sed -E "s/^https?:\/\/([^\/]+).*/\1/"

Below is a list we are studying:

example.com
example.com
example.org
google.ca
google.ca
google.ca
example.co.uk
twitter.com

Now we can use the sort / uniq techniques mentioned above to arrive at a final solution:

psql -d mydatascience -t -A -F"," -c "select url from urls;" | sed -E "s/^https?:\/\/([^\/]+).*/\1/" | sort | uniq -c | sort -n -r

Mysql client has a similar set of command-line options for extracting data to the command line:

mysql ... -s -r -N -e "select 1,2;"

Of course, you may think that your favorite query languages ​​can be used as a single query to perform these operations directly in the SQL command line, but the point here is to show that, if necessary, you can perform these operations on the command line.

to sum up

As we have discussed in this article, there are many Linux commands useful for scientific data to quickly resolve the problem. This article shows only a few useful marker for each command, but in fact there are dozens. Hope, your interest has been aroused enough to study them further.

If you are interested can follow the public number "chasays" - brings together programmers to
Here Insert Picture Description

Guess you like

Origin www.cnblogs.com/ievjai/p/12215836.html