Linux is the most common platform for scientific computing and deployment of data science tools.
Open source and community support.
Things break; when they break using Linux, it’s easy to fix.
Cost: it’s free!
Debian/Ubuntu is a popular choice for personal computers.
RHEL/CentOS is popular on servers.
The teaching server for this class runs CentOS 7.
MacOS was originally derived from Unix/Linux (Darwin kernel). It is POSIX compliant. Most shell commands we review here apply to MacOS terminal as well. Windows/DOS, unfortunately, is a totally different breed.
Show distribution/version on Linux:
cat /etc/*-release
CentOS Linux release 7.7.1908 (Core)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
CentOS Linux release 7.7.1908 (Core)
CentOS Linux release 7.7.1908 (Core)
Show distribution/version on MacOS:
sw_vers -productVersion
or
system_profiler SPSoftwareDataType
A shell translates commands to OS instructions.
Most commonly used shells include bash
, csh
, tcsh
, zsh
, etc.
The default shell in MacOS changed from bash
to zsh
since MacOS v10.15.
Sometimes a command and a script does not run simply because it’s written for another shell.
We mostly use bash
shell commands in this class.
Determine the current shell:
echo $SHELL
/bin/bash
List available shells:
cat /etc/shells
/bin/sh
/bin/bash
/usr/bin/sh
/usr/bin/bash
Change to another shell:
exec bash -l
The -l
option indicates it should be a login shell.
Change your login shell permanently:
chsh -s /bin/bash userid
Then log out and log in.
We can navigate to previous/next commands by the upper and lower keys, or maintain a command history stack using pushd
and popd
commands.
Bash provides the following standard completion for the Linux users by default. Much less typing errors and time!
Pathname completion.
Filename completion.
Variablename completion: echo $[TAB][TAB]
.
Username completion: cd ~[TAB][TAB]
.
Hostname completion ssh hwachou@[TAB][TAB]
.
It can also be customized to auto-complete other stuff such as options and command’s arguments. Google bash completion
for more information.
man
is man’s best friendOnline help for shell commands: man commandname
.
# display documentation for the ls command
man ls
LS(1) User Commands LS(1)
NAME
ls - list directory contents
SYNOPSIS
ls [OPTION]... [FILE]...
DESCRIPTION
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is speci‐
fied.
Mandatory arguments to long options are mandatory for short options
too.
-a, --all
do not ignore entries starting with .
-A, --almost-all
do not list implied . and ..
--author
with -l, print the author of each file
-b, --escape
print C-style escapes for nongraphic characters
--block-size=SIZE
scale sizes by SIZE before printing them; e.g., '--block-size=M'
prints sizes in units of 1,048,576 bytes; see SIZE format below
-B, --ignore-backups
do not list implied entries ending with ~
-c with -lt: sort by, and show, ctime (time of last modification of
file status information); with -l: show ctime and sort by name;
otherwise: sort by ctime, newest first
-C list entries by columns
--color[=WHEN]
colorize the output; WHEN can be 'never', 'auto', or 'always'
(the default); more info below
-d, --directory
list directories themselves, not their contents
-D, --dired
generate output designed for Emacs' dired mode
-f do not sort, enable -aU, disable -ls --color
-F, --classify
append indicator (one of */=>@|) to entries
--file-type
likewise, except do not append '*'
--format=WORD
across -x, commas -m, horizontal -x, long -l, single-column -1,
verbose -l, vertical -C
--full-time
like -l --time-style=full-iso
-g like -l, but do not list owner
--group-directories-first
group directories before files;
can be augmented with a --sort option, but any use of
--sort=none (-U) disables grouping
-G, --no-group
in a long listing, don't print group names
-h, --human-readable
with -l, print sizes in human readable format (e.g., 1K 234M 2G)
--si likewise, but use powers of 1000 not 1024
-H, --dereference-command-line
follow symbolic links listed on the command line
--dereference-command-line-symlink-to-dir
follow each command line symbolic link
that points to a directory
--hide=PATTERN
do not list implied entries matching shell PATTERN (overridden
by -a or -A)
--indicator-style=WORD
append indicator with style WORD to entry names: none (default),
slash (-p), file-type (--file-type), classify (-F)
-i, --inode
print the index number of each file
-I, --ignore=PATTERN
do not list implied entries matching shell PATTERN
-k, --kibibytes
default to 1024-byte blocks for disk usage
-l use a long listing format
-L, --dereference
when showing file information for a symbolic link, show informa‐
tion for the file the link references rather than for the link
itself
-m fill width with a comma separated list of entries
-n, --numeric-uid-gid
like -l, but list numeric user and group IDs
-N, --literal
print raw entry names (don't treat e.g. control characters spe‐
cially)
-o like -l, but do not list group information
-p, --indicator-style=slash
append / indicator to directories
-q, --hide-control-chars
print ? instead of nongraphic characters
--show-control-chars
show nongraphic characters as-is (the default, unless program is
'ls' and output is a terminal)
-Q, --quote-name
enclose entry names in double quotes
--quoting-style=WORD
use quoting style WORD for entry names: literal, locale, shell,
shell-always, c, escape
-r, --reverse
reverse order while sorting
-R, --recursive
list subdirectories recursively
-s, --size
print the allocated size of each file, in blocks
-S sort by file size
--sort=WORD
sort by WORD instead of name: none (-U), size (-S), time (-t),
version (-v), extension (-X)
--time=WORD
with -l, show time as WORD instead of default modification time:
atime or access or use (-u) ctime or status (-c); also use spec‐
ified time as sort key if --sort=time
--time-style=STYLE
with -l, show times using style STYLE: full-iso, long-iso, iso,
locale, or +FORMAT; FORMAT is interpreted like in 'date'; if
FORMAT is FORMAT1<newline>FORMAT2, then FORMAT1 applies to
non-recent files and FORMAT2 to recent files; if STYLE is pre‐
fixed with 'posix-', STYLE takes effect only outside the POSIX
locale
-t sort by modification time, newest first
-T, --tabsize=COLS
assume tab stops at each COLS instead of 8
-u with -lt: sort by, and show, access time; with -l: show access
time and sort by name; otherwise: sort by access time
-U do not sort; list entries in directory order
-v natural sort of (version) numbers within text
-w, --width=COLS
assume screen width instead of current value
-x list entries by lines instead of by columns
-X sort alphabetically by entry extension
-1 list one file per line
SELinux options:
--lcontext
Display security context. Enable -l. Lines will probably be
too wide for most displays.
-Z, --context
Display security context so it fits on most displays. Displays
only mode, user, group, security context and file name.
--scontext
Display only security context and file name.
--help display this help and exit
--version
output version information and exit
SIZE is an integer and optional unit (example: 10M is 10*1024*1024).
Units are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (pow‐
ers of 1000).
Using color to distinguish file types is disabled both by default and
with --color=never. With --color=auto, ls emits color codes only when
standard output is connected to a terminal. The LS_COLORS environment
variable can change the settings. Use the dircolors command to set it.
Exit status:
0 if OK,
1 if minor problems (e.g., cannot access subdirectory),
2 if serious trouble (e.g., cannot access command-line argument).
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report ls translation bugs to <http://translationproject.org/team/>
AUTHOR
Written by Richard M. Stallman and David MacKenzie.
COPYRIGHT
Copyright © 2013 Free Software Foundation, Inc. License GPLv3+: GNU
GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
SEE ALSO
The full documentation for ls is maintained as a Texinfo manual. If
the info and ls programs are properly installed at your site, the com‐
mand
info coreutils 'ls invocation'
should give you access to the complete manual.
GNU coreutils 8.22 August 2019 LS(1)
cat
prints the contents of a file:
cat runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
eval(parse(text=arg))
}
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
n = length(x)
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
# simulate data
x = rnorm(n)
# estimate mean
estMeanPrimes(x)
head
prints the first 10 lines of a file:
head runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
eval(parse(text=arg))
}
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
head -l
prints the first \(l\) lines of a file:
head -15 runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
eval(parse(text=arg))
}
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
tail
prints the last 10 lines of a file:
tail runSim.R
n = length(x)
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
# simulate data
x = rnorm(n)
# estimate mean
estMeanPrimes(x)
tail -l
prints the last \(l\) lines of a file:
tail -15 runSim.R
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
n = length(x)
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
# simulate data
x = rnorm(n)
# estimate mean
estMeanPrimes(x)
|
sends output from one command as input of another command.
>
directs output from one command to a file.
>>
appends output from one command to a file.
<
reads input from a file.
Combinations of shell commands (grep
, sed
, awk
, …), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently.
See HW1.
less
is more; more
is lessmore
browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the q
key.
less
is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input.
less
doesn’t need to read the whole file, i.e., it loads files faster than more
.
grep
grep
prints lines that match an expression:
Show lines that contain string CentOS
:
# quotes not necessary if not a regular expression
grep 'CentOS' linux.Rmd
- RHEL/CentOS is popular on servers.
- The teaching server for this class runs CentOS 7.
- Show lines that contain string `CentOS`:
grep 'CentOS' linux.Rmd
grep 'CentOS' *.Rmd
grep -n 'CentOS' linux.Rmd
- Replace `CentOS` by `RHEL` in a text file:
sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
Search multiple text files:
grep 'CentOS' *.Rmd
- RHEL/CentOS is popular on servers.
- The teaching server for this class runs CentOS 7.
- Show lines that contain string `CentOS`:
grep 'CentOS' linux.Rmd
grep 'CentOS' *.Rmd
grep -n 'CentOS' linux.Rmd
- Replace `CentOS` by `RHEL` in a text file:
sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
Show matching line numbers:
grep -n 'CentOS' linux.Rmd
34:- RHEL/CentOS is popular on servers.
36:- The teaching server for this class runs CentOS 7.
323:- Show lines that contain string `CentOS`:
326: grep 'CentOS' linux.Rmd
331: grep 'CentOS' *.Rmd
336: grep -n 'CentOS' linux.Rmd
353:- Replace `CentOS` by `RHEL` in a text file:
355: sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
Find all files in current directory with .png
extension:
ls | grep '.png$'
key_authentication_1.png
key_authentication_2.png
linux_directory_structure.png
linux_filepermission_oct.png
linux_filepermission.png
Richard_Stallman_2013.png
screenshot_top.png
Find all directories in the current directory:
ls -al | grep '^d'
drwxrwxr-x. 2 huazhou huazhou 4096 Jan 15 01:01 .
drwxrwxr-x. 7 huazhou huazhou 117 Jan 14 17:01 ..
sed
sed
is a stream editor.
Replace CentOS
by RHEL
in a text file:
sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
- RHEL/RHEL is popular on servers.
- The teaching server for this class runs RHEL 7.
- Show lines that contain string `RHEL`:
grep 'RHEL' linux.Rmd
grep 'RHEL' *.Rmd
grep -n 'RHEL' linux.Rmd
- Replace `RHEL` by `RHEL` in a text file:
sed 's/RHEL/RHEL/' linux.Rmd | grep RHEL
awk
awk
is a filter and report writer.
First let’s display first lines of the file /etc/passwd
:
head /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
Each line contains fields (1) user name, (2) password, (3) user ID, (4) group ID, (5) user ID info, (6) home directory, and (7) command shell, spearated by :
.
Print sorted list of login names:
awk -F: '{ print $1 }' /etc/passwd | sort | head -10
203bdemo
adm
amisheth26
andyliugraduateschool
bin
brendonchau
brett.young
bursontung97
char.flournoy
chenyu1997
Print number of lines in a file, as NR
stands for Number of Rows:
awk 'END { print NR }' /etc/passwd
69
or
wc -l /etc/passwd
69 /etc/passwd
or (not displaying file name)
wc -l < /etc/passwd
69
Print login names with UID in range 1000-1035
:
awk -F: '{if ($3 >= 1000 && $3 <= 1035) print}' /etc/passwd
huazhou:x:1000:1001::/home/huazhou:/bin/bash
juhkim111:x:1001:1003::/home/juhkim111:/bin/bash
raguilar2:x:1002:1004::/home/raguilar2:/bin/bash
elalb:x:1003:1005::/home/elalb:/bin/bash
sdalmia:x:1004:1006::/home/sdalmia:/bin/bash
gdewey:x:1005:1007::/home/gdewey:/bin/bash
mfdong12:x:1006:1008::/home/mfdong12:/bin/bash
farboodi:x:1007:1009::/home/farboodi:/bin/bash
mfaulis17:x:1008:1010::/home/mfaulis17:/bin/bash
char.flournoy:x:1009:1011::/home/char.flournoy:/bin/bash
willgertsch:x:1010:1012::/home/willgertsch:/bin/bash
ynguo94:x:1011:1013::/home/ynguo94:/bin/bash
ghancock:x:1012:1014::/home/ghancock:/bin/bash
krh005:x:1013:1015::/home/krh005:/bin/bash
yilanh19:x:1014:1016::/home/yilanh19:/bin/bash
nslly19:x:1015:1017::/home/nslly19:/bin/bash
tonylim:x:1016:1018::/home/tonylim:/bin/bash
andyliugraduateschool:x:1017:1019::/home/andyliugraduateschool:/bin/bash
lnliuxue:x:1018:1020::/home/lnliuxue:/bin/bash
y9lyu:x:1019:1021::/home/y9lyu:/bin/bash
lillynhan:x:1020:1022::/home/lillynhan:/bin/bash
joodeh:x:1021:1023::/home/joodeh:/bin/bash
wenlanpan:x:1022:1024::/home/wenlanpan:/bin/bash
stpraser18:x:1023:1025::/home/stpraser18:/bin/bash
qiqi0610:x:1024:1026::/home/qiqi0610:/bin/bash
Johnrandazzo1996:x:1025:1027::/home/Johnrandazzo1996:/bin/bash
amisheth26:x:1026:1028::/home/amisheth26:/bin/bash
ranjana.n.w:x:1027:1029::/home/ranjana.n.w:/bin/bash
naomixu:x:1028:1030::/home/naomixu:/bin/bash
xurui1996:x:1029:1031::/home/xurui1996:/bin/bash
brett.young:x:1030:1032::/home/brett.young:/bin/bash
hanyanyuan:x:1031:1033::/home/hanyanyuan:/bin/bash
203bdemo:x:1032:1034::/home/203bdemo:/bin/bash
dalekim25:x:1033:1035::/home/dalekim25:/bin/bash
chenyu1997:x:1034:1036::/home/chenyu1997:/bin/bash
hk_lian:x:1035:1037::/home/hk_lian:/bin/bash
Print login names and log-in shells in comma-seperated format:
awk -F: '{OFS = ","} {print $1, $7}' /etc/passwd
root,/bin/bash
bin,/sbin/nologin
daemon,/sbin/nologin
adm,/sbin/nologin
lp,/sbin/nologin
sync,/bin/sync
shutdown,/sbin/shutdown
halt,/sbin/halt
mail,/sbin/nologin
operator,/sbin/nologin
games,/sbin/nologin
ftp,/sbin/nologin
nobody,/sbin/nologin
systemd-network,/sbin/nologin
dbus,/sbin/nologin
polkitd,/sbin/nologin
ntp,/sbin/nologin
sshd,/sbin/nologin
postfix,/sbin/nologin
chrony,/sbin/nologin
huazhou,/bin/bash
tss,/sbin/nologin
rstudio-server,/bin/bash
shiny,/bin/sh
mongod,/bin/false
saslauth,/sbin/nologin
juhkim111,/bin/bash
raguilar2,/bin/bash
elalb,/bin/bash
sdalmia,/bin/bash
gdewey,/bin/bash
mfdong12,/bin/bash
farboodi,/bin/bash
mfaulis17,/bin/bash
char.flournoy,/bin/bash
willgertsch,/bin/bash
ynguo94,/bin/bash
ghancock,/bin/bash
krh005,/bin/bash
yilanh19,/bin/bash
nslly19,/bin/bash
tonylim,/bin/bash
andyliugraduateschool,/bin/bash
lnliuxue,/bin/bash
y9lyu,/bin/bash
lillynhan,/bin/bash
joodeh,/bin/bash
wenlanpan,/bin/bash
stpraser18,/bin/bash
qiqi0610,/bin/bash
Johnrandazzo1996,/bin/bash
amisheth26,/bin/bash
ranjana.n.w,/bin/bash
naomixu,/bin/bash
xurui1996,/bin/bash
brett.young,/bin/bash
hanyanyuan,/bin/bash
203bdemo,/bin/bash
dalekim25,/bin/bash
chenyu1997,/bin/bash
hk_lian,/bin/bash
srishtimajumdar,/bin/bash
bursontung97,/bin/bash
jaketompkins97,/bin/bash
fredericy19,/bin/bash
brendonchau,/bin/bash
jshamsho,/bin/bash
tagibson,/bin/bash
postgres,/bin/bash
Print login names and indicate those with UID>1000 as vip
:
awk -F: -v status="" '{OFS = ","}
{if ($3 >= 1000) status="vip"; else status="regular"}
{print $1, status}' /etc/passwd
root,regular
bin,regular
daemon,regular
adm,regular
lp,regular
sync,regular
shutdown,regular
halt,regular
mail,regular
operator,regular
games,regular
ftp,regular
nobody,regular
systemd-network,regular
dbus,regular
polkitd,regular
ntp,regular
sshd,regular
postfix,regular
chrony,regular
huazhou,vip
tss,regular
rstudio-server,regular
shiny,regular
mongod,regular
saslauth,regular
juhkim111,vip
raguilar2,vip
elalb,vip
sdalmia,vip
gdewey,vip
mfdong12,vip
farboodi,vip
mfaulis17,vip
char.flournoy,vip
willgertsch,vip
ynguo94,vip
ghancock,vip
krh005,vip
yilanh19,vip
nslly19,vip
tonylim,vip
andyliugraduateschool,vip
lnliuxue,vip
y9lyu,vip
lillynhan,vip
joodeh,vip
wenlanpan,vip
stpraser18,vip
qiqi0610,vip
Johnrandazzo1996,vip
amisheth26,vip
ranjana.n.w,vip
naomixu,vip
xurui1996,vip
brett.young,vip
hanyanyuan,vip
203bdemo,vip
dalekim25,vip
chenyu1997,vip
hk_lian,vip
srishtimajumdar,vip
bursontung97,vip
jaketompkins97,vip
fredericy19,vip
brendonchau,vip
jshamsho,vip
tagibson,vip
postgres,regular
Emacs
is a powerful text editor with extensive support for many languages including R
, \(\LaTeX\), python
, and C/C++
; however it’s not installed by default on many Linux distributions.
emacs filename
to open a file with emacs.CTRL-x CTRL-f
to open an existing or new file.CTRL-x CTRX-s
to save.CTRL-x CTRL-w
to save as.CTRL-x CTRL-c
to quit.Google emacs cheatsheet
C-<key>
means hold the control
key, and press <key>
.
M-<key>
means press the Esc
key once, and press <key>
.
Vi
is ubiquitous (POSIX standard). Learn at least its basics; otherwise you can edit nothing on some clusters.
vi filename
to start editing a file.vi
is a modal editor: insert mode and normal mode. Pressing i
switches from the normal mode to insert mode. Pressing ESC
switches from the insert mode to normal mode.:x<Return>
quits vi
and saves changes.:q!<Return>
quits vi without saving latest changes.:w<Return>
saves changes.:wq<Return>
quits vi
and saves changes.Google vi cheatsheet
Statisticians write a lot of code. Critical to adopt a good IDE that goes beyond code editing: syntax highlighting, executing code within editor, debugging, profiling, version control, etc.
R Studio, Eclipse, Emacs, Matlab, Visual Studio, etc.
Ctrl+C
to cancel a non-responding or long-running program.OS runs processes on behalf of user.
Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc.
ps
PID TTY TIME CMD
19455 ? 00:00:01 rsession
19529 ? 00:00:00 R
19644 ? 00:00:00 sh
19645 ? 00:00:00 ps
All current running processes:
ps -eaf
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Jan09 ? 00:00:24 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
root 2 0 0 Jan09 ? 00:00:00 [kthreadd]
root 4 2 0 Jan09 ? 00:00:00 [kworker/0:0H]
root 6 2 0 Jan09 ? 00:00:00 [ksoftirqd/0]
root 7 2 0 Jan09 ? 00:00:00 [migration/0]
root 8 2 0 Jan09 ? 00:00:00 [rcu_bh]
root 9 2 0 Jan09 ? 00:01:02 [rcu_sched]
root 10 2 0 Jan09 ? 00:00:00 [lru-add-drain]
root 11 2 0 Jan09 ? 00:00:02 [watchdog/0]
root 12 2 0 Jan09 ? 00:00:01 [watchdog/1]
root 13 2 0 Jan09 ? 00:00:00 [migration/1]
root 14 2 0 Jan09 ? 00:00:00 [ksoftirqd/1]
root 16 2 0 Jan09 ? 00:00:00 [kworker/1:0H]
root 17 2 0 Jan09 ? 00:00:01 [watchdog/2]
root 18 2 0 Jan09 ? 00:00:00 [migration/2]
root 19 2 0 Jan09 ? 00:00:01 [ksoftirqd/2]
root 21 2 0 Jan09 ? 00:00:00 [kworker/2:0H]
root 22 2 0 Jan09 ? 00:00:01 [watchdog/3]
root 23 2 0 Jan09 ? 00:00:00 [migration/3]
root 24 2 0 Jan09 ? 00:00:01 [ksoftirqd/3]
root 26 2 0 Jan09 ? 00:00:00 [kworker/3:0H]
root 28 2 0 Jan09 ? 00:00:00 [kdevtmpfs]
root 29 2 0 Jan09 ? 00:00:00 [netns]
root 30 2 0 Jan09 ? 00:00:00 [khungtaskd]
root 31 2 0 Jan09 ? 00:00:00 [writeback]
root 32 2 0 Jan09 ? 00:00:00 [kintegrityd]
root 33 2 0 Jan09 ? 00:00:00 [bioset]
root 34 2 0 Jan09 ? 00:00:00 [bioset]
root 35 2 0 Jan09 ? 00:00:00 [bioset]
root 36 2 0 Jan09 ? 00:00:00 [kblockd]
root 37 2 0 Jan09 ? 00:00:00 [md]
root 38 2 0 Jan09 ? 00:00:00 [edac-poller]
root 39 2 0 Jan09 ? 00:00:00 [watchdogd]
root 46 2 0 Jan09 ? 00:00:00 [kswapd0]
root 47 2 0 Jan09 ? 00:00:00 [ksmd]
root 48 2 0 Jan09 ? 00:00:02 [khugepaged]
root 49 2 0 Jan09 ? 00:00:00 [crypto]
root 57 2 0 Jan09 ? 00:00:00 [kthrotld]
root 59 2 0 Jan09 ? 00:00:00 [kmpath_rdacd]
root 60 2 0 Jan09 ? 00:00:00 [kaluad]
root 61 2 0 Jan09 ? 00:00:00 [kpsmoused]
root 63 2 0 Jan09 ? 00:00:00 [ipv6_addrconf]
root 76 2 0 Jan09 ? 00:00:00 [deferwq]
root 115 2 0 Jan09 ? 00:00:48 [kauditd]
root 400 2 0 Jan09 ? 00:00:00 [virtscsi-scan]
root 401 2 0 Jan09 ? 00:00:00 [scsi_eh_0]
root 402 2 0 Jan09 ? 00:00:00 [scsi_tmf_0]
root 427 2 0 Jan09 ? 00:00:00 [bioset]
root 428 2 0 Jan09 ? 00:00:00 [xfsalloc]
root 429 2 0 Jan09 ? 00:00:00 [xfs_mru_cache]
root 430 2 0 Jan09 ? 00:00:00 [xfs-buf/sda1]
root 431 2 0 Jan09 ? 00:00:00 [xfs-data/sda1]
root 432 2 0 Jan09 ? 00:00:00 [xfs-conv/sda1]
root 433 2 0 Jan09 ? 00:00:00 [xfs-cil/sda1]
root 434 2 0 Jan09 ? 00:00:00 [xfs-reclaim/sda]
root 435 2 0 Jan09 ? 00:00:00 [xfs-log/sda1]
root 436 2 0 Jan09 ? 00:00:00 [xfs-eofblocks/s]
root 437 2 0 Jan09 ? 00:02:48 [xfsaild/sda1]
root 438 2 0 Jan09 ? 00:00:04 [kworker/0:1H]
root 487 2 0 Jan09 ? 00:00:00 [kworker/2:1H]
root 502 1 0 Jan09 ? 00:01:07 /usr/lib/systemd/systemd-journald
root 540 1 0 Jan09 ? 00:00:00 /usr/lib/systemd/systemd-udevd
root 559 1 0 Jan09 ? 00:01:58 /sbin/auditd
root 646 2 0 Jan09 ? 00:00:00 [hwrng]
root 673 2 0 Jan09 ? 00:00:00 [nfit]
root 873 1 0 Jan09 ? 00:00:01 /opt/shiny-server/ext/node/bin/shiny-server /opt/shiny-server/lib/main.js
dbus 875 1 0 Jan09 ? 00:00:05 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root 880 1 0 Jan09 ? 00:00:04 /usr/lib/systemd/systemd-logind
polkitd 882 1 0 Jan09 ? 00:00:01 /usr/lib/polkit-1/polkitd --no-debug
root 888 1 0 Jan09 ? 00:00:00 /usr/sbin/acpid
chrony 918 1 0 Jan09 ? 00:00:00 /usr/sbin/chronyd
root 919 1 0 Jan09 ? 00:00:01 /usr/sbin/crond -n
root 926 1 0 Jan09 tty1 00:00:00 /sbin/agetty --noclear tty1 linux
root 928 1 0 Jan09 ttyS0 00:00:00 /sbin/agetty --keep-baud 115200,38400,9600 ttyS0 vt220
root 944 1 0 Jan09 ? 00:00:01 /usr/bin/python2 -Es /usr/sbin/firewalld --nofork --nopid
root 973 1 0 Jan09 ? 00:00:10 /usr/sbin/NetworkManager --no-daemon
root 1125 973 0 Jan09 ? 00:00:00 /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkManager/dhclient-18fa3a36-80a2-442f-bc21-276a1bb368be-eth0.lease -cf /var/lib/NetworkManager/dhclient-eth0.conf eth0
root 1365 1 0 Jan09 ? 00:00:51 /usr/bin/python2 -Es /usr/sbin/tuned -l -P
root 1366 1 0 Jan09 ? 00:00:00 /usr/sbin/cupsd -f
root 1369 1 0 Jan09 ? 00:01:00 /usr/bin/google_osconfig_agent
root 1370 1 0 Jan09 ? 00:00:55 /usr/sbin/rsyslogd -n
rstudio+ 1441 1 0 Jan09 ? 00:02:12 /usr/lib/rstudio-server/bin/rserver
root 1628 1 0 Jan09 ? 00:00:24 /usr/bin/python /usr/bin/google_network_daemon
root 1631 1 0 Jan09 ? 00:00:12 /usr/sbin/sshd -D
root 1633 1 0 Jan09 ? 00:00:46 /usr/bin/python /usr/bin/google_accounts_daemon
root 1635 1 0 Jan09 ? 00:00:10 /usr/bin/python /usr/bin/google_clock_skew_daemon
root 1717 1 0 Jan09 ? 00:00:02 /usr/libexec/postfix/master -w
postfix 1719 1717 0 Jan09 ? 00:00:00 qmgr -l -t unix -u
mongod 1721 1 0 Jan09 ? 00:24:13 /usr/bin/mongod -f /etc/mongod.conf
root 1838 2 0 Jan09 ? 00:00:00 [kworker/3:1H]
root 1839 2 0 Jan09 ? 00:00:00 [kworker/1:1H]
root 4474 1631 0 02:07 ? 00:00:00 sshd: raguilar2 [priv]
raguila+ 4478 4474 0 02:07 ? 00:00:00 sshd: raguilar2@pts/4
raguila+ 4479 4478 0 02:07 pts/4 00:00:00 -bash
root 7502 2 0 02:49 ? 00:00:00 [kworker/2:1]
root 7504 2 0 02:49 ? 00:00:00 [kworker/0:0]
root 8240 1 0 03:01 ? 00:00:00 /usr/sbin/anacron -s
root 11033 8240 0 03:45 ? 00:00:00 /bin/bash /bin/run-parts /etc/cron.daily
root 11037 11033 0 03:45 ? 00:00:00 /usr/bin/python -tt /usr/sbin/yum-cron
root 11038 11033 0 03:45 ? 00:00:00 awk -v progname=/etc/cron.daily/0yum-daily.cron progname { ???? print progname ":\n" ???? progname=""; ??? } ??? { print; }
root 11046 2 0 03:45 ? 00:00:00 [kworker/u8:0]
root 13888 1631 0 04:29 ? 00:00:00 sshd: lnliuxue [priv]
lnliuxue 13892 13888 0 04:29 ? 00:00:00 sshd: lnliuxue@pts/1
lnliuxue 13893 13892 0 04:29 pts/1 00:00:00 -bash
postfix 15098 1717 0 04:45 ? 00:00:00 pickup -l -t unix -u
root 15491 1631 0 04:51 ? 00:00:00 sshd: qiqi0610 [priv]
qiqi0610 15495 15491 0 04:51 ? 00:00:00 sshd: qiqi0610@pts/2
qiqi0610 15496 15495 0 04:51 pts/2 00:00:00 -bash
root 15783 2 0 04:54 ? 00:00:00 [kworker/3:0]
root 16214 2 0 05:00 ? 00:00:00 [kworker/0:1]
root 17791 2 0 05:21 ? 00:00:00 [kworker/3:1]
root 18456 2 0 05:31 ? 00:00:00 [kworker/1:2]
root 18848 2 0 05:37 ? 00:00:00 [kworker/1:0]
root 19171 2 0 05:42 ? 00:00:00 [kworker/1:1]
root 19174 2 0 05:42 ? 00:00:00 [kworker/0:2]
huazhou 19455 1441 7 05:47 ? 00:00:01 /usr/lib/rstudio-server/bin/rsession -u huazhou --launcher-token BB375147
huazhou 19529 19455 48 05:47 ? 00:00:00 /usr/lib64/R/bin/exec/R --slave --no-save --no-restore -e rmarkdown::render('/home/huazhou/ucla-biostat203b-2020winter.github.io/slides/02-linux/linux.Rmd',~+~~+~encoding~+~=~+~'UTF-8');
huazhou 19646 19529 0 05:47 ? 00:00:00 sh -c 'bash' -c 'ps -eaf' 2>&1
huazhou 19647 19646 0 05:47 ? 00:00:00 ps -eaf
root 24201 2 0 Jan14 ? 00:00:01 [kworker/u8:1]
root 31010 2 0 00:59 ? 00:00:00 [kworker/2:0]
All Python processes:
ps -eaf | grep python
root 944 1 0 Jan09 ? 00:00:01 /usr/bin/python2 -Es /usr/sbin/firewalld --nofork --nopid
root 1365 1 0 Jan09 ? 00:00:51 /usr/bin/python2 -Es /usr/sbin/tuned -l -P
root 1628 1 0 Jan09 ? 00:00:24 /usr/bin/python /usr/bin/google_network_daemon
root 1633 1 0 Jan09 ? 00:00:46 /usr/bin/python /usr/bin/google_accounts_daemon
root 1635 1 0 Jan09 ? 00:00:10 /usr/bin/python /usr/bin/google_clock_skew_daemon
root 11037 11033 0 03:45 ? 00:00:00 /usr/bin/python -tt /usr/sbin/yum-cron
huazhou 19648 19529 0 05:47 ? 00:00:00 sh -c 'bash' -c 'ps -eaf | grep python' 2>&1
huazhou 19649 19648 0 05:47 ? 00:00:00 bash -c ps -eaf | grep python
huazhou 19651 19649 0 05:47 ? 00:00:00 grep python
Process with PID=1:
ps -fp 1
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Jan09 ? 00:00:24 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
All processes owned by a user:
ps -fu huazhou
UID PID PPID C STIME TTY TIME CMD
huazhou 19455 1441 7 05:47 ? 00:00:01 /usr/lib/rstudio-server/bin/rsession -u huazhou --launcher-token BB375147
huazhou 19529 19455 49 05:47 ? 00:00:00 /usr/lib64/R/bin/exec/R --slave --no-save --no-restore -e rmarkdown::render('/home/huazhou/ucla-biostat203b-2020winter.github.io/slides/02-linux/linux.Rmd',~+~~+~encoding~+~=~+~'UTF-8');
huazhou 19654 19529 0 05:47 ? 00:00:00 sh -c 'bash' -c 'ps -fu huazhou' 2>&1
huazhou 19655 19654 0 05:47 ? 00:00:00 ps -fu huazhou
Kill process with PID=1001:
kill 1001
Kill all R processes.
killall -r R
top
top
prints realtime process information (very useful).
top
top
program by pressing the q
key.SSH (secure shell) is the dominant cryptographic network protocol for secure network connection via an insecure network.
On Linux or Mac ternminal, access the teaching server by
ssh username@server.ucla-biostat-203b.com
For Windows users, there are 2 ways: (1) (highly recommended) Git Bash, (2) (not recommended) PuTTY program (free), or (3) (may be an overkill for this class) use WSL for Windows to install a full fledged Linux system within Windows.
Key authentication is more secure than password. Most passwords are weak.
Script or a program may need to systematically SSH into other machines.
Log into multiple machines using the same key.
Seamless use of many services: Git, AWS or Google cloud service, parallel computing on multiple hosts, Travis CI (continuous integration) etc.
Many servers only allow key authentication and do not accept password authentication.
Public key. Put on the machine(s) you want to log in.
Private key. Put on your own computer. Consider this as the actual key in your pocket; never give to others.
Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key.
Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication).
On Linux, Mac, or Windows Git Bash, to generate a key pair:
ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
[KEY_FILENAME]
is the name that you want to use for your SSH key files. For example, a filename of id_rsa
generates a private key file named id_rsa
and a public key file named id_rsa.pub
.
[USERNAME]
is the user for whom you will apply this SSH key.
Use a (optional) paraphrase different form password.
Set correct permissions on the .ssh
folder and key files
chmod 400 ~/.ssh/[KEY_FILENAME]
Append the public key to the ~/.ssh/authorized_keys
file of any Linux machine we want to SSH to, e.g.,
ssh-copy-id -i ~/.ssh/[KEY_FILENAME] [USERNAME]@server.ucla-biostat-203b.com
Test your new key.
ssh -i ~/.ssh/[KEY_FILENAME] [USERNAME]@server.ucla-biostat-203b.com
Now you don’t need password each time you connect from your machine to the teaching server.
If you set paraphrase when generating keys, you’ll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using ssh-agent
on Linux/Mac or Pagent on Windows.
Same key pair can be used between any two machines. We don’t need to regenerate keys for each new connection.
scp
securely transfers files between machines using SSH.
## copy file from local to remote
scp [LOCALFILE] [USERNAME]@server.ucla-biostat-203b.com:/[PATH_TO_FOLDER]
## copy file from remote to local
scp [USERNAME]@server.ucla-biostat-203b.com:/[PATH_TO_FILE] [PATH_TO_LOCAL_FOLDER]
sftp
is FTP via SSH.
Globus
is GUI program for securely transferring files between machines. To use Globus you will have to go to https://www.globus.org/ and login through UCLA by selecting your existing organizational login as UCLA. Then you will need to download their Globus Connect Personal software, then set your laptop as an endpoint. Very detailed instructions can be found at https://www.hoffman2.idre.ucla.edu/file-transfer/globus/.
GUIs for Windows (WinSCP) or Mac (Cyberduck).
(My preferred way) Use a version control system (git, svn, cvs, …) to sync project files between different machines and systems.
Windows uses a pair of CR
and LF
for line breaks.
Linux/Unix uses an LF
character only.
MacOS X also uses a single LF
character. But old Mac OS used a single CR
character for line breaks.
If transferred in binary mode (bit by bit) between OSs, a text file could look a mess.
Most transfer programs automatically switch to text mode when transferring text files and perform conversion of line breaks between different OSs; but I used to run into problems using WinSCP. Sometimes you have to tell WinSCP explicitly a text file is being transferred.
Start R in the interactive mode by typing R
in shell.
Then run R script by
source("script.R")
Demo script meanEst.R
implements an (terrible) estimator of mean \[
{\widehat \mu}_n = \frac{\sum_{i=1}^n x_i 1_{i \text{ is prime}}}{\sum_{i=1}^n 1_{i \text{ is prime}}}.
\]
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
n = length(x)
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
print(estMeanPrimes(rnorm(100000)))
To run your R code non-interactively aka in batch mode, we have at least two options:
# default output to meanEst.Rout
R CMD BATCH meanEst.R
or
# output to stdout
Rscript meanEst.R
Typically automate batch calls using a scripting language, e.g., Python, Perl, and shell script.
Specify arguments in R CMD BATCH
:
R CMD BATCH '--args mu=1 sig=2 kap=3' script.R
Specify arguments in Rscript
:
Rscript script.R mu=1 sig=2 kap=3
Parse command line arguments using magic formula
for (arg in commandArgs(T)) {
eval(parse(text=arg))
}
in R script. After calling the above code, all command line arguments will be available in the global namespace.
To understand the magic formula commandArgs
, run R by:
R '--args mu=1 sig=2 kap=3'
and then issue commands in R
commandArgs()
commandArgs(TRUE)
Understand the magic formula parse
and eval
:
rm(list=ls())
print(x)
Error in print(x): object 'x' not found
parse(text="x=3")
expression(x = 3)
eval(parse(text="x=3"))
print(x)
[1] 3
runSim.R
has components: (1) command argument parser, (2) method implementation, (3) data generator with unspecified parameter n
, and (4) estimation based on generated data.## parsing command arguments
for (arg in commandArgs(TRUE)) {
eval(parse(text=arg))
}
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
n = length(x)
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
# simulate data
x = rnorm(n)
# estimate mean
estMeanPrimes(x)
Call runSim.R
with sample size n=100
:
R CMD BATCH '--args n=100' runSim.R
or
Rscript runSim.R n=100
[1] 0.1883607
Many statistical computing tasks take long: simulation, MCMC, etc. When we exit Linux when the job is unfinished, the job is killed.
nohup
command in Linux runs program(s) immune to hangups and writes output to nohup.out
by default. Logging out will not kill the process; we can log in later to check status and results.
nohup
is POSIX standard thus available on Linux and MacOS.
Run runSim.R
in background and writes output to nohup.out
:
nohup Rscript runSim.R n=100 &
[1] 0.1704422
The &
at the end of the command instructs Linux to run this command in background, so we gain control of the terminal immediately.
screen
is another popular utility, but not installed by default.
Typical workflow using screen
.
Access remote server using ssh
.
Start jobs in batch mode.
Detach jobs.
Exit from server, wait for jobs to finish.
Access remote server using ssh
.
Re-attach jobs, check on progress, get results, etc.
R in conjuction with nohup
(or screen
) can be used to orchestrate a large simulation study.
It can be more elegant, transparent, and robust to parallelize jobs corresponding to different scenarios (e.g., different generative models) outside of the code used to do statistical computation.
We consider a simulation study in R but the same approach could be used with code written in Julia, Matlab, Python, etc.
Python in many ways makes a better glue.
runSim.R
which runs a simulation based on command line argument n
.n
values that we want to use in our simulation study.Option 1: manually call runSim.R
for each setting.
Option 2 (smarter): automate calls using R and nohup
.
Let’s demonstrate using the script autoSim.R
cat autoSim.R
# autoSim.R
nVals <- seq(100, 1000, by=100)
for (n in nVals) {
oFile <- paste("n", n, ".txt", sep="")
sysCall <- paste("nohup Rscript runSim.R n=", n, " > ", oFile, sep="")
system(sysCall, wait = FALSE)
print(paste("sysCall=", sysCall, sep=""))
}
Note when we call bash command using the system
function in R, we set optional argument wait=FALSE
so that jobs can be run parallel.
Rscript autoSim.R
[1] "sysCall=nohup Rscript runSim.R n=100 > n100.txt"
[1] "sysCall=nohup Rscript runSim.R n=200 > n200.txt"
[1] "sysCall=nohup Rscript runSim.R n=300 > n300.txt"
[1] "sysCall=nohup Rscript runSim.R n=400 > n400.txt"
[1] "sysCall=nohup Rscript runSim.R n=500 > n500.txt"
[1] "sysCall=nohup Rscript runSim.R n=600 > n600.txt"
[1] "sysCall=nohup Rscript runSim.R n=700 > n700.txt"
[1] "sysCall=nohup Rscript runSim.R n=800 > n800.txt"
[1] "sysCall=nohup Rscript runSim.R n=900 > n900.txt"
[1] "sysCall=nohup Rscript runSim.R n=1000 > n1000.txt"
Now we just need to write a script to collect results from the output files.
Later on we learn how to coordinate large scale computation on UCLA Hoffmann2 cluster, using Linux and R scripting.