Why Linux

Linux is the most common platform for scientific computing and deployment of data science tools.

Distributions of Linux


Linux shells

Shells

  • A shell translates commands to OS instructions.

  • Most commonly used shells include bash, csh, tcsh, zsh, etc.

  • The default shell in MacOS changed from bash to zsh since MacOS v10.15.

  • Sometimes a command and a script does not run simply because it’s written for another shell.

  • We mostly use bash shell commands in this class.

  • Determine the current shell:

    echo $SHELL
    /bin/bash
  • List available shells:

    cat /etc/shells
    /bin/sh
    /bin/bash
    /usr/bin/sh
    /usr/bin/bash
  • Change to another shell:

    exec bash -l

    The -l option indicates it should be a login shell.

  • Change your login shell permanently:

    chsh -s /bin/bash userid

    Then log out and log in.

Command history and bash completion

We can navigate to previous/next commands by the upper and lower keys, or maintain a command history stack using pushd and popd commands.

Bash provides the following standard completion for the Linux users by default. Much less typing errors and time!

  • Pathname completion.

  • Filename completion.

  • Variablename completion: echo $[TAB][TAB].

  • Username completion: cd ~[TAB][TAB].

  • Hostname completion ssh hwachou@[TAB][TAB].

  • It can also be customized to auto-complete other stuff such as options and command’s arguments. Google bash completion for more information.

man is man’s best friend

Online help for shell commands: man commandname.

# display documentation for the ls command
man ls
LS(1)                            User Commands                           LS(1)



NAME
       ls - list directory contents

SYNOPSIS
       ls [OPTION]... [FILE]...

DESCRIPTION
       List  information  about  the FILEs (the current directory by default).
       Sort entries alphabetically if none of -cftuvSUX nor --sort  is  speci‐
       fied.

       Mandatory  arguments  to  long  options are mandatory for short options
       too.

       -a, --all
              do not ignore entries starting with .

       -A, --almost-all
              do not list implied . and ..

       --author
              with -l, print the author of each file

       -b, --escape
              print C-style escapes for nongraphic characters

       --block-size=SIZE
              scale sizes by SIZE before printing them; e.g., '--block-size=M'
              prints sizes in units of 1,048,576 bytes; see SIZE format below

       -B, --ignore-backups
              do not list implied entries ending with ~

       -c     with -lt: sort by, and show, ctime (time of last modification of
              file status information); with -l: show ctime and sort by  name;
              otherwise: sort by ctime, newest first

       -C     list entries by columns

       --color[=WHEN]
              colorize  the  output;  WHEN can be 'never', 'auto', or 'always'
              (the default); more info below

       -d, --directory
              list directories themselves, not their contents

       -D, --dired
              generate output designed for Emacs' dired mode

       -f     do not sort, enable -aU, disable -ls --color

       -F, --classify
              append indicator (one of */=>@|) to entries

       --file-type
              likewise, except do not append '*'

       --format=WORD
              across -x, commas -m, horizontal -x, long -l, single-column  -1,
              verbose -l, vertical -C

       --full-time
              like -l --time-style=full-iso

       -g     like -l, but do not list owner

       --group-directories-first
              group directories before files;

              can   be  augmented  with  a  --sort  option,  but  any  use  of
              --sort=none (-U) disables grouping

       -G, --no-group
              in a long listing, don't print group names

       -h, --human-readable
              with -l, print sizes in human readable format (e.g., 1K 234M 2G)

       --si   likewise, but use powers of 1000 not 1024

       -H, --dereference-command-line
              follow symbolic links listed on the command line

       --dereference-command-line-symlink-to-dir
              follow each command line symbolic link

              that points to a directory

       --hide=PATTERN
              do not list implied entries matching shell  PATTERN  (overridden
              by -a or -A)

       --indicator-style=WORD
              append indicator with style WORD to entry names: none (default),
              slash (-p), file-type (--file-type), classify (-F)

       -i, --inode
              print the index number of each file

       -I, --ignore=PATTERN
              do not list implied entries matching shell PATTERN

       -k, --kibibytes
              default to 1024-byte blocks for disk usage

       -l     use a long listing format

       -L, --dereference
              when showing file information for a symbolic link, show informa‐
              tion  for  the file the link references rather than for the link
              itself

       -m     fill width with a comma separated list of entries

       -n, --numeric-uid-gid
              like -l, but list numeric user and group IDs

       -N, --literal
              print raw entry names (don't treat e.g. control characters  spe‐
              cially)

       -o     like -l, but do not list group information

       -p, --indicator-style=slash
              append / indicator to directories

       -q, --hide-control-chars
              print ? instead of nongraphic characters

       --show-control-chars
              show nongraphic characters as-is (the default, unless program is
              'ls' and output is a terminal)

       -Q, --quote-name
              enclose entry names in double quotes

       --quoting-style=WORD
              use quoting style WORD for entry names: literal, locale,  shell,
              shell-always, c, escape

       -r, --reverse
              reverse order while sorting

       -R, --recursive
              list subdirectories recursively

       -s, --size
              print the allocated size of each file, in blocks

       -S     sort by file size

       --sort=WORD
              sort  by  WORD instead of name: none (-U), size (-S), time (-t),
              version (-v), extension (-X)

       --time=WORD
              with -l, show time as WORD instead of default modification time:
              atime or access or use (-u) ctime or status (-c); also use spec‐
              ified time as sort key if --sort=time

       --time-style=STYLE
              with -l, show times using style STYLE: full-iso, long-iso,  iso,
              locale,  or  +FORMAT;  FORMAT  is interpreted like in 'date'; if
              FORMAT  is  FORMAT1<newline>FORMAT2,  then  FORMAT1  applies  to
              non-recent  files  and FORMAT2 to recent files; if STYLE is pre‐
              fixed with 'posix-', STYLE takes effect only outside  the  POSIX
              locale

       -t     sort by modification time, newest first

       -T, --tabsize=COLS
              assume tab stops at each COLS instead of 8

       -u     with  -lt:  sort by, and show, access time; with -l: show access
              time and sort by name; otherwise: sort by access time

       -U     do not sort; list entries in directory order

       -v     natural sort of (version) numbers within text

       -w, --width=COLS
              assume screen width instead of current value

       -x     list entries by lines instead of by columns

       -X     sort alphabetically by entry extension

       -1     list one file per line

       SELinux options:

       --lcontext
              Display security context.   Enable -l. Lines  will  probably  be
              too wide for most displays.

       -Z, --context
              Display  security context so it fits on most displays.  Displays
              only mode, user, group, security context and file name.

       --scontext
              Display only security context and file name.

       --help display this help and exit

       --version
              output version information and exit

       SIZE is an integer and optional unit (example:  10M  is  10*1024*1024).
       Units  are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (pow‐
       ers of 1000).

       Using color to distinguish file types is disabled both by  default  and
       with  --color=never.  With --color=auto, ls emits color codes only when
       standard output is connected to a terminal.  The LS_COLORS  environment
       variable can change the settings.  Use the dircolors command to set it.

   Exit status:
       0      if OK,

       1      if minor problems (e.g., cannot access subdirectory),

       2      if serious trouble (e.g., cannot access command-line argument).

       GNU  coreutils  online  help:  <http://www.gnu.org/software/coreutils/>
       Report ls translation bugs to <http://translationproject.org/team/>

AUTHOR
       Written by Richard M. Stallman and David MacKenzie.

COPYRIGHT
       Copyright © 2013 Free Software Foundation, Inc.   License  GPLv3+:  GNU
       GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
       This  is  free  software:  you  are free to change and redistribute it.
       There is NO WARRANTY, to the extent permitted by law.

SEE ALSO
       The full documentation for ls is maintained as a  Texinfo  manual.   If
       the  info and ls programs are properly installed at your site, the com‐
       mand

              info coreutils 'ls invocation'

       should give you access to the complete manual.



GNU coreutils 8.22                August 2019                            LS(1)

Work with text files

View/peek text files

  • cat prints the contents of a file:

    cat runSim.R
    ## parsing command arguments
    for (arg in commandArgs(TRUE)) {
      eval(parse(text=arg))
    }
    
    ## check if a given integer is prime
    isPrime = function(n) {
      if (n <= 3) {
        return (TRUE)
      }
      if (any((n %% 2:floor(sqrt(n))) == 0)) {
        return (FALSE)
      }
      return (TRUE)
    }
    
    ## estimate mean only using observation with prime indices
    estMeanPrimes = function (x) {
      n = length(x)
      ind = sapply(1:n, isPrime)
      return (mean(x[ind]))
    }
    
    # simulate data
    x = rnorm(n)
    
    # estimate mean
    estMeanPrimes(x)

  • head prints the first 10 lines of a file:

    head runSim.R
    ## parsing command arguments
    for (arg in commandArgs(TRUE)) {
      eval(parse(text=arg))
    }
    
    ## check if a given integer is prime
    isPrime = function(n) {
      if (n <= 3) {
        return (TRUE)
      }

    head -l prints the first \(l\) lines of a file:

    head -15 runSim.R
    ## parsing command arguments
    for (arg in commandArgs(TRUE)) {
      eval(parse(text=arg))
    }
    
    ## check if a given integer is prime
    isPrime = function(n) {
      if (n <= 3) {
        return (TRUE)
      }
      if (any((n %% 2:floor(sqrt(n))) == 0)) {
        return (FALSE)
      }
      return (TRUE)
    }
  • tail prints the last 10 lines of a file:

    tail runSim.R
      n = length(x)
      ind = sapply(1:n, isPrime)
      return (mean(x[ind]))
    }
    
    # simulate data
    x = rnorm(n)
    
    # estimate mean
    estMeanPrimes(x)

    tail -l prints the last \(l\) lines of a file:

    tail -15 runSim.R
      return (TRUE)
    }
    
    ## estimate mean only using observation with prime indices
    estMeanPrimes = function (x) {
      n = length(x)
      ind = sapply(1:n, isPrime)
      return (mean(x[ind]))
    }
    
    # simulate data
    x = rnorm(n)
    
    # estimate mean
    estMeanPrimes(x)

  • Questions:
    • How to see the 11th line of the file and nothing else?
    • What about the 11th to the last line?

Piping and redirection

  • | sends output from one command as input of another command.

  • > directs output from one command to a file.

  • >> appends output from one command to a file.

  • < reads input from a file.

  • Combinations of shell commands (grep, sed, awk, …), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently.

  • See HW1.

less is more; more is less

  • more browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the q key.

  • less is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input.

  • less doesn’t need to read the whole file, i.e., it loads files faster than more.

grep

grep prints lines that match an expression:

  • Show lines that contain string CentOS:

    # quotes not necessary if not a regular expression
    grep 'CentOS' linux.Rmd
    - RHEL/CentOS is popular on servers.
    - The teaching server for this class runs CentOS 7.
    - Show lines that contain string `CentOS`:
        grep 'CentOS' linux.Rmd
        grep 'CentOS' *.Rmd
        grep -n 'CentOS' linux.Rmd
    - Replace `CentOS` by `RHEL` in a text file:
        sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
  • Search multiple text files:

    grep 'CentOS' *.Rmd
    - RHEL/CentOS is popular on servers.
    - The teaching server for this class runs CentOS 7.
    - Show lines that contain string `CentOS`:
        grep 'CentOS' linux.Rmd
        grep 'CentOS' *.Rmd
        grep -n 'CentOS' linux.Rmd
    - Replace `CentOS` by `RHEL` in a text file:
        sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
  • Show matching line numbers:

    grep -n 'CentOS' linux.Rmd
    34:- RHEL/CentOS is popular on servers.
    36:- The teaching server for this class runs CentOS 7.
    323:- Show lines that contain string `CentOS`:
    326:    grep 'CentOS' linux.Rmd
    331:    grep 'CentOS' *.Rmd
    336:    grep -n 'CentOS' linux.Rmd
    353:- Replace `CentOS` by `RHEL` in a text file:
    355:    sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
  • Find all files in current directory with .png extension:

    ls | grep '.png$'
    key_authentication_1.png
    key_authentication_2.png
    linux_directory_structure.png
    linux_filepermission_oct.png
    linux_filepermission.png
    Richard_Stallman_2013.png
    screenshot_top.png
  • Find all directories in the current directory:

    ls -al | grep '^d'
    drwxrwxr-x. 2 huazhou huazhou    4096 Jan 15 01:01 .
    drwxrwxr-x. 7 huazhou huazhou     117 Jan 14 17:01 ..

sed

  • sed is a stream editor.

  • Replace CentOS by RHEL in a text file:

    sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
    - RHEL/RHEL is popular on servers.
    - The teaching server for this class runs RHEL 7.
    - Show lines that contain string `RHEL`:
        grep 'RHEL' linux.Rmd
        grep 'RHEL' *.Rmd
        grep -n 'RHEL' linux.Rmd
    - Replace `RHEL` by `RHEL` in a text file:
        sed 's/RHEL/RHEL/' linux.Rmd | grep RHEL

awk

  • awk is a filter and report writer.

  • First let’s display first lines of the file /etc/passwd:

    head /etc/passwd
    root:x:0:0:root:/root:/bin/bash
    bin:x:1:1:bin:/bin:/sbin/nologin
    daemon:x:2:2:daemon:/sbin:/sbin/nologin
    adm:x:3:4:adm:/var/adm:/sbin/nologin
    lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
    sync:x:5:0:sync:/sbin:/bin/sync
    shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
    halt:x:7:0:halt:/sbin:/sbin/halt
    mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
    operator:x:11:0:operator:/root:/sbin/nologin

    Each line contains fields (1) user name, (2) password, (3) user ID, (4) group ID, (5) user ID info, (6) home directory, and (7) command shell, spearated by :.

  • Print sorted list of login names:

    awk -F: '{ print $1 }' /etc/passwd | sort | head -10
    203bdemo
    adm
    amisheth26
    andyliugraduateschool
    bin
    brendonchau
    brett.young
    bursontung97
    char.flournoy
    chenyu1997
  • Print number of lines in a file, as NR stands for Number of Rows:

    awk 'END { print NR }' /etc/passwd
    69

    or

    wc -l /etc/passwd
    69 /etc/passwd

    or (not displaying file name)

    wc -l < /etc/passwd
    69
  • Print login names with UID in range 1000-1035:

    awk -F: '{if ($3 >= 1000 && $3 <= 1035) print}' /etc/passwd
    huazhou:x:1000:1001::/home/huazhou:/bin/bash
    juhkim111:x:1001:1003::/home/juhkim111:/bin/bash
    raguilar2:x:1002:1004::/home/raguilar2:/bin/bash
    elalb:x:1003:1005::/home/elalb:/bin/bash
    sdalmia:x:1004:1006::/home/sdalmia:/bin/bash
    gdewey:x:1005:1007::/home/gdewey:/bin/bash
    mfdong12:x:1006:1008::/home/mfdong12:/bin/bash
    farboodi:x:1007:1009::/home/farboodi:/bin/bash
    mfaulis17:x:1008:1010::/home/mfaulis17:/bin/bash
    char.flournoy:x:1009:1011::/home/char.flournoy:/bin/bash
    willgertsch:x:1010:1012::/home/willgertsch:/bin/bash
    ynguo94:x:1011:1013::/home/ynguo94:/bin/bash
    ghancock:x:1012:1014::/home/ghancock:/bin/bash
    krh005:x:1013:1015::/home/krh005:/bin/bash
    yilanh19:x:1014:1016::/home/yilanh19:/bin/bash
    nslly19:x:1015:1017::/home/nslly19:/bin/bash
    tonylim:x:1016:1018::/home/tonylim:/bin/bash
    andyliugraduateschool:x:1017:1019::/home/andyliugraduateschool:/bin/bash
    lnliuxue:x:1018:1020::/home/lnliuxue:/bin/bash
    y9lyu:x:1019:1021::/home/y9lyu:/bin/bash
    lillynhan:x:1020:1022::/home/lillynhan:/bin/bash
    joodeh:x:1021:1023::/home/joodeh:/bin/bash
    wenlanpan:x:1022:1024::/home/wenlanpan:/bin/bash
    stpraser18:x:1023:1025::/home/stpraser18:/bin/bash
    qiqi0610:x:1024:1026::/home/qiqi0610:/bin/bash
    Johnrandazzo1996:x:1025:1027::/home/Johnrandazzo1996:/bin/bash
    amisheth26:x:1026:1028::/home/amisheth26:/bin/bash
    ranjana.n.w:x:1027:1029::/home/ranjana.n.w:/bin/bash
    naomixu:x:1028:1030::/home/naomixu:/bin/bash
    xurui1996:x:1029:1031::/home/xurui1996:/bin/bash
    brett.young:x:1030:1032::/home/brett.young:/bin/bash
    hanyanyuan:x:1031:1033::/home/hanyanyuan:/bin/bash
    203bdemo:x:1032:1034::/home/203bdemo:/bin/bash
    dalekim25:x:1033:1035::/home/dalekim25:/bin/bash
    chenyu1997:x:1034:1036::/home/chenyu1997:/bin/bash
    hk_lian:x:1035:1037::/home/hk_lian:/bin/bash
  • Print login names and log-in shells in comma-seperated format:

    awk -F: '{OFS = ","} {print $1, $7}' /etc/passwd
    root,/bin/bash
    bin,/sbin/nologin
    daemon,/sbin/nologin
    adm,/sbin/nologin
    lp,/sbin/nologin
    sync,/bin/sync
    shutdown,/sbin/shutdown
    halt,/sbin/halt
    mail,/sbin/nologin
    operator,/sbin/nologin
    games,/sbin/nologin
    ftp,/sbin/nologin
    nobody,/sbin/nologin
    systemd-network,/sbin/nologin
    dbus,/sbin/nologin
    polkitd,/sbin/nologin
    ntp,/sbin/nologin
    sshd,/sbin/nologin
    postfix,/sbin/nologin
    chrony,/sbin/nologin
    huazhou,/bin/bash
    tss,/sbin/nologin
    rstudio-server,/bin/bash
    shiny,/bin/sh
    mongod,/bin/false
    saslauth,/sbin/nologin
    juhkim111,/bin/bash
    raguilar2,/bin/bash
    elalb,/bin/bash
    sdalmia,/bin/bash
    gdewey,/bin/bash
    mfdong12,/bin/bash
    farboodi,/bin/bash
    mfaulis17,/bin/bash
    char.flournoy,/bin/bash
    willgertsch,/bin/bash
    ynguo94,/bin/bash
    ghancock,/bin/bash
    krh005,/bin/bash
    yilanh19,/bin/bash
    nslly19,/bin/bash
    tonylim,/bin/bash
    andyliugraduateschool,/bin/bash
    lnliuxue,/bin/bash
    y9lyu,/bin/bash
    lillynhan,/bin/bash
    joodeh,/bin/bash
    wenlanpan,/bin/bash
    stpraser18,/bin/bash
    qiqi0610,/bin/bash
    Johnrandazzo1996,/bin/bash
    amisheth26,/bin/bash
    ranjana.n.w,/bin/bash
    naomixu,/bin/bash
    xurui1996,/bin/bash
    brett.young,/bin/bash
    hanyanyuan,/bin/bash
    203bdemo,/bin/bash
    dalekim25,/bin/bash
    chenyu1997,/bin/bash
    hk_lian,/bin/bash
    srishtimajumdar,/bin/bash
    bursontung97,/bin/bash
    jaketompkins97,/bin/bash
    fredericy19,/bin/bash
    brendonchau,/bin/bash
    jshamsho,/bin/bash
    tagibson,/bin/bash
    postgres,/bin/bash
  • Print login names and indicate those with UID>1000 as vip:

    awk -F: -v status="" '{OFS = ","} 
    {if ($3 >= 1000) status="vip"; else status="regular"} 
    {print $1, status}' /etc/passwd
    root,regular
    bin,regular
    daemon,regular
    adm,regular
    lp,regular
    sync,regular
    shutdown,regular
    halt,regular
    mail,regular
    operator,regular
    games,regular
    ftp,regular
    nobody,regular
    systemd-network,regular
    dbus,regular
    polkitd,regular
    ntp,regular
    sshd,regular
    postfix,regular
    chrony,regular
    huazhou,vip
    tss,regular
    rstudio-server,regular
    shiny,regular
    mongod,regular
    saslauth,regular
    juhkim111,vip
    raguilar2,vip
    elalb,vip
    sdalmia,vip
    gdewey,vip
    mfdong12,vip
    farboodi,vip
    mfaulis17,vip
    char.flournoy,vip
    willgertsch,vip
    ynguo94,vip
    ghancock,vip
    krh005,vip
    yilanh19,vip
    nslly19,vip
    tonylim,vip
    andyliugraduateschool,vip
    lnliuxue,vip
    y9lyu,vip
    lillynhan,vip
    joodeh,vip
    wenlanpan,vip
    stpraser18,vip
    qiqi0610,vip
    Johnrandazzo1996,vip
    amisheth26,vip
    ranjana.n.w,vip
    naomixu,vip
    xurui1996,vip
    brett.young,vip
    hanyanyuan,vip
    203bdemo,vip
    dalekim25,vip
    chenyu1997,vip
    hk_lian,vip
    srishtimajumdar,vip
    bursontung97,vip
    jaketompkins97,vip
    fredericy19,vip
    brendonchau,vip
    jshamsho,vip
    tagibson,vip
    postgres,regular

Text editors

Source: Editor War on Wikipedia.

Emacs

  • Emacs is a powerful text editor with extensive support for many languages including R, \(\LaTeX\), python, and C/C++; however it’s not installed by default on many Linux distributions.

  • Basic survival commands:
    • emacs filename to open a file with emacs.
    • CTRL-x CTRL-f to open an existing or new file.
    • CTRL-x CTRX-s to save.
    • CTRL-x CTRL-w to save as.
    • CTRL-x CTRL-c to quit.
  • Google emacs cheatsheet

C-<key> means hold the control key, and press <key>.
M-<key> means press the Esc key once, and press <key>.

Vi

  • Vi is ubiquitous (POSIX standard). Learn at least its basics; otherwise you can edit nothing on some clusters.

  • Basic survival commands:
    • vi filename to start editing a file.
    • vi is a modal editor: insert mode and normal mode. Pressing i switches from the normal mode to insert mode. Pressing ESC switches from the insert mode to normal mode.
    • :x<Return> quits vi and saves changes.
    • :q!<Return> quits vi without saving latest changes.
    • :w<Return> saves changes.
    • :wq<Return> quits vi and saves changes.
  • Google vi cheatsheet

IDE (Integrated Development Environment)

  • Statisticians write a lot of code. Critical to adopt a good IDE that goes beyond code editing: syntax highlighting, executing code within editor, debugging, profiling, version control, etc.

  • R Studio, Eclipse, Emacs, Matlab, Visual Studio, etc.

Processes

Cancel a non-responding program

  • Press Ctrl+C to cancel a non-responding or long-running program.

Processes

  • OS runs processes on behalf of user.

  • Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc.

    ps
      PID TTY          TIME CMD
    19455 ?        00:00:01 rsession
    19529 ?        00:00:00 R
    19644 ?        00:00:00 sh
    19645 ?        00:00:00 ps
  • All current running processes:

    ps -eaf
    UID        PID  PPID  C STIME TTY          TIME CMD
    root         1     0  0 Jan09 ?        00:00:24 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
    root         2     0  0 Jan09 ?        00:00:00 [kthreadd]
    root         4     2  0 Jan09 ?        00:00:00 [kworker/0:0H]
    root         6     2  0 Jan09 ?        00:00:00 [ksoftirqd/0]
    root         7     2  0 Jan09 ?        00:00:00 [migration/0]
    root         8     2  0 Jan09 ?        00:00:00 [rcu_bh]
    root         9     2  0 Jan09 ?        00:01:02 [rcu_sched]
    root        10     2  0 Jan09 ?        00:00:00 [lru-add-drain]
    root        11     2  0 Jan09 ?        00:00:02 [watchdog/0]
    root        12     2  0 Jan09 ?        00:00:01 [watchdog/1]
    root        13     2  0 Jan09 ?        00:00:00 [migration/1]
    root        14     2  0 Jan09 ?        00:00:00 [ksoftirqd/1]
    root        16     2  0 Jan09 ?        00:00:00 [kworker/1:0H]
    root        17     2  0 Jan09 ?        00:00:01 [watchdog/2]
    root        18     2  0 Jan09 ?        00:00:00 [migration/2]
    root        19     2  0 Jan09 ?        00:00:01 [ksoftirqd/2]
    root        21     2  0 Jan09 ?        00:00:00 [kworker/2:0H]
    root        22     2  0 Jan09 ?        00:00:01 [watchdog/3]
    root        23     2  0 Jan09 ?        00:00:00 [migration/3]
    root        24     2  0 Jan09 ?        00:00:01 [ksoftirqd/3]
    root        26     2  0 Jan09 ?        00:00:00 [kworker/3:0H]
    root        28     2  0 Jan09 ?        00:00:00 [kdevtmpfs]
    root        29     2  0 Jan09 ?        00:00:00 [netns]
    root        30     2  0 Jan09 ?        00:00:00 [khungtaskd]
    root        31     2  0 Jan09 ?        00:00:00 [writeback]
    root        32     2  0 Jan09 ?        00:00:00 [kintegrityd]
    root        33     2  0 Jan09 ?        00:00:00 [bioset]
    root        34     2  0 Jan09 ?        00:00:00 [bioset]
    root        35     2  0 Jan09 ?        00:00:00 [bioset]
    root        36     2  0 Jan09 ?        00:00:00 [kblockd]
    root        37     2  0 Jan09 ?        00:00:00 [md]
    root        38     2  0 Jan09 ?        00:00:00 [edac-poller]
    root        39     2  0 Jan09 ?        00:00:00 [watchdogd]
    root        46     2  0 Jan09 ?        00:00:00 [kswapd0]
    root        47     2  0 Jan09 ?        00:00:00 [ksmd]
    root        48     2  0 Jan09 ?        00:00:02 [khugepaged]
    root        49     2  0 Jan09 ?        00:00:00 [crypto]
    root        57     2  0 Jan09 ?        00:00:00 [kthrotld]
    root        59     2  0 Jan09 ?        00:00:00 [kmpath_rdacd]
    root        60     2  0 Jan09 ?        00:00:00 [kaluad]
    root        61     2  0 Jan09 ?        00:00:00 [kpsmoused]
    root        63     2  0 Jan09 ?        00:00:00 [ipv6_addrconf]
    root        76     2  0 Jan09 ?        00:00:00 [deferwq]
    root       115     2  0 Jan09 ?        00:00:48 [kauditd]
    root       400     2  0 Jan09 ?        00:00:00 [virtscsi-scan]
    root       401     2  0 Jan09 ?        00:00:00 [scsi_eh_0]
    root       402     2  0 Jan09 ?        00:00:00 [scsi_tmf_0]
    root       427     2  0 Jan09 ?        00:00:00 [bioset]
    root       428     2  0 Jan09 ?        00:00:00 [xfsalloc]
    root       429     2  0 Jan09 ?        00:00:00 [xfs_mru_cache]
    root       430     2  0 Jan09 ?        00:00:00 [xfs-buf/sda1]
    root       431     2  0 Jan09 ?        00:00:00 [xfs-data/sda1]
    root       432     2  0 Jan09 ?        00:00:00 [xfs-conv/sda1]
    root       433     2  0 Jan09 ?        00:00:00 [xfs-cil/sda1]
    root       434     2  0 Jan09 ?        00:00:00 [xfs-reclaim/sda]
    root       435     2  0 Jan09 ?        00:00:00 [xfs-log/sda1]
    root       436     2  0 Jan09 ?        00:00:00 [xfs-eofblocks/s]
    root       437     2  0 Jan09 ?        00:02:48 [xfsaild/sda1]
    root       438     2  0 Jan09 ?        00:00:04 [kworker/0:1H]
    root       487     2  0 Jan09 ?        00:00:00 [kworker/2:1H]
    root       502     1  0 Jan09 ?        00:01:07 /usr/lib/systemd/systemd-journald
    root       540     1  0 Jan09 ?        00:00:00 /usr/lib/systemd/systemd-udevd
    root       559     1  0 Jan09 ?        00:01:58 /sbin/auditd
    root       646     2  0 Jan09 ?        00:00:00 [hwrng]
    root       673     2  0 Jan09 ?        00:00:00 [nfit]
    root       873     1  0 Jan09 ?        00:00:01 /opt/shiny-server/ext/node/bin/shiny-server /opt/shiny-server/lib/main.js
    dbus       875     1  0 Jan09 ?        00:00:05 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
    root       880     1  0 Jan09 ?        00:00:04 /usr/lib/systemd/systemd-logind
    polkitd    882     1  0 Jan09 ?        00:00:01 /usr/lib/polkit-1/polkitd --no-debug
    root       888     1  0 Jan09 ?        00:00:00 /usr/sbin/acpid
    chrony     918     1  0 Jan09 ?        00:00:00 /usr/sbin/chronyd
    root       919     1  0 Jan09 ?        00:00:01 /usr/sbin/crond -n
    root       926     1  0 Jan09 tty1     00:00:00 /sbin/agetty --noclear tty1 linux
    root       928     1  0 Jan09 ttyS0    00:00:00 /sbin/agetty --keep-baud 115200,38400,9600 ttyS0 vt220
    root       944     1  0 Jan09 ?        00:00:01 /usr/bin/python2 -Es /usr/sbin/firewalld --nofork --nopid
    root       973     1  0 Jan09 ?        00:00:10 /usr/sbin/NetworkManager --no-daemon
    root      1125   973  0 Jan09 ?        00:00:00 /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkManager/dhclient-18fa3a36-80a2-442f-bc21-276a1bb368be-eth0.lease -cf /var/lib/NetworkManager/dhclient-eth0.conf eth0
    root      1365     1  0 Jan09 ?        00:00:51 /usr/bin/python2 -Es /usr/sbin/tuned -l -P
    root      1366     1  0 Jan09 ?        00:00:00 /usr/sbin/cupsd -f
    root      1369     1  0 Jan09 ?        00:01:00 /usr/bin/google_osconfig_agent
    root      1370     1  0 Jan09 ?        00:00:55 /usr/sbin/rsyslogd -n
    rstudio+  1441     1  0 Jan09 ?        00:02:12 /usr/lib/rstudio-server/bin/rserver
    root      1628     1  0 Jan09 ?        00:00:24 /usr/bin/python /usr/bin/google_network_daemon
    root      1631     1  0 Jan09 ?        00:00:12 /usr/sbin/sshd -D
    root      1633     1  0 Jan09 ?        00:00:46 /usr/bin/python /usr/bin/google_accounts_daemon
    root      1635     1  0 Jan09 ?        00:00:10 /usr/bin/python /usr/bin/google_clock_skew_daemon
    root      1717     1  0 Jan09 ?        00:00:02 /usr/libexec/postfix/master -w
    postfix   1719  1717  0 Jan09 ?        00:00:00 qmgr -l -t unix -u
    mongod    1721     1  0 Jan09 ?        00:24:13 /usr/bin/mongod -f /etc/mongod.conf
    root      1838     2  0 Jan09 ?        00:00:00 [kworker/3:1H]
    root      1839     2  0 Jan09 ?        00:00:00 [kworker/1:1H]
    root      4474  1631  0 02:07 ?        00:00:00 sshd: raguilar2 [priv]
    raguila+  4478  4474  0 02:07 ?        00:00:00 sshd: raguilar2@pts/4
    raguila+  4479  4478  0 02:07 pts/4    00:00:00 -bash
    root      7502     2  0 02:49 ?        00:00:00 [kworker/2:1]
    root      7504     2  0 02:49 ?        00:00:00 [kworker/0:0]
    root      8240     1  0 03:01 ?        00:00:00 /usr/sbin/anacron -s
    root     11033  8240  0 03:45 ?        00:00:00 /bin/bash /bin/run-parts /etc/cron.daily
    root     11037 11033  0 03:45 ?        00:00:00 /usr/bin/python -tt /usr/sbin/yum-cron
    root     11038 11033  0 03:45 ?        00:00:00 awk -v progname=/etc/cron.daily/0yum-daily.cron progname { ????   print progname ":\n" ????   progname=""; ???       } ???       { print; }
    root     11046     2  0 03:45 ?        00:00:00 [kworker/u8:0]
    root     13888  1631  0 04:29 ?        00:00:00 sshd: lnliuxue [priv]
    lnliuxue 13892 13888  0 04:29 ?        00:00:00 sshd: lnliuxue@pts/1
    lnliuxue 13893 13892  0 04:29 pts/1    00:00:00 -bash
    postfix  15098  1717  0 04:45 ?        00:00:00 pickup -l -t unix -u
    root     15491  1631  0 04:51 ?        00:00:00 sshd: qiqi0610 [priv]
    qiqi0610 15495 15491  0 04:51 ?        00:00:00 sshd: qiqi0610@pts/2
    qiqi0610 15496 15495  0 04:51 pts/2    00:00:00 -bash
    root     15783     2  0 04:54 ?        00:00:00 [kworker/3:0]
    root     16214     2  0 05:00 ?        00:00:00 [kworker/0:1]
    root     17791     2  0 05:21 ?        00:00:00 [kworker/3:1]
    root     18456     2  0 05:31 ?        00:00:00 [kworker/1:2]
    root     18848     2  0 05:37 ?        00:00:00 [kworker/1:0]
    root     19171     2  0 05:42 ?        00:00:00 [kworker/1:1]
    root     19174     2  0 05:42 ?        00:00:00 [kworker/0:2]
    huazhou  19455  1441  7 05:47 ?        00:00:01 /usr/lib/rstudio-server/bin/rsession -u huazhou --launcher-token BB375147
    huazhou  19529 19455 48 05:47 ?        00:00:00 /usr/lib64/R/bin/exec/R --slave --no-save --no-restore -e rmarkdown::render('/home/huazhou/ucla-biostat203b-2020winter.github.io/slides/02-linux/linux.Rmd',~+~~+~encoding~+~=~+~'UTF-8');
    huazhou  19646 19529  0 05:47 ?        00:00:00 sh -c 'bash'  -c 'ps -eaf' 2>&1
    huazhou  19647 19646  0 05:47 ?        00:00:00 ps -eaf
    root     24201     2  0 Jan14 ?        00:00:01 [kworker/u8:1]
    root     31010     2  0 00:59 ?        00:00:00 [kworker/2:0]
  • All Python processes:

    ps -eaf | grep python
    root       944     1  0 Jan09 ?        00:00:01 /usr/bin/python2 -Es /usr/sbin/firewalld --nofork --nopid
    root      1365     1  0 Jan09 ?        00:00:51 /usr/bin/python2 -Es /usr/sbin/tuned -l -P
    root      1628     1  0 Jan09 ?        00:00:24 /usr/bin/python /usr/bin/google_network_daemon
    root      1633     1  0 Jan09 ?        00:00:46 /usr/bin/python /usr/bin/google_accounts_daemon
    root      1635     1  0 Jan09 ?        00:00:10 /usr/bin/python /usr/bin/google_clock_skew_daemon
    root     11037 11033  0 03:45 ?        00:00:00 /usr/bin/python -tt /usr/sbin/yum-cron
    huazhou  19648 19529  0 05:47 ?        00:00:00 sh -c 'bash'  -c 'ps -eaf | grep python' 2>&1
    huazhou  19649 19648  0 05:47 ?        00:00:00 bash -c ps -eaf | grep python
    huazhou  19651 19649  0 05:47 ?        00:00:00 grep python
  • Process with PID=1:

    ps -fp 1
    UID        PID  PPID  C STIME TTY          TIME CMD
    root         1     0  0 Jan09 ?        00:00:24 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
  • All processes owned by a user:

    ps -fu huazhou
    UID        PID  PPID  C STIME TTY          TIME CMD
    huazhou  19455  1441  7 05:47 ?        00:00:01 /usr/lib/rstudio-server/bin/rsession -u huazhou --launcher-token BB375147
    huazhou  19529 19455 49 05:47 ?        00:00:00 /usr/lib64/R/bin/exec/R --slave --no-save --no-restore -e rmarkdown::render('/home/huazhou/ucla-biostat203b-2020winter.github.io/slides/02-linux/linux.Rmd',~+~~+~encoding~+~=~+~'UTF-8');
    huazhou  19654 19529  0 05:47 ?        00:00:00 sh -c 'bash'  -c 'ps -fu huazhou' 2>&1
    huazhou  19655 19654  0 05:47 ?        00:00:00 ps -fu huazhou

Kill processes

  • Kill process with PID=1001:

    kill 1001
  • Kill all R processes.

    killall -r R

top

  • top prints realtime process information (very useful).

    top

  • Exit the top program by pressing the q key.

Secure shell (SSH)

SSH

SSH (secure shell) is the dominant cryptographic network protocol for secure network connection via an insecure network.

  • On Linux or Mac ternminal, access the teaching server by

    ssh username@server.ucla-biostat-203b.com
  • For Windows users, there are 2 ways: (1) (highly recommended) Git Bash, (2) (not recommended) PuTTY program (free), or (3) (may be an overkill for this class) use WSL for Windows to install a full fledged Linux system within Windows.

Use keys over password

  • Key authentication is more secure than password. Most passwords are weak.

  • Script or a program may need to systematically SSH into other machines.

  • Log into multiple machines using the same key.

  • Seamless use of many services: Git, AWS or Google cloud service, parallel computing on multiple hosts, Travis CI (continuous integration) etc.

  • Many servers only allow key authentication and do not accept password authentication.

Key authentication


  • Public key. Put on the machine(s) you want to log in.

  • Private key. Put on your own computer. Consider this as the actual key in your pocket; never give to others.

  • Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key.

  • Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication).

Steps to generate keys

  • On Linux, Mac, or Windows Git Bash, to generate a key pair:

    ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
    • [KEY_FILENAME] is the name that you want to use for your SSH key files. For example, a filename of id_rsa generates a private key file named id_rsa and a public key file named id_rsa.pub.

    • [USERNAME] is the user for whom you will apply this SSH key.

    • Use a (optional) paraphrase different form password.

  • Set correct permissions on the .ssh folder and key files

    chmod 400 ~/.ssh/[KEY_FILENAME]

  • Append the public key to the ~/.ssh/authorized_keys file of any Linux machine we want to SSH to, e.g.,

    ssh-copy-id -i ~/.ssh/[KEY_FILENAME] [USERNAME]@server.ucla-biostat-203b.com
  • Test your new key.

    ssh -i ~/.ssh/[KEY_FILENAME] [USERNAME]@server.ucla-biostat-203b.com
  • Now you don’t need password each time you connect from your machine to the teaching server.


  • If you set paraphrase when generating keys, you’ll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using ssh-agent on Linux/Mac or Pagent on Windows.

  • Same key pair can be used between any two machines. We don’t need to regenerate keys for each new connection.

Transfer files between machines

  • scp securely transfers files between machines using SSH.

    ## copy file from local to remote
    scp [LOCALFILE] [USERNAME]@server.ucla-biostat-203b.com:/[PATH_TO_FOLDER]
    ## copy file from remote to local
    scp [USERNAME]@server.ucla-biostat-203b.com:/[PATH_TO_FILE] [PATH_TO_LOCAL_FOLDER]
  • sftp is FTP via SSH.

  • Globus is GUI program for securely transferring files between machines. To use Globus you will have to go to https://www.globus.org/ and login through UCLA by selecting your existing organizational login as UCLA. Then you will need to download their Globus Connect Personal software, then set your laptop as an endpoint. Very detailed instructions can be found at https://www.hoffman2.idre.ucla.edu/file-transfer/globus/.

  • GUIs for Windows (WinSCP) or Mac (Cyberduck).

  • (My preferred way) Use a version control system (git, svn, cvs, …) to sync project files between different machines and systems.

Line breaks in text files

  • Windows uses a pair of CR and LF for line breaks.

  • Linux/Unix uses an LF character only.

  • MacOS X also uses a single LF character. But old Mac OS used a single CR character for line breaks.

  • If transferred in binary mode (bit by bit) between OSs, a text file could look a mess.

  • Most transfer programs automatically switch to text mode when transferring text files and perform conversion of line breaks between different OSs; but I used to run into problems using WinSCP. Sometimes you have to tell WinSCP explicitly a text file is being transferred.

Run R in Linux

Interactive mode

  • Start R in the interactive mode by typing R in shell.

  • Then run R script by

    source("script.R")

Batch mode

  • Demo script meanEst.R implements an (terrible) estimator of mean \[ {\widehat \mu}_n = \frac{\sum_{i=1}^n x_i 1_{i \text{ is prime}}}{\sum_{i=1}^n 1_{i \text{ is prime}}}. \]

    ## check if a given integer is prime
    isPrime = function(n) {
      if (n <= 3) {
        return (TRUE)
      }
      if (any((n %% 2:floor(sqrt(n))) == 0)) {
        return (FALSE)
      }
      return (TRUE)
    }
    
    ## estimate mean only using observation with prime indices
    estMeanPrimes = function (x) {
      n = length(x)
      ind = sapply(1:n, isPrime)
      return (mean(x[ind]))
    }
    
    print(estMeanPrimes(rnorm(100000)))

  • To run your R code non-interactively aka in batch mode, we have at least two options:

    # default output to meanEst.Rout
    R CMD BATCH meanEst.R

    or

    # output to stdout
    Rscript meanEst.R
  • Typically automate batch calls using a scripting language, e.g., Python, Perl, and shell script.

Pass arguments to R scripts

  • Specify arguments in R CMD BATCH:

    R CMD BATCH '--args mu=1 sig=2 kap=3' script.R
  • Specify arguments in Rscript:

    Rscript script.R mu=1 sig=2 kap=3
  • Parse command line arguments using magic formula

    for (arg in commandArgs(T)) {
      eval(parse(text=arg))
    }

    in R script. After calling the above code, all command line arguments will be available in the global namespace.


  • To understand the magic formula commandArgs, run R by:

    R '--args mu=1 sig=2 kap=3'

    and then issue commands in R

    commandArgs()
    commandArgs(TRUE)

  • Understand the magic formula parse and eval:

    rm(list=ls())
    print(x)
    Error in print(x): object 'x' not found
    parse(text="x=3")
    expression(x = 3)
    eval(parse(text="x=3"))
    print(x)
    [1] 3

  • runSim.R has components: (1) command argument parser, (2) method implementation, (3) data generator with unspecified parameter n, and (4) estimation based on generated data.
## parsing command arguments
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

# simulate data
x = rnorm(n)

# estimate mean
estMeanPrimes(x)

  • Call runSim.R with sample size n=100:

    R CMD BATCH '--args n=100' runSim.R

    or

    Rscript runSim.R n=100
    [1] 0.1883607

Run long jobs

  • Many statistical computing tasks take long: simulation, MCMC, etc. When we exit Linux when the job is unfinished, the job is killed.

  • nohup command in Linux runs program(s) immune to hangups and writes output to nohup.out by default. Logging out will not kill the process; we can log in later to check status and results.

  • nohup is POSIX standard thus available on Linux and MacOS.

  • Run runSim.R in background and writes output to nohup.out:

    nohup Rscript runSim.R n=100 &
    [1] 0.1704422

    The & at the end of the command instructs Linux to run this command in background, so we gain control of the terminal immediately.

screen

  • screen is another popular utility, but not installed by default.

  • Typical workflow using screen.

    1. Access remote server using ssh.

    2. Start jobs in batch mode.

    3. Detach jobs.

    4. Exit from server, wait for jobs to finish.

    5. Access remote server using ssh.

    6. Re-attach jobs, check on progress, get results, etc.

Use R to call R

R in conjuction with nohup (or screen) can be used to orchestrate a large simulation study.

  • It can be more elegant, transparent, and robust to parallelize jobs corresponding to different scenarios (e.g., different generative models) outside of the code used to do statistical computation.

  • We consider a simulation study in R but the same approach could be used with code written in Julia, Matlab, Python, etc.

  • Python in many ways makes a better glue.

  • Suppose we have
    • runSim.R which runs a simulation based on command line argument n.
    • A large collection of n values that we want to use in our simulation study.
    • Access to a server with 128 cores.
      How do to parallelize the job?
  • Option 1: manually call runSim.R for each setting.

  • Option 2 (smarter): automate calls using R and nohup.

  • Let’s demonstrate using the script autoSim.R

    cat autoSim.R
    # autoSim.R
    
    nVals <- seq(100, 1000, by=100)
    for (n in nVals) {
      oFile <- paste("n", n, ".txt", sep="")
      sysCall <- paste("nohup Rscript runSim.R n=", n, " > ", oFile, sep="")
      system(sysCall, wait = FALSE)
      print(paste("sysCall=", sysCall, sep=""))
    }

    Note when we call bash command using the system function in R, we set optional argument wait=FALSE so that jobs can be run parallel.

  • Rscript autoSim.R
    [1] "sysCall=nohup Rscript runSim.R n=100 > n100.txt"
    [1] "sysCall=nohup Rscript runSim.R n=200 > n200.txt"
    [1] "sysCall=nohup Rscript runSim.R n=300 > n300.txt"
    [1] "sysCall=nohup Rscript runSim.R n=400 > n400.txt"
    [1] "sysCall=nohup Rscript runSim.R n=500 > n500.txt"
    [1] "sysCall=nohup Rscript runSim.R n=600 > n600.txt"
    [1] "sysCall=nohup Rscript runSim.R n=700 > n700.txt"
    [1] "sysCall=nohup Rscript runSim.R n=800 > n800.txt"
    [1] "sysCall=nohup Rscript runSim.R n=900 > n900.txt"
    [1] "sysCall=nohup Rscript runSim.R n=1000 > n1000.txt"
  • Now we just need to write a script to collect results from the output files.

  • Later on we learn how to coordinate large scale computation on UCLA Hoffmann2 cluster, using Linux and R scripting.