IMDB - LSTM

Prepare data
Build model
Training
Testing

sessionInfo()

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.3      stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.12       digest_0.6.24  
## [13] rlang_0.4.4     evaluate_0.14

Source: https://tensorflow.rstudio.com/keras/articles/examples/imdb_lstm.html

Prepare data

From documentation:

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: “only consider the top 10,000 most common words, but eliminate the top 20 most common words”.

Retrieve IMDB data:

library(keras)

max_features <- 20000
batch_size <- 32

# Cut texts after this number of words (among top max_features most common words)
maxlen <- 80  

cat('Loading data...\n')

## Loading data...

imdb <- dataset_imdb(num_words = max_features)
imdb$train$x[[1]]

##   [1]     1    14    22    16    43   530   973  1622  1385    65   458  4468
##  [13]    66  3941     4   173    36   256     5    25   100    43   838   112
##  [25]    50   670     2     9    35   480   284     5   150     4   172   112
##  [37]   167     2   336   385    39     4   172  4536  1111    17   546    38
##  [49]    13   447     4   192    50    16     6   147  2025    19    14    22
##  [61]     4  1920  4613   469     4    22    71    87    12    16    43   530
##  [73]    38    76    15    13  1247     4    22    17   515    17    12    16
##  [85]   626    18 19193     5    62   386    12     8   316     8   106     5
##  [97]     4  2223  5244    16   480    66  3785    33     4   130    12    16
## [109]    38   619     5    25   124    51    36   135    48    25  1415    33
## [121]     6    22    12   215    28    77    52     5    14   407    16    82
## [133] 10311     8     4   107   117  5952    15   256     4     2     7  3766
## [145]     5   723    36    71    43   530   476    26   400   317    46     7
## [157]     4 12118  1029    13   104    88     4   381    15   297    98    32
## [169]  2071    56    26   141     6   194  7486    18     4   226    22    21
## [181]   134   476    26   480     5   144    30  5535    18    51    36    28
## [193]   224    92    25   104     4   226    65    16    38  1334    88    12
## [205]    16   283     5    16  4472   113   103    32    15    16  5345    19
## [217]   178    32

imdb$train$y[[1]]

## [1] 1

x_train <- imdb$train$x
y_train <- imdb$train$y
x_test <- imdb$test$x
y_test <- imdb$test$y

cat(length(x_train), 'train sequences\n')

## 25000 train sequences

cat(length(x_test), 'test sequences\n')

## 25000 test sequences

cat('Pad sequences (samples x time)\n')

## Pad sequences (samples x time)

x_train <- pad_sequences(x_train, maxlen = maxlen)
x_test <- pad_sequences(x_test, maxlen = maxlen)
cat('x_train shape:', dim(x_train), '\n')

## x_train shape: 25000 80

cat('x_test shape:', dim(x_test), '\n')

## x_test shape: 25000 80

Build model

cat('Build model...\n')

## Build model...

model <- keras_model_sequential()
model %>%
  layer_embedding(input_dim = max_features, output_dim = 128) %>% 
  layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>% 
  layer_dense(units = 1, activation = 'sigmoid')

# Try using different optimizers and different optimizer configs
model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('accuracy')
)
summary(model)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## embedding (Embedding)               (None, None, 128)               2560000     
## ________________________________________________________________________________
## lstm (LSTM)                         (None, 64)                      49408       
## ________________________________________________________________________________
## dense (Dense)                       (None, 1)                       65          
## ================================================================================
## Total params: 2,609,473
## Trainable params: 2,609,473
## Non-trainable params: 0
## ________________________________________________________________________________

Training

cat('Train...\n')

## Train...

system.time({
model %>% fit(
  x_train, y_train,
  batch_size = batch_size,
  epochs = 1,
  validation_data = list(x_test, y_test)
)
})

##    user  system elapsed 
## 300.634  96.535 118.650

Testing

scores <- model %>% evaluate(
  x_test, y_test,
  batch_size = batch_size
)

cat('Test score:', scores[[1]])

## Test score: 0.3786198

cat('Test accuracy', scores[[2]])

## Test accuracy 0.83232

IMDB - LSTM

Biostat 203B

Dr. Hua Zhou @ UCLA

3/3/2020

Prepare data

Build model

Training

Testing