Home Artificial Intelligence Suggestions and Tricks to enhance your R-Skills Use code profiling Vectorise your code Matrices vs. Data frames is.na() and anyNA if() … else() vs. ifelse() Parallel Computing R interface to other languages Conclusion

Suggestions and Tricks to enhance your R-Skills Use code profiling Vectorise your code Matrices vs. Data frames is.na() and anyNA if() … else() vs. ifelse() Parallel Computing R interface to other languages Conclusion

0
Suggestions and Tricks to enhance your R-Skills
Use code profiling
Vectorise your code
Matrices vs. Data frames
is.na() and anyNA
if() … else() vs. ifelse()
Parallel Computing
R interface to other languages
Conclusion

Learn find out how to write efficient R code

Towards Data Science
Tips and Tricks to improve your R-Skills
1234567890-=Photo by AltumCode on Unsplash

R is widely utilized in business and science as an information evaluation tool. The programming language is a necessary tool for data driven tasks. For a lot of Statisticians and Data Scientists, R is the primary selection for statistical questions.

Data Scientists often work with large amounts of knowledge and sophisticated statistical problems. Memory and runtime play a central role here. You could write efficient code to realize maximum performance. In this text, we present suggestions that you would be able to use directly in your next R project.

Data Scientists often wish to optimise their code to make it faster. In some cases, you’ll trust your intuition and check out something out. This approach has the drawback that you almost certainly optimise the mistaken parts of your code. So that you waste effort and time. You’ll be able to only optimise your code in the event you know where your code is slow. The answer is code profiling. Code profiling helps you discover slow code parts!

Rprof() is a built-in tool for code profiling. Unfortunately, Rprof() shouldn’t be very user-friendly, so we don’t recommend its direct use. We recommend the profvis package. Profvis allows the visualisation of the code profiling data from Rprof(). You’ll be able to install the package via the R console with the next command:

install.packages("profvis")

In the following step, we do code profiling using an example.

library("profvis")

profvis({
y <- 0
for (i in 1:10000) {
y <- c(y,i)
}
})

For those who run this code in your RStudio, you then will get the next output.

Flame Graph (Image by authors)
Flame Graph (Image by authors)

At the highest, you’ll be able to see your R code with bar graphs for memory and runtime for every line of code. This display gives you an outline of possible problems in your code but doesn’t make it easier to to discover the precise cause. Within the memory column, you’ll be able to see how much memory (in MB) has been allocated (the bar on the precise) and released (the bar on the left) for every call. The time column shows the runtime (in ms) for every line. For instance, you’ll be able to see that line 4 takes 280 ms.

At the underside, you’ll be able to see the Flame Graph with the complete called stack. This graph gives you an outline of the entire sequence of calls. You’ll be able to move the mouse pointer over individual calls to get more information. It’s also noticeable that the rubbish collector () takes loads of time. But why? Within the memory column, you’ll be able to see in line 4 that there may be an increased memory requirement. A whole lot of memory is allocated and released in line 4. Each iteration creates one other copy of y, leading to increased memory usage. Please avoid such copy-modify tasks!

You may as well use the Data tab. The Data tab gives you a compact overview of all calls and is especially suitable for complex nested calls.

Data Tab (Image by authors)
Data Tab (Image by authors)

If you desire to learn more about provis, you’ll be able to visit the Github page.

Perhaps you might have heard of vectorisation. But what’s that? Vectorisation shouldn’t be nearly avoiding for() loops. It goes one step further. You’ve to think when it comes to vectors as an alternative of scalars. Vectorisation may be very essential to hurry up R code. Vectorised functions use loops written in C as an alternative of R. Loops in C have less overhead, which makes them much faster. Vectorisation means finding the present R function implemented in C that closely matches your task. The functions rowSums(), colSums(), rowMeans() and colMeans() are handy to hurry up your R code. These vectorised matrix functions are at all times faster than the apply() function.

To measure the runtime, we use the R package microbenchmark. On this package, the evaluations of all expressions are done in C to minimise the overhead. As an output, the package provides an outline of statistical indicators. You’ll be able to install the microbenchmark package via the R Console with the next command:

install.packages("microbenchmark")

Now, we compare the runtime of the apply() function with the colMeans() function. The next code example demonstrates it.

install.packages("microbenchmark")
library("microbenchmark")

data.frame <- data.frame (a = 1:10000, b = rnorm(10000))
microbenchmark(times=100, unit="ms", apply(data.frame, 2, mean), colMeans(data.frame))

# example console output:
# Unit: milliseconds
# expr min lq mean median uq max neval
# apply(data.frame, 2, mean) 0.439540 0.5171600 0.5695391 0.5310695 0.6166295 0.884585 100
# colMeans(data.frame) 0.183741 0.1898915 0.2045514 0.1948790 0.2117390 0.287782 100

In each cases, we calculate the mean value of every column of an information frame. To make sure the reliability of the result, we make 100 runs (times=10) using the microbenchmark package. In consequence, we see that the colMeans() function is about thrice faster.

We recommend the net book R Advanced if you desire to learn more about vectorisation.

Matrices have some similarities with data frames. A matrix is a two-dimensional object. As well as, some functions work in the identical way. A difference: All elements of a matrix will need to have the identical type. Matrices are sometimes used for statistical calculations. For instance, the function lm() converts the input data internally right into a matrix. Then the outcomes are calculated. Basically, matrices are faster than data frames. Now, we take a look at the runtime differences between matrices and data frames.

library("microbenchmark")

matrix = matrix (c(1, 2, 3, 4), nrow = 2, ncol = 2, byrow = 1)
data.frame <- data.frame (a = c(1, 3), b = c(2, 4))
microbenchmark(times=100, unit="ms", matrix[1,], data.frame[1,])

# example console output:
# Unit: milliseconds
# expr min lq mean median uq max neval
# matrix[1, ] 0.000499 0.0005750 0.00123873 0.0009255 0.001029 0.019359 100
# data.frame[1, ] 0.028408 0.0299015 0.03756505 0.0308530 0.032050 0.220701 100

We perform 100 runs using the microbenchmark package to acquire a meaningful statistical evaluation. It’s recognisable that the matrix access to the primary row is about 30 times faster than for the info frame. That’s impressive! A matrix is significantly quicker, so you need to prefer it to a knowledge frame.

You almost certainly know the function is.na() to envision whether a vector comprises missing values. There may be also the function anyNA() to envision if a vector has any missing values. Now we test which function has a faster runtime.

library("microbenchmark")

x <- c(1, 2, NA, 4, 5, 6, 7)
microbenchmark(times=100, unit="ms", anyNA(x), any(is.na(x)))
# example console output:
# Unit: milliseconds
# expr min lq mean median uq max neval
# anyNA(x) 0.000145 0.000149 0.00017247 0.000155 0.000182 0.000895 100
# any(is.na(x)) 0.000349 0.000362 0.00063562 0.000386 0.000393 0.022684 100

The evaluation shows that anyNA() is on average, significantly faster than is.na(). It is best to use anyNA() if possible.

if() ... else() is the usual control flow function and ifelse() is more user-friendly.

Ifelse() works in accordance with the next scheme:

# test: condition, if_yes: condition true, if_no: condition false
ifelse(test, if_yes, if_no)

From the perspective of many programmers, ifelse() is more comprehensible than the multiline alternative. The drawback is that ifelse() shouldn’t be as computationally efficient. The next benchmark illustrates that if() ... else() runs greater than 20 times faster.

library("microbenchmark")

if.func <- function(x){
for (i in 1:1000) {
if (x < 0) {
"negative"
} else {
"positive"
}
}
}
ifelse.func <- function(x){
for (i in 1:1000) {
ifelse(x < 0, "negative", "positive")
}
}
microbenchmark(times=100, unit="ms", if.func(7), ifelse.func(7))

# example console output:
# Unit: milliseconds
# expr min lq mean median uq max neval
# if.func(7) 0.020694 0.020992 0.05181552 0.021463 0.0218635 3.000396 100
# ifelse.func(7) 1.040493 1.080493 1.27615668 1.163353 1.2308815 7.754153 100

It is best to avoid using ifelse() in complex loops, because it slows down your program considerably.

Most computers have several processor cores, allowing parallel tasks to be processed. This idea is named parallel computing. The R package parallel enables parallel computing in R applications. The package is pre-installed with base R. With the next commands, you’ll be able to load the package and see what number of cores your computer has:

library("parallel")

no_of_cores = detectCores()
print(no_of_cores)

# example console output:
# [1] 8

Parallel data processing is good for Monte Carlo simulations. Each core independently simulates a realisation of the model. In the long run, the outcomes are summarised. The next example is predicated on the net book Efficient R Programming. First, we want to put in the devtools package. With the assistance of this package, we are able to download the efficient package from GitHub. You could enter the next commands within the RStudio console:

install.packages("devtools")
library("devtools")

devtools::install_github("csgillespie/efficient", args = "--with-keep.source")

Within the efficient package, there may be a function snakes_ladders() that simulates a single game of Snakes and Ladders. We’ll use the simulation to measure the runtime of the sapply() and parSapply() functions. parSapply() is the parallelised variant of sapply().

library("parallel")
library("microbenchmark")
library("efficient")

N = 10^4
cl = makeCluster(4)

microbenchmark(times=100, unit="ms", sapply(1:N, snakes_ladders), parSapply(cl, 1:N, snakes_ladders))
stopCluster(cl)

# example console output:
# Unit: milliseconds
# expr min lq mean median uq max neval
# sapply(1:N, snakes_ladders) 3610.745 3794.694 4093.691 3957.686 4253.681 6405.910 100
# parSapply(cl, 1:N, snakes_ladders) 923.875 1028.075 1149.346 1096.950 1240.657 2140.989 100

The evaluation shows that parSapply() the simulation calculates on average about 3.5 x faster than the sapply() function. Wow! You’ll be able to quickly integrate this tip into your existing R project.

There are cases where R is solely slow. You utilize every kind of tricks, but your R code remains to be too slow. On this case, you need to consider rewriting your code in one other programming language. For other languages, there are interfaces in R in the shape of R packages. Examples are Rcpp and rJava. It is straightforward to put in writing C++ code, especially if you might have a software engineering background. Then you need to use it in R.

First, you might have to put in Rcpp with the next command:

install.packages("Rcpp")

The next example demonstrates the approach:

library("Rcpp")

cppFunction('
double sub_cpp(double x, double y) {
double value = x - y;
return value;
}
')

result <- sub_cpp(142.7, 42.7)
print(result)

# console output:
# [1] 100

C++ is a robust programming language, which makes it best suited to code acceleration. For very complex calculations, we recommend using C++ code.

In this text, we learned find out how to analyse R code. The provis package supports you within the evaluation of your R code. You should use vectorised functions like rowSums(), colSums(), rowMeans() and colMeans() to speed up your program. As well as, you need to prefer matrices as an alternative of knowledge frames if possible. Use anyNA() as an alternative of is.na() to envision if a vector has any missing values. You speed up your R code through the use of if() ... else() as an alternative of ifelse(). Moreover, you need to use parallelised functions from the parallel package for complex simulations. You’ll be able to achieve maximum performance for complex code sections through the use of the Rcpp package.

There are some books for learning R. You can find three books that we expect are excellent for learning efficient R programming in the next:

LEAVE A REPLY

Please enter your comment!
Please enter your name here