man holding lightbulb
April 29, 2019

Basics in R Programming

You are about to begin a project on R? Before you watch any tutorial, read these basic standards.
Rboy

I spent my last 8 weeks on R and I must admit that, after many months on Python and JavaScript, I almost got knocked out by R ways of working.

As a consultant, I am led to work on different projects in a team. My last one brought us to develop a tailor-made R solution in the style of a Python scraping project. Sadly, the client environment made us use R. Are you used to develop on a state-of-the-art IDE? Are you familiar with basic programming principles? Then read the following!

Customize RStudio

You should begin by creating the best programming environment so that the hours you will spend in the near future are the more comfortable. The last thing you want to do is to spend hours on a trivial “bug” such as:

Yes, it took us hours to understand that this O is not a 0
Yes, it took us hours to understand that this O is not a 0

So, start with configuring RStudio:

  • Choose a font that can’t trick you (Global Options > Appearance)

  • Choose a theme that fits your eyes and your taste (Global Options > Appearance)

Moreover, if you want to use git correctly:

  • Remove trailing whitespaces on saving (Global Options > Code > Saving)

  • Make sure that your files end by a new line (Global Options > Code > Saving)

  • Encode your files with UTF-8 (Global Options > Code > Saving). Note that this command may alter the opening of differently encoded files.

Naming standards

Only in R could I see as many naming conventions as this:

dot.case, camelCase, you name it.
dot.case, camelCase, you name it.

You should choose one naming convention and respect it throughout your project. Indeed, fixing it afterward with multi-cursor won’t work 90% of the time. Do not let yourself be influenced by the disparity of R.

Not convinced by the naming conventions? Check out this article.

Unit-tests

As long as you develop your code as a package, R offers an easy testing environment. Keep in mind that the more you unit-test your code, the more confident you can be in what it actually does.

If you don’t know unit-testing or don’t see the point, give 5 minutes to this StackOverflow thread.

Hidden behaviors

There are many things that happen behind your code in R and some are not straightforward. Here are a few behaviors we discovered on a span of 8 weeks.

Checking if a variable is NA

One of our regular pain points was checking if a variable is NA. There are lots of ways to do it:

  • variable == NA: the double equals operator checks if the value of your variable equals NA. This operation has no sense in R (if you want more details, you can refer to this) and won’t work.

  • is.na(variable): this function is optimized for tables. It is performed element-wise and thus creates a mask suited to your variable.

  • identical(variable, NA): this function is reliable to test if a variable is an atomic vector with single value NA. Nevertheless, it won’t work on other NA types in R.

Indeed, R contains different types of NA. But R also allows functions to return custom NAs, such as in the package “rvest”.

package rvest

If you have this kind of issues with NA values, you should use anyNA(variable).

Unexpected Autocompletion

R has a tendency to autocomplete a few key elements:

  • functions’ arguments (that are also keyword arguments):

function arguments
  • column names when manipulating data frames:

column names

It didn’t bother us but I can imagine plenty of situations where it could have.

EDIT: You can make RStudio display warnings when such autocompletion happens, editing your .Rprofile:

R is a functional programming language

Coming from Python, with a more procedural/imperative programming use, I was surprised by the following behavior. R is a functional programming language, meaning that every call, every expression is a value.

So, when defining an R function, you have to remember that the call of this function will take the value of the last stated variable. At first, I considered this way of working as an implicit return value. But a Reddit fellow corrected me on this point.

You have the choice to return a value mid-function, as in Python or Javascript. But the return statement is not needed for the last instruction.

return value mid-function

Should I pipe functions?

R allows you to pipe functions in at least 2 ways. Let’s consider the following instruction:

Quite difficult to read.
Quite difficult to read.

Magrittr

The first alternative to this difficult-to-read line comes with the package magrittr:

Pipe functions with %>%
Pipe functions with %>%

This solution allows you to pipe function in a single instruction! We used it when chaining basic functions, namely rvest functions when scraping web pages. It allowed us to reduce the number of variables while keeping good readability.

I think it’s great but one could consider it difficult to read: after reading from left to right, you have to jump back to the beginning of the line to recall which variable you are assigning. Moreover, the disappearance of the first argument of each piped function could be considered misleading.

Built-in “->” operator

You can also use a built-in possibility with the -> assignment operator:

The . object is used as a temporary storage
The . object is used as a temporary storage

Even though it takes place on multiple lines, with this alternative, you get rid of the 2 drawbacks of magrittr.

Nevertheless, the use of this operator is not recommended by Hadley Wickham style guide, as this article points it out.

Conclusion

I can’t recommend any of these 2 solutions, you should choose the one that fits you the most. In any case, you should consider it as a convention, such as the naming one. Switching between piping ways shouldn’t become a mental workload when developing.

Learn the shortcuts

If you have to learn only one, it is Ctrl+ Shift + K or Opt + Shift + K. It will give you a sum-up of all the available shortcuts.

Other common shortcuts available in every IDE:

  • Move lines with Alt + Up/Down or Opt + Up/Down

  • Go to function definition with F2

  • Indent automatically with Ctrl + I or Cmd + I

  • Select all occurrences with Ctrl + Alt + K (couldn’t find it on Mac)

  • Go to next occurrence with Ctrl + K or Cmd + E

  • Format your code with Ctrl + Shift + A or Cmd + Shift + A

  • etc.

Or you can print a cheatsheet, whatever fits your way of working.

Here are more cheatsheets if you want!

Thanks to Dan Ringwald, Clément Walter, Nicolas Jean, and Emna Kamoun. 

dices

How Does Your Computer Generate Random Numbers?

What you should know about numpy and pseudo random number generators (PRNG).

how to build a successful ai poc

How To Build A Successful AI PoC

Turn Your Artificial Intelligence Ideas Into Working Software

thief

How to Perform Fraud Detection with Personalized Page Rank

This article shows how to perform fraud detection with Graph Analysis.