As a consultant, I am led to work on different projects in a team. My last one brought us to develop a tailor-made R solution in the style of a Python scraping project. Sadly, the client environment made us use R. Are you used to develop on a state-of-the-art IDE? Are you familiar with basic programming principles? Then read the following!
You should begin by creating the best programming environment so that the hours you will spend in the near future are the more comfortable. The last thing you want to do is to spend hours on a trivial “bug” such as:
So, start with configuring RStudio:
Choose a font that can’t trick you (Global Options > Appearance)
Choose a theme that fits your eyes and your taste (Global Options > Appearance)
Moreover, if you want to use git correctly:
Remove trailing whitespaces on saving (Global Options > Code > Saving)
Make sure that your files end by a new line (Global Options > Code > Saving)
Encode your files with UTF-8 (Global Options > Code > Saving). Note that this command may alter the opening of differently encoded files.
Only in R could I see as many naming conventions as this:
You should choose one naming convention and respect it throughout your project. Indeed, fixing it afterward with multi-cursor won’t work 90% of the time. Do not let yourself be influenced by the disparity of R.
Not convinced by the naming conventions? Check out this article.
As long as you develop your code as a package, R offers an easy testing environment. Keep in mind that the more you unit-test your code, the more confident you can be in what it actually does.
If you don’t know unit-testing or don’t see the point, give 5 minutes to this StackOverflow thread.
There are many things that happen behind your code in R and some are not straightforward. Here are a few behaviors we discovered on a span of 8 weeks.
One of our regular pain points was checking if a variable is NA. There are lots of ways to do it:
variable == NA: the double equals operator checks if the value of your variable equals NA. This operation has no sense in R (if you want more details, you can refer to this) and won’t work.
is.na(variable): this function is optimized for tables. It is performed element-wise and thus creates a mask suited to your variable.
identical(variable, NA): this function is reliable to test if a variable is an atomic vector with single value NA. Nevertheless, it won’t work on other NA types in R.
Indeed, R contains different types of NA. But R also allows functions to return custom NAs, such as in the package “rvest”.
If you have this kind of issues with NA values, you should use anyNA(variable).
R has a tendency to autocomplete a few key elements:
functions’ arguments (that are also keyword arguments):
column names when manipulating data frames:
It didn’t bother us but I can imagine plenty of situations where it could have.
EDIT: You can make RStudio display warnings when such autocompletion happens, editing your .Rprofile:
Coming from Python, with a more procedural/imperative programming use, I was surprised by the following behavior. R is a functional programming language, meaning that every call, every expression is a value.
So, when defining an R function, you have to remember that the call of this function will take the value of the last stated variable. At first, I considered this way of working as an implicit return value. But a Reddit fellow corrected me on this point.
R allows you to pipe functions in at least 2 ways. Let’s consider the following instruction:
The first alternative to this difficult-to-read line comes with the package magrittr:
This solution allows you to pipe function in a single instruction! We used it when chaining basic functions, namely rvest functions when scraping web pages. It allowed us to reduce the number of variables while keeping good readability.
I think it’s great but one could consider it difficult to read: after reading from left to right, you have to jump back to the beginning of the line to recall which variable you are assigning. Moreover, the disappearance of the first argument of each piped function could be considered misleading.
You can also use a built-in possibility with the -> assignment operator:
Even though it takes place on multiple lines, with this alternative, you get rid of the 2 drawbacks of magrittr.
I can’t recommend any of these 2 solutions, you should choose the one that fits you the most. In any case, you should consider it as a convention, such as the naming one. Switching between piping ways shouldn’t become a mental workload when developing.
If you have to learn only one, it is Ctrl+ Shift + K or Opt + Shift + K. It will give you a sum-up of all the available shortcuts.
Other common shortcuts available in every IDE:
Move lines with Alt + Up/Down or Opt + Up/Down
Go to function definition with F2
Indent automatically with Ctrl + I or Cmd + I
Select all occurrences with Ctrl + Alt + K (couldn’t find it on Mac)
Go to next occurrence with Ctrl + K or Cmd + E
Format your code with Ctrl + Shift + A or Cmd + Shift + A
Or you can print a cheatsheet, whatever fits your way of working.
Here are more cheatsheets if you want!
How Does Your Computer Generate Random Numbers?
What you should know about numpy and pseudo random number generators (PRNG).
How To Build A Successful AI PoC
Turn Your Artificial Intelligence Ideas Into Working Software
How to Perform Fraud Detection with Personalized Page Rank
This article shows how to perform fraud detection with Graph Analysis.