Scripting in R

From spatial-analyst.net
Jump to: navigation, search
A

lmost everybody I know had serious difficulties with switching from some statistical package (or a GIS) to R syntax. It's not only the lack of GUI or relatively limited explanation of the functions, it is mainly because R asks for a serious time investment (as you will soon find out, very frequently you will need to debug the code yourself, extend the existing functionality or even try to contact the creators), and it does require that you largely change your data analysis philosophy. R is also increasingly extensive and this often represents a problem to less professional users of statistics -- it immediately becomes difficult to find which package to use, which method, which parameters to set, and what do the results mean. Very little of such information comes with the installation of R. One thing is certain, switching to R without any help and without the right strategy can be very frustrating.


Contents

Why R?

Before I get into giving you practical tips how to (systematically) solve your problems in R, I need to emphasize that it is really really worth the effort! My usual argument to students why I decided to break my head with R is because, at the end of the day, I do find a solution and I do get my work done (although I was very suspicious at the beginning, and very frustrated at the midway). Do not get me wrong, although some R guru's exaggerate the capabilities of R ("every analysis is doable in R"), R is indeed powerful. Consequently, the number of users and its functionality is growing in exponential terms (read more about R in this New York Times article; at the moment, R has between 250-300k users; compare also the website traffic for SAS, Matlab and R-project).

Important point
The specific analysis you want to do in R is probably already available in a package, if not you can consider extending some existing package, if not you can consider using R to control external applications, if not you can consider improving R itself.


There are at least five strong reasons to switch to R:

  • It is of high quality — It is a non-commercial product of international collaboration between top statisticians.
  • It helps you think critically — It stimulates critical thinking about problem-solving rather than a push the button mentality.
  • It is an open source software — Source code is published, so you can see the exact algorithms being used; expert statisticians can make sure the code is correct.
  • It allows automation — Repetitive procedures can easily be automated by user-written scripts or functions.
  • It can handle and generate maps — R now also provides rich facilities for interpolation and statistical analysis of spatial data, including export to GIS packages and Google Earth.

One more thing: some very smart people use R. This is not by accident!

Making friends with R

I created Quick-R for one simple reason. I wanted to learn R and I am a teacher at heart. The easiest way for me to learn something is to teach it.

—Robert I. Kabaco, the creator of Quick-R

First, you should note that you can edit the R scripts in an user-friendly script editors such as as TINN-R, Rstudio, and/or JGR, or use the package R commander (Rcmdr), which has an user-friendly graphical interface. This would help you get some first ideas about the R syntax. A more detailed instructions on where to obtain and how install R you can find in this article.

Important point
The best way to learn R is to look at the existing scripts, then adjust/improve/extend/combine them to fit your needs.


Second, you should take small steps before you can get into really sophisticated script development (invest some time!). Start with some simple examples and then try to do the same exercises with your own data. The best way to learn R is to look at the existing scripts. For example, a French colleague, Romain Francois, has been maintaining a gallery of R scripts that is dedicated to the noble goal of getting you addicted to R. A similar website is this R Graphical Manual. Robert I. Kabacoff maintains a small website called Quick-R that gives an overview of the main R philosophy and functionality. John Verzani maintains a website with simple examples in R that will start you going (if you are use to cookbooks, R has it too). In fact, there even is a package that could be the first package you should consider using:

> install.packages("UsingR")

Third, if your R script does not work, do not break your head, try to get help. Try to obtain books on R; especially Chambers (2008), Venables and Ripley (2002) and/or Murrell (2006) are considered to be classics; if you are interested in running spatial analysis or geostatistics in R, then definitively Bivand et al. (2008), Reimann et al. (2008), and/or Diggle and Ribeiro (2006) is a must. Next, search internet for people with similar problems. Web-resources on R are quite extensive and often you will find out that all that you need is already there. If nothing from the above helps, try to follow courses on R or contact some of the R gurus. However, keep in mind that these are extremely busy people and that they prefer to communicate with you about your problems via some of the R mailing lists.

Important point
Before you start sending a question to some mailing list, please consider that there is a high probability that somebody else might have already had the same problem, so you should first do your home-work and carefully study what is already there.


Many books on R today also include the original scripts used to produce some of the results/graphics in the book. For example, the books by Bivand et al. (2008) and Reimann et al. (2008) come with accompanying websites where you can see how each example of each chapter/figure has been produced. It is important to start from such scripts, even if they seem rather extensive at first sight, because this way you will systematically learn the best practice of scripting (the books on R are often written by the R developers --- they know exactly what happens and why). Other useful websites with R scripts connected with spatial data analysis and plotting are:

Before you start sending a question to some mailing list, please consider that there is a high probability that somebody else might have already had the same problem, so you should first do your home-work and carefully study what is already there. R developers are not responsible to provide ANY support. If you really want to get some useful feedback then try to ask "SMART" questions that contribute to the progress of the whole community, and not only to your personal goals. Before posting any questions, make sure you read the R posting guide. One of the best resources on how to start programming (developing packages) is the Advanced R programming by Hadley Wickham page.

Figure: JGR is an example of user-friendly R scripting environment: it allows you to browse datasets and get hints on the method-specific attributes. Another efficient scripting environment is the Rstudio, which works on various OS's and is highly informative.
Figure: StatET is another example of user-friendly R scripting environment. This is based on the Eclipse development platform. To install StatE, first install Eclipse, then add the Welware repository and install StatET.

Scripting in R

R has really become the second language for people coming out of grad school now, and there’s an amazing amount of code being written for it.

—Max Kuhn, after NYT

R is a command line based environment, but users do really write things directly to a command line. It is more common to first write using text editors (Tinn-R, JGR) and then "send lines" of code to R command line. When generating an R script, there are few useful tips that you might consider following (especially if you plan to share this script with a wider community):

  • Put your comments and explain in the script the steps you are doing; Comments in R can be inserted after "#" sign; There are never enough comments in an R script!
  • Add some meta-information about the script at the beginning of your script --- its authors, last update, purpose, inputs and outputs, reference where somebody can find more info (R scripts usually come as supplementary materials for project reports or articles) and difficulties one might experience.
  • Once you tested your script and saw that it works, tidy-up the code -- remove unnecessary lines, improve the code where needed, and test it using extreme inputs. In R, many equivalent operations can be run via different paths. In fact, even the same techniques are often implemented in various packages, which if all beneficial to the users. On the other hand, not all methods are equally efficient (speed, robustness), i.e. equally elegant, so that it is often worth investigating what might be the most elegant way to run some analysis.
  • Place the input data on-line (this way you only need to distribute the script) and then call the data by using the "download.file" method in R;

All these things will make a life easier to your colleagues, but also to yourself if you decide to come back to your own script in few years (or few months).

Another thing you might consider is to directly write the code and comments in Tinn-R using the Sweave package. Note that you can still run this script from Tinn-R, you only need to specify where the R code begins ("<<>>=") and ends ("@"). This way, you do not only distribute the code, but also all explanation, formulae etc.

The best strategy to find what you look for

From a period in which geographic information systems, and later geocomputation and geographical information science, have been agenda setters, there seems to be interest in trying things out, in expressing ideas in code, and in encouraging others to apply the coded functions in teaching and applied research settings.

—Roger Bivand, in "Implementing Spatial Data Analysis Software Tools in R"

In last section we mentioned that, because the R community is large and extensive, the chance that solution to your problems is already there is pretty high. The remaining issue is how to find this information. There are several sources where you should look at: (1) the help files locally installed on your machine; (2) mailing/discussion lists; (3) various website and tutorials, and (4) the commercial literature. You should really start searching in this order --- from your machine to a bookstore --- although it is always a good idea to cross-check all possible sources and compare alternatives. Please also bare in mind that (a) not everything that you grab from www is correct and up-to-date; and (b) it is also possible that you have an original problem that nobody else before experienced.

Searching your local installation of R

Imagine that you would like to find out which packages on your machine can run interpolation by kriging. You can quickly find out this by running the method help.search, which will give something like this:

> help.search("kriging")

Help files with alias or concept or title matching ‘kriging’ using fuzzy matching:

image.kriging(geoR)             Image or Perspective Plot with Kriging Results
krige.bayes(geoR)               Bayesian Analysis for Gaussian Geostatistical Models
krige.conv(geoR)                Spatial Prediction -- Conventional Kriging
krweights(geoR)                 Computes kriging weights
ksline(geoR)                    Spatial Prediction -- Conventional Kriging
legend.krige(geoR)              Add a legend to a image with kriging results
wo(geoR)                        Kriging example data from Webster and Oliver
xvalid(geoR)                    Cross-validation by kriging
krige(gstat)                    Simple, Ordinary or Universal, global or local, Point or Block Kriging, or simulation.
krige.cv(gstat)                 (co)kriging cross validation, n-fold or leave-one-out
ossfim(gstat)                   Kriging standard errors as function of grid spacing and block size
krige(sgeostat)                 Kriging
prmat(spatial)                  Evaluate Kriging Surface over a Grid
semat(spatial)                  Evaluate Kriging Standard Error of Prediction over a Grid

Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'.

This shows that kriging (and its variants) is implemented in (at least) four packages. We can now display the help for the method "krige" that is available in the package gstat:

> help(krige, package=gstat)

Searching the R-project

The archives of the mailing lists are available via the servers in Zurich. They are fairly extensive and the only way to find something useful is to search them. The fastest way to search all R mailing lists is to use the RSiteSearch method. For example, imagine that you are trying to run kriging and then the console gives you the following error message e.g.:

"Error : dimensions do not match: locations XXXX and data YYYY" 

Based on the error message we can list at least 3-5 keywords that will help us search the mailing list, e.g.:

> RSiteSearch("krige {dimensions do not match}")

This will give over 15 messages with a thread matching exactly your error message. This means that other people also had this problem, so now you only need to locate the right solution. You should sort the messages by date and then start from the most recent message. The answer to your problem will be in one of the replies submitted by the mailing list subscribers. You can quickly check if this is a solution that you need by making a small script and then testing it.

Sometimes you only want to browse the functionality of contributed R packages and see what R has to offer. The best way to quickly find out what is there is to browse the complete list of contributed packages. Open the URL and then do a search using some more generic keyword e.g. "kriging".

Searching the www

Of course, you can at any time Google the key words of interest. However, you might instead consider using the Rseek.org search engine maintained by Sasha Goodman. The advantage of using Rseek over e.g. general Google is that it focuses only on the R publications, mailing lists, vignettes, tutorials etc. The result of the search is sorted in categories, which makes it easier to locate the right source.

R mailing lists: Do's and Don'ts!

If you are eventually not able to find a solution yourself, you can try sending the description of your problem to a mailing list. Note that there are MANY R mailing lists, so you first have to be sure to find the right one. Sending a right message to a wrong mailing list will still leave you without an answer. Also have in mind that everything you send to a mailing list is public/archived, so better cross-check your message before you send it.

Important point
When asking for a help from a mailing list, use the existing pre-installed datasets to describe your problem. (Then you only need to communicate the problem and not the specifics of a dataset; there is also no need to share your data.)


Do's:

  • If you have not done so already, read the R posting guide!
  • Use the existing pre-installed datasets (come together with a certain package) to describe your problem (you can list all available data sets on your machine by typing 'data()'). This way you do not have to attach your original data or waste time on trying to explain your case study.
  • If your problem is completely specific to your dataset, then upload it an put it on some FTP or a web-directory so that somebody can access it and see what really goes on.
  • Link your problem to some existing problems; put it in some actual context (to really understand what I mean here, you should consider attending the Use R conferences).
  • Acknowledge the work (time spent) other people do to help you.
  • You can submit not only the problems you discover but also the information that you think is interesting for the community.

Don'ts:

  • Do not send poorly formulated questions. Make sure you give technical description of your data, purpose of your analysis, even the details about your operating system, RAM etc. Try to put yourself in a position of a person that is interested to help --- try to provide all needed information as if the person who is ready to help you would feel like sitting at your computer.
  • Do not send too much. One message, one question (or better to say "one message, one problem"). Nobody reading R mailing lists has time to read long articles with multiple discussion points. Your problem should fit half the page; if somebody gets more interested, you can continue the discussion also off the list.
  • R comes with ABSOLUTELY NO WARRANTY. If you loose data or get strange results, you are welcome to improve the code yourself (or consider obtaining some commercial software). Complaining to a mailing list about what frustrates you about R makes no sense, because nobody is obliged to take any responsibility.
  • R is a community project (it is based on the solidarity between the users). Think what you can do for the community and not what the community can do for you.

Probably the worst thing that can happen to your question is that you do not get any reply (and this does not necessarily mean that nobody wants to help you or that nobody know the solution)! There are several possible reasons why this happened:

  • You have asked too much! Some people post questions that take even few weeks to solve (maybe you should better put a project proposal?). Instead, you should always limit your self to 1-2 key concrete issues. Broader discussions about various more general topics and statistical theory are sometimes also welcome, but they should be connected with specific packages.
  • You did not introduce your question/topic properly. If your question is very specific to your field and the subscribers can not really understand what you are doing, you need to think of ways to introduce your field and describe the specific context. The only way to learn the language used by the R mailing lists is to browse the existing mails (archives).
  • You are requesting that somebody does a work for you that you could do yourself! R and its packages are all open source, which allows YOU to double check the underlying algorithms and extend them where necessary. If you want other people to do programming for you, then you are at the wrong place (some commercial software companies do accept wish-lists and similar types of requests, but that's what they are paid for anyway).
  • Your question has been answered already few times (and it is quite annoying that you did not do your homework to check this).

Remember: everything you send to mailing list reads a large group of people (for example, R-sig-geo has +1000 subscribers), and it is archived on-line, so you should be more careful about what you post. If you develop a bad reputation of being ignorant and/or too sloppy, then people might start ignoring your questions even if they eventually start getting the right shape.


<Rating> Rate this article: 1 Poor and in error 2 Far from ready 3 Needs improvements 4 Minor corrections 5 Very useful!! </Rating>

Personal tools