Listen to the podcast
On episode 28, we finally get around to tackling R, a language for statistical computing. R has a storied history as an LGPL code related to the S language which came out of Bell labs which itself was influenced by Scheme. R is the go-to tool for many statisticians, analysts, and data scientists. It comes with the full suite of plotting, array, and math libraries that computational scientists have come to know and expect. As a high-level, dynamic, and extensible language, R is definitely worth knowing about – even if you aren’t a statistician!
Today’s useRs include:
- Geraldine van der Auwera
- Phillip Keung (special guest)
- Christopher Jordan-Squire (special guest)
- Anthony Scopatz (moderator)
Phillip Keung is a biostatistics graduate student and works in statistical genetics at the University of Washington in Seattle, WA.
Christopher Jordan-Squire is a mathematics graduate student at the University of Washington in Seattle, WA. His love of applied problems has overcome his original training in pure mathematics and deep love of the Fourier transform. With training in both pure mathematics and applied statistics, he now works in the intersection of the two fields. He is currently researching non-parametric mixture model estimation, and also does statistics consulting and python coding in his not-so-ample spare time.
Intro Music: Batman Theme
Outro Music: X-Men Theme
Tony Theodore
2012/07/09
Downloaded this with Instacast and it seems to actually be episode 25 … just checked and it’s the same in iTunes.
Anthony Scopatz
2012/07/09
I know that we were experiencing this issue with episode 27, but episode 28 should be fine.
The thing is though that we have fixed our feed for episode 27, so it is iTunes’ or Instacast’s caching mechanism that isn’t picking up our fix. Not a whole lot we can do on our end…
Tony Theodore
2012/07/09
Strange, episode 27 isn’t in the iTunes index at all, but is in Instacast.
Really enjoying the podcast, keep up the great work!
Anthony Scopatz
2012/07/09
Weird… That is strange… Thank you though! I am glad you like it!
David Nusinow
2012/07/10
This episode seemed really harsh to R. Maybe there just wasn’t enough familiarity with it on the panel? A few points that I think were sorely missed during the discussion:
1) The notion that R doesn’t have collections is silly. The list is a fundamental R data structure that was pretty obviously missing from this episode. You can trivially have named objects in a list. Although it’s not hashed (and so doesn’t have the nice performance properties of a hash/dict) it’s a workable data structure for collections. A built-in hash would be nice and I hope the R-core team adds it one day.
2) The for loop part was a little embarassing. For loops are probably R’s most glaring weakness, but it’s because they’re horribly slow rather than the reasons mentioned on the podcast. A for loop can iterate over a vector or list trivially. You can also use the apply family of functions, which were mentioned briefly on the podcast. They provide a very nice and functional interface to the concept of iterating over a collection. There’s also a set of functions named after lisp’s versions (python shares these) like Map and Reduce. These are just frontends to the apply family.
3) The foreach package is actually very neat. It provides a nice for loop-like interface over a number of pluggable backends for parallelization, like the snow package mentioned on the podcast. So you can write a normal for loop while you’re developing your code, and later do minor adjustments to make it a foreach loop and you’ve instantly got code that parallelized across multiple cores or a cluster. These should also allow access to generators, as in python’s xrange mentioned on the show, although I haven’t tested it myself. A caveat to this is the standard one for distributed coding, same as python’s GIL: sharing state is a non-trivial problem, so be careful or only do this for applications where you don’t need to share state.
4) Related to working with collections, a package that was mentioned during the podcast was Hadley Wickham’s plyr package. This is a package that deals with what he calls the “split, apply, combine” problem, which is a pretty basic one in data analysis. You’ve got a collection of different but related things, you want to split them up by some criterion (like country of origin, mutational status, etc) apply a function to the subcollection, and then reassemble the data that came out of those functions on each subcollection in to a whole. This adds another layer of ease to slicing data, and it’s something I’ve yet to see in another language.
5) R is actually a very impressive language on its own. It’s incredibly ugly and has some very real problems, but it’s generally a surprisingly well-designed language because it inherits so much from both UNIX (S was developed at Bell at the same time as UNIX) and Scheme (the S developers consciously drew from lisp, and R itself began life as a scheme interpreter). It’s a fully functional language that, above a beginner’s level, encourages high level programming and exposes a lot of low-level power. Computing on the language (a la Scheme) is quite common, and as an example it’s part of how Hadley put ggplot2 together. The terminal allows easy access to the implementation of any function written in R, allowing you to understand how the underlying code works with no ceremony, and also for easy debugging. When people talk about how R sucks it reminds me of how they used to talk about JavaScript a few years ago. The language is problematic, but don’t write it off, there’s a lot of elegance in there.
Anthony Scopatz
2012/07/10
Hello David, Thanks for the clarifications you posted. I would like to state that I have great respect for R (which is why I wanted to do this episode) but little experience other than V&V for other statistical tools. I am not an expert here, but do feel that it is important.
That said, I think it would be great if there was another R episode to talk about all of the issues that you brought up. If you want to grab a couple of your friends or other people in the R community, I would be happy to moderate another episode. Alternatively, if you want to moderate it yourself than you’ll need to be on another episode before that – which I am also happy to make happen. In either event, if you are interested you should send an introduction email out to inscight-dev@googlegroups.com or get in touch with me.
Andrew Prayle
2012/07/19
I second @David Nusinow’s points. I think that there were errors e.g. for loops do exist. The various data structures in R are really useful. The idea that R is not usable for large datasets doesn’t really square with packages such as bigmemory available. It’s trivial to spin up a cluster of workstations with R and then use any of R’s commands and additional packages over the cluster. And to talk about scientific computing without describing the Bioconductor (more like a suite of packages than a single package) in any detail is also a serious omission.
You’ve got to note that R was written by people who wanted to get statistical programming done and implement new statistical techniques with it and then apply them. So a key advantage is that when a new statistical technique is described in the literature it is almost immediately available in R (often a package accompanies the paper). So it is literally years ahead of other statistical packages such as Stata. Philosophically R also changes the way you do exploratory data analysis, due to the interactive nature of the way you can use it. Once the dataset is cleaned, it is trivial to produce sophisticated plots, and explore the data at the same time as doing the analysis, rather than blindly running the same types of statistical analysis on each dataset. And if R gets too slow, you can always write a functions in C and call them from R, or use the Rccp package.
Finally, anyone can make python code look ugly, and R code is said to be ugly but is actually pretty intuitive after a while. To do a linear regression in R you can have a datafame with columns called ‘lung_function’, ‘age’, ‘sex’ etc:
model1 <- lm(lung_function ~ age + sex + height, data = mydata)
summary(model1)
A key thing is that to do a randomForest you also type:
require(randomForest)
model2 <- randomForest(lung_function ~ age + sex + height, data = mydata)
So the consistency amongst packages is really useful. How simple do you want to make a language for statistical analysis? I think you should do another episode and get a few prominent people from the community on it. Maybe R does suck, but you only have to look at things like which tools people use in competitions like kaggle to see what's useful and used.
Anthony Scopatz
2012/07/19
Hi Andrew, I am sorry that you think that we missed the boat here. I make the same offer to you that I made to David. If you want do another episode on R, I am in full support. Please, let me know!
David Nusinow
2012/07/19
Hi Anthony,
I tried to send an introduction to the google group, but it won’t let me post without being a member, and somehow I’m not able to find the group on the search page to join. I’ll keep looking, but I’ve copy & pasted the email I sent below. I’d be happy to represent a happy R user on the podcast if there’s interest.
copy and paste below
—————————
Hello,
My name is David Nusinow and I was invited to introduce myself here
after I commented on the podcast episode about R. I wasn’t terribly
happy with how R was treated in that episode, and thought that it
could do with more users who were familiar with it. By way of who I
am, I’m not a pillar of the R community or anything, just a happy user
so I may not be the best representative for the language. I’m a
computational biologist working in the fields of developmental
biology, proteomics, genomics,and systems biology. I come from the
Free Software community and a biology background, so I’m more a
programmer biologist than statistician, although I’ve had to learn a
fair amount of stats and machine learning for my work.
I’ve been using R almost exclusively for the past three years for
almost all my real computational analysis including data cleaning,
statistical or machine learning analysis, and visualization. I’ve
spent a lot of time with the broader collection of libraries in CRAN,
as well as those from Bioconductor, which is a huge motivation for me
to use the language. I’ve managed to use most of the popular elements
of and packages for the language, and I feel like I’m reasonably
familiar with even those parts I don’t know very well. I’d be happy to
talk more about R, either to teach about it, correct misconceptions,
compare it to other languages (I’m a bit of a programming language
nerd), or just geek out about cool stuff that it’s capable of. Or…
you know… rant about its problems.
Anthony Scopatz
2012/07/19
Hi David, Whoops I didn’t realize that we were closed. I will open up the list and send you an invite as well. There is definite interest!
David Nusinow
2012/07/19
Cool, thanks! I got the invite and subscribed. I’ll send that email to the list I suppose.