R is a programming language for data management, statistical analysis, and visualisation. R is free, open source software and is one of the most popular environments for statistical computing.
Data management is about getting, cleaning, and preparing data for statistical analysis and visualisation. Tasks like filtering, summarising, mutating, and merging data and pivoting between wide and tall data formats are everyday jobs for a data manager.
Data management is for statistical analysis what tape and cover is for painting work: do it properly, or things get messy.
Traditionally, data management tasks are often ignored in statistics courses, where students are supplied with nice and clean preprepared datasets. But in practice data management tasks usually take most of the time involved in data analysis.
Many researches use point-and-click software like MS Excel, MS Access or SPSS for data management. This leads to non-reproducible, undocumented analyses that are difficult or impossible to share between researchers – not to mention publicising. With R, data management can be automated in fully reproducible, transparent and shareable ways. And when data gets updated, one only has to press a button to re-run the full analysis including outputting text, tables and figures to documents and web pages.
Data visualisation is about presenting data using graphical methods that support insight and reveal information that would otherwise be overlooked. R’s graphing capabilities are unsurpassed. With R (and a clean dataset), creating even very sophisticated plots is straightforward. Never again will you need to manually (re)format points and lines in Excel plots.
This course is for you if you want valid and reproducible results from your statistical analyses. In one day you will learn what they forgot to teach you in statistics class and what possibly got in your way or even forced you back to treacherous Excel when you started working with real-life data? (don’t click link unless you are ready for some serious Excel bashing.)
Beware that R is a full blown programming language and there will be (almost) no pointing and clicking in this course. Learning R is like learning a spoken language – you’ll quickly learn how to order two beers in a bar, but it takes years of practice to become fluent in the language. This course will teach you a bit more than ordering beers and more than enough to continue studying on your own. (Yes, you can make R order beers.)
The course is hands-on and 100% PowerPoint-free, and within the first hour you will be able to perform basic data manipulation tasks and create multivariate visualisations like this using just a few lines of R code.
ggplot(data = CO2, mapping = aes(x = conc, y = uptake, group = Plant, colour = Treatment, linetype = Type)) + geom_point() + geom_line() + labs(title = 'Carbon dioxide uptake in grass plants', x = 'Ambient CO2 concentration', y = 'CO2 uptake')
In the course you will practice the basic tasks of data management using real life datasets:
Read data into R: Data are (almost) always stored in external files or databases. Importing data is the first step in any analysis.
Tidy messy data: Tidy data is where every row is an observation, every column is a variable, and every cell is a single value. Real life data are often messy in serious need of tidying.
Arrange data, e.g. sort patients by age.
Select variables, e.g. in a dataset where each patient record has 200 variables pick only the variables you need, e.g. height and weight.
Filter observations, e.g. in a dataset of 5 mill. patients pick those from your own hospital.
Mutate (add or modify) variables, e.g. calculate BMI from body height and weight or age from date of birth.
Summarise variables, e.g. calculate the mean and standard deviation height of all patients by sex and age group.
Join two or more datasets into one, e.g. merge patient data with procedure data by patient-id.
Plot data using The Grammar of Graphics – i.e. map data to coordinates, colours, shapes etc. and output to points, lines, bars, etc. In the plot above each data value is mapped to a 2D position, a colour, and a line type. The values are presented using points and connected with straight lines.
This course is on general data management and visualisation. Look elsewhere for courses on specific statistical methods and specialised visualisation techniques, e.g. for genomic data. However, if you already know statistics, R will be your life-long friend.
This course is aimed primarily at medical researchers (doctors, biologists, engineers, etc.). But anybody with an interest in managing data with R is welcome. Prior R experience is not mandatory but an advantage. A basic understanding of data management tasks (select, filter, summarise etc.) in general is also an advantage as is any programming experience.
This course may be held where and whenever at least 12 people ask for it.
The course may be completed in one day (6 hours) or two half days (3 hours).
Please note that R and RStudio versions from “Softwareshoppen” are outdated, and that installing “unauthorised” software on a corporate PC is a pain. So you are better off using a private PC (or MAC) where you have administrator rights.
CORRECTION: It turn out that it is actually possible to use the versions of R and RStudio that are currently available from “Softwareshoppen” (Region H only). So if you do not have a private PC, you may use your work PC.