Review of Reproducible Research by Johns Hopkins University on Coursera

This week I completed the course: Reproducible Research by Johns Hopkins University on Coursera The course introduces tools to publish research documents containing data processing code, raw data and results. Research is "reproducible" if an independent researcher can fetch the code, fetch the data, execute the scripts and verify the results.

IMO, this is akin to the software engineering practices of Software Quality Assurance, Code Reviews and Continuous Integration. These practices are meant to solve the problem where the code "works-on-my-machine" but not anywhere else. This is extremely important in bioinformatics because erroneous research can lead to erroneous clinical trials - as described in the lecture: The Importance of Reproducible Research in High-Throughput Biology.

Key Lectures
My favorite lecture of the course was the The Importance of Reproducible Research in High-Throughput Biology lecture given by Keith A. Baggerly, Ph.D. of the MD Anderson Cancer Center, Houston, TX. The lecture discusses Dr. Baggerly's attempt to reverse engineer the results of a study that had numerous errors. See this NYT article for more details.

Projects
For the first project, we analyzed activity monitoring data created by a fitness tracker. First, I calculate the mean number of steps for each 5-minute interval grouped by weekends and weekdays (i.e. 1 group for Monday-Friday intervals, 1 group for Saturday-Sunday intervals). I conclude that the user is most active on weekdays because the maximum 5-minute interval occurs in the weekday group.

For the second project, we analyze the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. First, I show the data processing steps performed prior to the analysis. Next, I calculate the sum for number of fatalities, number of injuries and economic cost per weather event type. Finally, I rank the weather event types based on (1) public health impact and (2) economic impact. The results show tornados pose a significant public health risk in terms of injuries, fatalities and economic cost. Additionally, excessive heat poses a public health risk based on fatalities. Floods pose the greatest risk in terms of economic cost. I also published the report to Rpubs.com

I earned a certificate for completing the course. The next course in the series is Statistical Inference.