Category Archives: data-science

Kaggle Credit Risk Competition

Kaggle Competition Goal

Detect which loans are at risk of default using credit application data and 3rd party credit data.

My Approach

Fetch the Kaggle competition data from the Home Credit Default Risk Competition, generate numeric and categorical features then build models using Tensorflow, Scikit-Learn and XGBoost.

Github

See my kaggle_credit_risk github repo to view the source for generating features, training models and running model experiments.

Kaggle Seizure Prediction Competition

Kaggle Competition Goal

Detect seizure (preictal) or non-seizure (interical) segments of intracranial electroencephalography (iEEG) data. See Kaggle EEG Competition page for more details.

My Approach:

  • Extract basic stats and FFT features for non-overlapping 30-second iEEG windows
  • Detect signal drop out and impute missing data with mean for each feature per window
  • Predict seizure and non-seizure segments using a stacked model.

Model Details

For more details about the model,  see my github repo with the documentation and R code.

Final Thoughts

Completed Coursera Data Science Specialization

I completed the 10-course data science specialization by Johns Hopkins University on Coursera.

Here are my certificates:
https://www.coursera.org/account/accomplishments/specialization/JKSAW82GLH35

Links

Retrospective

I enjoyed the course. This course took me waaaay more time than I thought because I struggled with a few issues.

  • First, I wish I'd started by taking the NLP online course before starting the Capstone (https://www.youtube.com/watch?v=-aMYz1tMfPg).
  • There was an issue installing RWeka, RJava and it took me several days to work through the issues. I eventually moved to using quanteda (https://cran.r-project.org/web/packages/quanteda/vignettes/quickstart.html).
  • I also waited far too long to develop a method to test my model using a subset of the training data, so I could test whether changes to my model improved and reduced performance. It turns out that my model trained on a 25% sample performed just as well as a model trained on 100%. I should have spent more time trying different models with the 25% sampled data.

I'm thankful for the Discussion Forum and final peer review process. Both helped me learn how I can improve my model and demo application. I really appreciate the instructors for creating this specialization. I've learned a lot.

Activity Recognition of Weight Lifting Exercises Data

Course Project for Practical Machine Learning by Johns Hopkins University on Coursera

The project includes the following reports:

Qualitative Activity Recognition of Weight Lifting Exercises Data : In this project, we use R to build a classifier using the sensor data. The data consists of training set containing over 19000 samples, each with 152 variables and classe outcome variable with the value ‘A’, ‘B’, ‘C’, ‘D’ or ‘E’. The testing set consists of 20 samples without the classe outcome variable. The goal is to build a classifier using the training data to predict the classe of the testing data.

Data
The project uses sensor data collected by Groupware@LES

Review of Reproducible Research by Johns Hopkins University on Coursera

This week I completed the course: Reproducible Research by Johns Hopkins University on Coursera The course introduces tools to publish research documents containing data processing code, raw data and results. Research is "reproducible" if an independent researcher can fetch the code, fetch the data, execute the scripts and verify the results.

IMO, this is akin to the software engineering practices of Software Quality Assurance, Code Reviews and Continuous Integration. These practices are meant to solve the problem where the code "works-on-my-machine" but not anywhere else. This is extremely important in bioinformatics because erroneous research can lead to erroneous clinical trials - as described in the lecture: The Importance of Reproducible Research in High-Throughput Biology.

Key Lectures
My favorite lecture of the course was the The Importance of Reproducible Research in High-Throughput Biology lecture given by Keith A. Baggerly, Ph.D. of the MD Anderson Cancer Center, Houston, TX. The lecture discusses Dr. Baggerly's attempt to reverse engineer the results of a study that had numerous errors. See this NYT article for more details.

Projects
For the first project, we analyzed activity monitoring data created by a fitness tracker. First, I calculate the mean number of steps for each 5-minute interval grouped by weekends and weekdays (i.e. 1 group for Monday-Friday intervals, 1 group for Saturday-Sunday intervals). I conclude that the user is most active on weekdays because the maximum 5-minute interval occurs in the weekday group.

For the second project, we analyze the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. First, I show the data processing steps performed prior to the analysis. Next, I calculate the sum for number of fatalities, number of injuries and economic cost per weather event type. Finally, I rank the weather event types based on (1) public health impact and (2) economic impact. The results show tornados pose a significant public health risk in terms of injuries, fatalities and economic cost. Additionally, excessive heat poses a public health risk based on fatalities. Floods pose the greatest risk in terms of economic cost. I also published the report to Rpubs.com

I earned a certificate for completing the course. The next course in the series is Statistical Inference.