PERSONALIZED MEDICINE: REDEFINING CANCER TREATMENT

Predict the effect of Genetic Variants to enable Personalized Medicine

This data set retrieved from Kaggle competetition plat form. Data set can be downloaded from www.kaggle.com

Kaggle ran this competetion with partnership of Memorial Sloan Kettering Cancer Center (MSKCC).

A lot has been said during the past several years about how precision medicine and, more concretely, how genetic testing is going to disrupt the way diseases like cancer are treated.

Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers).

Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.

MSKCC provided an expert-annotated knowledge base where world-class researchers and oncologists have manually annotated thousands of mutations.

We need to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations.

Data Description

Data set can be downloaded from www.kaggle.com

There are nine different classes a genetic mutation can be classified on.

This is not a trivial task since interpreting clinical evidence is very challenging even for human specialists. Therefore, modeling the clinical evidence (text) will be critical for the success of your approach.

Both, training and test, data sets are provided via two different files. One (training/testvariants) provides the information about the genetic mutations, whereas the other (training/testtext) provides the clinical evidence (text) that our human experts used to classify the genetic mutations. Both are linked via the ID field.

Therefore the genetic mutation (row) with ID=15 in the file trainingvariants, was classified using the clinical evidence (text) from the row with ID=15 in the file trainingtext

Finally, to make it more exciting!! Some of the test data is machine-generated to prevent hand labeling. You will submit all the results of your classification algorithm, and we will ignore the machine-generated samples.

Model

Developed an XGBOOST Model to predict the probability of each of nine mutations against each samples.

Complete project note book can be found at my github repoistory