EASY Start

In this contest, we will provide you with 5 collection of data (You could download in from google drive ) that we have extracted from TG-GATEs for you.

You can use these five data sets for the contest. But since data/data preprocessing is extremely important in Data Science, we strongly recommend you to read Data Preprocessing section to appreciate the real situation of Big Data and Data Science in Biology and Medicine, in my point of view, which is the most import things you should take home after the contest.

Then let’s get of a general idea of our data. we show this in R but you can use other programming languages(e.g. Python and Matlab) for this purpose too. Actural, we use Python for all the following analysis after we get these five data set as we can use popular machine learning(scikit-learn) and deep learning tools(Tensorflow) in Python.

Data description

exprs_train.csv - This is drug-response gene differential expression data that you should use when you train the model
exprs_test.csv - With this data, you make predictions and we will calculate scores for each team based on unpublished data “pathology_test.csv”
pathology_train.csv - This is the labels corresponding to samples in exprs_train.csv file
go.bp.roche.symbols.gmt - This is the GMT format file containing gene set/pathway information from MSigDB
path.reactomeV55.roche.symbols.gmt - This is the GMT format file containing gene set/pathway information from reactome

How to use these data

Drug-response gene differential expression data

“exprs” data here are drug-response gene differential expression data of rat in vivo liver collected from TG-GATEs. For the details, please check Data Preprocessing section. But if you want to get start as soon as possible, you can regard it just as a matrix whose rows are samples and columns are features

We load exprs_train data in R and explain a bit more as you will need this knowledge in stage two and three.

exprs_train=read.csv("../TGGATEs_tutorial_secrete/data/exprs_train.csv",header = TRUE,row.names = 1)
head(exprs_train[,1:6])

##                                 SART3       MIEN1         IMP4       SAR1B
## omeprazole_3 hr_Low      -0.136554174 -0.11181198 -0.218282335 -0.03260668
## hydroxyzine_4 day_Middle -0.158193103 -0.04118753  0.070575646 -0.05022884
## hydroxyzine_29 day_Low   -0.005548889 -0.05259548 -0.168798439  0.06060698
## quinidine_8 day_Middle   -0.065046300 -0.07373302 -0.007335606  0.08516626
## disopyramide_6 hr_High   -0.041048175 -0.07277274  0.067440315  0.16439248
## diltiazem_6 hr_High      -0.084768276  0.21114965  0.220124107  0.11420903
##                               PPHLN1        SSR3
## omeprazole_3 hr_Low      -0.16402278 -0.14767145
## hydroxyzine_4 day_Middle -0.02678692 -0.18726024
## hydroxyzine_29 day_Low   -0.04301561  0.01234981
## quinidine_8 day_Middle   -0.08804269 -0.22610470
## disopyramide_6 hr_High    0.19531008 -0.02047350
## diltiazem_6 hr_High       0.21623079  0.31194707

dim(exprs_train)

## [1] 2822 5613

As you can see, row names of expression data have format “drug/compound_time_dose level”. So this means this row is the gene differential expression data of liver for the rat that was administered “drug/compound” with “dose level” and sacrificed at “time point” for measurement. From “drug/compound”, you can collect more information and thus ‘making-a-bigger-box’ in stage 3.

And col names of expression data are gene names. So each column is gene differential expression level across all the samples. You may notice already that we have more features than samples in our data.This is general problem we will encounter when applying machine learning techniques in Biology/Medicine. The aims of stage 2 is trying to using feature selection/extraction methods to reduce the number of features but having equivalent prediction power at the same time.

pathology data

Pathology data are pathological records corresponding to samples in expression data.

pathology_train=read.csv("../TGGATEs_tutorial_secrete/data/pathology_train.csv",header = TRUE,row.names = 1)
head(pathology_train)

##                          Microgranuloma Increasedmitosis Hypertrophy
## omeprazole_3 hr_Low                   0                0           0
## hydroxyzine_4 day_Middle              0                0           0
## hydroxyzine_29 day_Low                0                0           0
## quinidine_8 day_Middle                0                0           0
## disopyramide_6 hr_High                0                0           0
## diltiazem_6 hr_High                   0                0           0

You can see we provide you with three pathologies “Microgranuloma”, “Increasedmitosis” and “Hypertrophy”. During contest , it’s up to you to predict them separately or try to predict three pathologies simultaneously (For example, by multi-label classification)

Pathway data

Two GMT format files contain pathway/gene set information you can exploit if you decide to use gene-set enrichment analysis for feature selection.

Each row in gmt format file is a gene set, which is described by “gene set name”, “gene set description” and the genes in the gene set.