Clean and do some normalization of this data (get to at least 1NF) and put this into a new schema
called clinvar inside a table called variants. Infer from reading the VCF specification and the downloaded clinvar data what attributes are needed and how you can determine each row and each
column (hint: n lines and t rows with ## on header lines). Pay close attention to: CHROM, POS, ID,
REF, ALT and INFO (the really nasty one that violates 1NF…). Because INFO is ‘;’ separated, you
must expand it out to several columns of which CLNVC,CLNSIG, RS and especially GENEINFO are
needed. You can toss the rest if you wish and may use any programming language or processing tool to
complete this section.
(see the included NetBeans8.2 project that has the connectorJ driver .jar file) (2a) Write a java pplication that can return all rows that are Pathogenic (hint: CLNSIG=Pathogenic) (2b) Of the genes that are Pathogenic, what is the frequency? This is a histogram of gene symbol to number of urrences. (hint look at GENEINFO=X:1234 and split on ‘:’ to get the symbol ‘X’ out) ?
REATE TABLE VCFInfo
( AF_ESP FLOAT, AF_EXAC FLOAT,AF_TGP FLOAT, ALLELEID INT, CLNDN varchar2(500),CLNDNINCL varchar2(500),CLNDISDBvarchar2(500),CLNDISDBINCLvarchar2(500),CLNHGVS
varchar2(500),CLNREVSTAT varchar2(500),CLNSIG varchar2(500),CLNSIGCONF varchar2(500),
SSR varchar2(500)) ;