A selected list of Seth Dobrin's publications and articles.
How IBM is advancing AI governance to help clients build trust and transparency
As more and more organizations scale their use of AI, they’re challenged with mitigating the associated risks and building genuine trust in AI decision-making. When it comes to trustworthy AI, we believe that consumers, clients and all stakeholders need to know how AI impacts their day-to-day lives, organizations, and work. …
New Certifications to Help Close the Data Scientist Deficit
Well, In this era of swelling data, the mining of insights to predict future outcomes with greater accuracy, to automate tasks, and to recommend actions based on that data is growing increasingly critical for organizations and businesses of all sizes...
Martin Fleming and Seth Dobrin
Helping Data Science Flourish – One Client at a Time
By now, in 2018, most enterprises have established some type of program to utilize math to help them make money. For many this means taking advantage of data science, or more specifically, machine learning. But despite seemingly widespread adoption, some studies show that more than two-thirds of these companies are failing to realize value from their data work ...
Countdown to GDPR: Yes, it applies to you—no, you don’t have to panic
Are you dealing with information that belongs to EU subjects? Does your company have a “Data Protection Officer”? If the answer the first question is yes and the answer to the second is no, then the new General Data Protection Regulation (GDPR) probably applies to you, and you might not be prepared to comply. That’s ok. There’s still time. But first…
Unifying Data Governance for the Future
Mastering fast-growing data volumes across the enterprise is one of the first and most critical steps in establishing a cognitive business. To do it requires adopting advanced analytics that enable an organization to better understand and control its data, but also to gain insights that set the stage for driving new business models…
Introduction: It was obvious from the start that the data lake was a different type of project. It was so much more than new data processing technology built around the Apache Hadoop open source platform. The data lake needs a new type of information governance, and this governance affects every aspect of the way an organization collects, processes, and governs their data—challenging traditional lines of control and ownership. However, when we began the partnership between IBM and ING, none of us realized the true extent of the impact it would have, both to an organization’s operation and the way we design data driven solutions...
Mandy Chessell, Ferd Scheepers, Maryna Strelchuk, Ron van der Starre, Seth Dobrin
Journey to Digital Series
Raiders of Every Industry: The Journey to Digital
Companies have a choice today. They can be the disrupted or the disruptor. Every industry will be disrupted in the coming years. None are safe. The safer they seem, the more susceptible they probably are. In fact, only two things stand a chance of protecting incumbents ...
The Journey to Digital: Part 2, Data Transformation
Companies have a choice today. They can be the disrupted or the disruptor. I laid out the case for this in Raiders of Every Industry: The Journey to Digital and Journey To Digital: Part 1, Table Stakes.
In those introductory posts, I note that becoming digital is typically a three-phase journey ...
The Journey to Digital: Part 1, Table Stakes
In the initial post in this series, I gave a picture of the entire digital journey. With this post, let’s dive into part 1 of that journey: Table stakes. Effectively transforming a company requires a commitment to do things differently, and requires that your partners and vendors do things differently too ...
The Journey to Digital: Part 3, Insight Transformation
In those previous posts, I laid out our perspective on what a true digital transformation requires. I introduced the need for automation in the form of machine learning in software and platforms and described how I advise clients to build their data strategies as core data assets.
The Journey to Digital: Part 4 The Final Chapter, Digital Nirvana
The final stage is where the enterprise really begins to see fundamental changes in its business operations and performance. These changes include how you do business, what you offer, how you offer it, and to whom ...
Rise of the Policy Catalog
The term ‘data governance’ is of course deceptively simple. In reality, it refers to multiple, interlocking, co-dependent categories of data management, organizational collaboration, and policy — from data ownership to data supply chains, and from data protection and legal holds through to compliance with regulations ...
Machine Learning and Governance: An Interview with Seth Dobrin
Well, I’ll first say that we’re already doing a lot of things right, and it’s clear that our customers trust us and we make some of the best products on the market — just ask the analysts ...
Don’t Let Data Science Become a Scam
Companies have been sold on the alchemy of data science. They have been promised transformative results. They modeled their expectations after their favorite digital-born companies. They have piled a ton of money into hiring expensive data scientists and ML engineers ...
Six Steps Up: From Zero to Data Science for the Enterprise
Data has intrinsic value to the enterprise, but how to quantify these data assets has been a struggle for many organizations and for many enterprises as they establish modern data practices and data organizations ...
Below a selection of Seth's academic publications and research.
Abstract: Pairwise distance data for maize (Zea maysL.) inbred lines generated using sets of single nucleotide polymorphisms (SNPs) selected from a 50k Infinium array were compared with pairwise distances generated using a set of 163 simple sequence repeat (SSR) loci previously identified to help determine essentially derived variety (EDV) status (UPOV, 1991). Final com-parisons were made using 26,874 SNPs after discarding SNPs with insufficient data quality or vulnerability to ascertainment bias. Inbred lines developed in the United States or in western Europe that had been previously published to establish SSR-based thresholds provided the means to determine equivalent SNP-based pro-tocols. Use of 3072 SNPs selected to provide even genomic coverage according to genetic and physical maps provided robust, precise, high discrimination among inbred lines with con-sistent zonal classification with up to 20% miss-ing data. Comparisons of intercepts and slopes for SSR and SNP inbred pairwise distance data translated the 82% SSR green-orange similar-ity threshold to 91% using SNPs and the 90% SSR orange-red threshold to 95% using SNPs. Information required to conduct analyses using these 3072 SNPS is presented
Yves Rousselle, Elizabeth Jones, Alain Charcosset, Philippe Moreau, Kelly Robbins, Benjamin Stich, Carsten Knaak, Pascal Flament, Zivian Karaman, Jean-Pierre Martinant, Michael Fourneau, Alain Taillardat, Michel Romestant, Claude Tabel, Javier Bertran, Nicolas Ranc, Denis Lespinasse, Philippe Blanchard, Alex Kahler, Jialiang Chen, Jonathan Kahler, Seth Dobrin, Todd Warner, Ron Ferris, Stephen Smith
Background and Purpose— Inbred mouse strains C57BL/6J (B6) and C3H/HeJ (C3H) exhibit marked differences in atherosclerotic lesion formation in the carotid arteries on the apolipoprotein E–deficient (apoE−/−) background when fed a Western diet. Quantitative trait locus analysis was performed on an intercross between B6.apoE−/− and C3H.apoE−/− mice to determine genetic factors contributing to variation in the phenotype.
Methods— Female B6.apoE−/− mice were crossed with male C3H.apoE−/− mice to generate F1 hybrids, which were intercrossed to generate 241 female F2 progeny. At 6 weeks of age, F2 mice were started on a Western diet. After being fed the diet for 12 weeks, F2 mice were analyzed for phenotypes such as lesion size in the left carotid arteries and plasma lipid levels and typed for 154 genetic markers spanning the mouse genome.
Results— One significant quantitative trait locus, named CAth1 (25 cM, log of the odds score: 4.5), on chromosome 12 and 4 suggestive quantitative trait loci, on chromosomes 1, 5, 6, and 11, respectively, were identified to influence carotid lesion size. One significant quantitative trait locus on distal chromosome 1 accounted for major variations in plasma low-density lipoprotein/very-low-density lipoprotein, high-density lipoprotein cholesterol, and triglyceride levels. Carotid lesion size was not significantly correlated with plasma low-density lipoprotein/very-low-density lipoprotein or high-density lipoprotein cholesterol levels.
Conclusions— These data indicate that the loci for carotid lesions do not overlap with those for aortic lesions as identified in a previous cross derived from the same parental strains, and carotid atherosclerosis and plasma lipids are controlled by separate genetic factors in the B6 and C3H mouse model.
Abstract: Adolescent idiopathic scoliosis (AIS) is a common disorder with strong evidence for genetic predisposition. Quantitative trait loci (QTLs) for AIS susceptibility have been identified on chromosomes. We performed a genome‐wide genetic linkage scan in seven multiplex families using 400 marker loci with a mean spacing of 8.6 cM. We used Genehunter Plus to generate linkage statistics, expressed as homogeneity (HLOD) scores, under dominant and recessive genetic models. We found a significant linkage signal on chromosome 12p, whose support interval extends from near 12pter, spanning approximately 10 million bases or 31 cM. Fine mapping within the region using 20 additional markers reveals maximum HLOD = 3.7 at 5 cM under a dominant inheritance model, and a split peak maximum HLOD = 3.2 at 8 and 18 cM under a recessive inheritance model. The linkage support interval contains 95 known genes. We found evidence suggestive of linkage on chromosomes 1, 6, 7, 8, and 14. This study is the first to find evidence of an AIS susceptibility locus on chromosome 12. Detection of AIS susceptibility QTLs on multiple chromosomes in this and other studies demonstrate that the condition is genetically heterogeneous.
Abstract: Background – This paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia. This data set contains gene expression data as well as limited demographic and clinical data for each subject. Previous studies using statistical classification or machine learning algorithms have focused on gene expression data only. The present paper investigates if such techniques can benefit from including demographic and clinical data.
Results – We compare six classification algorithms: support vector machines (SVMs), nearest shrunken centroids, decision trees, ensemble of voters, naïve Bayes, and nearest neighbor. SVMs outperform the other algorithms. Using expression data only, they yield an area under the ROC curve of 0.92 for bipolar disorder versus control, and 0.91 for schizophrenia versus control. By including demographic and clinical data, classification performance improves to 0.97 and 0.94 respectively.
Conclusion – This paper demonstrates that SVMs can distinguish bipolar disorder and schizophrenia from normal control at a very high rate. Moreover, it shows that classification performance improves by including demographic and clinical data. We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases. These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases. Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.
Background and Purpose—Inbred mouse strains C57BL/6J (B6) and C3H/HeJ (C3H) exhibit marked differences in atherosclerotic lesion formation in the carotid arteries on the apolipoprotein E– deficient (apoE/) background when fed a Western diet. Quantitative trait locus analysis was performed on an intercross between B6.apoE/ and C3H.apoE/ mice to determine genetic factors contributing to variation in the phenotype.
Methods—Female B6.apoE/ mice were crossed with male C3H.apoE/ mice to generate F1 hybrids, which were intercrossed to generate 241 female F2 progeny. At 6 weeks of age, F2 mice were started on a Western diet. After being fed the diet for 12 weeks, F2 mice were analyzed for phenotypes such as lesion size in the left carotid arteries and plasma lipid levels and typed for 154 genetic markers spanning the mouse genome.
Results—One significant quantitative trait locus, named CAth1 (25 cM, log of the odds score: 4.5), on chromosome 12 and 4 suggestive quantitative trait loci, on chromosomes 1, 5, 6, and 11, respectively, were identified to influence carotid lesion size. One significant quantitative trait locus on distal chromosome 1 accounted for major variations in plasma low-density lipoprotein/very-low-density lipoprotein, high-density lipoprotein cholesterol, and triglyceride levels. Carotid lesion size was not significantly correlated with plasm
Conclusions—These data indicate that the loci for carotid lesions do not overlap with those for aortic lesions as identified in a previous cross derived from the same parental strains, and carotid atherosclerosis and plasma lipids are controlled by separate genetic factors in the B6 and C3H mouse model.
Abstract: A genome‐wide scan in 60 bipolar affective disorder (BPAD) affected sib‐pairs (ASPs) identified linkage on chromosome 21 at 21q22 (D21S1446, NPL = 1.42, P = 0.08), a BPAD susceptibility locus supported by multiple studies. Although this linkage only approaches significance, the peak marker is located 12 Kb upstream of S100B, a neurotrophic factor implicated in the pathology of psychiatric disorders, including BPAD and schizophrenia. We hypothesized that the linkage signal at 21q22 may result from pathogenic disease variants within S100B and performed an association analysis of this gene in a collection of 125 BPAD type I trios. S100B single nucleotide polymorphisms (SNPs) rs2839350 (P = 0.022) and rs3788266 (P = 0.031) were significantly associated with BPAD. Since variants within S100B have also been associated with schizophrenia susceptibility, we reanalyzed the data in trios with a history of psychosis, a phenotype in common between the two disorders. SNPs rs2339350 (P = 0.016) and rs3788266 (P = 0.009) were more significantly associated in the psychotic subset. Increased significance was also obtained at the haplotype level. Interestingly, SNP rs3788266 is located within a consensus‐binding site for Six‐family transcription factors suggesting that this variant may directly affect S100B gene expression. Fine‐mapping analyses of 21q22 have previously identified transient receptor potential gene melastatin 2 (TRPM2), which is 2 Mb upstream of S100B, as a possible BPAD susceptibility gene at 21q22. We also performed a family‐based association analysis of TRPM2 which did not reveal any evidence for association of this gene with BPAD. Overall, our findings suggest that variants within the S100B gene predispose to a psychotic subtype of BPAD, possibly via alteration of gene expression. © 2007 Wiley‐Liss, Inc.
Abstract: Bipolar disorder (BPD) is a complex genetic disorder with cycling symptoms of depression and mania. Despite the extreme complexity of this psychiatric disorder, attempts to localize genes which confer vulnerability to the disorder have had some success. Chromosomal regions including 4p16, 12q24, 18p11, 18q22, and 21q21 have been repeatedly linked to BPD in different populations. Here we present the results of a whole genome scan for linkage to BPD in an Irish population. Our most significant result was at 14q24 which yielded a non‐parametric LOD (NPL) score of 3.27 at the D14S588 marker with a nominal P‐value of 0.0006 under a narrow (bipolar type I only) model of affection. We previously reported linkage to 14q22‐24 in a subset of the families tested in this analysis. We also obtained suggestive evidence for linkage at 4q21, 9p21, 12q24, and 16p13, chromosomal regions that have all been previously linked to BPD. Additionally, we report on a novel approach to linkage analysis, STRUCTURE‐Guided Linkage Analysis (SGLA), which is designed to reduce genetic heterogeneity and increase the power to detect linkage. Application of this technique resulted in more highly significant evidence for linkage of BPD to three regions including 16p13, a locus that has been repeatedly linked to numerous psychiatric disorders.
Abstract: This study compares and contrasts three different high-density single nucleotide polymorphism genotyping platforms using data generated on the 270 HapMap samples. The differences in minor allele frequencies are evaluated, coverage across the entire genome using r2 and then the coverage of the ENCyclopedia Of DNA Elements (ENCODE) regions is compared using both a single- and multi-point evaluation. All of these analyses are carried out on the three HapMap populations.
SA Tishkoff, FA Reed, A Froment, MW Smith, SM Williams, SA Omar, MJ Kotze, GS Pretorius, M Ibrahim, O Doumbo, M Thera, C Wambebe, SE Dobrin, JL Weber
Siobhan Roche, Fiona Cassidy, Chengfeng Zhao, Badger Jonathon, Lisa Mooney, Catherine Delaney, Seth Dobrin, Patrick McKeon
Abstract: Our laboratory has been testing ways to reduce costs, sample volumes, and decrease labor in microsatellite (or short tandem repeat polymorphism) genotyping. Microsatellite genotyping involves polymerase chain reaction amplification of a short (100–400 bp) fragment of chromosomal DNA that encompasses the tandem repeats followed by electrophoresis to size the amplification products. Using a continuous polypropylene tape (array tape) embossed with 384-well arrays, conforming to the microtiter plate standard, we have been able to perform the amplification reactions in smaller volumes and to decrease handling of stacks of microtiter plates. Instruments were constructed in-house to achieve these results. However, the problem of removal of the samples from the tape for electrophoresis remained. We report here efficient piercing of the tape seal for extraction of the samples using a CO2 laser. Scoring of the seals with the laser weakens it sufficiently to permit extraction of the samples with a syringe array. CO2 lasers are robust systems that do not contain a lot of frequently replaced parts, and do not require frequent recalibration. In addition, the laser is software controlled allowing for highly reproducible scoring and easily switching between 384-, 1536-, and 96-well formats.
Abstract: Abstract: Until now, performing whole-genome association studies has been an unattainable, but highly desirable, goal for geneticists. With the recent advent of high-throughput genotyping platforms, this goal is now a reality for geneticists today and for clinicians in the not-so-distant future. This review will cover a broad range of topics to provide an overview of this emerging branch of genetics, and will provide references to more specific sources. Specifically, this review will cover the technologies available today and in the near future, the specific types of whole-genome association studies, the benefits and limitations of these studies, the applications to complex disease–gene interactions, diagnostic devices, therapeutics, and finally, we will describe the 5-year perspective and key issues.
Abstract: Most epileptic disorders can be traced to an abnormality of cortical architecture, channel-mediated currents, neuronal growth and differentiation, or cerebral metabolism.1,2 In most cases, however, the underlying biologic complexity of epilepsy precludes the identification of the genetic cause, and 65 to 79 percent of recurrent seizure syndromes remain unexplained.3 Microarray analysis of DNA samples can be a powerful tool for revealing a genetic lesion in well-defined families. We have used this approach in Old Order Amish families, some members of which have a clinical and neuropathological phenotype that we designate as the cortical dysplasia–focal epilepsy (CDFE) syndrome. We identified a genetic variation in the gene encoding CASPR2 in affected patients, a finding that suggests that CASPR2 influences brain development.
The mania and the delusions surrounding the genomic overlap of bipolar type I and schizophrenia
S Dobrin, P Stafford, C Zhao, SCHIZOPHRENIA RESEARCH 81, 45-46
Kevin A. Strauss, M.D., Erik G. Puffenberger, Ph.D., Matthew J. Huentelman, Ph.D., Steven Gottlieb, M.D., Seth E. Dobrin, Ph.D., Jennifer M. Parod, B.S., Dietrich A. Stephan, Ph.D., and D. Holmes Morton, M.D.
New England Journal of Medicine 354 (13), 1370-1377
F Cassidy, J Badger, C Zhao, S Dobrin, S Roche, P McKeon
Abstract: Two well-supported theories of schizophrenia pathogenesis are the neurotransmitter theory and the neurodevelopmental theory, suggesting, respectively, that dysregulation of neurotransmitter signaling and abnormal brain development are causative in this disease. The strongest evidence of neurotransmitter involvement are suggestions of abnormal dopamine signaling in the prefrontal cortex and one of the strongest indications of developmental abnormalities contributing to this disease is an inverse layering of the prefrontal cortex. These two theories of schizophrenia pathogenesis can be united by their involvement of the prefrontal cortex, where structural abnormalities could lead to neurochemical abnormalities. Accordingly, any gene expressed in the prefrontal cortex of developing brains is a functional candidate for schizophrenia. We have previously reported strong linkage to 15q15 (LOD=3. 57; P=2.6×10−5) in a collection of German multiplex families segregating the periodic catatonia subtype of schizophrenia in a nearly Mendelian fashion. A gene within our 15q15 linkage region, DLL4, is expressed in developing forebrain and produces a NOTCH4 ligand. Variants of NOTCH4 are associated with schizophrenia, thus DLL4 is both a functional as well as a positional candidate for schizophrenia. We screened this gene for mutations in three affected individuals and two unrelated controls and found two previously unreported SNPs, one non-synonymous polymorphism that changed an arganine to a histadine in Exon 7 and one synonymous polymorphism in exons. The non-synonymous SNP is a rare variant in that it was not found in 100 control chromosomes; however, it did not cosegregate with the disease in the extended family so it is not causative in this pedigree. It is unlikely that mutations in DLL4 are causative in this collection of families with linkage to 15q15.
Data Mining Whole-Genome Expression Profiling S Lal, S Dobin, D Stephan Arizona State University
DP McKeane, J Meyer, SE Dobrin, KM Melmed, S Ekawardhani, NA Tracy, KP Lesch, DA Stephan
Abstract: We have identified a lethal phenotype characterized by sudden infant death (from cardiac and respiratory arrest) with dysgenesis of the testes in males [Online Mendelian Inheritance in Man (OMIM) accession no. 608800]. Twenty-one affected individuals with this autosomal recessive syndrome were ascertained in nine separate sibships among the Old Order Amish. High-density single-nucleotide polymorphism (SNP) genotyping arrays containing 11,555 single-nucleotide polymorphisms evenly distributed across the human genome were used to map the disease locus. A genome-wide autozygosity scan localized the disease gene to a 3.6-Mb interval on chromosome 6q22.1-q22.31. This interval contained 27 genes, including two testis-specific Y-like genes (TSPYL and TSPYL4) of unknown function. Sequence analysis of the TSPYL gene in affected individuals identified a homozygous frameshift mutation (457_458insG) at codon 153, resulting in truncation of translation at codon 169. Truncation leads to loss of a peptide domain with strong homology to the nucleosome assembly protein family. GFP-fusion expression constructs were constructed and illustrated loss of nuclear localization of truncated TSPYL, suggesting loss of a nuclear localization patch in addition to loss of the nucleosome assembly domain. These results shed light on the pathogenesis of a disorder of sexual differentiation and brainstem-mediated sudden death, as well as give insight into a mechanism of transcriptional regulation
Erik G Puffenberger, Diane Hu-Lince, Jennifer M Parod, David W Craig, Seth E Dobrin, Andrew R Conway, Elizabeth A Donarum, Kevin A Strauss, Travis Dunckley, Javier F Cardenas, Kara R Melmed, Courtney A Wright, Winnie Liang, Phillip Stafford, C Robert Flynn, D Holmes Morton, Dietrich A Stephan