Poster Gallery 2015 | Mind Bytes 2019 - Research Computing Expo and Symposium

MindBytes Poster Gallery 2015

#1 Great Match in Natural Language Processing in Big Data at Scale

Abstract View PDF

#2 Bayesian variant-based pathway enrichment analysis using GWAS summary statistics

Abstract View PDF

#3 Testing Land Coverage Classification Algorithms for Optimizing Flood Detection in Hyperspectral Image Data

Abstract View PDF

#4 Robust Prior Analysis and Detection of Significant Change

Abstract View PDF

#5 Scenes

Abstract View PDF

#6 A scalable and cost-effective method for measuring pharyngeal pumping under controlled conditions

Abstract View PDF

#7 Non-universal star formation in turbulent interstellar medium

Abstract View PDF

#8 In Silico Construction of a Host/Pathogen Patient Cohort Using HPC Parameter Sweeps on an Agent Based Model of Sepsis

Abstract View PDF

#9 Merging novel imaging technologies to understand muscle dynamics in monkey mouths

Abstract View PDF

#10 Coupled charge transport in the Cl-/H+ antiporter

Abstract View PDF

#11 Water Management in the U.S Southwest: A Systems View

Abstract View PDF

#12 Halo Rendering via Phase Space Projection

Abstract View PDF

#13 Multiscale simulations reveal the proton pumping mechanism in cytochrome c oxidase

Abstract View PDF

#14 Predicting Chicago Real Estate Market Absorption

Abstract View PDF

#15 Integrating big data analysis and visualization

Abstract View PDF

#16 Organic solar cell models predict how structures change properties

Abstract View PDF

#17 Mean-variance Optimization for Equity Portfolio Selection

Abstract View PDF

#18 Predicting Financial Market Direction Using Social Media Data

Abstract View PDF

#19 Insight into the Impact of Marketing Activities on Sales Using Marketing Mix Modeling

Abstract View PDF

#20 WEST: Novel Scalable Software for Materials by Design

Abstract View PDF

#21 Beyond-DFT Electronic Structure: Spin-Orbit Coupling and Surface Defect Calculations

Abstract View PDF

Great Match in Natural Language Processing in Big Data at Scale

Google n gram viewer is the most famous implementation of n gram. n gram is an essential tool for Natural Language Processing. Since the birth of the project, no one has not been able to compete with Google n gram viewer or has not even tried to do so. On this project, the author challenges Google n gram viewer in terms of # of generated n gram words building up n gram from scratch with maximum support of multicore CPUs.

Bayesian variant-based pathway enrichment analysis using GWAS summary statistics

åCarbonetto and Stephens (2013) developed a multiple-SNP modeling approach that integrated pathway enrichment analysis with variant prioritization in enriched pathways, and demonstrated its potential to yield novel biological insights into complex human traits and diseases. The method, however, is limited by the requirement of individual-level genotype and phenotype data, which are not widely available for large GWAS. In contrast, single-SNP association summary statistics are often released in public domain. Here we present a new Bayesian method for multiple-SNP pathway analysis that relies solely on GWAS summary statistics and linkage disequilibrium (LD) structure inferred from a public reference panel. Our method adopts a recently proposed large-scale Bayesian regression model for GWAS summary statistics (Zhu and Stephens, ASHG 2015). Unlike in previous work where each SNP was treated equally likely to be associated with the phenotype a priori, the new method allows the prior probability of each SNP being associated to depend on its membership of a pathway so that potential enrichment of associations within the pathway can be captured. A parallel algorithm using mean-field variational approximation is developed to ensure scalability for genome-wide applications. On summary statistics of 435,615 SNPs in a GWAS of Crohn's disease and 3,160 curated pathways from eight web databases, our method obtains results comparable to the analysis that used individual-level data (Carbonetto and Stephens, 2013). The top-ranked pathways that show strong support for enrichment in Crohn's disease are IL12-mediated signaling, cytokine signaling, IL23-mediated signaling and immune system (Bayes Factor = 2.63e9, 7.41e8, 5.25e8, 1.74e6). We also apply the method on the summary statistics of 1,064,575 SNPs in a GWAS of human height and 3,700 pathways, and identify highly enriched gene sets that play important roles in bone biology, including Hedgehog signaling, RAC1 signaling and Y branching of actin filaments.

Testing Land Coverage Classification Algorithms for Optimizing Flood Detection in Hyperspectral Image Data

In remote sensing, hyperspectral imaging instruments provide data with potentially high predictive performance in image classification. However, the limited computational capabilities onboard these sensors do not allow full utilization of the data. For instance, the algorithms onboard NASA’s Earth Observing-1 (EO-1) satellite are limited to using only 12 of 242 spectral bands due to data size. Project Matsu, a cloud-based collaboration with NASA, makes processed hyperspectral data available to the public, within 24 hours of acquisition. Utilizing this framework facilitates fast access and computational power over the full dataset, allowing us to test machine learning algorithms for land cover classification in hyperspectral data using all 242 bands to improve upon the existing water detection algorithms currently used onboard the satellite. Using a diverse training set of hyperspectral data, we achieve a significant accuracy increase of 5%-20%.

Robust Prior Analysis and Detection of Significant Change

Our research is focused on analysis of future climate change and e ects of model misspecication on economic policy. In particular, the uncertainty in a climate component of linked economic-climate model a ects robust energy/consumption policies. A decision maker wants to minimize regrets over the least favorable outcome possible resulting from policy applied over a set of climate models considered.

Scenes

Social scientists and others consider many types of contexts that can join in the scenes approach. But if the components have been used previously, scenes analysts join them together to create a new holistic synthesis. That is a scene includes (1) Neighborhoods, rather than cities, metro regions, states/provinces, or nations. (2) Physical structures, such as dance clubs or shopping malls. (3) Persons, described according to their race, class, gender, education, occupation, age, and the like. (4) Specific combinations of 1-3, and the activities which join them, like young tech workers attending a local punk concert. (5) These four in turn express symbolic meanings, values defining what is important about the experiences offered in a place. General meanings include legitimacy, defining a right or wrong way to live; theatricality, an attractive way of seeing and being seen by others; authenticity, a real or genuine identity. (6) Publicness – rather than the uniquely personal and private, scenes are projected by public spaces, available to passers-by and deep enthusiasts alike. (7) Politics and policy, especially policies and political controversies about how to shape, sustain, alter, or produce a given scene, how certain scenes attract (or repel) residents, firms, and visitors, or how some scenes mesh with political sensibilities, voting patterns, and specific organized groups, such as new social movements.

RA scalable and cost-effective method for measuring pharyngeal pumping under controlled conditions

C. elegans feeding consists of two pharyngeal motions: pumping and isthmus peristalsis. Pumping is typically quantified by counting the number of quasi-periodic contractions of the terminal bulb during a fixed short period. Under ideal imaging conditions, i.e., high magnification and high spatial and temporal resolutions, automated detection of pharyngeal pumping can be achieved using intensity threshold-based machine vision. However, such conditions require the dedication of significant resources to every animal, thus limiting the throughput of the assay. We employ a mixture of affordable optics and novel analysis to build a high-throughput imaging and analysis pipeline. Models of regulatory strategies can potentially be tested using detailed experimental data and may assist in conceptualizing the data in terms of an optimality principle in feeding.

Non-universal star formation in turbulent interstellar medium

Recent observational evidence indicates variation of efficiencies at which giant molecular clouds (GMCs) convert their gas into stars. Consistent theory of galaxy formation must explain such variation. Common numerical models, in which star formation efficiency (SFE) is assumed constant (at the level of few %) above certain density threshold, by their design are not suitable for this purpose. Theoretically, variation of SFE is attributed to the turbulent nature of interstellar medium. Unfortunately, state-of-art galaxy formation models lack relevant small-scale motions due to limited resolution that hardly reaches typical scale of the largest GMCs (few 10 pc). However, with an appropriate subgrid model of turbulence star formation can be connected to resolved dynamics with the aid of theoretical and numerical models of star formation in turbulent medium. In this work we implement such model coupled with prescription for star formation in compressible MHD turbulence. We find that our model predicts distribution of r.m.s. turbulent velocities consistent with local and extragalactic observations (on average few km/s on 100 pc scale). In our model turbulence is produced in warm gas (T ∼ 104 K) at level of few km/s and is amplified by compression in spiral arms up to few tens km/s. As far as star formation is concerned, in our simulation we observe distribution of rates that is in a good agreement with both local GMCs data and resolved extragalactic star formation maps. The resulting variation of efficiency is found to be due to scatter in turbulent properties. Our model predicts high abundance of molecular gas inefficiently forming stars along with existence of very efficient GMCs.

In Silico Construction of a Host/Pathogen Patient Cohort Using HPC Parameter Sweeps on an Agent Based Model of Sepsis

Current predictive models for sepsis generally use correlative methods, and as such are limited in their individual precision due to patient heterogeneity and data sparseness. The use of computational modeling and simulation can aid in the process of contextualizing data generated by complex systems in order to describe their behavior. Towards this end, we have performed a multi-dimensional parameter sweep on a previously validated model of sepsis. Data from this parameter sweep has been used to construct an in silico cohort of patients, defined by parameters representing host health and microbial virulence, upon which further studies and simulations can be performed to both understand the septic process and design putative interventions.

Merging novel imaging technologies to understand muscle dynamics in monkey mouths

Muscles can function in diverse ways to move the skeleton, and our lab examines on how muscles and skeletons work together to produce movements; we focus primarily on chewing and swallowing in primates. Here we describe two new technologies that can aid in understanding anatomy and function: 1) XROMM (X-ray Reconstruction Of Moving Morphology), a tool for visualizing skeletal movements based on x-ray video and CT scanning; and 2) contrast-enhanced CT scanning, which allows for accurate depiction of individual muscle fibers in addition to skeletal anatomy. Each of these techniques is computationally intensive in its own right, and integrating them to gain a more complete understanding of musculoskeletal biomechanics is difficult. We faced problems with data volume (many files of different types, some of huge sizes), data management, logistics of sharing sensitive and complex data, and analytical tools for processing our data. Inside the University, the RCC has customized hardware and software solutions for automatically downloading/archiving our files, keeping each trial’s data organized, and expediting metadata entry for each trial. In support of our collaborative efforts outside of the University, the RCC has organized for secure, streamlined access to our data for off-campus collaborators and automated a pipeline for using computational tools developed at other universities using our data.

Coupled charge transport in the Cl-/H+ antiporter

The chloride channels (ClC) are a family of proteins that transport Clacross membranes either as selective ion channels or secondary active Cl-/H+ antiporters. ClC-ec1, a prokaryotic homologue of ClC, uses Cl- gradient to pump H+ thermodynamically uphill through the membrane, or vice versa. We have studied this ion exchange process by calculating free energy profile for migration of each ion through the protein channel. Free energy calculation was done with a suite of multi-scale methods, ranging from a classical MD simulation to use of semi-quantum mechanical reactive model and the hybrid QM/ MM method. Here we report results on the proton transport process and coordinated movement of Cl- ions through the external and internal gate regions. A Markov state model was constructed to connect the intermediate states, identified in the free energy profile of Cl- transprot. The model showed a good agreement with experimental results of the the Cl- conduction rate and Cl-/H+ exchange ratio. Our results suggest a plausible mechanism for coupled ion exchange.

Water Management in the U.S Southwest: A Systems View

N-body simulations are often used to study how dark matter self-organizes into a complex "cosmic web" of filaments and halos. However, the discrete nature of simulations makes it difficult to study the low-density regions of the cosmic web. We developed a code that studies these regions by directly representing the continuous phase-space structure of a simulation's dark matter. This allows for the creation of ultra-high resolution density maps, like the one shown.

Halo Rendering via Phase Space Projection

Multiscale simulations reveal the proton pumping mechanism in cytochrome c oxidase

Cytochrome c oxidase (CcO) reduces oxygen to water and uses the released free energy to pump protons across the membrane, contributing to the transmembrane proton electrochemical gradient that drives ATP synthesis. Herein, we provide a complete atomic level description of the key steps of the proton pumping mechanism in aa3-type CcO. We have used multiscale reactive molecular dynamics simulations to explicitly characterize (with free energy profiles and calculated rates) the internal proton transport events that enable pumping and chemistry during a reaction step that involves proton transport to the pump loading site (PLS) and to the catalytic site (binuclear center, BNC) (the A→PR→F transition). Our results show that both proton transport events are thermodynamically driven by electron transfer from heme a to the BNC, but that pumping (amino acid residue E286 to the PLS) is kinetically favored, while transfer of the chemical proton (E286 to the BNC) is rate-limiting. The calculated rates are in quantitative agreement with experimental measurement. The back flow of the pumped proton from the PLS to E286 is prevented by the fast reprotonation of E286 through the D-channel and a large free energy barrier for the back flow reaction. Proton transport through the D-channel is not rate-limiting during the A→PR→F transition, but is strongly coupled to solvation changes across the N121-N139 asparagine gate. Our results also show how the D-channel biases unidirectional proton transport from the inner to outer side of the membrane.

Predicting Chicago Real Estate Market Absorption

Opportunities to support urban economic decision-making with analytical models are extensive in the real estate market. Both buyers and sellers face uncertainty in real estate transactions in large metropolitan areas about when to time a transaction and at what cost. A housing demand index based on microscopic home showings events data can provide decision-making support for buyers and sellers on a very granular time and spatial scale. In the current real estate market, both buyers and sellers make decisions without knowing the present and future state of the large and dynamic real estate market. Consequently, accurate and granular housing market demand forecasts play a valuable role in these decisions. In this paper, we aim to predict housing market demand by developing housing demand indices using high-volume, high-velocity data on home showings, listing events, and historic sales data. By employing a combination of traditional market measures supplemented by the number of home showings, the indices result in timely insight into housing market demand. We demonstrate our analysis using data from seven million individual records sourced from a unique, proprietary dataset that has not previously been explored in application to the real estate market. We then employ a series of predictive models to estimate current and forecast future housing demand. Specifically, we first develop a shorter-term market demand heat index that predicts housing demand for the subsequent week using only past weekly market demand and home showings data.

Integrating big data analysis and visualization

Language is manifested at multiple, interconnected levels of structure: letters/sounds, words, phrases, etc. How can linguistic structure be learned? With the availability of large datasets (e.g., text corpora, transcribed speech collections), we develop unsupervised approaches to learning linguistic structure from unstructured data, which has important implications for both the scientific study of language and language technologies. We are constructing Linguistica 5, a software with a graphical user interface that integrates linguistic data analysis and visualization.

Organic solar cell models predict how structures change properties

Solar cell devices based upon organic electronics possess many advantages relative to traditional silicon photovoltaics, including potentially inexpensive manufacturing, band gaps which are tunable through chemical modification, and high optical absorption coefficients. Recent progress in generating improved polymers for organic solar cells has suggested that precise electronic energy level alignment between multiple donor polymers (e.g. PTB7, PID2, etc.) significantly improves device performance. The effect of specific morphologies on device efficiency is difficult to determine experimentally, but our theoretical models can directly probe how structural changes affect electronic properties, such as band locations and band gaps. Using periodic plane wave DFT and many body perturbation theory (G0W0) calculations, we calculated the dependence of electronic states on order parameters such as backbone dihedrals. Hybrid density functional theory, in particular, performs well for these systems. Our results suggest that accurate description of the device performance must include a description of local disorder. In particular, the potential energy scans we performed suggest multiple possible configurations which differ significantly in electronic structure. Using the lowest energy conformations, our alignment of electronic energy levels is in good agreement with experimental values.

Mean-variance Optimization for Equity Portfolio Selection

This research explores mean-variance optimization for stock portfolio modeling on high-dimensional datasets. The availability of data allows for average investors to obtain large datasets for use in making investment decisions. The two step model used Sortino Ratio ranking and then mean-variance optimization to select a portfolio of stocks. The resultant portfolio outperformed the benchmark (S&P 500) both in-sample, and out-of-sample

Predicting Financial Market Direction Using Social Media Data

This research project is a study of investor sentiment as derived from social media content collected using RCC infrastructure and its ability to predict short-term stock market direction. The team choose S&P 500 and Russell 2000 indices to represent the stock market for this project. The study involves using cutting edge Text Analytics/ Natural Language processing techniques to analyze articles posted on financial social media websites and derive daily ‘Sentiment Measures’ from it. The results indicate that these ‘Sentiment Measures’ have strong short term relationship with Russell 2000 Index. On the other hand, there were no strong evidence of short term relationship between sentiment measures and S&P 500 Index. Using these Sentiment measures, the research team was able to predict the Russell index direction for the next trading day with an 80% accuracy rate.

Insight into the Impact of Marketing Activities on Sales Using Marketing Mix Modeling

The purpose of this research is to help HAVI Global Solutions’ client estimate the impact of historical marketing and pricing activities and then forecast the impact of future activities through the use of predictive statistical models. These models can provide the client with insight into where and how to apply marketing investment dollars more effectively. The marketing mix model developed for this research will do the following: Correctly identify the cannibalization effect of promotions across different products; Minimize data collinearity risks of independent variables (such as different types of promotions); Contain a model validation process that efficiently validates models related to large numbers of products, and Allow for scalability into more markets.

WEST: Novel Scalable Software for Materials by Design

Petascale computational resources have provided the opportunity to perform quantum simulations of materials properties of unprecedented size, yielding results that complement experimental observations and may lead to the discovery of new materials, designed using the basic principles of quantum mechanics. Density functional theory (DFT) is one of the main tools used in first principle simulations of materials, but several of the current approximations of exchange and correlation functionals do not provide the level of accuracy required for predictive calculations of electronic properties, such as photoemission and absorption spectra. Many-Body Perturbation Theory (MBPT) offers an improved level of accuracy, but it is more computationally demanding than DFT and often difficult to apply for realistic materials including disorder, defects, interfaces or nanostructured solids. We describe the features of WEST, an open source massively parallel code to compute excited state properties of molecules and materials (www.west-code.org), which is scalable up to ~32k BG/Q nodes. We will discuss the electronic structure of systems relevant to solar energy conversion processes obtained with WEST, as well as the parallel performance of the code.

Beyond-DFT Electronic Structure: Spin-Orbit Coupling and Surface Defect Calculations

While density functional theory (DFT) is widely used, it gives large errors for several properties, such as semiconductor band gaps. The GW approximation to the Dyson equation is a very successful method which begins with a DFT calculation and computes corrections based on many-body theory. GW calculations have traditionally been extremely challenging to perform for large systems, but newly developed algorithms and a high performance code developed by some of the authors, WEST, greatly improve the efficiency of GW calculations. Here, we present extensions and applications of the WEST code. First, spin-orbit coupling is included in calculations, so that solids and nanoparticles containing heavy elements such as gold and lead can be calculated with much better accuracy. Second, we use WEST to refine the simulation of a dangling bond on a hydrogen-passivated silicon surface. This system shows promise for quantum information applications, and it is also an intermediate step toward the fabrication of complex atomic-scale silicon devices.