Projects

Here are some examples of our ongoing projects, along with key publications and presentations for each project. More projects will be added soon!


Fast Subset Scanning

While most previous work in machine learning and data mining has focused on detection and classification of single data records, pattern detection extends these methods to groups of records, in order to detect and identify patterns not visible from any individual record alone. A key idea of our work is subset scanning: we frame the pattern detection problem as a search over subsets of the data, in which we define a measure of the “interestingness” or “anomalousness” of a subset, and maximize this “score function” over all potentially relevant subsets.

Subset scanning often improves detection power as compared to heuristic methods (which are not guaranteed to find optimal subsets), top-down detection methods (which fail to detect small-scale patterns that are not evident from global aggregates), and bottom-up detection methods (which fail to detect subtle patterns that are only evident when a group of data records are considered collectively). Of course, subset scanning creates both statistical and computational challenges, the most serious of which is the computational infeasibility of exhaustively searching over the exponentially many subsets.

One key breakthrough along these lines is our fast subset scan (Neill, 2012), which enables exact and efficient optimization over subsets of the data. However, fast subset scan only solves the unconstrained best subset problem, thus creating additional challenges as to how we can incorporate real-world constraints such as proximity for spatial data, connectivity for graph and network data, and temporal consistency for dynamic, evolving patterns. Our recent work uses fast subset scan as a building block to address all of these challenges, as well as integrating information from many data streams, high-dimensional data, and complex data sources such as text and images, thus dramatically extending the range of pattern discovery problems that we can address.

The Linear-Time Subset Scanning Property

In the subset scan framework, our primary goal is to find the subsets of the data which are most anomalous (or that best match some known and relevant pattern) by maximizing a score function F(S). Since an exhaustive search over subsets is computationally infeasible, typical spatial scan methods either restrict the search space, e.g. by searching over circular or rectangular regions, or perform a heuristic search. The former approach has low detection power for regions outside the search space, while the latter does not guarantee that an optimal or near-optimal region will be found. However, we have recently discovered that many pattern detection methods satisfy a property (linear-time subset scanning, or LTSS) which allows efficient optimization over all subsets of the data: the highest-scoring (most anomalous or most relevant) of all the exponentially many subsets of the data can be found in linear time, by sorting the data records according to some priority function and searching only over regions containing the k highest-priority records (letting k vary from 1 to the total number of records N). This approach enables us to optimize F(S) by evaluating only N of the 2N possible subsets.

We have shown that LTSS holds for many relevant score functions, including parametric and non-parametric scan statistics; for example, it holds for any expectation-based scan statistic in a single-parameter exponential family. These are likelihood ratio statistics F(S) = Pr(Data | H1(S)) / Pr(Data | H0), where H0 assumes that each data element xi is drawn with some expected value μi, and H1(S) assumes a constant multiplicative increase in expected value for the affected subset S.

Incorporating Real-World Constraints

Since LTSS only guarantees a solution to the unconstrained (all-subsets) optimization problem, the biggest challenge is to incorporate constraints such as spatial proximity, graph connectedness, or temporal consistency to ensure that relevant and useful subsets are detected. In recent work, we have developed a number of novel and powerful methods for constrained optimization, using the unconstrained LTSS method as a building block.

For the spatial event detection problem, we often want to use spatial information to constrain our search by penalizing or excluding unlikely subsets (e.g. spatially dispersed or highly irregular regions). Thus we propose fast localized scan approaches which incorporate spatial proximity constraints into the LTSS framework, either restricting or penalizing the neighborhood size. For example, we can constrain our search to regions consisting of a center location and some subset of its k-nearest neighbors, using LTSS to reduce the complexity from exponential to linear in k. In practice, this allows us to perform spatial detection tasks in milliseconds that would require thousands of years for an exhaustive search. We also demonstrate that proximity-constrained subset scans substantially improve the timeliness and accuracy of event detection, as compared to the traditional (circular) spatial scan statistic, particularly when the affected region is elongated or irregular in shape.

In some settings, the affected subset may be irregular in shape but explained by an underlying graph structure. In event detection, the graph structure may represent spatial adjacency, but can also represent traffic patterns, rivers, or other natural phenomena. Unconstrained LTSS does not incorporate this additional information and may return disconnected subsets. Therefore, in Speakman, McFowland, and Neill (2015), we propose GraphScan to incorporate connectivity constraints into the subset scan. Given a known graph structure, GraphScan is guaranteed to identify the highest-scoring connected subgraph while scaling to graph sizes an order of magnitude larger than the previous state of the art. GraphScan’s dramatic speed improvements come from an efficient depth-first search algorithm that incorporates an extension of the LTSS property to avoid searching provably suboptimal subsets. GraphScan demonstrates improved power and timeliness of detection when identifying affected regions along rivers and interstate highways as compared to heuristic alternatives.

In many applications, such as tracking contamination in a water distribution system or predicting the spread of an outbreak, it is necessary to identify dynamic patterns, where the affected spatial region (or subgraph of a network) may change over time. However, typical space-time scan statistics simply aggregate observed counts across time steps, assuming that the affected region does not change over time. An alternative approach, identifying different subsets on each time step, may result in detection of disjoint (and therefore, implausible) space-time regions. In Speakman, Zhang, and Neill (2013), we propose the Dynamic Subset Scan which is a fruitful compromise between these two extremes, identifying dynamic patterns while enforcing soft temporal consistency constraints between adjacent time steps. Dynamic Subset Scan proceeds by iteratively optimizing the detected subset at each time step t using the subsets at time steps t-1 and t+1, rewarding overlap and penalizing abrupt changes. Given an underlying graph structure, hard constraints on connectivity can also be incorporated, and we develop a heuristic approach, Additive GraphScan, which can scale to tens of thousands of nodes. Using both connectivity and temporal consistency constraints, Dynamic Subset Scan outperformed competing approaches when tracking, detecting, and source-tracing contaminant plumes spreading through a water distribution system equipped with noisy binary sensors.

Multivariate and Multidimensional Fast Subset Scans

In Neill, McFowland, and Zheng (2013), we extended the fast subset scan approach from univariate to multivariate spatial datasets. The key insight is that LTSS can either be used to efficiently optimize a score function over subsets of locations for a given subset of the monitored data streams, or to optimize over subsets of streams for a given subset of locations. Thus we can iterate between optimizing over locations and streams until the algorithm converges to a (local) maximum of the score function over all subsets of locations and streams, and use multiple randomized restarts to approach the global maximum.

In Neill and Kumar (2013), we extended the multivariate LTSS approach to multidimensional tensor data. The approach is a natural generalization of our previous algorithm, in which we randomly initialize the algorithm then iteratively optimize over subsets of each tensor mode given the other modes. This process converges to a local maximum of the score function, and then multiple randomized restarts can be used to approach the global maximum. This approach improves detection power by enabling us to simultaneously search over space-time regions, subsets of the monitored data streams, and subpopulations with different demographic or behavioral characteristics (e.g. age groups, gender, socioeconomic status, and race/ethnicity), thus increasing our ability to detect disease outbreaks which have different impacts on different subpopulations.

Fast Generalized Subset Scanning

In McFowland et al. (2013), we developed the Fast Generalized Subset Scan framework for efficient pattern detection in general multivariate datasets. In this case, we have an arbitrary set of attributes measured for each of a large set of data records, and our goal is to detect self-similar subsets of data records for which some subset of attributes are anomalous. Our approach consists of four steps: 1) efficiently learning a Bayesian network which represents the assumed null distribution of the data; 2) computing the conditional probability of each attribute value in the dataset given the Bayes Net, conditioned on the other attribute values for that record; 3) computing an empirical p-value range corresponding to each attribute value by ranking the conditional probabilities, where under the null hypothesis we expect empirical p-values to be uniformly distributed on [0,1]; and 4) using a nonparametric scan statistic to detect subsets of records and attributes with an unexpectedly large number of low (significant) empirical p-values. The final step is computationally expensive (exponential in the numbers of records and attributes for a naïve search), but linear-time subset scanning can be used to efficiently search over all subsets of records for a given subset of attributes, or to efficiently search over all subsets of attributes for a given subset of records. As in the multivariate spatial setting, we can iterate between these two efficient steps until convergence to a local maximum of the score function, and use multiple restarts to approach the global maximum. This approach was evaluated on multiple application domains, including early detection of simulated anthrax bio-attacks, discovery of patterns of illicit container shipments for customs monitoring, and network intrusion detection, demonstrating improved detection accuracy, efficient run time, and ability to correctly characterize the affected subset of attributes in all three domains.

Key Publications

Univariate Fast Subset Scan with spatial proximity constraints:
Daniel B. Neill. Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society (Series B: Statistical Methodology) 74(2): 337-360, 2012. (pdf)

Multivariate Fast Subset Scan:
Daniel B. Neill, Edward McFowland III, and Huanian Zheng. Fast subset scan for multivariate event detection. Statistics in Medicine 32: 2185-2208, 2013. (pdf)

Multidimensional Fast Subset Scan:
Daniel B. Neill and Tarun Kumar. Fast multidimensional subset scan for outbreak detection and characterization. Online Journal of Public Health Informatics 5(1), 2013. (pdf)

Fast Generalized Subset Scan:
Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast generalized subset scan for anomalous pattern detection. Journal of Machine Learning Research, 14: 1533-1561, 2013. (pdf)

GraphScan:
Skyler Speakman, Edward McFowland III, and Daniel B. Neill. Scalable detection of anomalous patterns with connectivity constraints. Journal of Computational and Graphical Statistics 24(4): 1014-1033, 2015. (pdf)

Dynamic Subset Scan:
Skyler Speakman, Yating Zhang, and Daniel B. Neill. Dynamic pattern detection with temporal consistency and connectivity constraints. Proc. 13th IEEE International Conference on Data Mining, 697-706, 2013. (pdf)

Penalized Subset Scan:
Skyler Speakman, Sriram Somanchi, Edward McFowland III, and Daniel B. Neill. Penalized fast subset scanning. Journal of Computational and Graphical Statistics, 25(2): 382-404, 2016. Selected for “Best of JCGS” invited session by the journal’s editor in chief. (pdf).

Background Reading

Expectation-Based Scan Statistics:
Daniel B. Neill. Expectation-based scan statistics for monitoring spatial time series data. International Journal of Forecasting 25: 498-517, 2009. (pdf)

Nonparametric Scan Statistics:
Daniel B. Neill and Jeff Lingwall. A nonparametric scan statistic for multivariate disease surveillance. Advances in Disease Surveillance 4:106, 2007. (pdf)

Key Presentations

Daniel B. Neill. Scaling up event and pattern detection to big data. NYU Stern School of Business, Information Systems Seminar, New York, NY, April 2014. (pdf)

Daniel B. Neill. Fast subset scanning for scalable event and pattern detection. Stony Brook University, Stony Brook, NY, May 2013. (pdf)

Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast generalized subset scan for anomalous pattern detection. Sixth International Workshop on Applied Probability, Jerusalem, Israel, June 2012. (pdf)


Scalable Bayesian Event Detection and Visualization

One ongoing theme of our work is the integration of many data sources to achieve earlier and more accurate detection of emerging events and other patterns. Rather than making strong assumptions (e.g., conditional independence of the monitored data streams), our approaches model the multivariate data in a Bayesian spatial event detection framework.

Advantages of this “Bayesian Scan Statistic” over the typical, frequentist scan statistic approach include easier incorporation of prior information regarding an event’s distribution in space and time and its impact on the affected region, fast computation (randomization testing is not necessary in the Bayesian framework), intuitive visualizations that incorporate our uncertainty about the event type and affected region, higher detection power, and ability to characterize events by distinguishing between multiple event types.

Multivariate Bayesian Scan Statistic

The multivariate Bayesian scan statistic (MBSS) is a powerful spatial event detection method that can integrate information from multiple data streams and can model and distinguish between multiple event types (Neill and Cooper, 2010). This method, an extension of our earlier univariate Bayesian scan statistic work (Neill et al., 2006), allows us to achieve faster and more accurate detection of emerging patterns through incorporation of prior information. It also enables accurate characterization and differentiation of multiple event types, and can accurately learn event models from labeled training data, expert knowledge, or a combination of the two.

The output of MBSS can be easily visualized as a posterior probability map by computing the posterior probabilities that each event type Ek has affected each spatial location si, summing the posterior probabilities for all regions S containing si. Unlike standard spatial scan visualizations, which do not compute probabilities but instead show the most likely cluster, this method is able to quantify its uncertainty about the spatial extent and type of the detected events.

Fast Subset Sums

Our Fast Subset Sums method (Neill, 2011) extends the MBSS framework to enable detection and visualization of irregularly-shaped spatial clusters in multivariate data. The original MBSS method limits its search to circular-shaped regions (similar to the original, non-Bayesian spatial scan statistic), and it is computationally infeasible to use this method to compute and sum posterior probabilities over the exponentially many subsets of the data. However, Fast Subset Sums can efficiently produce the posterior probability map, computing the summed posterior probability over all subsets containing each location si without computing the posterior probability of each individual subset, thus enabling rapid detection and visualization of irregular clusters. This provides substantial improvements in spatial accuracy and timeliness of event detection as compared to MBSS, while maintaining the scalability and fast run time of the original MBSS method.

Generalized Fast Subset Sums

In Shao et al. (2011), we developed a generalization of the Fast Subset Sums method which allows the sparsity of the detected region to be controlled. We propose a hierarchical probabilistic model with three steps: first, choosing the center location sc from a multinomial distribution; second, choosing the neighborhood size k from a multinomial distribution; and third, independently choosing whether to include (with probability p) or exclude (with probability 1-p) each location in the k-neighborhood of the center. We demonstrate that our previously proposed MBSS and Fast Subset Sums methods correspond to special cases of this Generalized Fast Subset Sums (GFSS) method, with p = 1 and p = 0.5 respectively, and show that appropriate choice of the sparsity parameter p enables much faster detection and higher spatial accuracy than either MBSS or FSS.

Learning Event Models

In Shao et al. (2011), we also consider learning event models for the Generalized Fast Subset Sums framework. We demonstrate that the distribution of the sparsity parameter p can be accurately learned from a small number of labeled training examples, and that the resulting GFSS method with learned p distribution outperforms MBSS, FSS, and GFSS with a uniform p distribution. We also show that two otherwise identical event types with different sparsities can be reliably distinguished by learning each event’s p distribution, and that learning both an event’s sparsity distribution and its relative effects on different data streams leads to more timely detection and better characterization than learning either parameter on its own. More recently, we have developed an expectation-maximization (EM)-based method which enables simultaneous learning of the distributions for the sparsity parameter, neighborhood size, and center location.

Key Publications

Univariate Bayesian Spatial Scan:
Daniel B. Neill, Andrew W. Moore, and Gregory F. Cooper. A Bayesian spatial scan statistic. In Y. Weiss, et al., eds. Advances in Neural Information Processing Systems 18, 1003-1010, 2006. (pdf)

Multivariate Bayesian Scan Statistic:
Daniel B. Neill and Gregory F. Cooper. A multivariate Bayesian scan statistic for early event detection and characterization. Machine Learning 79: 261-282, 2010. (pdf)

Fast Subset Sums:
Daniel B. Neill. Fast Bayesian scan statistics for multivariate event detection and visualization. Statistics in Medicine 30(5): 455-469, 2011. (pdf)

Generalized Fast Subset Sums and Learning Event Models:
Kan Shao, Yandong Liu, and Daniel B. Neill. A generalized fast subset sums framework for Bayesian event detection. Proceedings of the 11th IEEE International Conference on Data Mining, 617-625, 2011. (pdf)

Key Presentations

Kan Shao, Yandong Liu, and Daniel B. Neill. A generalized fast subset sums framework for Bayesian event detection. Presented at the 11th IEEE International Conference on Data Mining, 2011. (pdf)


Applications of Event and Pattern Detection

An essential goal of our work is to benefit the public good through development and deployment of novel methods for event and pattern detection. We work directly with a variety of organizations in the public and private sectors, including public health practitioners, hospitals, police departments, and city leaders, to develop data-driven solutions that can improve public health, safety, and security. Three of our primary application areas are disease surveillance, law enforcement and urban analytics, and health care. We have also begun working on applications to conflict prediction, human rights, and international development, as well as security applications such as intrusion detection, customs monitoring, and infrastructure monitoring.

Disease Surveillance

The disease surveillance domain has served as our primary testbed for the development of new event and pattern detection methods, and our work has been incorporated into multiple deployed disease surveillance systems in the U.S., Canada, and Sri Lanka. This work began as part of the National Retail Data Monitor project, a research collaboration between the University of Pittsburgh and CMU for disease surveillance using over-the-counter medication sales. In collaboration with a team led by Dr. Rick Davies of the Ottawa Heart Institute, we have developed and deployed systems for real-time disease surveillance in the Grey Bruce region of Ontario (ECADS project), Ottawa (ASSET project), and several other Canadian cities (Data Fusion project). The ECADS and ASSET systems are complete, and the deployed systems are in regular use by Grey Bruce Public Health and Ottawa Public Health respectively. The ECADS system has enabled Grey Bruce to rapidly detect several emerging outbreaks (E. coli, scarlet fever, cryptosporidiosis, etc.) and our work has given both health departments powerful tools for outbreak investigation and response. The Data Fusion project is focused on development and deployment of methods for fusion of multiple health data sources, with applications to the monitoring of hospital-acquired infections and detection of patterns of events related to drug abuse. Additional biosurveillance system deployments (in collaboration with CMU’s Auton Lab) include Sri Lanka and Tamil Nadu, India. Most recently, we have worked with the North Carolina Department of Public Health (and are starting new collaborations with the New York City Department of Health and Mental Hygiene and the Illinois Department of Health) focused on identifying “novel” disease outbreaks with previously unseen patterns of symptoms.

Some key publications and presentations in the disease surveillance domain include:

Mallory Nobles, Lana Deyneka, Amy Ising, and Daniel B. Neill. Identifying emerging novel outbreaks in textual emergency department data. International Society for Disease Surveillance Annual Conference, Philadelphia, PA, December 2014. (pdf)

Daniel B. Neill. New directions in artificial intelligence for public health surveillance. IEEE Intelligent Systems 27(1): 56-59, 2012. (pdf)

Daniel B. Neill. Spatial scan tips and tricks for practical outbreak detection. Invited webinar for the International Society for Disease Surveillance, January 2011. (pdf)

Daniel B. Neill. Research challenges for biosurveillance: the next ten years (invited plenary). International Society for Disease Surveillance Annual Conference, Park City, UT, December 2010. (pdf)

Daniel B. Neill. An empirical comparison of spatial scan statistics for outbreak detection. International Journal of Health Geographics 8: 20, 2009. (pdf) (open access)

Xia Jiang, Gregory F. Cooper, and Daniel B. Neill. Generalized AMOC curves for evaluation and improvement of event surveillance. Proceedings of the American Medical Informatics Association Annual Symposium, 281-285, 2009. (pdf)

Maheshkumar R. Sabhnani, Daniel B. Neill, Andrew W. Moore, Fu-Chiang Tsui, Michael M. Wagner, and Jeremy U. Espino. Detecting anomalous patterns in pharmacy retail data. Proceedings of the KDD 2005 Workshop on Data Mining Methods for Anomaly Detection, 2005. (pdf)

M. Wagner, F.-C. Tsui, J. Espino, W. Hogan, J. Hutman, J. Hersh, D. Neill, A. Moore, G. Parks, C. Lewis, and R. Aller. A national retail data monitor for public health surveillance. Morbidity and Mortality Weekly Report 53: 40-42, 2004. (pdf)

Law Enforcement and Urban Analytics

The EPD Lab is currently collaborating with the city leadership of Chicago (Department of Innovation and Technology) and Pittsburgh (Department of Innovation and Performance) to develop novel methods and open-source predictive analytics tools that will help city leaders make smarter, faster decisions in real-time to help address and prevent problems before they develop. By incorporating predictive analytics into the day-to-day operations of multiple city departments, we will help these cities to improve delivery of core services and to effectively respond to a range of issues (including public health, sanitation, traffic management, public safety, and infrastructure maintenance) before they become full-fledged crises, saving money, time, and lives.

Our collaboration with the City of Chicago started in 2009, when we began working with the newly formed Predictive Analytics Group at the Chicago Police Department (CPD). In Neill and Gorr (2007), we developed novel machine learning methods for event detection and prediction, which we then incorporated into our “CrimeScan” software. This tool enables advance prediction of emerging hot-spots of violent crime with high accuracy at very fine spatial and temporal resolutions, and it was incorporated into the CPD’s day-to-day policing operations for crime prediction and prevention through targeted patrols. The key insight of the CrimeScan framework is to use detection for prediction: we apply our state-of-the-art methods for early detection of emerging spatial clusters of various leading indicator types, and use these clusters to predict that the event of interest is likely to occur nearby. For example, to predict violent crime, we monitor 911 calls (classified by call type, such as “shots fired” or “gang loitering”) and crime offense reports for more minor crimes (such as “simple battery” or “criminal damage to property”). These methods and data sources have been shown to achieve accurate prediction of hot-spots of violence, thus enabling early, targeted, and effective policing interventions. According to our CPD partners, “based upon deployment suggestions indicated in the CrimeScan intelligence reports, important arrests were affected, weapons were seized, and crimes were prevented”.

While the CrimeScan methodology and software continued to evolve through our work with the CPD, the establishment of the City of Chicago’s Department of Innovation and Technology gave us an opportunity to apply these methods to many other urban challenges through a broader collaboration with the City. We generalized the CrimeScan approach to create our “CityScan” software, which enables us to incorporate many other data sources and prediction of many event types relevant to city operations. For example, we are currently focusing on predicting emerging clusters of 311 calls for service (such as abandoned buildings or sanitation complaints), thus anticipating citizen needs and enabling the city to respond proactively. One of our initial focus areas for 311 call prediction is in collaboration with the City’s Department of Streets and Sanitation (DSS), focusing on predicting and preventing rodent complaints. By predicting where clusters of these complaints are likely to occur, we can more precisely target the City’s preventative baiting crews, with the goal of reducing the overall level of rat infestation. We are currently working with DoIT and DSS to conduct a randomized, controlled experiment to determine whether we can reduce the number of rodent complaints by predicting, targeting, and preventing rat infestations before they occur. We have also performed preliminary analyses using retrospective data from Pittsburgh and Baltimore, demonstrating that substantial benefits could be gained from even a small, precisely targeted preventive rodent baiting program based on CityScan predictions.

Our CrimeScan software was featured in a Chicago Sun-Times article (1/22/2011):

“It was a bit like a scene from “Minority Report,” the 2002 Tom Cruise movie that featured genetically altered humans with special powers to predict crime. In October, the Chicago Police Department’s new crime-forecasting unit was analyzing 911 calls for service and produced an intelligence report predicting a shooting would happen soon on a particular block on the South Side. Three minutes later, it did, police officials say…”

Our rodent prevention work has also gotten some press here and here.

We are in the process of performing a comprehensive analysis of CityScan and submitting this work for publication. Our earlier results are available as:

Daniel B. Neill and Wilpen L. Gorr. Detecting and preventing emerging epidemics of crime. Advances in Disease Surveillance 4:13, 2007. (pdf)

Professor Neill recently served on the National Science Foundation’s Subcommitee on Youth Violence, which was commissioned by Congressman Frank Wolf in response to the shooting at Sandy Hook Elementary. Our report, focusing on risk factors associated with mass shootings and street violence, was presented to Congress and discussed at a hearing before the House Appropriations Commerce-Justice-Science (CJS) subcommittee. A substantially revised and expanded version of our report has recently been published in the American Psychological Association’s flagship journal, American Psychologist:

Brad J. Bushman, Katherine Newman, Sandra L. Calvert, Geraldine Downey, Mark Dredze, Michael Gottfredson, Nina G. Jablonski, Ann S. Masten, Calvin Morrill, Daniel B. Neill, Daniel Romer, and Daniel W. Webster. Youth violence: what we know and what we need to know. American Psychologist 71(1): 17-39, 2016. (pdf) (APA press release)

Health Care

We have numerous ongoing projects at the intersection of machine learning, event detection, and health care information systems. First, funded by a Healthcare Technology Innovation grant from the University of Pittsburgh Medical Center (UPMC) and a gift from the Highmark/CMU Disruptive Health Technology Institute, we are in the process of creating a widely applicable methodological and implementation framework for using massive quantities of healthcare data (including electronic health records and health insurance claims data) to discover patterns of care with significant potential impacts on patient outcomes (e.g., mortality, length of stay, and readmissions) and healthcare costs. Second, we have developed novel methods for detection of “regions of interest” in digital pathology slides, and demonstrated that these methods can accurately detect prostate cancer. Third, we have developed novel methods for classification and visualization of high-dimensional health data, applied to assessment of the risk of diabetes and related complications. Fourth, we have developed a collaborative filtering approach to medication reconciliation, predicting and correcting omissions from a patient’s medication list in order to reduce the risk of adverse drug events. Finally, with colleagues at the Technical University of Munich, we are working on hospital length-of-stay management, and have developed a method to predict a hospital patient’s Diagnosis Related Group (DRG) at the beginning of their stay. This technology will enable hospitals to better predict costs, reimbursement, and patient flow, enabling better scheduling of patients to increase hospital profit and improve quality of care.

Some key publications in the health care domain include:

Discovering anomalous patterns of care:
Daniel B. Neill. Using artificial intelligence to improve hospital inpatient care. IEEE Intelligent Systems 28(2): 92-95, 2013. (pdf)

Digital pathology:
Sriram Somanchi and Daniel B. Neill. Discovering anomalous patterns in large digital pathology images. Proc. 8th INFORMS Workshop on Data Mining and Health Informatics, 2013. (pdf)

Chronic disease risk:
Christopher A. Harle, Daniel B. Neill, and Rema Padman. Information visualization for chronic disease risk assessment. IEEE Intelligent Systems 27(6): 81-85, 2012. (pdf)

Medication reconciliation:
Sharique Hasan, George T. Duncan, Daniel B. Neill, and Rema Padman. Automatic detection of omissions in medication lists. Journal of the American Medical Informatics Association 18(4): 449-458, 2011. (pdf)

Prediction of diagnosis-related groups:
Daniel Gartner, Rainer Kolisch, Daniel B. Neill, and Rema Padman. Machine learning approaches for early DRG classification and resource allocation. INFORMS Journal of Computing 27(4): 718-734, 2015. (pdf) (supplementary material)

Event Detection and Prediction from Social Media Data

In Chen and Neill (2014, 2015), we developed new machine learning methods for event detection in social media data such as Twitter, enabling earlier detection and advance prediction of emerging patterns of events such as civil unrest, human rights violations, and rare disease outbreaks. Most previous approaches are based on burst detection, topic modeling, or clustering techniques, which cannot naturally model the implicit heterogeneous network structure in social media. As a result, only limited information, such as terms and geographic locations, can be used, limiting the ability of such methods to detect spatially localized events that (in their early stages, at least) may have only subtle signals in the social media data.

Our novel non-parametric heterogeneous graph scan (NPHGS) approach first models the entire Twitter network as a heterogeneous graph, where the graph nodes represent various entity types (Twitter users, tweets, keywords, locations, links, and hashtags), and the graph edges represent the multiple relations (reply, retweet, follower, etc.) between these entities. Various node and neighborhood features are measured for each entity type (e.g., numbers of tweets, retweets, followers, mentions, and replies for each Twitter user). The anomalousness of each node is measured using a novel two-stage calibration process which accounts for the heterogeneity of different node types and node attributes, and the correlations between different attributes of a node. Given an empirical p-value measuring the anomalousness of each node, we then efficiently maximize a nonparametric scan statistic over connected subgraphs to identify the most anomalous network clusters. Finally, the event represented by each cluster is summarized with information such as type of event, geographical locations, time, and participants.

These methodological advances have been applied to predict civil unrest and for early detection of rare disease outbreaks, and we have demonstrated substantial improvements over the previous state of the art in each domain, increasing detection power, forecasting accuracy, and forecasting lead time while reducing time to detection. Accurate prediction of civil unrest events (e.g., strikes, protests, and riots) at the city level was possible up to 1 week in advance. Most recently, using Twitter data from Mexico, we were able to accurately detect relevant clusters of human rights-related tweets prior to international news sources, and in some cases, prior to local news reports. This could enhance the information-gathering missions of human rights organizations by pinpointing specific abuses, revealing events and details that may be blocked from traditional media sources, and providing evidence of emerging patterns of human rights violations. This could lead to more timely, targeted, and effective advocacy, as well as other potential interventions.

Our key publications in social media event detection include:

Feng Chen and Daniel B. Neill. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1166-1175, 2014. (pdf)

Feng Chen and Daniel B. Neill. Human rights event detection from heterogeneous social media graphs. Big Data 3(1): 34-40, 2015. (pdf)


Gaussian Processes and Kernel Methods

We are currently developing new methodological approaches for scalable Gaussian process regression and kernel space-time independence tests, with applications to prediction, causal inference, change point detection, and leading indicator selection.

Prediction and Crime Forecasting

In Flaxman et al. (2015a), we extend scalable Kronecker methods for non-Gaussian likelihoods, enabling applications outside of standard regression settings. To fit count data, we develop a spatiotemporal log Gaussian Cox process (LGCP), with highly expressive spectral mixture covariance kernels. Our model is capable of learning intricate structure on large datasets, allowing us to derive new scientific insights from the data and to perform accurate long range forecasts. We apply our model to small area crime rate forecasting using a decade of publicly available date-stamped and geocoded crime reports from Chicago. We demonstrate that our approach scales up to sample sizes far exceeding the state-of-the-art in fitting LGCPs, and indeed in fitting most Gaussian process regression problems without extreme simplifying assumptions or approximations. Moreover, we demonstrate that our crime forecasts far outperform predictions made using popular alternatives, and we use the learned structure to gain insights into the fundamental properties of these data.

Causal Inference

In Flaxman et al. (2015b), we consider the use of Gaussian processes for conditional independence tests, a key component of modern causal structure learning and causal orientation algorithms. Kernel-based tests of independence have gained popularity to deal with non-linear dependencies in recent years, but testing for conditional independence remains a challenging problem. We highlight the important issue of non-iid observations: when data are observed in space, time, or on a network, “nearby” observations are likely to be similar, which biases any estimates of dependence between variables. Our solution is to integrate Gaussian process regression (to control for spatial and temporal correlations and obtain residuals) with the Hilbert-Schmidt Independence Criterion, a kernel-based independence test. We illustrate this on two classic datasets, one spatial, the other temporal, that are usually treated as iid, and show how properly accounting for spatial and temporal variation can lead to more reasonable causal graphs. We also show how highly structured data, like images and text, can be used in a causal inference framework, using a novel structured input/output Gaussian Process formulation. We demonstrate this idea on a dataset of translated sentences, showing that we can accurately predict the source language.

Change Point Detection

In Herlands et al. (2016), we present a scalable Gaussian process model for identifying and characterizing smooth multidimensional changepoints, and automatically learning changes in expressive covariance structure. Our approach differs from standard change point detection methods in that it can model both gradual changes and those that occur heterogeneously (e.g., in space). We use Random Kitchen Sink features to flexibly define a change surface and expressive spectral mixture kernels to capture the complex statistical structure. Finally, through the use of novel Kronecker methods for additive non-separable kernels, we can scale the model to very large datasets. We demonstrate the model on numerical and real world data, including a large spatio-temporal disease incidence dataset where we identify previously unknown heterogeneous changes in space and time. In particular, we demonstrate how the effect of the measles vaccine introduced in the US in 1963 was spatiotemporally varying. Our model automatically discovers the time frame in which the measles vaccine was introduced, and accurately represents the spatially heterogeneous changes in disease dynamics after vaccine introduction, thus providing new insights into the spatial and temporal dynamics of reported disease incidence.

Related EPD Lab projects in progress include “Fast hierarchical Gaussian processes”, “Scalable leading indicator selection via bivariate kernel space-time interaction tests”, and “New multivariate point process models for crime and 311 call prediction”. More information about these projects will be available soon.

Key Publications

Prediction and Crime Forcasting:
Seth R. Flaxman, Andrew Gordon Wilson, Daniel B. Neill, Hannes Nickisch, and Alexander J. Smola. Fast Kronecker inference in Gaussian processes with non-Gaussian likelihoods. Proc. 32nd International Conference on Machine Learning, JMLR: W&CP 37, 2015a. (pdf)

Causal Inference:
Seth R. Flaxman, Daniel B. Neill, and Alexander J. Smola. Gaussian processes for independence tests with non-iid data in causal inference. ACM Transactions on Intelligent Systems and Technology, 7(2): 22:1-22:23, 2015b. (pdf)

Change Point Detection:
William Herlands, Andrew Gordon Wilson, Hannes Nickisch, Seth Flaxman, Daniel B. Neill, Willem van Panhuis, and Eric P. Xing. Scalable Gaussian processes for characterizing multidimensional change surfaces. Proc. 19th International Conference on Artificial Intelligence and Statistics, JMLR: W&CP 51: 1013-1021, 2016. (pdf)