Contributed talks – 6th Swedish Workshop on Data Science

20 November (Tuesday), 2018

* Contributed talk : Anneli Ågren, SLU.se

Title: Using High Resolution Digital Elevation Models and Machine Learning to generate Better Maps of Stream Networks and Wet Soils: a case-study of the Swedish forest landscape
Authors: Anneli Ågren, William Lidberg and Mats Nilsson
Abstract: Forest management is conducted using heavy machinery driving on forest soils. Wet soils have low bearing capacity making them more susceptible to soil disturbance, resulting in leaching of nutrients and pollutants to nearby surface waters. In order to protect streams it’s important not to drive on wet soils, especially in close connections to streams. But where are the streams and the wet soils?

Headwaters make up the majority of any given stream network, yet, they are poorly mapped, in fact we show that 81.5% of all running waters in Sweden are missing on today’s maps. Forested wetlands and wet soils near streams and lakes are also often missing from current maps. Better maps are needed as a tool to plan forest management in order to increase production and protect surface waters. Topographical modelling of hydrological features such as stream networks and topographical wetness index and depth to water has been suggested as a solution to this problem. These methods are easy to implement on large scales but are also static and do not take differences in spatial runoff patterns or soil textures into account. Here the Swedish landscape is used as a test bench to 1) Investigate how mapping of small stream channels (<6 m width) can be improved using high resolution digital elevation models. 2) Evaluate if machine learning can be used to create more accurate maps of wet soils.

The best modelled stream channel network was generated by breaching the DEM, calculating the accumulated flow, extracting a stream network using a 2 ha stream initiation threshold. The best available map of today, the Swedish property map (1:12 500) had a Matthews Correlation Coefficient (MCC) 0.387 while the MCC for the 2 ha stream channel network was 0.463. The most accurate stream channel network had a length 4.5 times longer the currently mapped stream network. While the map is not perfect it’s a clear improvement compared to today’s maps.

To map the wet soils we used a machine learning approach where the National Forest Inventory of Sweden was used to train a random forest classifier. The in-data layers to the model comprise 14 topographically derived terrain indices of wet soils in a high resolution (2*2 m pixels) along with soil types and runoff data for all of Sweden (in total ca 30 TB of data). The predicted map had a substantial agreement with the field plots (Cohen’s kappa index of agreement = 0.66, overall accuracy = 84%). A visual inspection of the resulting map also agreed well with the authors’ first-hand knowledge of the wet soils in the Krycklan catchment. The new maps show both the previously mapped larger open wetlands (mires and bogs) but also the smaller wet areas along stream networks. A clear improvement when planning forest management.

* Contributed talk : Thomas Hellström, UMU.se

Title: The Reasonable Ineffectiveness of Data
Abstract: We are currently experiencing two complementary approaches to problem solving: model driven and data driven. The former, classical, approach has been extremely successful not least within physics, and was praised by the Nobel laureate Eugene Wigner in the paper The Unreasonable Effectiveness of Mathematics in Natural Sciences (1960). Wigner described the usefulness of mathematics as bordering on the mysterious, for which there is no rational explanation. The latter approach was praised 49 years later by three Google researchers, including Peter Norvig, in the paper The Unreasonable Effectiveness of Data (2009). The data driven approach has also been extremely successful at solving practical problems, for instance in data mining, image analysis and speech recognition. The authors suggest that we should stop aiming at elegant theories and instead make use of the enormous amounts of available data to solve problems.

In this talk I will compare the two approaches, showing how certain aspects of ineffectiveness associated with the data driven approach are quite reasonable and expected. A key aspect is the difference between correlation and causation, and the discussion is inspired by Judea Pearl who for a long time has pointed at the need for causal models to reach understanding. The ideas are popularized in Pearl’s recent The Book of Why: The New Science of Cause and Effect (2018).

* Contributed talk : Gunnar Mathiason, HIS.se

Title: Production data analysis for process refinement in hot rolling
Abstract: In the competitive steel manufacturing industry, there is a challenging need to reduce manufacturing costs, have more flexible production, but also increase the product quality. The replacement of complete production equipment is very costly, so manufacturers must meet these business challenges by improving the current equipment. For this, there is a need to develop approaches for full-scale analysis of entire production lines, but due to the difficulty and the complexity of large-scale data analysis in these settings, production analysis often aims at limited sections of production operations, as incremental improvements. We present an analysis case, where machine learning (ML) algorithms were used in a major Swedish hot rolling mill, for a full-scale analysis of data from an entire production line. We aimed for deeper understanding by operators and production analysts, to know causes for one major product quality issue. We deployed both standard and new machine learning algorithms with three months of detailed production data, and we developed prediction models that were combined in a prototype as a new production analysis tool.

Two new algorithms were developed specifically for this case. To make use of the clustering of samples in the input space, due to different production parameters (such as thickness and length), we developed a new algorithm as a refinement of the Supervised Growing Neural Gas (SGNG) algorithm, but with stronger modeling abilities [1], and which outperforms alternative algorithms. We also development an LSTM-based neural network with an attention mechanism [2] to analyze the sequential data describing the in-process product shape, which is data that was found to be a determining factor for the quality issue studied. Our approach can not only predict the quality issue well, but the attention layer can produce a weight value series over the shape that points out which parts of the shape that is used by the classification model.

By segmenting the full set of production data, based on the production line operations, we could train different prediction models at each segmentation boundary, predicting the quality issue based on upstream data only for each such boundary. Segments downstream where prediction precision increases pin points locations where causes of quality issues arise. Our approach found the second machine operation to be a major contributor to the quality issue, rather than the third machine operation, pointed out as a contributor before examining the data in this way. The segmentation approach also showed that the properties of incoming material contributed significantly to the quality issue.

In this work we believe that we have somewhat “opened the AI black box” by showing what shapes matter for the risk of quality issue classification. The operator and the production analysts can now better understand what shapes increase the risk. The collaborative analysis approach of the project team shows how findings from data can be translated all the way into actual production refinement suggestions. Further, this study serves as a case for how the metal manufacturing industry can capitalize on their large production databases, to get new insights for increased competitiveness.

* Contributed talk: Jörgen Carlsson, UmeåEnergi.se

Title: Umeå Energi AB- Inside the Smart City development of Umeå
Abstract: All utilities of today, and certainly Umeå Energi AB, faces huge challenges of the current digitalization and change of the energy business as a whole. Customers of tomorrow, and indeed today, demands a more open approach and possibilites to co-create on our common infrastructure platforms such as District Heating- cooling, Power grids and more.
Umeå Energi, municipality owned utility of Umeå, has taken upon themselves to lead the change of Umeå toward an open platform of integrating energy business of tomorrow. The business opportunities of the Smart city lies in focusing on the aspirations of reducing climate impact of customers, lead work of integration in mass data harvesting and utilization, smart contract business arcitectures and more.
Key is to nail the value proposition of tomorrows societies, by trying to achieve this we are currently working on several projects aiming for Smart city development, such as RUGGEDISED, Den Koldioxidsnåla Platsen, 5G- testbeds etc.
We are now moving into the next phase of developments, aiming to implement testbeds of energy cooperations

21 November (Wednesday)

* Contributed talk: Philip Buckland, UMU.se

Title: Data Analysis in Palaeoecology and Environmental Archaeology
Abstract: Large scale databases for the storage, management and dissemination of palaeoenvironmental data are becoming increasingly important in a range of fields from archaeology to climatology and conservation science. By including data from a very large number of sites in a single data source, they facilitate a level of spatial and temporal aggregation which would otherwise be unfeasibly time consuming. Open Access databases also promote data re-use, transparency and interdisciplinary collaboration spanning the humanities and natural sciences. Although the amounts of data used in palaeoecology are often regarded as insufficient to be worthy of the title Big Data, we now have the data and tools available to investigate empirically large scale patterns of environmental change, spanning 1000’s of year, in an efficient, reproducible and accessible manner. By integrating or linking these data with others from ecology, history and even literature studies we can increase the potential and power of investigations enormously. This presentation will demonstrate some ongoing research using the Strategic Environmental Archaeology Database (SEAD), a research data infrastructure for palaeoenvironmental and archaeological data, and part of an international network of databases for studying past climates, environmental change and people.

* Contributed talk: Niclas Ståhl, HIS.se

Title: The need for Bayesian models in the search for novel medical drugs
Authors: Niclas Ståhl, Jonas Boström, Göran Falkman, Alexander Karlsson and Gunnar Mathiason
Abstract: The search for novel molecules that can act as drugs is often a very costly process. There-fore, much of the recent research efforts within the area of drug discovery have been devotedto the development of new data driven methods and techniques to make the process more ef-fective. To this end, machine learning models, such as neural networks, have been used since,at least, the pioneering work of Gasteiger and Zupan [1]. A recent trend is to usegenerativemodels, such as the work of Olivecrona et al. [2], to generate novel molecules. These modelsare typically trained throughreinforcement learning, where the model is optimized towardsa given target. Since the properties of the generated molecules are unknown, an additionalmodel for the evaluation of the properties is needed. A common solution is to use a machinelearning model that is trained on a set of molecules with known properties [2, 3]. A drawbackwith this is the risk that the generative model can drift away from the initial distributionused to train the predictive model [4]. As a consequence, the predictive model is forced toextrapolate. Hence, then this occurs there are no guarantee of the validity of the predictions.Although this problem is pertinent to all problems where generated data have to be evaluated,it is of particularly concern when it comes to generated molecules, since these are positionedin a non-euclidean space and it is more difficult to determine if a given model is extrapolating.In this paper, we show a possible solution to the problem of the drift in the generativemodel by the use ofdeep Bayesian neural networks[5], which allows for the quantification ofmodel uncertainty. This enables the model to both internally represent and externally presentwhen it is preforming uncertain predictions due to extrapolation. To demonstrate the utility ofsuch models, we have conducted three separate experiments, concerning molecules with knownequilibrium constants with dopamin D2 and D4. Within this set of molecules, there are twosub-groups, tricyclic and non-tricyclic molecules. In the first experiment, a deep Bayesianneural network is trained on a mix of these two subsets, resulting in a model that can predictthe properties of new molecules from both sets with a good accuracy. In the two followingexperiments, the deep Bayesian neural network is subsequently trained on only one of thesubsets, respectively. This results in models which perform well within the same subgroup ofmolecules used for training, while being bad at predicting properties of molecules in the otherset. Our result shows that the uncertainty of the generated models was clearly coupled tothe accuracy of the trained models and thus it is possible to detect the extrapolation of themodel. This shows the importance of staying in the same domain as the training data whengenerative models are used in combination with predictive models and to have mechanismsfor detecting when the generative model drifts.

* Contributed talk: Amira Soliman, RISE SICS.se

Title: Graph-based Analytics for Decentralized Online Social Networks
Authors: Amira Soliman, Sarunas Girdzijauskas
Abstract: Decentralized Online Social Networks (DOSNs) have been introduced as a privacy-preserving alternative to the existing Online Social Network frameworks. DOSNs remove the dependency on a centralized provider and operate as distributed information management platforms. The main objective behind decentralization is to preserve users privacy in both shared content and communication. Social networks can be considered one of the earliest domains that shed light on users privacy issues in nowadays digital world, yet user privacy regulations are not potentially limited to social networks.

General Data Protection Regulation (GDPR) represents one of the efforts that the European Union (EU) is going to apply to strengthen and unify data protection for EU citizens. Accordingly, data management and analytic services need to adapt in order to follow privacy preserving regulations. Thus, decentralized data processing is the only way through to provide privacy-preserving data analytics for DOSNs.

In this talk, we present algorithms for providing privacy-preserving data analytics for DOSNs. Our proposed algorithms access the data locally and require no global knowledge about social graph, hence they conform to DOSN requirements of preserving user privacy in both data and network dimensions. Furthermore, our algorithms follow the unsupervised machine learning scheme, thus removing the need for collecting labeled training data that causes privacy violations. Furthermore, we combine graph analytics with machine learning. By this integration, different applications can be provided that are capable of analyzing autonomous data sources, as well as user interactions in a distributed and decentralized way that suitably fits DOSNs. Furthermore, our graph-based analytics go beyond current data mining methods that extract patterns from textual data, and leverage the interlinked nature of social network that links data with network topology. Specifically, we use community detection as our core analytics component for analyzing topological users interactions and introduce our community-aware learning mechanism that extracts patterns for each individual community where the correlations are more pronounced. Additionally, our proposed methods can be executed efficiently and effectively in decentralized environments. Our solutions follow node-centric programming paradigm, such that the nodes of the social network require only some local information and perform only local operations. Furthermore, our graph-based analytics provides basic components that can be combined to provide several services for DOSNs, such as identity validation and spam detection.

* Contributed talk: Sabri Pllana, LNU.se

Title: Workload Distribution on Heterogeneous Systems based on Meta-heuristics and Machine Learning: A Performance and Energy Aware Approach
Authors: Suejb Memeti and Sabri Pllana
Abstract: Due to the ability to provide high performance and energy efficiency, accelerators (including GPU, FPGA, Intel Xeon Phi) are being used collaboratively with conventional general-purpose CPUs to increase the overall performance and energy efficiency of applications. Some of the most powerful supercomputers in the TOP500 list are heterogeneous at their node level.

The increase of performance of top supercomputers in the world is often associated with the increase of power consumption. For example, from June 2008 to June 2016 the average power consumption of the top 10 supercomputers has increased from 1.33 MW to 8.88 MW. The increase of the power consumption has increased the demand from researchers to consider power aware approaches to optimize the energy without or minimal impact on the performance of parallel applications. As a result, in 2018, the average power consumption has decreased to 6.12MW while the performance continues to increase.
Utilizing the computational power of all the available resources (CPUs + accelerators) in heterogeneous systems is essential to achieve good performance. However, due to different performance characteristics of their processing elements, achieving a good workload distribution across multiple devices on heterogeneous systems is non-trivial. Optimal system configurations that result with the highest throughput do not necessarily provide the most power efficient results. Furthermore, the optimal system configurations are most likely to change for different applications, input problem sizes, and available resources. Determining the optimal system configuration using brute-force may be prohibitively time-consuming.
We combine meta-heuristics and machine learning to determine near optimal system configuration parameters with regards to performance and energy consumption. Meta-heuristics are used to search for the optimal system configuration in the given parameter space, whereas machine learning is used to evaluate the proposed configurations. To evaluate our approach, we use a parallel application for DNA sequence analysis on a heterogeneous platform that comprises two 12-core Intel Xeon CPUs and an Intel Xeon Phi co-processor with 61 cores. The results show that by using only about 5% of the total possible experiments required by enumeration we can determine near-optimal system configurations.
The key contributions of our work include:
– A meta-heuristic-based optimization approach to explore the large system configuration space.
– A supervised Machine Learning approach to evaluating the performance and energy consumption of data parallel applications.
– An approach that combines the meta-heuristics with machine learning to determine near-optimal system configurations.
– Joint consideration of performance and energy consumption.
– Experimental evaluation of our approach with a DNA sequence analysis application.

Note: Our preliminary results have been published in the following papers:
1. Memeti, S., & Pllana, S. (2017). Combinatorial optimization of DNA sequence analysis on heterogeneous systems. Concurrency and Computation: Practice and Experience, 29(7), e4037.
2. Memeti, S., & Pllana, S. (2018). A machine learning approach for accelerating DNA sequence analysis. The International Journal of High Performance Computing Applications, 32(3), 363-379.

* Contributed talk: Ruben Buendia, HB.se

Title: Probabilistic Prediction for Accurate Hit Estimation in Iterative Screening
Abstract: Iterative screening (IS) has emerged as a promising approach to increase the efficiency of high-throughput screening (HTS) campaigns in early stage drug discovery projects. In this approach, selected hits from a given round of screening are used to enrich a compound activity prediction model for the next iteration. By learning from a subset of the compound library, inferences on what compounds to screen next can be made by predictive models. This approach is referred to as Quantitative Structure-Activity Relationship (QSAR) modelling, where the compound chemical structures are used to derive the predictor variables or features, and the compound activities are the target variables or labels.

One of the challenges of iterative screening is the choice of the portion of the compound library to be screened at each iteration. This choice cannot be approached by traditional machine learning methods. In order to solve this problem, a method which provides accurate estimations of the number of hits that would be retrieved in each iteration during an IS campaign is proposed. This method refers to the application of a novel mathematical framework, i.e. Inductive Venn-ABERS predictors (IVAP). Venn-ABERS predictors, applied on top of a scoring classifier, under standard assumptions regarding data generation, offer a way to assign true calibrated probabilities to predictions. In their inductive way, IVAP make use of a calibration set of instances that cannot be utilized for training.
The method was retrospectively evaluated on data from six full HTS campaigns against six different biological targets (over two million different compounds in total). The targets were chosen to represent typical cases with high, moderate and low proportion of hits, (2 targets for each case). The proposed method showed ability to provide accurate estimations of the number of hits that would be retrieved at each iteration. Therefore, it gives IS campaigns an efficient control on the number of compounds to be tested and the resulting expected enrichment. Secondarily, enrichment offered by the method in terms of hits and their diversity was evaluated against random selection. This way, the method exhibited great ability in producing enriched subsets for all six different targets.
The price for the value that the present method might add to drug discovery projects is the cost of screening of the compounds in the calibration set (which can have a significant economic cost). However, in a recently accepted paper, it was shown that IVAP can be estimated using out of bag calibration samples. For that reason, a Random Forest combined with IVAP in which the calibration set consist of out of bag calibration samples has been implemented and it is currently under test. If successful, the information obtained would be ´for free´.

* Contributed talk: Klara Leffler, UMU.se

Title: Intelligent data sampling promotes accelerated medical imaging: sharper positron emission tomography
Authors: Klara Leffler, Ida Häggström and Jun Yu
Abstract: Positron emission tomography (PET) is an imaging modality widely used in oncology, neurology, and cardiology. It has the ability to produce images that map functional processes in the body, e.g., locate activity of cancerous tumours. On the positive side, PET data has high sensitivity and high spatial resolution. On the negative side, PET imaging relies on radioactive decay within the body and time-consuming data acquisition, and the resulting images are highly corrupted with noise. At the same time, the systems are heavily oversampled with respect to the information relevant for the purpose at hand. Since images are compressible, one might wonder if it is necessary to acquire the amount of data that we do today? Could we not simply measure the compressed information from a subsample of the data and still reconstruct images of the same quality as for the fully sampled data? Reducing the amount of required data promotes an accelerated imaging process, which opens up for a reduction in sampling time and radioactive dosage, while simultaneously allowing for less noisy higher-resolution images.

We propose an intelligent data sampling method for PET imaging based on a wavelet sparsification of the PET images followed by compressed sensing reconstruction from undersampled PET data. The key ingredient of the compressed sensing technique is an assumption of sparsity in the unknown image, which allows us to reconstruct complete images from a significantly smaller set of measurements. PET images are not naturally sparse but implicitly sparse and, more specifically, they are known to be sparse in the wavelet domain.

The application possibilities are promising. Medical resonance imaging has shown to be a prominent application area, where combinations of compressed sensing with total variation regularisation and wavelet sparsifying transforms have shown convincing results. We aim to lead PET imaging on the same path. The overall efficiency of PET imaging can be improved via reduced radioactive dosage and scan times, without losing image quality, or even while simultaneously reducing image noise. This opens up for a sharper technique that can offer sharper images, which in turn will provide better aid to clinical diagnostics.