Posters – 6th Swedish Workshop on Data Science

15:00 – 16:20, 20 November (Tuesday), 2018

Poster Presentation & FIKA: MIT-place, MIT-huset (one floor up from the lecture room)

(list in alphabetical order on title name)

* (Nr. 1) Title: A Minimum Spanning Tree Clustering Approach for Mining Sequence Datasets
Authors: Shahrooz Abghari, Veselka Boeva, Niklas Lavesson, Håkan Grahn, Selim Ickin and Jörgen Gustafsson
Abstract: We propose an unsupervised approach for outlier detection in a sequence dataset. Outlier detection has been studied in many domains. Outliers arise due to different reasons such as mechanical issues, fraudulent behavior, and human error. Our approach consists of a preprocessing step and three main steps: 1) Sequential patterns mining, 2) Frequent sequential pattern clustering, and 3) Minimum spanning tree (MST) building and outlier detection analysis.
In the preprocessing step, Data segmentation, data is partitioned into equal-sized segments in order to identify sequential patterns. The first step, Sequential patterns mining, concerns the extraction of frequent sequential patterns and mapping them with records of a sequence dataset.

The PrefixSpan algorithm is used to find frequent sequential patterns from each segment. The extracted patterns can lead us to find collective outliers. Furthermore, the extracted patterns are mapped with the source they come from. This can help us to find additional information about the patterns such as pattern frequency and its occurrence time. The latter is useful for finding a contextual outliers. In the second step, Frequent sequential pattern clustering, the selected patterns are clustered by applying affinity propagation (AP) algorithm. AP can estimate the number of clusters from data. In the third step, Minimum spanning tree building and outlier detection analysis, the exemplars of the clusters are used for building a complete weighted graph, where vertices of the graph are the exemplars and edges are the distance between them. The aim is to determine a subset of edges that connect all the vertices together without any cycles that has the minimum total edge weight. In order to identify outliers, the longest edge of the tree is removed. The constructed MST will be replaced by the created sub-trees. The sub-trees are ranked from smallest to largest based on the number of items they match within the sequence dataset. Here the smallest subtrees can be regarded as outliers.
The proposed approach can be used to facilitate the domain experts in identification of outliers. Building the minimum spanning tree on top the clustering solution can lead to identifying clusters of outliers. This can reduce the time complexity of the proposed approach. The proposed approach has been evaluated in two different experimental scenarios. Namely, it has been applied on two different sequence datasets: smart meter data and video session data. Both datasets contain sequences of event types that either shows the operational status of a smart meter or the current action that takes place in a viewer’s video session. The results of the experiments on the smart meter data are more comprehensible compared to the video session data. The main reason is the fact that the event types in smart meters are explicitly detailed, explaining the status of the devices. However, in video session data the event types are general which requires more investigation and experts’ knowledge in order to detect video sessions with quality issues. The validation of the results on video session data by the domain experts showed that 67% of the labeled sessions by the proposed approach were correct.

* (Nr. 2) Title: Automated surface finish defect detection using statistical learning approach
Authors: Natalya Pya Arnqvist, Blaise Ngendangenzwa, Leif Nilsson, Eric Lindahl and Jun Yu
Abstract: Automated and intelligent manufacturing process is highly desired for increasing production efficiency, product quality as well as decreasing labor costs. Surface treatment of commercial bodies, such as trucks, cars and buses, is highly automated except for the inspection and evaluation of what to repair and adjust. In particular, the paint quality inspection process is still mainly performed manually in most worldwide automotive manufacturers, including AB Volvo. At the paint shop in the Volvo cab plant in Umeå, every cab from the body shop goes through multiple stages where fully automated robots and a group of color technicians work together, and a computerized tracking system monitors every cab through each painting stage. However, the final step in the paint shop is a human visual inspection and touch which ensures that there are no missed spots, irregularities, bumps, scratches or uneven surfaces. Human vision and touch are still important for such inspection but they are susceptible to inconsistency due to unavoidable human error, limited time for the inspection of each cab, difficulties with finding defects due to their micrometer size or less-accessible location on the surface, and also there are defects that are only visible at some particular viewing angles. The joint research project between Umeå University, AB Volvo and Volvo Cars develops and implements an automated inspection and detection of the surface finish defects using statistical learning approach. Rather than analyzing the natural images of the painted surface of the cabs, the reflection of the sinusoidal patterns on those surfaces are captured and used in the classification problem for defect detection. Five types of feature descriptors have been analyzed in this work, such as histogram of oriented gradients, local binary pattern, 2D wavelet transform, P-spline smoothing features and features based on variabilities. A comparative analysis of the performances of classifiers based on these features, including Support Vector Machine and Random Forest, has been carried out. Moreover, the probability based performance evaluation metrics have been proposed as alternatives to the conventional metrics. The usage of the probability based metrics allows for uncertainty estimation of the predictive performance of a classifier.

* (Nr. 3) Title: Data-driven Multiscale DDoS Detection based on Total-variation Metric
Authors: Monowar Bhuyan and Erik Elmroth
Abstract: The rapid growth in scale and sophistication of distributed denial-of-service (DDoS) attacks is inducing failures or collocated damages at a higher rate than before, generating extensive numbers of alarms per day for popular internet services. DDoS attacks regard as one of the most substantial threats to modern Internet services. The DDoS attacks provide with two significantly different characteristics concerning traffic rate dynamics, i.e., as low-rate and high-rate attacks. Low-rate attacks employ the weaknesses of protocols rather than just exhausting the network resources like flooding attacks and can evade the defense system due to flowing traffic similar to legitimate one. High-rate attacks attempt to cease the resources from individual systems to datacenters and again traffic identical to flash-crowds. Both these attacks result in Quality-of-Service (QoS) and economic loss. In cloud datacenters, service owners are often auto-scaling the amount of resources used for running the service in order to manage load variations, that may come from either legitimate traffic or attack traffic. Taking advantages of systems are managed in a datacenter, intruders follow four different ways to penetrate them into a target including (i) frequency-rate variations, (ii) burst-peak variations, (iii) burst-width variations, and (iv) random variations. Intruders use traffic scale-in and scale-out strategies based on the operational observations of datacenters to penetrate attacks to a target. As such attacks are distributed and coordinated, each compromised system makes scaling of traffic-rate with malformed packets towards a target with respect to a time period that results a multi-scale attack.

We aim to develop a mechanism to detect multi-scale DDoS attacks using a generalized total variation metric. The proposed metric is highly sensitive towards detecting different variations in the network traffic and evoke more distance between legitimate and attack traffic as compared to the other detection mechanisms. Most low-rate attackers invade the security system by scale-in-and-out of periodic packet bursts towards the bottleneck router which severely degrades the Quality of Service (QoS) of TCP applications. Our proposed mechanism can effectively identify attack traffic of this natures, despite its similarity to legitimate traffic, based on the spacing value of our metric. We have evaluated our mechanism using datasets from CAIDA DDoS, MIT Lincoln Lab, and real-time testbed traffic. Our results demonstrate that our mechanism exhibits good accuracy and scalability in the detection of multi-scale low-rate DDoS attacks.

* (Nr. 4) Title: Electricity Consumption Profiling of Individual Households using Clustering Analysis
Authors: Christian Nordahl, Veselka Boeva, Håkan Grahn and Marie Netz
Abstract: Today, much of what we do as individuals in our homes incorporate the use of electrical appliances.
Being able to determine if a resident is adhering to their normal routine or if something deviate from the norm, by solely analyzing their electricity consumption, would be beneficial to many different application domains, with the elder care and home care systems being prime examples.
With the adoption of smart meters, we now have an opportunity for remote monitoring of residents with a low cost, no intrusion of their privacy, and an easy deployment.
In this study, we have performed a cluster analysis on a set of households to determine if a normal behavior could be identified.
We have used k-medoids with two different distance metrics, Euclidean distance and Dynamic Time Warping, to see how the characteristics of the households’ daily electricity consumption patterns are captured.
The produced results show that we can identify a number of different signatures that reflect how the individuals in the household are normally behaving in their everyday life.
We plan to apply the proposed approach on a future data collection study to be able to more clearly identify, visualize, and analyze the individual households’ electricity consumption behaviors, and use the created signatures to determine if it is possible to detect abnormal behavior.

* (Nr. 5)Title: Explainability in Different Explainable AI Domains
Authors: Sule Anjomshoae, Amro Najjar and Kary Främling
Abstract: Explainable artificial intelligence (XAI) is growing to help users understand their intelligent agents’ behavior and reasoning to support an effective interaction. Although a large body of research is dealing with explanations and intelligibility, definitions for explainable AI seem underspecified. As works on producing explanations come from multiple fields, this study seeks to gain insights into how different AI communities use these terms interchangeably. Prior to commencing the review, the studies that working on XAI are divided as the data-driven and the goal-driven for the purpose of analysis. The data-driven studies mainly consider opening the black-box systems and understanding how machine learning algorithms reach a particular prediction, whereas the goal-driven methods emphasis on explaining the behavior, action, and decision of an autonomous agent (e.g., agents, robots). The findings show that in the machine learning community, there appears to be some agreement on the concept of interpretability to explain the cause of a decision [1]. The use of the term interpretability is sometimes equated with transparency (the opposite of blackbox-ness) which connotes some sense of understanding the components and mechanism of how the model works [2]. However, it has been argued that since not all models are readily interpretable, the term justiﬁcation was introduced for explaining why a decision is good, but not explaining the process of actual decision-making [3]. Another important finding was that there is no agreed definition of what constitutes comprehensibility. While some described comprehensibility as understanding the data which led to the decision, others refer to it as a presentation of the natural language explanations (text and visualization) [4, 5]. On the other side of the spectrum, numerous terms are used to describe explainability in the goal-driven studies as well. One line of research has emphasized on understandability as referring to state-of-mind of the AI considering its beliefs, desires, intentions, emotions, perceptions, capabilities, and limitations [6]. In another study, the notion of plan explicability was introduced for reducing the mismatch between the mind of the robot, and human by generating plans that match with the humans’ expected plan [7, 8]. Also, several other definitions of explainability such as readability, legibility, predictability, transparency, and intelligibility have been proposed for communicating robot’s goals. These terms are often used interchangeably and without precision referring to predicting and understanding a robot’s goal, actions, and emotions [6]. Consequently, the fact that different research communities working on XAI is resulting variations in using these terms, even though they may be referred to similar practices. To overcome this problem, this research proposes to conduct a review on explanations from both goal-driven and data-driven studies to identify the properties of an explanation, and provide common ground definitions. The connections between definitions for explanations and classifications of terms within and between domains will be presented in a framework. The proposed framework would be an initial step toward reducing the divergence in terminology and facilitating collaboration across different XAI domains by providing unifying terms.

* (Nr. 6)Title: Fault detection using deep learning: a case study in 5G
Authors: Tobias Sundqvist, Monowar H. Bhuyan and Erik Elmroth
Abstract: We surround ourselves with wireless devices that can aid us in our daily life. We use smart-eones, smart-watches, GPS, tablets, and many other devices. Some of these use radio technology to communicate with other devices, and then they use
a system called Radio Access Network (RAN). To support our increasing demands for new services, bandwidth, latency, and reliability, RAN has become an extensive and complex system. Part of RAN now needs to be virtualized and distributed to meet
the 5G standard [1],[2] and its complexity will increase even more. The larger a system is, more likely it is that something fails and it might be alright to lose connection sometimes when you make a phone call or watch a movie, but when using a self-driving car or having robotic surgery then it can be lethal if something goes wrong.
Engineers are making the RAN more reliable by testing many different scenarios and evaluates if RAN behaves normally or not. This task is very challenging, since there is a lot of data to analyse. In order to aid the RAN engineers, our research is focusing on how machine learning techniques such as variants of Long Short-Term Memory (LSTM) [3] can be used to analyse the big RAN data and automatically detect the abnormal behaviour of RAN.
Large Telecom vendors such as Ericsson, Nokia and Huawei are still developing the 5G RAN and there is no real system that we can access data from. Instead, we used a test bed that has been developed by Tieto in Umeå. Tieto is collaborating with
many large Telecom vendors and has deep knowledge in building RAN. The test bed they have developed is simulating a city with a 5G network in which 10 000 citizens are living in. It is possible to simulate different kind of scenarios for the citizens and at the same time collect metrics from the 5G traffic.
In the preliminary results when comparing test cases with and without faults we have seen that it is possible to use LSTM to detect fault behaviour. Our experiments are focusing on the identification and detection of various faults in RANs using LSTM. An important part is also to see what RAN metrics that is needed and how often we need to collect it in order to detect the faults. By finding faults in RAN earlier and quicker with LSTM networks, we can get rid of many faults and make the future 5G network more reliable, and maybe then we can put our lives in the hands of autonomous cars or robots.

* (Nr. 7)Title: HSTREAM: A directive-based language extension for heterogeneous stream computing
Authors: Suejb Memeti and Sabri Pllana
Abstract: Nowadays, huge amounts of data are generated through various mechanisms, such as scientific measurement and experiments (including genetics, physics, and astronomy), social media (including Facebook, and Twitter), and health-care. The current challenges of big data include storing and processing very large files.
However, in big data applications, not necessarily the entire data has to be processed at once. Furthermore, in most of the big-data applications, the data may be streamed, which means flowing in real-time. In such cases, the data needs to be processed in chunks and continuously.
Heterogeneous parallel computing systems comprise multiple non-identical processing units (PU), including CPUs on the host and accelerating devices. Most of the top supercomputers in the world comprise multiple nodes with heterogeneous processing units. For instance, the nodes of the current number one supercomputer in the TOP500 list consist of two IBM POWER9 CPUs and six NVIDIA Volta V100 GPUs.
While the combination of such heterogeneous processing units may deliver high performance, scalability, and energy efficiency; programming and optimizing such systems is much more complex. Different manufacturers of accelerating devices prefer to use different programming frameworks for offloading (which means transferring the data and control from the host to the device). For instance, OpenMP is used to offload computations to Intel Xeon Phi accelerators, whereas CUDA and OpenCL are used for offloading computations to GPUs.
We present HSTREAM, a compiler directive-based language extension that supports heterogeneous stream computing. HSTREAM aims to keep the same simplicity as programming with OpenMP and to enable programmers to easily utilize the available heterogeneous parallel computing resources on the host (CPU threads) and device (GPUs, or Intel Xeon Phis). The HSTREAM source-to-source compiler performs several analysis steps (including lexical, syntactical, and semantical) and generates target specific code from a given source code annotated with HSTREAM compiler directives and a PDL file that describes the hardware architecture. HSTREAM supports code generation for multi-core CPUs using OpenMP, GPUs using CUDA, and Intel Xeon Phis (also known as MIC) using Intel Language Extensions for Offloading (LEO). The HSTREAM run-time is responsible for scheduling the workload across the heterogeneous PUs.
We use the HSTREAM source-to-source compiler to generate the heterogeneous version of the STREAM benchmark. We evaluate the generated heterogeneous STREAM benchmark with respect to programming productivity and performance. The experimental results show that HSTREAM keeps the same simplicity as OpenMP, and the code generated for execution on heterogeneous systems delivers higher performance compared to CPUs-only and GPUs-only execution.
Major contributions of our work include:
– HSTREAM compiler – a source-to-source compiler for generating target specific code from high-level directive-based annotated source code.
– HSTREAM runtime – a runtime system for scheduling the workload across various non-identical processing units.
– Evaluation of the usefulness of HSTREAM using applications from the STREAM and STREAM2 benchmarks.
Note: This talk is based on our paper that is accepted for presentation at the 21st IEEE International Conference on Computational Science and Engineering (CSE 2018), 29 – 31 October, 2018.

* (Nr. 8)Title: Impact of Poisoning Attacks under Regression Learning
Authors: Pratyush Kumar Deka, Monowar H. Bhuyan and Erik Elmroth
Abstract: The rapid growth in scale and sophistication of distributed denial-of-service (DDoS) attacks is inducing failures or collocated damages at a higher rate than before, generating extensive numbers of alarms per day for popular internet services. DDoS attacks regard as one of the most substantial threats to modern Internet services. The DDoS attacks provide with two significantly different characteristics concerning traffic rate dynamics, i.e., as low-rate and high-rate attacks. Low-rate attacks employ the weaknesses of protocols rather than just exhausting the network resources like flooding attacks and can evade the defense system due to flowing traffic similar to legitimate one. High-rate attacks attempt to cease the resources from individual systems to datacenters and again traffic identical to flash-crowds. Both these attacks result in Quality-of-Service (QoS) and economic loss. In cloud datacenters, service owners are often auto-scaling the amount of resources used for running the service in order to manage load variations, that may come from either legitimate traffic or attack traffic. Taking advantages of systems are managed in a datacenter, intruders follow four different ways to penetrate them into a target including (i) frequency-rate variations, (ii) burst-peak variations, (iii) burst-width variations, and (iv) random variations. Intruders use traffic scale-in and scale-out strategies based on the operational observations of datacenters to penetrate attacks to a target. As such attacks are distributed and coordinated, each compromised system makes scaling of traffic-rate with malformed packets towards a target with respect to a time period that results a multi-scale attack.

* (Nr. 9)Title: Language Processing Center North
Authors: Suna Bensch, Martin Berglund, Henrik Björklund, Johanna Björklund and Frank Drewes
Abstract: The (LPCN) is a special interest group comprising researchers and professionals from different areas whose work involves the digital processing of linguistic data. The founding members believe that this cross-disciplinary view will become ever more valuable as we progress further into the machine learning era. We anticipate that natural language interfaces (NLI) will blur the lines between formal and natural languages, and observe that the humanities are already to an increasingly degree taking advantage of digital tools in their study of human communication. Current research at the LPCN ranges from the development of formal models for NLP, to the implementation of NLI for robots and investigations into the semantics and computational properties of programming languages.

In our poster presentation, we report on current work on linguistic data analysis and invite others to join the effort. Language processing is a key technological enabler for artificial intelligence research, and is central to many modern technologies ranging from speaking robots to automated translation. Today, research in this area is done in a number of different research groups at Umeå university and other institutions in northern Sweden. The region is also home to several companies for which language technology stand to add value. Communication and collaboration between these entities has, however, been limited to date, and the formation of the LPCN is intended to remedy this.

* (Nr. 10)Title: Machine Learning for Better Process Control in Waste Water Treatment Plant
Authors: Dong Wang
Abstract: Online probes are used along the process line of the largest waste water treatment plant in Umeå, which makes it possible to collect the big historical data of the parameters. Based on the historical data, the random forest regression model was built using the probe data before the effluent as the input and each effluent parameter as the output. Random search based on out-of-bag error was performed to search for the optimal hyperparameters (number of trees, number of features for split, minimum leaf size) of the model. With the suspend solid as the dependent variable, the R2 on the test set was 0.76. The three most important variables were 121:TT01, 211S507:FT01 and 311S502:CT01 in descending order. 121:TT01(temperature) being the most important among the parameters is in line with the professional knowledge in this industry. Additionally, the centered PDPs (Partial dependence plot) of the important variables were drawn to give some deeper understanding — in which way they are important — which would enable the operators to understand the process better and optimize it precisely. Besides the model on suspended solid, the random forest model on PO4 will also be built in the following days and deep learning model for both parameters as well. The accuracy difference between random forest and deep learning models will be compared.

* (Nr. 11)Title: Parallel k-means Clustering with Triangle Inequality on Spark
Authors: Ambika Shrestha Chitrakar and Slobodan Petrovic
Abstract: k-means clustering is one of the widely used methods in data mining. However, it is a hill-climbing method and it becomes slower with the increase of data (n), the number of iterations (e), and the number of clusters (k). K-means clustering with triangle inequality (k-meansTI) is one of the improved versions of the standard k-means algorithm. It has possibility to reduce the time complexity from O(kne) to O(n) by skipping many point-center distance computations, giving same clustering results as the standard k-means. Because of the exponential growth of data, it is important to implement such algorithms on big data platforms like Hadoop and Spark that support parallel implementation. k-meansTI has been implemented by using multi-threaded architecture, OpenMPI, and Hadoop MapReduce framework. This poster presents a framework to implement k-meansTI algorithm on Spark. The reason to choose Spark is that it is in-memory persistent and is fault tolerant as Hadoop. This poster also includes the experimental results that compare the performance of the parallel k-meansTI on Spark and the SparkML parallel k-means algorithm. The experimental results show that parallel k-meansTI on Spark can be faster than the Spark ML k-means when a data set is large, does not contain many sparse data instances, and is high dimensional. On the contrary, Spark ML k-means can be faster when a data set contains many sparse instances, regardless of its dimensionality and the number of clusters to be computed.

* (Nr. 12)Title: Phenomenological Ontology Guided Conceptual Modeling for Creating Meaningful Structures of Data (Demo & poster)
Authors: Tomas Jonsson
Abstract: Previously demonstrated at International Conference on Conceptual Modelling 2018, October, ChinaFor Enterprise Information Systems (EIS) conceptual ER models are fundamental. We demonstrate how a phenomenological modeling ontology (phenomenology as in philosophy of Husserl et al) for conceptual models assists in creating, from a user perspective, meaningful, auto generated EIS. Application of the phenomenological ontology on conceptual modeling is shown in a graphical, easy-to-use CASE tool, CoreWEB, which support both editing of models and generating user interfaces for consistent manipulation of instance data. As a sample system for the demo, the supporting EIS for Projekt Lazarus 2018 (www.projektlazarus.se) a Swedish 600 character Live Action Role Play (LARP) enterprise is used. A visual and easy-to-grasp demonstration of conceptual modeling with live generation of EIS should be a welcome holistic complement to specialized, focused, in-depth presentations. Audience at all levels find food for thought and inspiration to continue to explore conceptual modeling in general and phenomena oriented approaches with systems generation in particular. The tool is available online for educational institutions together with some demo videos at https://www.ameis.se/cml

* (Nr. 13)Title: Prehospital resource optimisation – Ambulance direction
Authors: Johanna Björklund
Abstract: An aging population, urbanization, and medical progress demand a flexible prehospital care. In particular, the emergency medical service must be customisable so that resources are used sustainably, efficiently, and equitably. In the research project “prehospital resource optimisation”, we develop a broad solution that enables the prehospital care to be organised in an optimal way, which requires that we can predict key metrics for different resource allocations and simulate future alarms. Our solution combines unique historical alarm data, advanced statistical modelling and large-scale simulations to derive information that allows for transparent decision-making and optimal resource utilisation This it easier to highlight the implications for specific patient groups, which is central from a democracy and gender perspective.

The focus of this presentation is the part of the project dedicated to supporting ambulance direction. When SOS alarm receives an emergency call, they quickly have to assess the severeness of the situation and assign an ambulance accordingly. Several factors influence the decision, e.g., the capabilities of individual ambulances, the distance to the emergency site, the current status of the ambulance, the overarching assignment scheme, and the risk for leaving geographical areas without ambulance cover. By analysing historical data, we can help the operators understand the consequences of alternative ambulance directions.

The prehospital care in Sweden has about 660 ambulances, respond to about 1.2 million emergency calls per year, and costs more than 4 billion SEK per year. The tool that we develop lays the groundwork for a systematic quality management, which in the long run can cause large efficiency gains. Behind the project are SOS Alarm, Umeå University, and the public health providers for Västerbotten, Jämtland, Härejdalen, Norrbotten, and Västernorrland.

* (Nr. 14)Title: Robust Prediction of Air Compressor Failures using Long Short Term Memory
Authors: Kunru Chen, Sepideh Pashami, Yuantao Fan and Slawomir Nowaczyk
Abstract: We propose a method to predict compressor failures using sensory data collected on board from slightly over 1000 Volvo Trucks during 2015 and 2016. The goal is to proactively predict maintenance needs, before the failure happens. Flexible maintenance strategy is crucial for commercial transportation, since both fixed maintenance intervals and reactive strategies lead to a waste of resources. Instead, predictive maintenance pays close attention to the status of the machine, and only suggests repairs and component replacements when is needed.
We formulate the problem as a classification task which predicts whether a compressor failure will happen in a truck within the specified prediction horizon. A recurrent neural network using Long Short-Term Memory (LSTM) architecture is used as the prediction model due to its ability to capture dynamic processes. Previously, Prytz et al. 2014[1] have used Random Forest (RF) in a similar setting. However, their method assumes that the records in the data is independent, which is a simplification. Compressor wear is gradually increasing over time. Therefore, we claim that temporal relationships between records are crucial for tracking the status of the machine, and LSTM is a better choice for this problem.
Two sources of data are combined in this study, sensor data and repair information. The former one is collected during truck operation, and the later records the time of repair for different truck components. Not all trucks in the dataset have records corresponding to repair, since a considerable quantity of the trucks does not fail in their uptime.
The high class imbalance makes it extremely challenging to train a model for both classes in the data. To address it, we have implemented a method to resample the data from both categories in different proportion: 1) we adjusted the amount of faulty trucks; 2) and we adjusted the amount of healthy data from each truck. In this way, different combinations of these two numbers become 10 learning setups in our experiment.
We present results comparing LSTM against RF, and LSTM outperforms random forest in most of the setups with respect to AUC score. Regarding accuracy, LSTM performs better in 3 setups, while RF works best in 2 setups. In the learning setups which has many healthy data in a faulty truck, RF works normally good; however it is hard for LSTM to capture the dynamic procedure of failure development with too many information indicating the truck is healthy. In a binary setup, where the algorithms observe either healthy or faulty records in a given window, both algorithms achieved the highest accuracy (around 0.85) and AUC score (0.90). In highly imbalanced setups, both algorithms perform poorly and no significant difference can be observed. Additionally, the prediction of LSTM stays quite stable over time, with the prediction showing a consistent slow trend from “healthy” to “faulty” class. On the other hand, predictions from RF are mutable, even in a short period, 10 days for example, it can switch twice.

Reference [1] Rune Prytz. Machine learning methods for vehicle predictive maintenance using off-board and on-board data. PhD Thesis, Halmstad University Press, 2014.

* (Nr.15 ) Title: Self-adaptive Privacy Concern Detection for User-generated Content
Authors: Xuan-Son Vu and Lili Jiang
Abstract: To protect user privacy in data analysis, a state-of-the-art strategy is differential privacy in which scientific noise is injected into the real analysis output. The noise masks individual’s sensitive information contained in the dataset. However, determining the amount of noise is a key challenge, since too much noise will destroy data utility while too little noise will increase privacy risk. Though previous research works have designed some mechanisms to protect data privacy in different scenarios, most of the existing studies assume uniform privacy concerns for all individuals. Consequently, putting an equal amount of noise to all individuals leads to insufficient privacy protection for some users, while over-protecting others. To address this issue, we propose a self-adaptive approach for privacy concern detection based on user personality. Our experimental studies demonstrate the effectiveness to address a suitable personalized privacy protection for cold-start users (i.e., without their privacy-concern information in training data).

* (Nr. 16)Title: Statistical Learning in a Manufacturing Environment
Authors: Niklas Fries, Xijia Liu and Patrik Rydén S
Abstract: The FIQA project is a collaboration between Volvo Trucks, Volvo Cars and Umeå University, partially financed by Vinnova FFI. The aim of the project is to model the paint quality at the Volvo Trucks cab manufacturing unit in Umeå using cab specific process/environmental variables. The models will be utilized in an automated alarm system which will predict the expected risk of defects as well as perform an automatic root cause analysis in the case of an alarm.
Process data are generated with high resolution by various sensors including: temperature measurements, robotic data, and dust measurements. Additionally, the cabs are tracked throughout the complete manufacturing process generating complete tracking data for all cabs. In a novel approach the unstructured sensor data have been combined with the tracking data in order to generate more than 5000 cab specific process variables. In addition, we have cab specific quality data, measuring various type of defects including dust and craters. We have gathered process and quality data from several month of production and more than 50,000 cabs.
Statistical learning and Machine learning techniques have been used to model different aspects of quality using the track specific process variables as explanatory variables. Challenging and important parts of the modeling involve fitting of hyper parameters, pre-processing of the data matrices and estimating the performance of the classifiers. The modelling approach allows us to predict the probability that a cab has a certain type of defect, i.e., the risk of defect. An alarm is triggered when the predicted risk of defects exceeeds a critical level, which in turn initiates an automated root-case analysis that identifies process steps responsible for the decline in quality.

* (Nr. 17)Title: Teaching Data Science the Interdisciplinary Way: Learning Cycles and Diverse Skillsets
Authors: Vasili Mankevich and Johan Sandberg
Abstract: Due to the potential to extract useful knowledge by means of data mining and statistical analysis, there has been a significant increase in education that teach data science courses, issues degree certificates, and offers master’s programs. Such programs focus on three skill sets: 1) information technology skills necessary for accessing and working with data (e.g. relational databases, OLAP); 2) analytical skills drawing from various disciplines that enable data analysis (e.g. statistical analysis, machine learning, econometrics); 3) business and communication skills that facilitate appropriate problem formulation and value extraction from data science solutions. Successful educational programs need to synthesize and balance these three aspects.

However, recent reports show that academic programs consistently underperform in preparing data science professionals. We studied the challenges and opportunities of teaching data science skills to students of diverse backgrounds (often the case in Master’s level programs). Our preliminary findings suggest that to balance the three skill sets educators need to recognize their distinct learning cycles. A learning cycle consists of exploration, concept introduction, and application and the emphasis on each stage varies greatly across skill sets. Technical data science skills rely on a multitude of separate systems and technologies involving numerous concepts that require continuous exploration, short cycles of learning (and assessment). Analytical skills require deeper engagement, hence longer learning cycles with an emphasis on concept introduction and formal instruction. Business and communication skills rely on engagement with high-level social systems (e.g. an organization, a market) and require longer involvement with the learning context, almost exclusively achieved by simulation projects or projects with industry partners. Our study indicates that data science education must consider the distinct learning cycles students experience, and suggests methods for doing so.

* (Nr. 18) Title: Training model-based and model-free regulators for server fan control with TensorFlow
Authors: Rickard Brannvall
Abstract: In a modern datacenter large amounts of electric energy is converted to heat by CPUs, memory and network equipment. Cooling units and fans remove heat from the server racks, which further adds to the consumption of electricity. It can be challenging to design automatic controls that use these resources efficiently to simultaneously minimize electricity costs and avoid damage to equipment caused by overheating. In a step towards building such a regulator we here look at a scaled down system of six servers with fans, analysing both model-based and model-free regulators using TensorFlow.

TensorFlow is popular for training deep neural networks in supervised-, unsupervised- or reinforcement learning. In this application we use it for calibrating a traditional time-series models and for controller design, taking advantage of its ability to express sequential models and pass them through its powerful automatic differentiator. The particular model used is a so called grey-box model, inspired by the physical laws of heat exchange, but with all parameters obtained by optimization. First the model is encoded as a RNN and exposed to the time-series data via n-step PEM, leveraging TensorFlow’s ability to express rich multi-step cost functions.

In regulator design one’s preferences in terms of tolerance for deviation from set-point, smoothness of control signal and the economics of resource use are expressed in a precise cost function. We directly obtain a MPC type regulator from our model by writing such a cost function on the n-step predictions. The optimal control signal is then solved for approximately in real-time (on CPU). This allows online use for server-fan control. We also compare with model-free regulators; first a PID controller with fixed constants calibrated against our model in TensorFlow. Finally, we explore an autotuning regulator that adjust its parameters on-line by gradient descent.

[Joint work with: Joar Svartholm, Rickard Liikamaa and Jonas Gustavsson]

* (Nr. 19)Title: Towards Understanding ICU Procedures using Similarities in Patient Trajectories – An exploratory study on the MIMIC-III intensive care database
Authors: Alexander Galozy, Slawomir Nowaczyk and Anita Sant’Anna
Abstract: Recent advances in Artificial Intelligence has prompted a shear explosion of new research initiatives and applications, improving not only existing technologies, but also opening opportunities for new and exciting applications. One field that has seen a surge in activity in recent years is the health care sector. One problem commonly encountered in health care and especially in the intensive care unit (ICU) is overtreatment or mistreatment of the patients resulting in increased morbidity and mortality. Utilizing existing electronic health records in conjunction with machine learning and data mining techniques to detect harmful or deviating treatments provide the opportunity for needed change. We explore the MIMIC-III intensive care unit database and conduct experiments on an interpretable representation on the Simplified Acute Physiology Score III (SAPS-III) and Oxford Acute Severity of Illness Score (OASIS), commonly used to predict in-hospital mortality. The sub-scores of the severity scores represent the health state of patients throughout the hospital stay and health state trajectories are clustered using a powerful kernel technique: The Time Series Cluster Kernel, a recently developed method for robust time series clustering with missing data employing an ensemble learning approach with informative priors. Medications and procedure contexts are clustered on semantic representations learned from data using powerful word embeddings to find commonalities and deviations in treatments among patients with similar and individual health state trajectories. Changes in the health state are correlated with administered medication and performed procedures to evaluate treatments on their effect on said health state, where commonalities and deviations can be detected and analyzed. Our results show the potential and limitations in detecting commonalities and deviations in treatment in an ICU-setting, using the proposed representation of a patient’s health state in conjunction with contemporary representation learning and clustering techniques.

* (Nr.20 )Title: Using computational molecular evolutionary methods to predict functional capabilities of (p)ppGpp synthesising enzymes
Authors: Chayan Kumar Saha and Gemma Catherine Atkinson
Abstract: Bacterial response to nutritional stress by producing (p)ppGpp triggers a stringent response that down-regulates protein synthesis and upregulates amino acid biosynthesis. The level of (p)ppGpp in bacteria is controlled by enzymes belonging to the RelA/SpoT homologue (RSH) family, named for their sequence similarity to the RelA and SpoT enzymes of Escherichia coli. RelA is able to synthesize (p)ppGpp and SpoT is able to both synthesize and hydrolyse (p)ppGpp. High-throughput sensitive sequence searching of over 1000 genomes from across the tree of life, in combination with phylogenetic analyses, previously allowed the classification of RSHs into 30 subgroups [1]. These belonged to three larger groups: small alarmone synthetases (SASs) that contain only the (p)ppGpp synthesis (SYNTH) domain, small alarmone hydrolases (SAHs) that contain only the (p)ppGpp hydrolysis (HD) domain and long RSHs which carry both SYNTH and HD domains, often with additional CTD domains. My bioinformatics PhD project aims to take advantage of new functional and structural information available for SASs, as well as the many new genome sequences now available, to revisit this group of RSHs, identifying new subgroups, retracing their evolution and predicting functional capabilities that can be tested experimentally. I have started with 24072 genomes from across the tree of life (bacteria, archaea, eukaryotes and viruses), in which I have identified 35616 RSH proteins, and classified them into subgroups according to the previous scheme. The next step is to carry out phylogenetic analyses to identify new subfamilies and improve the previous classification. By developing tools for analyzing co-distribution and conservation of flanking genes, I will predict novel functional associations of SASs and other stress response proteins of interest in the research group. My analyses in combination with biochemical and in vivo experiments will allow the wealth of the evolutionary and sequence data we have collated to be used to link sequence, structure and function.

Reference [1] Atkinson GC, Tenson T, Hauryliuk V (2011) PLoS ONE 6(8): e23479

* (Nr. 21) Title: A.I.D.A the digital receptionist – An Artificial Intelligent Assistant-solution for Dental Healthcare
Authors: Eva Mårell-Olsson and Peter Bergström
Abstract: Swedish dental healthcare is facing several challenges that is shared with the rest of the western world about how to develop an effective dental healthcare in a digitized society. For the society, the challenges concern a growing population along with a growing staff shortage. The shortage of dentists and dental nurses is not possible to solve by an increased education and therefore it is necessary to find other solutions. In addition, there is an ongoing digitalisation of dental healthcare and increased demands are raised on developing new innovative solutions, as well as on existing structures and systems. Focus is on creating an effective dental healthcare in combination with an agile and experimental way of meeting the challenges of the future. The use of Artificial intelligence (AI) in dental healthcare brings hope of increased quality and effectiveness along with increased growth and better welfare. This project is about an on-going applied study in dental healthcare. The study context embraces a newly built public dental clinic in the city center of Umea, Sweden that opened for patients in September 2018.

The clinics vision is to become the world’s smartest dental clinic, using smart digital tools and develop smart working methods. Further, the clinic aims to be a fully functioned test bed within 2 years. The clinic is fully digitized with the latest equipment that is possible to obtain for dental healthcare. The recruited staff has a specific responsibility to be part of the exploration and share their experiences on how to develop dental healthcare for the future. Another focus concerns how to conduct more effective dental healthcare for the patients. One of the on-going projects is an AI-based communication-solution under development called A.I.D.A. – the digital receptionist (e.g. Artificial Intelligent Dental Assistant).
Aim and method
The study aims to investigate how an AI-system can optimize the communication between patient and staff as well as the communication between the staff within the clinic. The research questions focus on patients’ and staffs’ experiences and their use of the developed AI-system concerning possibilities as well as challenges. Design-based research methods will be used as a mean to develop an approach to meet the challenges of staff shortage based on the use of AI. Three types of data will be collected based on staff interviews, patient interviews and log data from the AI system. The theoretical framework is based on activity theory, where motives, goals, actions and operations are key starting points. Activity theory embraces an exploration and an understanding of a context in relation to how social relations and materials, tools and intentions affect acting in different situations.
Thematic analysis will be used for constructing an understanding and meaning of the collected empirical material and to identify key themes and emerging patterns based on the study’s aim and research questions.

* (Nr. 22) Title: Prehospital resource optimisation – Spatio-temporal modelling and forecasting of emergency calls
Authors: Fekadu L. Bayisa, Nana Li, Markus Ådahl, Patrik Rydén, Johanna Björklund, Ottmar Cronie
Abstract: The aim of the project “prehospital resource optimisation” is to develop a solution which enables the prehospital care to be organised in an optimal way. Such a system would allow the current prehospital resources (e.g. ambulances) to be used in an “optimal” way and it would, in addition, act as a quality management tool used to create (economical) efficiency gains.

The prehospital care in Sweden has about 660 ambulances, responds to about 1.2 million emergency calls per year, and costs more than 4 billion SEK per year. Given these figures, it should not be hard to realise that the behaviour of the 112-emergency call frequency throughout Sweden (and any other country for that matter) is rather complex and dynamical, with ever changing conditions in both space and time. For instance, in a given municipality, the statistics for incoming calls during a Monday in January a given year are most likely different than the statistics for the incoming calls during, say, a Saturday in Augusta another year. In other words, the spatial dynamics of the calls tend to change over time. Moreover, the underlying risk of a call occurring at a given location and time changes with e.g. the age distribution in the population, urbanisation and general societal behaviours.

A key component in the intended system is a tool which describes and forecasts the call dynamics. Not only does such a tool allow one to indicate geographical “hot-spot” regions for a given future timeframe, it also makes it possible to create simulated future call scenarios/realisations, which may be used to evaluate different ambulance dispatch/routing strategies. The part of the project presented here specifically deals with the spatio-temporal modelling and forecasting of 112-emergency calls, based on unique historical call data. Our modelling set-up re-calibrates itself over time, as new call data is entering the system, in order to adequately reflect the changing spatio-temporal dynamics of the calls.

Partners of the project are SOS Alarm, Umeå University, and the public health providers in the regions Västerbotten, Norrbotten, and Västra Norrland. The project is (partially) financed by Vinnova.