Course Description

KeyNote and Courses



  • Richard Bonneau
    (New York University) [introductory]
    Large Scale Machine Learning Methods for Integrating Protein Sequence and Structure to Predict Gene Function

    Summary

    There are a large number of ways that machine learning is advancing biology, biotech and medicine. A key grand challenge in biology is the annotation of genomes. We sequence large numbers of genomes for diverse reasons: from diagnosing and tailoring treatment for diseases to cataloging and mining biodiversity. We will dive deeply into the problem of predicting the function of proteins and protein families. This is a good illustrative problem at the intersection of biology and ML for several reasons: 1) input data is diverse with networks, sequences, and 3 dimensional structures all in great abundance, 2) protein function is organized into a hierarchy of many thousands of labels, both organized and challengingly rich, and 3) the potential for positive impact is quite high with applications from bioremediation, biosynthesis, ecology, and medicine.

    Syllabus

    • 1: Framing of the problem: Overview of features, labels, and past work in protein function prediction.
    • 1.1 the data: proteins, sequences, structures and networks.
    • 1.2 Previous work on protein function prediction.
    • 1.3 Common task challenges: the critical assessment of function annotation (CAFA) and MouseFunc
    • 1.4 Introduction to autoencoders with examples drawn from biology and social networks analysis.
    • 2: Learning from networks: deep multimodal autoencoders applied to the problem of integrating multiple biological networks.
    • 2.1 Biological networks: a rich but complicated input
    • 2.2 Using non-negative matrix factorization to combine high dim features
    • 2.3 Deep multimodal autoencoders for combining networks
    • 2.4 Behind the curtain: would other simple methods work better? alternate architectures
      1. Learning from structure using graph convolutional neural networks and Integrating features
    • 3.1 Similarities between sequences and structure: biology’s workhorses
    • 3.2 Sequence autoencoders
    • 3.3 Protein sequence and protein family CNNs
    • 3.4 Salience mapping to localize function
    • 3.5 Protein structure features overview
    • 3.6 Encapsulating and learning from protein structure using Graph CNNs

    References

    Pre-requisites

    Familiarity with statistics and basic concepts of machine learning. Biology will be introduced and biology knowledge is not needed, but it couldn't hurt to read up on basic concepts like sequence alignment, biological sequence databases, and protein structure.

    Short Bio

    Richard Bonneau joined the Simons Foundation in 2014 to develop next-generation computational biology methods for the Center for Computational Biology at the Flatiron institute, he is also a faculty member (and former acting director) at NYU’s center for data science. He focuses on creating new methods for using protein structure modeling to interpret genetic variation and new methods for understanding biological networks. Before coming to the foundation, Bonneau was a senior scientist at the Institute for Systems Biology in Seattle, Washington, and before that he was a senior scientist at Structural GenomiX in San Diego, California. He is co-director of the Social Media and Political Participation Lab at New York University. He holds a Ph.D. in biomolecular structure and design from the University of Washington, Seattle.

    Short Bio (Vladimir Gligorijevic, Research Fellow):

    Vladimir Gligorijevic joined the Simons Foundation in March 2017 as a member of Systems Biology group at the Center for Computational Biology to develop protein function prediction methods using deep learning techniques. Prior to this, Gligorijevic was a research assistant in the computing department at Imperial College London. There, he worked on developing machine learning methods for integration of large- scale, heterogeneous biological data with applications in protein function prediction and precision medicine. Gligorijevic holds a B.Sc. and M.Sc. in physics from the University of Belgrade in Serbia and a Ph.D. in computer science from Imperial College London, United Kingdom.



    Altan Cakir
    (Istanbul Technical University) [introductory/intermediate]
    Processing Big Data with Apache Spark: From Science to Industrial Applications

    Summary

    Apache spark, open-source cluster-computing framework providing a fast and general engine for large-scale processing, has been one of the exciting technologies in recent years for the big data development. The main idea behind this technology is to provide a memory abstraction which allows us to efficiently share data across the different stages of a map-reduce job or provide in-memory data sharing. Our lecture starts with a brief introduction to Spark and its ecosystem, and then shows some common techniques - classification, collaborative filtering, and anomaly detection, among others, to fields particle physics, genomics, social media analysis, web-analytics and finance. If you have an entry-level understanding of machine learning and statistics, and program in Python or Scala, you will find these subjects useful for working on your own big data projects.

    Syllabus

    • Introduction to Data Analysis with Apache Spark
    • Spark Programming Model with RDD
    • Running Spark Applications on Hadoop / AWS Systems
    • Spark SQL
    • Spark Streaming
    • Machine Learning with Spark MLlib
    • Advanced Analytics Applications with Spark
    • Anaysis of real world applications

    References

    • https://spark.apache.org, Unified Analytics Engine for Big Data
    • Advanced Analytics with Spark: Patterns For Learning From Data at Scale, A. Teller, M. Pumperla, M. Malohlava
    • Mastering Machine Learning with Apache Spark 2.x, S. Amirgodshi, M. Rajendran, B. Hall, S. Mei

    Pre-requisites

    Python, Statistics, Machine Learning

    Short Bio

    Altan Cakir received his M.Sc. degree in theoretical particle physics from Izmir Institute of Technology in 2006 and then went straight to graduate school at the Karlsruhe Institute of Technology, Germany, from which he earned a Ph.D. in experimental high energy and particle physics in 2010. During his Ph.D., he was responsible for a scientific research based on new physics searches in the CMS detector at the Large Hadron Collider (LHC) at European Nuclear Research Laboratory (CERN). Thereafter he was granted as a post-doctoral research fellow at Deutsches Elektronen-Synchotron (DESY), a national nuclear research center in Hamburg, Germany where he spent 5 years, and then recently got his present a full professor position at Istanbul Technical University (ITU), Istanbul, Turkey. Currently, Altan Cakir is a group leader of ITU-CMS group at CERN and leading a data analysis group at the CMS detector. Furthermore, he was a visiting faculty at Fermi National Accelerator Laboratory (Fermilab), Illinois, USA in 2017. His group’s expertise is focused around machine learning techniques in large scale data analysis. However, their research is very much interdisciplinary, with expertise in the group ranging from particle physics, detector development and big data synthesis to economy and industrial applications.

    Altan Cakir involved a large number of high profile research projects at CERN, DESY and Fermilab last thirteen years. He enjoys being able to integrate his research and teaching key concepts of science and big data technologies. It’s rewarding to be part of the development of the next generation of scientist, and help his students move on to careers all over the world, in academia, industry and government.

    The following lectures on big data are periodically given by Assoc. Prof. Dr. Altan Cakir in Big Data and Business Analytics Program (http://bigdatamaster.itu.edu.tr) at Istanbul Technical University: Big Data Technologies and Applications, Machine Learning with Big Data



    Jiannong Cao
    (Hong Kong Polytechnic University) [introductory/intermediate]
    Cross-Domain Multi-Source Big Data Fusion and Analytics

    Summary:

    Big data analytics using cross-domain multi-source datasets allow us to study the phenomena of our interest by fusing views from multiple angles, facilitating us to identify meaningful problems and discover new insights. However, we need methods and techniques to solve the challenges like heterogeneity, uncertainty and high dimensionality in analyzing cross-domain datasets. In this lecture, I will describe a general framework of cross-domain big data fusion and analytics, and introduce existing works including our own on fusing and analyzing datasets from multiple domains to uncover the underlying patterns, correlations and interactions. Example applications include human and urban dynamics like predicting traffic congestions, optimize demand dispatching in emerging on-demand services, and designing wireless networks.

    Syllabus:

      1. Introduction to big data analytics (0.5 hour)
      1. Cross-domain multi-source data analytics: opportunities and challenges (1 hour)
      1. Cross-domain multi-source data analytics framework and techniques (1.5 hours)
      1. Examples: our work on cross-domain data analytics (1 hour)
      1. Summary and future directions (0.5 hour)

    References:

    • [1] Y. Zheng, “Methodologies for Cross-Domain Data Fusion: An Overview”, IEEE Transactions on Big Data, 1(1): 16-34 (2015)
    • [2] C. Shi, Y. Li, J. Zhang, Y. Sun, Philip S. Yu, “A Survey of Heterogeneous Information Network Analysis”, IEEE Transactions on Knowledge and Data Engineering, Volume: 29 , Issue: 1 , Jan. 1 (2017)
    • [3] C-W. Tsai, C-F. Lai, H-C. Chao, A. V. Vasilakos, “Big Data Analytics: a Survey”, Journal of Big Data, December 2:21(2015)
    • [4] Y. Zheng, L. Capra, O. Wolfson, H. Yang, “Urban Computing: Concepts, Methodologies, and Applications”, ACM Transactions on Intelligent Systems and Technology, Vol. 5, No. 3, September (2014)

    Pre-requisites:

    Basic knowledge in big data analytics, machine learning, and urban computing applications.

    Short bio:

    Dr. Cao is currently a Chair Professor of Department of Computing at The Hong Kong Polytechnic University, Hong Kong. He is also the director of the Internet and Mobile Computing Lab in the department and the director of University’s Research Facility in Big Data Analytics. His research interests include parallel and distributed computing, wireless sensing and networks, pervasive and mobile computing, and big data and cloud computing. He has co-authored 5 books, co-edited 9 books, and published over 500 papers in major international journals and conference proceedings. He received Best Paper Awards from conferences including DSAA’2017, IEEE SMARTCOMP 2016, ISPA 2013, IEEE WCNC 2011, etc.

    Dr. Cao served the Chair of the Technical Committee on Distributed Computing of IEEE Computer Society 2012-2014, a member of IEEE Fellows Evaluation Committee of the Computer Society and the Reliability Society, a member of IEEE Computer Society Education Awards Selection Committee, a member of IEEE Communications Society Awards Committee, and a member of Steering Committee of IEEE Transactions on Mobile Computing. Dr. Cao has also served as chairs and members of organizing and technical committees of many international conferences, and as associate editor and member of the editorial boards of many international journals. Dr. Cao is a fellow of IEEE and ACM distinguished member. In 2017, he received the Overseas Outstanding Contribution Award from China Computer Federation.



    Nello Cristianini
    (University of Bristol) [introductory]
    The Interface between Big Data and Society

    Summary

    Syllabus

    Ethical and Social Implications of AI

    The introduction of AI in the midst of society has created new opportunities and new challenges, that include deep issues of fairness, transparency, and human autonomy. The solution of those new problems cannot be just technical, but there is a role for technical solutions too, within a more general effort to understand what can be done - so that we can safely coexist with intelligent machines in a data-driven society.

    Social Media Analysis

    The analysis of social media content can reveal new information about society, public opinion and even possibly biology. We review recent work that is based on statistical text mining, and that integrates various signals to reveal novel psychological and behavioural patterns in large populations.

    News Content Analysis

    The analysis of media content, both present and historical, can reveal new insights about trends, biases and events, that would otherwise be difficult to analyse. We show a series of examples - starting with the digitisation process and ending with the creation of maps and timelines - illustrating how big data and digital humanities can meet and provide new help for historians.

    Pre-requisites

    References

    • Cristianini, N. (2016). A different way of thinking. New Scientist 232(3101), 39-43.
    • Cristianini, N. (2016). Intelligence reinvented. New Scientist, 232(3097), 37-41.

    Short Bio

    Nello Cristianini is a Professor of Artificial Intelligence at the University of Bristol since March 2006, and a recipient of both a ERC Advanced Grant, and of a Royal Society Wolfson Merit Award. He has wide research interests in the areas of data science, artificial intelligence, machine learning, and applications to computational social sciences, digital humanities, news content analysis.

    He has contributed extensively to the field of statistical AI. Before the appointment to Bristol he has held faculty positions at the University of California, Davis, and visiting positions at the University of California, Berkeley, and in many other institutions. Before that he was a research assistant at Royal Holloway, University of London. He has also covered industrial positions. He has a PhD from the University of Bristol, a MSc from Royal Holloway, University of London, and a Degree in Physics from University of Trieste. He is co-author of the books 'An Introduction to Support Vector Machines' and 'Kernel Methods for Pattern Analysis' with John Shawe-Taylor, and "Introduction to Computational Genomics" with Matt Hahn (all published by Cambridge University Press).



    Geoffrey C. Fox
    (Indiana University, Bloomington) [intermediate]
    High Performance Big Data Computing

    Summary

    Big data problems can be classified into three main categories: batch processing (Hadoop), stream processing (Apache Flink and Apache Heron) and iterative machine learning and graph problems (Apache Spark). Each of these problems have different processing, communication and storage requirements. Therefore, each system provides separate solutions to these needs.

    All these systems use dataflow programming model to perform distributed computations. With this model, big data frameworks represent a computation as a generic graph where nodes doing computations and the edges representing the communication. The nodes of the graph can be executed on different machines in the cluster depending on the requirements of the application.

    We identify four key tasks in big data systems:

    1. Acquiring computing resources, 2) Spawning and managing executor processes/threads, 3) Handling communication between processes, and 4) Managing the data including both static and intermediate. An independent component can be developed for each of these tasks. However, current systems provide tightly coupled solutions to these tasks excluding the resource scheduling.

    Twister2 [1-3] is a loosely-coupled component-based approach to big data. Each of the four essential abstractions have different implementations to support various applications. Therefore, it has a pluggable architecture. It can be used to solve all three types of big data problems mentioned above.

    In this tutorial, we review big data problems and systems, explain Twister2 architecture and features, provide examples for developing and running applications on Twister2 system. By learning Twister2, big data developers will have an experience with a flexible big data solution that can be used to solve all three types of big data problems.

    Syllabus

    • Introduction to big data problems and systems
    • Decoupling big data solutions (big data stack)
    • Twister2 overview
    • Resource scheduling (Kubernetes, Mesos)
    • Communication (MPI on Twister2)
    • Task scheduling and execution (Fault Tolerance)
    • Data representation
    • Developing Big Data Solutions in Twister2
    • Batch Processing example
    • Streaming example
    • Machine learning example
    • Conclusion

    References

    • Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe, Vibhatha Abeykoon, Geoffrey Fox, "Twister2: Design of a Big Data Toolkit" in EXAMPI 2017 workshop November 12 2017 at SC17 conference, Denver CO 2017.
    • Supun Kamburugamuve, Pulasthi Wickramasinghe, Kannan Govindarajan, Ahmet Uyar, Gurhan Gunduz, Vibhatha Abeykoon, Geoffrey Fox, "Twister:Net - Communication Library for Big Data Processing in HPC and Cloud Environments", Proceedings of Cloud 2018 Conference July 2-7 2018, San Francisco.
    • Kannan Govindarajan, Supun Kamburugamuve, Pulasthi Wickramasinghe, Vibhatha Abeykoon, Geoffrey Fox, "Task Scheduling in Big Data - Review, Research: Challenges, and Prospects", Proceedings of 9th International Conference on Advanced Computing (ICoAC), December 14-16, 2017, India.

    Pre-requisites

    Basic knowledge of computer algorithms and software; knowledge of machine learning; some knowledge about big data systems including streaming systems and batch systems.

    Short Bio

    Geoffrey Fox received a Ph.D. in Theoretical Physics from Cambridge University where he was Senior Wrangler. He is now a distinguished professor of Engineering, Computing, and Physics at Indiana University where he is director of the Digital Science Center, and Department Chair for Intelligent Systems Engineering at the School of Informatics, Computing, and Engineering. He previously held positions at Caltech, Syracuse University, and Florida State University after being a postdoc at the Institute for Advanced Study at Princeton, Lawrence Berkeley Laboratory, and Peterhouse College Cambridge. He has supervised the Ph.D. of 73 students and published around 1300 papers (over 500 with at least 10 citations) in physics and computing with an hindex of 77 and over 35000 citations. He is a Fellow of APS (Physics) and ACM (Computing) and works on the interdisciplinary interface between computing and applications. Current application collaboration is in Biology, Pathology, Sensor Clouds and Ice-sheet Science, Image processing and Particle Physics. His architecture work is built around High-performance computing enhanced Software Defined Big Data Systems on Clouds and Clusters. The analytics focuses on scalable parallel machine learning. He is an expert on streaming data and robot-cloud interactions. He is involved in several projects to enhance the capabilities of Minority Serving Institutions. He has experience in online education and its use in MOOCs for areas like Data and Computational Science.



    David Gerbing
    (Portland State University) [introductory]
    Data Visualization with R

    Summary

    This seminar introduces the R language via data visualization, aka computer graphics, in the context of a discussion of best practices and consideration for the analysis of big data. Code to generate the graphs is presented in terms of R base graphics, Hadley Wickham's ggplot package, and the author's lessR package. The content of the seminar is summarized with R Markup files that include commentary and implementation of all the code presented in the seminar, available to all participants. These explanatory examples serve as templates for applications to new data sets.

    Syllabus

    Session 1

    Introduction to R

    • R functions and syntax
    • R variable types
    • Read data into R

    Specialized Graphic Functions

    • Functions from the lessR package
    • The ggplot function from the ggplot2 package
    • Base R graphics

    Themes

    Session 2

    Bar Charts for Distributions of Categorical Variables

    • R factor variables and lessR doFactors function
    • Counts of one variable
    • Joint frequencies of two variables
    • Statistics for a second variable plotted against one variable

    Graphs for Distributions of a Continuous Variable

    • Histograms and binning
    • Densities
    • Boxplot
    • Scatterplot, 1-dimensional
    • Introduction to the integrated Violin/Box/Scatterplot, the VBS plot

    Scatterplots, 2-dimensional

    • With two or more continuous variables
    • A categorical variable with a continuous variable
    • Bubble plots with categorical variables
    • Two variable plot with a third variable, categorical or continuous

    Session 3

    Scatterplots, 2-dimensional (continued)

    • Visualization of relationships for big data sets

    Time Series Plots

    • One-variable plot
    • Stacked time-series plot
    • Area plots
    • Forecasts

    Interactive R Visualizations

    • Shiny
    • Plotly

    References

    Gerbing, D. W. (2013). R Data Analysis without Programming, NY: Routledge. Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

    Pre-requisites

    Basic understanding of data analysis

    Short-Bio

    David Gerbing, Ph.D., since 1987 Professor of Quantitative Methods, School of Business Administration, Portland State University. He received his Ph.D. in quantitative psychology from Michigan State University in 1979. From 1979 until 1987 he was an Assistant and then Associate Professor of Psychology and Statistics at Baylor University. He has authored R Data Analysis without Programming, which describes his lessR package, and many articles on statistical techniques and their application in a variety of journals that span several academic disciplines.



    Craig Knoblock
    (University of Southern California) [intermediate/advanced]
    Building Knowledge Graphs

    Summary

    There is a tremendous amount of data spread across the web and stored in databases that can be turned into an integrated semantic network of data, called a knowledge graph. Knowledge graphs have been applied to a variety of challenging real-world problems including combating human trafficking by analyzing web ads, identifying illegal arms sales from online marketplaces, and predicting cyber attacks using data extracted from both the open and dark web. However, exploiting the available data to build knowledge graphs is difficult due to the heterogeneity of the sources, scale in the amount of data, and noise in the data. In this course I will present the techniques for building knowledge graphs, including extracting data from online sources, cleaning and transforming the data, aligning the data to a common terminology,and linking the data across sources, and constructing and querying knowledge graphs at scale.

    Syllabus

      1. Knowledge graphs
      1. Web data extraction
      1. Data cleaning and transformation
      1. Source alignment
      1. Entity linking
      1. Graph construction and querying

    Pre-requisites

    Background in computer science and some basic knowledge of AI, machine learning, and databases will be helpful, but not required.

    References

    Short Bio

    Craig Knoblock is a Research Professor of both Computer Science and Spatial Sciences at the University of Southern California (USC), Executive Director of the USC Information Sciences Institute, Research Director of the Center on Knowledge Graphs, and Associate Director of the Informatics Program at USC. He received his Bachelor of Science degree from Syracuse University and his Master’s and Ph.D. from Carnegie Mellon University in computer science. His research focuses on techniques for describing, acquiring, and exploiting the semantics of data. He has worked extensively on source modeling, schema and ontology alignment, entity and record linkage, data cleaning and normalization, extracting data from the Web, and combining all of these techniques to build knowledge graphs. He has published more than 300 journal articles, book chapters, and conference papers on these topics and has received 7 best paper awards on this work. Dr. Knoblock is a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), a Fellow of the Association of Computing Machinery (ACM), past President and Trustee of the International Joint Conference on Artificial Intelligence (IJCAI), and winner of the 2014 Robert S. Engelmore Award.



    Geoff McLachlan
    (University of Queensland) [intermediate/advanced]
    Applying Finite Mixture Models to Big Data

    Geoff McLachlan University of Queensland

    COURSE DESCRIPTION: Applying Finite Mixture Models to Big Data

    Summary

    Attention is focussed initially on the role of finite mixture models in modelling and clustering heterogeneous data. The use of the EM algorithm to fit mixture distributions via maximum likelihood is reviewed. Extensions of the commonly used normal (Gaussian) mixture models are considered through the use of hidden variables to formulate various skew component distributions suitable for representing clusters in the presence of skewness and kurtosis. Also, hidden variables in the form of latent factors are adopted to fit mixtures of factor analyzers in situations where the dimension of the feature data are large relative to the data within a cluster. Consideration is given to further extensions of this approach to handle big data of possibly high-dimensions after an appropriate reduction where necessary in the number of variables. This latter reduction in the first instance is effected by a clustering of the variables via mixtures of linear mixed models. There is coverage also of deep normal mixtures to increase the flexibility of these models. Another role of mixture models addressed is their use for controlling the false positive rate in multiple comparisons and testing. The statistical methodology to be presented is to be highlighted by consideration of several real-data examples from various fields, including flow cytometry, bioinformatics, and image analysis.

    Syllabus

    Role of mixture distributions in modelling and clustering heterogeneous data; maximum likelihood fitting of mixture models via the EM algorithm; Normal (Gaussian) mixture models and extensions for clusters in the presence of skewness and kurtosis; mixtures of linear mixed models; mixtures of factor analyzers for high-dimensional data; deep mixture factor analyzers; role of mixture models in controlling the false discovery rate in multiple comparisons and testing; applications of the aforementioned methodology to data from flow cytometry, microarray analyses, magnetic resonance imaging, and other various studies.

    References

      1. Lee, S.X., Leemaqz, K.L., and McLachlan, G.J. (2018). A block EM algorithm for multivariate skew normal and skew t-mixture models. IEEE Transactions on Neural Networks and Learning Systems. (Advance Access published 09 March, 2018). To appear. Preprint arXiv:1608.02797.
      1. Lee, S.X. and McLachlan, G.J. (2016a). Finite mixtures of canonical fundamental skew t-distributions: the unification of the restricted and unrestricted skew t-mixture models. Statistics and Computing 26, 573-589. Correction. Preprint (2014) arXiv: 1401.8182.
      1. Lee, S.X. and McLachlan, G.J. (2018). EMMIXcskew: an R Package for the fitting of a mixture of canonical fundamental skew t-distributions. Journal of Statistical Software 83, No. 3.
      1. Lee, S.X., McLachlan, G.J., and Pyne, S. (2016). Modelling of inter-sample variation in flow cytometric data with the joint clustering and matching (JCM) procedure. Cytometry: Part A 89A, 30-43.
      1. McLachlan, G.J., Bean, R.W., and Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18, 413-422.
      1. Ng, S.K., McLachlan, G.J., Wang, K., Nagymanyoki, Z., Liu, S., and Ng, S.W. (2015). Inference on differential expression using cluster-specific contrasts of mixed effects. Biostatistics 16, 98-112.
      1. Nguyen, H.D., McLachlan G.J., Ullmann, J.F.P., and Janke, A.L. (2016). Spatial clustering of time-series via mixtures of autoregressive models and Markov random fields. Statistica Neerlandica 70, 414-439.
      1. Viroli, C. and McLachlan, G.J. (2018). Deep Gaussian mixture models. Statistics and Computing. (Advance Access, published 1 December, 2017). To appear. Preprint arXiv:1711.06929.

    Pre-requisites

    Participants are expected to have a knowledge of Statistics at a level corresponding to at least the third year of a three-year undergraduate course in Science. Background knowledge particularly useful for this Course can be obtained from the two monographs McLachlan and Krishnan (2008, The EM Algorithm and Extensions. Second Edition, Wiley) and McLachlan and Peel (2000, Finite Mixture Models, Wiley). This material, however, will be briefly reviewed.

    Short Bio

    Geoff McLachlan is Professor of Statistics in the School of Mathematics and Physics at the University of Queensland. His research in Statistics is in the related fields of classification, cluster and discriminant analyses, image analysis, machine learning, neural networks, and pattern recognition, and in the field of statistical inference. The focus in the latter field has been on the theory and applications of finite mixture models and on estimation via the EM algorithm. A common theme of his research in these fields has been statistical computation, with particular attention being given to the computational aspects of the statistical methodology. He has written over 270 peer-reviewed research articles which have received over 40,000 citations. He has written six monographs on discriminant analysis (McLachlan, 1992), mixture models (McLachlan and Basford, 1988; McLachlan and Peel, 200), the EM algorithm (McLachlan and Krishnan, 1997 & 2008); and the analysis of gene expression data (McLachlan, Do, and Ambroise; 2004). He is currently an associate editor of several journals including Advances in Data Analysis and Classification, BMC Bioinformatics, Journal of Classification, Statistics and Computing, and Statistical Modelling. He is a former president of the International Federation of Classification Societies (IFCS). He is a fellow of the Australian Academy of Science and also a fellow of the American Statistical Association and the Royal Statistical Society. For 2007 to 2011 he was an Australian Research Council Professorial Fellow. He has received several awards, including the Pitman Medal of the Statistical Society of Australia in 2010, the IEEE ICDM Research Contributions Award in 2011, and the IFCS Research Medal for Outstanding Research Achievements in 2017.



    Folker Meyer
    (Argonne National Laboratory) [intermediate]
    Skyport2: A Multi Cloud Framework for Executing Scientific Workflows

    Summary

    Executing scientific workflows at scale poses a significant challenge to many teams and institutions, we present a unified system for portable, reproducible execution on local and remote resources.

    The Skyport system [1] provides containerized workflow execution with Docker [2] across systems boundaries. Allowing researchers to execute scientific workflows across system boundaries using the AWE [1, 3, 4] workflow engine and the SHOCK [5] as an active object store. AWE and SHOCK are implemented a RESTful service for managing and executing workflows, workflows are specified in Common workflow language (CWL) format[6].

    CWL is a single, multi-vendor language to describe scientific workflows created by a community of practitioners. In addition to being multi-vendor – and thus supporting multiple engines for computing --, another critical feature of CWL is the separation of science content and computational implementation. This allows experts in each of the domains to focus on their area (CWL, http://commonwl.org).

    One large scale use case driving the development of Skyport is MG-RAST [7, 8] (https://www.mg-rast.org), a hosted data analytics system used by over 40,000 researchers all over the planet.

    Syllabus

    • Session 1: Initial system setup and execution of demo workflow
      • o System overview
        • § Scientific computing & workflows
        • § Distributed computing
        • § Example use case MG-RAST
        • § CWL, Docker (why did we use those)
      • o Installskyport2services(usingsingledockercomposeimage)
        • § Install authentication services, AWE-server and SHOCK-server
        • § Load demo data into SHOCK
        • § Load demo workflow
        • § Install awe-worker node
        • § Download results from SHOCK
    • Session 2: Customizing system setup and monitoring execution
      • o Setupofexpandedsystem
        • § Basis for customization
        • § Creation of custom data types
        • § Monitoring execution via the web interface or cmd-line
        • § Adding tools to workflows
        • § Adding workflow steps
      • o Handsonexercisecreatingandexecutingacustomizedworkflow
    • Session 3: Advanced topics
      • o CombiningAWEwithotherexecutionengines
        • § Using Singularity [9]
      • o AddingdatatypestoSHOCK

    Pre-requisites

    The assumption is that participants will bring a laptop with the ability to execute multiple Docker containers. Participants should test their available memory and hard drive space. Software systems required:

    • Docker
      • Dockercompose(recentversion)
    • ASCII editor: e.g. Emacs, vi, Textmate, ...

    The participants will be asked to install software (via Docker), modify configuration files and perform other Unix command line style activities.

    References

      1. Wolfgang Gerlach WT, Andreas Wilke, Dan Olson, Folker Meyer: Container orchestration for scientific workflows. In: Cloud Engineering (IC2E), 2015 IEEE International Conference on: 2015/3/9. IEEE Transactions on Knowledge and Data Engineering 2016: 377-378.
      1. Merkel D: Docker: lightweight Linux containers for consistent development and deployment. Linux Journal 2014, 2014(239):2.
      1. Tang W, Bischof J, Desai N, Mahadik K, Gerlach W, Harrison T, Wilke A, Meyer F: Workload Characterization for MG-RAST Metagenomic Data Analytics Service in the Cloud. In: Proc of IEEE Int’l Conf on Big Data. 2014.
      1. Tang W, Wilkening J, Bischof J, Gerlach W, Wilke A, Desai N, Meyer F: Building Scalable Data Management and Analysis Infrastructure for Metagenomics. In: 5th International Workshop on Data-Intensive Computing in the Clouds. IEEE 2013.
      1. Bischof J, Wilke A, Gerlach W, Harrison T, Paczian T, Tang W, Trimble W, Wilkening J, Desai N, Meyer F: Shock: Active Storage for Multicloud Streaming Data Analysis. In: 2nd IEEE/ACM International Symposium on Big Data Computing: 2015; Limassol, Cyprus.
      1. Common Workflow Language, v1.0.
      1. Wilke A, Bischof J, Gerlach W, Glass E, Harrison T, Keegan KP, Paczian T, Trimble WL, Bagchi S, Grama A et al: The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res 2016, 44(D1):D590-594.
      1. Meyer F, Bagchi S, Chaterji S, Gerlach W, Grama A, Harrison T, Paczian T, Trimble WL, Wilke A: MG-RAST version 4—lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis. Briefings in bioinformatics 2017.
      1. Kurtzer GM, Sochat V, Bauer MW: Singularity: Scientific containers for mobility of compute. PLoS One 2017, 12(5):e0177459.

    Short Bio

    Folker Meyer is a computational biologist at Argonne National Laboratory and a Professor of Bioinformatics at the University of Chicago. He is also the Deputy Division Director for the Biology Divison at Argonne National Laboratory. Folker was trained as a computer scientist and with that came his interest in building software systems to answer complex biological questions. He is the driving force behind the MG-RAST project.



    Wladek Minor
    (University of Virginia) [introductory/advanced]
    Big Data in Biomedical Sciences

    Summary

    Syllabus

    • Big Data and Big Data in Biomedical Sciences
    • Why big data is perceived as a big problem - technological considerations
    • Data reduction - should we preserve unreduced (raw) data?
    • Databases and databanks
    • Data mining with the use of raw data, databanks and databases
    • Data Integration
    • Automatic and semi-automatic curation of large amounts of data
    • Conversion of databanks into databases
    • Database priorities – content and design
    • Interaction between databases
    • Modern data management in biomedical sciences – necessity or luxury
    • Automatic data harvesting – close reality or still on the horizon
    • Reproducibility of the biomedical experiments - drug discovery considerations
    • Artificial Intelligence and machine learning in drug discovery
    • Big data in medicine - new possibilities
    • Future considerations

    Pre-requisites

    References

    Short Bio

    Harrison Distinguished Professor of Molecular Physiology and Biological Physics, University of Virginia. Development of methods for structural biology, in particular macromolecular structure determination by protein crystallography. Data management in structural biology, data mining as applied to drug discovery, bioinformatics. Member of Center of Structural Genomics of Infectious Diseases. Former Member of Midwest Center for Structural Genomics, New York Center for structural Genomics and Enzyme Function Initiative.



    Soumya Mohanty
    (University of Texas Rio Grande Valley) [introductory/intermediate]
    Swarm Intelligence Methods for Statistical Regression

    Summary

    Big Data applications generically require the use of flexible, hence high-dimensional, statistical models to capture meaningful patterns in the data, and this usually leads to challenging non-linear and non-convex global optimization problems. The large data volume that must be handled further increases their difficult nature. This course will introduce methods from the field of computational swarm intelligence (SI), with the focus being on an in-depth presentation of Particle Swarm Optimization (PSO), that have proven useful in solving such optimization problems. PSO is a metaheuristic inspired by observations of cooperative behavior in multi-agent biological systems. It shows a remarkable robustness across a wide range of optimization problems, reducing the burden that is generally involved in tuning stochastic global optimization algorithms. The course will use concrete problems to illustrate the application of PSO to statistical regression and address practical issues that are often encountered in creating a successful implementation.

    Syllabus

    The following is a list of the main topics to be covered in the course.

      • Overview:
          • The role of optimization in statistical analysis
          • Fundamental results in optimization theory
              • Survey of SI algorithms
          • Particle Swarm Optimization
          • Performance benchmarking
      • Tuning considerations
      • Applications of PSO to:
          • Parametric regression
          • Non-parametric regression

    Pre-requisites

    None, as this is an introductory course. However, familiarity with basic probability theory and statistics will be a plus.

    References

      • The following textbooks (and references therein):
        • "Swarm intelligence methods for statistical regression", Soumya D. Mohanty, To be published (CRC press, Fall 2018).
        • "Fundamentals of Computational Swarm Intelligence", A. P. Engelbrecht, Wiley.
        • "Particle Swarm Optimization", M. Clerc, ISTE.
      • Additional material to be provided during the course.

    Short bio

    Soumya D. Mohanty, Professor of Physics at UTRGV, completed his PhD degree in 1997 at the Inter-University Center for Astronomy and Astrophysics, India. He subsequently held post-doctoral positions at Northwestern University, Penn State, and the Max-Planck Institute for Gravitational Physics. He was also a visiting scholar with the LIGO project at Caltech. Mohanty's research has focused on solving some of the important data analysis challenges faced in the realization of Gravitational Wave (GW) astronomy across all observational frequency bands. These include semi-parametric regression of very weak signals in noisy data, high-dimensional non-linear parametric regression, time series classification, and analysis of data from large heterogeneous sensor arrays. His work has been funded by grants from the Research Corporation, the U.S. National Science Foundation, and NASA.



    Sankar K. Pal
    (Indian Statistical Institute) [introductory/advanced]
    Machine Intelligence and Soft Granular Mining: Features, Applications and Challenges

    Summary

    The lecture has two parts. Beginning with the role of pattern recognition in data mining and machine intelligence, it describes the various components of granular computing (GrC), information granules, significance of fuzzy sets and rough sets in GrC and its relevance in mining large data sets. Uncertainty modelling in fuzzy-rough framework, including generalised entropy measures, concerning data analytics is emphasized.

    The second part deals with some mining applications such as, video tracking in ambiguous situations, selection of genes and miRNAs in bioinformatics, and community detection, target set selection and link prediction in social networks. All these data possess Big data characteristics. Roles of different kinds of granules, f-information measures and rough lower approximation in mining are demonstrated. Significance of lower approximation for knowledge encoding in designing granular neural networks, estimating the object model in unsupervised tracking, and determining the probability of definite and doubtful regions in cancer classification is illustrated. Finally, the concept of perception granules in natural language understanding and the use of z-numbers in abstracting various semantic-precisiations are explained. New terms like generalized rough entropy, granular flow graph, rough filter, intuitionistic entropy, granular social network model, fuzzy-rough community, double bounded rough sets, and z*-numbers are defined.

    Several examples are provided to explain the aforesaid concepts. The talk concludes mentioning the challenging issues and the future directions of research including the significance in computational theory of perception, natural computing and in granulated deep learning.

    Pre-requisites

    Knowledge of pattern recognition, fuzzy sets, rough sets, neural networks, data mining, probability theory

    Short Bio

    Sankar K. Pal received PhD degrees from Calcutta University and Imperial College, London. He joined the Indian Statistical Institute, Calcutta in 1975 as a CSIR-SRF where he became a full professor in 1987, a distinguished scientist in 1998, and the Director in 2005. He is currently an INSA Distinguished Professor Chair. He founded the Machine Intelligence Unit and the Center for Soft Computing Research at his Institute. He worked at Imperial College, London; UC Berkeley and UMD, College Park, the NASA JSC, Houston, Texas, and the US Naval Research Lab, Washington DC; served as a IEEE CS Distinguished visitor since 1987; and held several visiting positions in Italy, Poland, Hong Kong, and Australia. Fellows of IEEE, TWAS, IAPR, IFSA, and all four National Academies for Science/Engg. in India, he is a coauthor of 20 books and more than 400 research publications in the areas of pattern recognition, machine learning, image/video processing, data mining, web intelligence, soft computing, bioinformatics, and cognitive machines. He is/was on the editorial boards of 20 journals including some IEEE Transactions. Coveted national/ international awards received include: S.S. Bhatnagar Prize (India), Padma Shri (India), Khwarizmi International Award (Iran), and NASA Tech Brief Award (USA). Visited 45 countries as keynote /invited speaker.



    Lior Rokach
    (Ben-Gurion University of the Negev) [introductory/advanced]
    Ensemble Learning

    Summary

    Ensemble learning imitates our second nature to seek several opinions before making a crucial decision. The core principle is to weigh several individual models and combine them in order to reach a decision or prediction that is better than the one obtained by each of them separately. Researchers from various disciplines have explored the use of ensemble methods since the late seventies.

    This short course aims to provide a methodical and coherent presentation of classical ensemble methods as well as extensions and novel approaches that were recently introduced. Along with algorithmic descriptions of each method, we will provide a description of the settings in which this method is applicable and the trade-offs incurred by using the method.

    Syllabus

    Decision tree learning, Introduction to Ensemble Learning, Random forest, Gradient boosting, Fusion methods, Error-correcting output codes, Ensemble diversity, Ensemble pruning, Bias-variance tradeoff, Combining deep learning with ensemble learning

    Pre-requisites

    Familiarity with probability theory, statistics, and algorithms will be assumed, at the level typically taught at the bachelor level in computer science or engineering programs. Basic understanding of machine learning and data mining would be helpful.

    References

    1.   Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 2016.
    2. Kuncheva, Ludmila I. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.

    3. Rokach, Lior, and Oded Z. Maimon. Data mining with decision trees: theory and applications. Second Edition. World Scientific, 2014.

    4. Rokach, Lior. Pattern classification using ensemble methods. Vol. 75. World Scientific, 2010.

    5. Rokach, Lior. Decision forest: Twenty years of research, Information Fusion 27, 111-125

    6. Zhou, Zhi-Hua. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012.

    Short-Bio

    Lior Rokach is a professor of data science at the Ben-Gurion University of the Negev, where he currently serves as the chair of the Department of Software and Information System Engineering. His research interests lie in the areas of Machine Learning, Big Data, Deep Learning and Data Mining and their applications. Prof. Rokach is the author of over 300 peer-reviewed papers in leading journals and conference proceedings. Rokach has authored several popular books in data science, including Data Mining with Decision Trees (1st edition, World Scientific Publishing, 2007, 2nd edition, World Scientific Publishing, 2015). He is also the editor of "The Data Mining and Knowledge Discovery Handbook" (1st edition, Springer, 2005; 2nd edition, 2010) and "Recommender Systems Handbook" (1st edition, Springer, 2011; 2nd edition, 2015). He currently serves as an editorial board member of ACM Transactions on Intelligent Systems and Technology (ACM TIST) and an area editor for Information Fusion (Elsevier).



    Michael Rosenblum
    (University of Potsdam) [introductory/intermediate]
    Synchronization Approach to Time Series Analysis

    Summary

    The course presents data analysis techniques based on the theory of coupled oscillators and aimed at reconstruction of the network structure and inference of nodes properties from observations. The approach assumes that the multi-variate time series under study are outputs of weakly-coupled self-sustained oscillators and that the signals are appropriate for phase estimation. Therefore, the course will begin with a short introduction into the theory of interacting oscillators and there synchronization. We will discuss effects of phase and frequency locking and illustrate them with numerous examples. Next, we will discuss how this ideas can be used in data analysis. The main idea is to infer a model for phase dynamics of the observed network. Hence, the first step is to estimate phases from time series, and this step will be discussed in details. Then we will proceed with an analysis of the phase model with the goal to obtain the strength of directed links and infer the network structure. Thus, our technique represents an approach to the connectivity problem, relevant for physiology, neuroscience, and other fields. We demonstrate that our technique provides effective phase connectivity which is close, though not identical to the structural one. However, for weak coupling we achieve a good separation between existing and non-existing connections. We also discuss how the frequencies and phase response curves of interacting units can be estimated. Next, we extend the approach to cover the case of pulse-coupled neuron-like units, where only times of spikes can be registered, so that the data represent point processes. We will also demonstrate how the inferred phase model can predict synchronisation domains in experiments.

    Syllabus

    1. Self-sustained oscillator. Forced and coupled oscillators, phase and frequency locking, synchronization domains. Model of phase dynamics.
    2. Phase and its estimation from data. Hilbert Transform. Phase vs angle variable.
    3. Strength of interaction. Synchronization indices (phase locking values).
    4. Direction of interaction. Reconstruction and analysis of phase model for two interacting units.
    5. Networks, structural and effective connectivity. Triplet analysis, true and spurious connections.
    6. An example: cardio-respiratory interaction.
    7. Case of pulse coupled oscillators.
    8. Properties of nodes: natural frequencies, phase response curves.

    References

    • A. Pikovsky, M. Rosenblum, and J. Kurths, Synchronization. A Universal Concept in Nonlinear Sciences, Cambridge University Press, 2001.

    Pre-requisites

    Basic knowledge of calculus and differential equations.

    Short Bio

    MICHAEL ROSENBLUM has been a research scientist and Professor in the Department of Physics and Astronomy, University of Potsdam, Germany, since 1997.

    His main research areas are nonlinear dynamics, synchronization theory, and time series analysis, with application to biological systems. The most important results include description of phase synchronization of chaotic systems, analysis of complex collective dynamics in large networks of interacting oscillators, development of feedback techniques for control of collective synchrony in neuronal networks (as a model of deep brain stimulation of Parkinsonian patients), methods for reconstruction of oscillatory networks from observations, application of these methods to analysis of cardio-respiratory interaction in humans.

    Michael Rosenblum studied physics at Moscow Pedagogical University, and went on to work in the Mechanical Engineering Research Institute of the USSR Academy of Sciences, where he was awarded a PhD in physics and mathematics. He was a Humboldt fellow in the Max-Planck research group on nonlinear dynamics, and a visiting scientist at Boston University. He is a co-author (with A. Pikovsky and J. Kurths) of the book "Synchronization. A Universal Concept in Nonlinear Sciences", Cambridge University Press, 2001 and has published over 100 peer-review publications.

    Michael Rosenblum served as a member of the Editorial Board of Physical Review E, terms 2008-2013. Since 2014 he is an Editor of Chaos: Int. J. of Nonlinear Science. He was named an American Physical Society Outstanding Referee for 2015.



    Hanan Samet
    (University of Maryland) [introductory/intermediate]
    Sorting in Space: Multidimensional, Spatial, and Metric Data Structures for Applications in Spatial and Spatio-textual Databases, Geographic Information Systems (GIS), and Location-based Services

    Summary

    The representation of multidimensional, spatial, and metric data is an important issue in applications of spatial and spatiotextual databases, geographic information systems (GIS), and location-based services. Recently, there has been much interest in hierarchical data structures such as quadtrees, octrees, and pyramids which are based on image hierarchies, as well methods that make use of bounding boxes which are based on object hierarchies. Their key advantage is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search.

    We describe hierarchical representations of points, lines, collections of small rectangles, regions, surfaces, and volumes. For region data, we point out the dimension-reduction property of the region quadtree and octree. We also demonstrate how to use them for both raster and vector data. For metric data that does not lie in a vector space so that indexing is based simply on the distance between objects, we nreview various representations such as the vp-tree, gh-tree, and mb-tree. In particular, we demonstrate the close relationship between these representations and those designed for a vector space.
    For all of the representations, we show how they can be used to compute nearest objects in an incremental fashion so that the number of objects need not be known in advance. The VASCO JAVA applet is presented that illustrates these methods (found at http://www.cs.umd.edu/~hjs/quadtree/index.html). They are also used in applications such as the SAND Internet Browser (found at http://www.cs.umd.edu/~brabec/sandjava).

    The above has been in the context of the traditional geometric representation of spatial data, while in the final part we review the more recent textual representation which is used in location-based services where the key issue is that of resolving ambiguities. For example, does London'' correspond to the name of a person or a location, and if it corresponds to a location, which of the over 700 different instances ofLondon'' is it. The NewsStand system at newsstand.umiacs.umd.edu and the TwitterStand system at TwitterStand.umiacs.umd.edu system are examples. See also the cover article of the October 2014 issue of Communications of the ACM at http://tinyurl.com/newsstand-cacm or a cached version at http://www.cs.umd.edu/~hjs/pubs/cacm-newsstand.pdf and the accompanying video at https://vimeo.com/106352925

    Syllabus

    1. Introduction a. Sample queries b. Spatial Indexing c. Sorting approach d. Minimum bounding rectangles (e.g., R-tree) e. Disjoint cells (e.g., R+-tree, k-d-B-tree) f. Uniform grid g. Location-based queries vs: feature-based queries h. Region quadtree i. Dimension reduction j. Pyramid k. Region quadtrees vs: pyramids l. Space ordering methods

    2. Points a. point quadtree b. MX quadtree c. PR quadtree d. k-d tree e. Bintree f. BSP tree

    3. Lines a. Strip tree b. PM1 quadtree c. PM2 quadtree d. PM3 quadtree e. PMR quadtree

    4. Rectangles and arbitrary objects a. MX-CIF quadtree b. Loose quadtree c. Partition fieldtree d. R-tree

    5. Surfaces and Volumes a. Restricted quadtree b. Region octree c. PM octree

    6. Metric Data a. vp-tree b. gh-tree c. mb-tree

    7. Operations a. Incremental nearest object location b. Boolean set operations

    8. Spatial Database Issues a. General issues b. Specific issues

    9. Indexing for spatiotextual databases and location-based services delivered on platforms such as smart phones and tablets a. Incorporation of spatial synonyms in search engines b. Toponym recognition c. Toponym resolution d. Spatial reader scope e. Incorporation of spatiotemporal data f. System integration issues g. Demos of live systems on smart phones

    10. Example systems a. SAND internet browser b. JAVA spatial data applets c. STEWARD d. NewsStand e. TwitterStand

    References

    1. H. Samet. ``Foundations of Multidimensional Data Structures.'' Morgan-Kaufmann, San Francisco, 2006.

    2. H. Samet. ``A sorting approach to indexing spatial data.'' International Journal of Shape Modeling. 14(1):15--37, 28(4):517--580, June 2008.

    3. G. R. Hjaltason and H. Samet. ``Index-driven similarity search in metric spaces.'' ACM Transactions on Database Systems, 28(4):517--580, December 2003.

    4. G. R. Hjaltason and H. Samet. ``Distance browsing in spatial databases.'' ACM Transactions on Database Systems, 24(2):265--318, June 1999. Also Computer Science TR-3919, University of Maryland, College Park, MD.

    5. G. R. Hjaltason and H. Samet. ``Ranking in spatial databases.'' In Advances in Spatial Databases --- 4th International Symposium, SSD'95, M. J. Egenhofer and J. R. Herring, eds., Portland, ME, August 1995, 83--95. Also Springer-Verlag Lecture Notes in Computer Science

    6. H. Samet. ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS.'' Addison-Wesley, Reading, MA,

    7. H. Samet. ``The Design and Analysis of Spatial Data Structures.'' Addison-Wesley, Reading, MA, 1990.

    8. C. Esperanca and H. Samet. ``Experience with SAND/Tcl: a scripting tool for spatial databases.'' Journal of Visual Languages and Computing, 13(2):229--255, April 2002.

    9. H. Samet, H. Alborzi, F. Brabec, C. Esperanca, G. R. Hjaltason, F. Morgan, and E. Tanin. ``Use of the SAND spatial browser for digital government applications.'' Communications of the ACM, 46(1):63--66, January 2003.

    10. B. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. ``NewsStand: A new view on news.'' Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Irvine, CA, November 2008, 144--153. SIGSPATIAL 10-Year Impact Award.

    11. H. Samet, J. Sankaranarayanan, M. D. Lieberman, M. D. Adelfio, B. C. Fruin, J. M. Lotkowski, D. Panozzo, J. Sperling, and B. E. Teitler. ``Reading news with maps by exploiting spatial synonyms.'' Communications of the ACM, 57(10):64--77, October 2014.

    12. J. Sankaranarayanan, H. Samet, B. Teitler, M. D. Lieberman, and J. Sperling. ``TwitterStand: News in tweets.'' Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, November 2009, 42--51.

    13. M. D. Lieberman, H. Samet, and J. Sankaranarayanan. ``Geotagging with local lexicons to build indexes for textually-specified spatial data.'' Proceedings of the 26th IEEE International Conference on Data Engineering, Long Beach, CA, March 2010, 201--212.

    14. M. D. Lieberman and H. Samet. ``Multifaceted Toponym Recognition for Streaming News.'' Proceedings of the ACM SIGIR Conference. Beijing, July 2011, 843--852.

    15. M. D. Lieberman and H. Samet. ``Adaptive Context Features for Toponym Resolution in Streaming News.'' Proceedings of the ACM SIGIR Conference. Portland, OR, August 2012, 731--740.

    16. M. D. Lieberman and H. Samet. Supporting Rapid Processing and Interactive Map-Based Exploration of Streaming News. Proceedings of the 20th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Redondo Beach, CA, November 2012, 179--188/

    17. H. Samet, B. C. Fruin, and S. Nutanong. Duking it out at the smartphone mobile app mapping API corral: A}pple, G}oogle, and the competition. In Proceedings of the 1st ACM SIGSPATIAL International Workshop on Mobile Geographic Information Systems (MobiGIS 2012), Redondo Beach, CA, November 2012.

    18. H. Samet, S. Nutanong, and B. C. Fruin. Dynamic presentation consistency issues in smartphone mapping apps. Communications of the ACM, 59(9):58--67, September 2016.

    19. H. Samet, S. Nutanong, and B. C. Fruin. Static presentation consistency issues in smartphone mapping apps. Communications of the ACM, 59(5):88--98, May 2016.

    20. Spatial Data Structure applets at; http://www.cs.umd.edu/~hjs/quadtree/index.html.

    Pre-requistes

    Practitioners working in the areas of big spatial data and spatial data science that involve spatial databases, geographic information systems, and location-based services will be given a different perspective on data structures found to be useful in most applications. Familiarity with computer terminology and some programming experience is needed to follow this course.

    Short Bio

    Hanan Samet (http://www.cs.umd.edu/~hjs/) is a Distinguished University Professor of Computer Science at the University of Maryland, College Park and is a member of the Institute for Computer Studies. He is also a member of the Computer Vision Laboratory at the Center for Automation Research where he leads a number of research projects on the use of hierarchical data structures for database applications, geographic information systems, computer graphics, computer vision, image processing, games, robotics, and search. He received the B.S. degree in engineering from UCLA, and the M.S. Degree in operations research and the M.S. and Ph.D. degrees in computer science from Stanford University. His doctoral dissertation dealt with proving the correctness of translations of LISP programs which was the first work in translation validation and the related concept of proof-carrying code. He is the author of the recent book Foundations of Multidimensional and Metric Data Structures'' (http://www.cs.umd.edu/~hjs/multidimensional-book-flyer.pdf) published by Morgan-Kaufmann, an imprint of Elsevier, in 2006, an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structuresDesign and Analysis of Spatial Data Structures'', and ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS,'' both published by Addison-Wesley in 1990. He is the Founding Editor-In-Chief of the ACM Transactions on Spatial Algorithms and Systems (TSAS), the founding chair of ACM SIGSPATIAL, a recipient of a Science Foundation of Ireland (SFI) Walton Visitor Award at the Centre for Geocomputation at the National University of Ireland at Maynooth (NUIM), 2009 UCGIS Research Award, 2010 CMPS Board of Visitors Award at the University of Maryland, 2011 ACM Paris Kanellakis Theory and Practice Award, 2014 IEEE Computer Society Wallace McDowell Award, and a Fellow of the ACM, IEEE, AAAS, IAPR (International Association for Pattern Recognition), and UCGIS (University Consortium for Geographic Science). He received best paper awards in the 2007 Computers & Graphics Journal, the 2008 ACM SIGMOD and SIGSPATIAL ACMGIS Conferences, the 2012 SIGSPATIAL MobiGIS Workshop, and the 2013 SIGSPATIAL GIR Workshop, as well as a best demo award at the 2011 SIGSPATIAL ACMGIS'11 Conference. The 2008 ACM SIGSpATIAL ACMGIS best paper awared winner also received the SIGSPATIAL 10-Year Impact Award. His paper at the 2009 IEEE International Conference on Data Engineering (ICDE) was selected as one of the best papers for publication in the IEEE Transactions on Knowledge and Data Engineering. He was elected to the ACM Council as the Capitol Region Representative for the term 1989-1991, and is an ACM Distinguished Speaker.



    Rory Smith
    (Monash University) [intermediate/advanced]
    Statistical Inference: Optimal Methods for Learning from Signals in Noise

    Summary

    What is the statistically optimal way to detect and extract information from signals in noisy data? After detecting ensembles of signals, what can we learn about the population of all the signals? This course will address these questions using the language of Bayesian inference. After reviewing the basics of Bayes theorem, we will frame the problem of signal detection in terms of hypothesis testing and model selection. Extracting information from signals will be cast in terms of computing posterior density functions of signal parameters. After reviewing model selection and parameter estimation, the course will focus on practical methods. Specifically, we will implement sampling algorithms which we will use to perform model selection and parameter estimation on signals in synthetic data sets. Finally, we will ask what can be learned about the population properties of an ensemble of signals. This population-level inference will be studied as a hierarchical inference problem.

    Syllabus

    • The basics of Bayesian inference
    • Parameter estimation, hypothesis testing and model selection
    • Sampling methods: MCMC and Nested Sampling
    • Illustrative examples: Detecting signals in noise using Bayesian inference
    • Illustrative examples: Performing parameter estimation to learn about a signal's properties
    • Hierarchical inference: Using "Hyper-parameter" estimation to learn about populations of signals

    References

    • Probability Theory: The logic of Science, E.T. Jaynes, Cambridge University Press.
    • Nested sampling for general Bayesian computation, J. Skilling, Bayesian Analysis (2006).
    • For an overview of Markov Chain Monte Carlo (MCMC), see e.g. Bayesian Data Analysis, Andrew Gelman, Chapman & Hall.
    • For a practical implementation of MCMC, see e.g. emcee, D. Foreman-Mackey, https://dfm.io/emcee/current/

    Pre-requisites

    Basic probability theory, sampling, python and jupyter hub.

    Short Bio

    Dr. Rory Smith is a lecturer in physics at Monash University in Melbourne, Australia. In 2013-2017, he was a senior postdoctoral fellow at the California Institute of Technology, where he worked on searches for gravitational waves. Dr. Smith participated in the landmark first detection of gravitational waves for which the 2018 Nobel Prize in physics was awarded. Dr. Smith's research focuses on detecting astrophysical gravitational-wave signals from black holes and neutron stars, and

    extracting the rich astrophysical information encoded within to study the fundamental nature of spacetime.



    Jaideep Srivastava
    (University of Minnesota) [intermediate]
    Social Computing - Concepts and Applications

    Summary

    Social Computing is an emerging discipline, and just like any discipline at a nascent stage it can often mean different things to different people. However, there are three distinct threads that are emerging. First thread is often called Socio-Technical Systems, which focuses on building systems that allow large scale interactions of people, whether for a specific purpose or in general. Examples include social networks like Facebook and Google Plus, and Multi Player Online Games like World of Warcraft and Farmville. The second thread is often called Computational Social Science, whose goal is to use computing as an integral tool to push the research boundaries of various social and behavioral science disciplines, primarily Sociology, Economics, and Psychology. Third is the idea of solving problems of societal relevance using a combination of computing and humans. The three modules of this course are structured according to this description. The goal of this course is to discuss, in a tutorial manner, through case studies, and through discussion, what Social Computing is, where it is headed, and where is it taking us.

    Syllabus

    • • Module 1: Socio-technical systems

      • • Introduction to Social Computing

      • • Socio-technical systems

        • • Examples of a number of social computing systems, e.g. Twitter, FaceBook, MMO games, etc.
      • • Applying data mining to social computing systems

    • • Module 2: Computational Social Science

      • • Online trust

      • • Social influence

      • • Individual and group/team performance

      • • Identifying and preventing bad behavior

    • • Module 3: Solving Problems of Societal Relevance

      • • Social computing for humanitarian assistance

      • • Wrap-up discussion

        • • Privacy and ethics

        • • Where are we headed

    References

    To be provided later.

    Pre-requisites

    This course is intended primarily for graduate students. Following are the potential audiences:

    • · Computer Science graduate students: All that is needed for this audience is interest in one of the themes of social computing
    • · Social Science graduate students: Some exposure to building models from data, at least what these techniques are and what they can do
    • · Management graduate students: Those with MIS focus

    Short Bio

    Jaideep Srivastava (https://www.linkedin.com/in/jaideep-srivastava-50230/) is Professor of Computer Science at the University of Minnesota, where he directs a laboratory focusing on research in Web Mining, Social Analytics, and Health Analytics. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), and has been an IEEE Distinguished Visitor and a Distinguished Fellow of Allina’s Center for Healthcare Innovation. He has been awarded the Distinguished Research Contributions Award of the PAKDD, for his lifetime contributions to the field of machine learning and data mining. Dr. Srivastava has significant experience in the industry, in both consulting and executive roles. Most recently he was the Chief Scientist for Qatar Computing Research Institute (QCRI), which is part of Qatar Foundation. Earlier, he was the data mining architect for Amazon.com (www.amazon.com), built a data analytics department at Yodlee (www.yodlee.com), and served as the Chief Technology Officer for Persistent Systems (www.persistentsys.com). He has provided technology and strategy advice to Cargill, United Technologies, IBM, Honeywell, KPMG, 3M, TCS, and Eaton. Dr. Srivastava Co-Founded Ninja Metrics (www.ninjametrics.com), based on his research in behavioral analytics. He was advisor and Chief Scientist for CogCubed (www.cogcubed.com), an innovative company with the goal to revolutionize the diagnosis and therapy of cognitive disorders through the use of online games, which was subsequently acquired by Teladoc (https://www.teladoc.com/), a public company. He has been a technology advisor to a number of startups at various stages, including Jornaya (https://www.jornaya.com/) - a leader in cross-industry lead management, and Kipsu (http://kipsu.com/) - which provides an innovative approach to improving service quality in the hospitality industry. Dr. Srivastava has held distinguished professorships at Heilongjiang University and Wuhan University, China. He has held advisory positions with the State of Minnesota, and the State of Maharashtra, India. He is a technology advisor to the Unique ID (UID) project of the Government of India, whose goal is to provide biometrics-based social security numbers to the 1.3 Billion citizens of India. Dr. Srivastava has a Bachelors of Technology from the Indian Institute of Technology (IIT), Kanpur, India, and MS and PhD from the University of California, Berkeley.



    Mayte Suárez-Fariñas
    (Icahn School of Medicine at Mount Sinai) [intermediate]
    A Practical Guide to the Analysis of Longitudinal Data Using R

    Summary

    Longitudinal data is obtained when a time-sequence of measurements is made on a response variable for each of a number of subjects in an experimental or observational study. In such cases, individuals usually display a high degree of similarity in responses over time and, thus, classical regression models are inadequate. This course is aimed at giving its attendees insights into the theoretical concepts and practical experience into the models used for analysis of longitudinal data, particularly mixed-effect models. It will provide an introduction to (1) the theoretical foundations of mixed models, 2) a guide to build, examine, interpret and compare mixed-effect models as well as to conduct hypothesis testing and (3) the necessary resources to conduct a wide range of longitudinal analyses. All practical exercises will be conducted in R. Participants are encouraged to bring datasets to the course and apply the principles to their specific areas of research.

    Syllabus

    1. Introduction to linear Mixed-Effect Models.
    2. Random intercept and Random Intercept and slopes models
    3. Mixed-Effects Models in R’s nlme packages
    4. Build, examine, interpret, expand and compare mixed effects models
    5. Testing hypotheses in mixed-effect models through R’s emmeans package
    6. Non-linear Mixed-effect Models

    Pre-requisites

    Students must be familiar with basic R functions to read and manipulate data and generate basic data plots. Familiarity with statistical concepts and basic understanding of regression and anova. An installed version of R (https://cran.r-project.org/) and R-Studio (https://www.rstudio.com/) on a laptop for completing exercises.

    References

    • • Diggle, P. J., Heagerty, P., Liang, K-Y. and Zeger, S. L. (2002) Analysis of Longitudinal Data. Second Edition. Oxford: Oxford University Press.
    • • Fitzmaurice, G.M., Laird, N.M., andWare, J.H. (2004) Applied Longitudinal Analysis. New York: Wiley.

    Short Bio

    Mayte Suarez-Farinas, PhD is currently an Associate Professor at the Center for Biostatistics and The Department of Genetics and Genomics Science of the Icahn School of Medicine at Mount Sinai, New York. She received a masters in mathematics from the University of Havana, Cuba and, in 2003, a Ph.D. degree in quantitative analysis from the Pontifical Catholic University of Rio de Janeiro, Brazil. Prior to joining Mount Sinai, she was co-director of Biostatistics at the Center for Clinical and Translational Science at the Rockefeller University, where she developed methodologies for data integration across omic studies, and a framework to evaluate drug response at the molecular level in proof of concept studies in inflammatory skin diseases using mixed-effect models and machine learning. Her long terms goals are to develop robust statistical techniques to mine and integrate complex high-throughput data, tailored to specific disease models, with an emphasis on immunological diseases and to develop precision medicine algorithms to predict treatment response and phenotype.



    Jeffrey Ullman
    (Stanford University) [introductory]
    Big-data Algorithms That Aren't Machine Learning

    Summary

    We shall study algorithms that have been found useful in querying large dataswets. The emphasis is on algorithms that cannot be considered "machine learning."

    Syllabus

    • Locality-sensitive hashing: shingling, minhashing, applications;
    • PageRank and related ideas: topic-specific PageRank, combatting link spam;
    • Stream-processing algorithms: counting occurrences, counting unique values, sampling;
    • Graph-processing algorithms: counting neighborhoods, counting triangles, transitive closure.

    Pre-requisites

    A course in algorithms at the advanced-undergraduate level is important. A course in database systems is helpful, but not required.

    References

    We will be covering (parts of) Chapters 3, 4, 5, and 10 of the free text: Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman, available at www.mmds.org

    Short Bio

    A brief on-line bio is available at http://i.stanford.edu/~ullman/pub/opb.txt



    Andrey Ustyuzhanin
    (National Research University Higher School of Economics) [intermediate/advanced]
    Surrogate Modelling for Fun and Profit

    Summary

    In this mini-course, I’m going to explain a powerful approach that has been used in many industrial cases for complex design optimisations. Surrogate data-driven models help to build a balanced model of a complex process or an object. Later on, those models can be used during the study of a complex object or a process properties in a computationally efficient way. The same method can be applied to several scientific cases ranging from new detector design and theory parameters estimation to the substitution of the slow simulator by a faster neural network. I'll present an overview of the method and several industrial and scientific examples: design of an active muon shield for SHiP experiment and fast tuning of Pythia parameters. The course includes both theoretical part and practical hands-on exercises. 

    Syllabus

    • introduction into optimisation problems of modern design techniques

    • Practical cases: SHiP detector design optimisation, parametric model optimisation

    • Non-gradient optimisation methods overview. Gaussian process as bayesian inference example.

    • Choice of acquisition functions

    • Tips and tweaks. Saving resources on computationally-heavy tasks by altering basic assumptions

    • Variational optimisation, natural evolution strategy

    • Adversarial variation optimisation. Design of deep surrogate model  

      References

    • Methods based on response surfaces, Jones 2001

    • Smola A. J. and Scholkopf, B. (2004) A tutorial on support vector regression. Statistics and Computing

    • Bergstra J., Yamins, D., Cox, D. D. (2013) Making a Science of Model Search

    • Kramer O., Ciaurri, D. E., & Koziel, S. (2011). Derivative-free optimization.

    • Zitzler E., et al. Improving the strength Pareto evolutionary algorithm for multiobjective optimization. (2002)

    • Snelson E. Tutorial: Gaussian process models for machine learning.

    • Wierstra D, Schaul T, Glasmachers T, Sun Y, Peters J, Schmidhuber J. Natural evolution strategies

    • Wang Z, Zoghi M, Hutter F, Matheson D, De Freitas N. Bayesian Optimization in High Dimensions via Random Embeddings

    • Snoek J, Rippel O, Swersky K, Kiros R, Satish N, Sundaram N, Patwary M, Prabhat M, Adams R. Scalable bayesian optimization using deep neural networks

    • Louppe G., Cranmer, K. (2017). Adversarial Variational Optimization of Non-Differentiable Simulators arXiv:1707.07113

     

    Pre-requisites

    python 3 [basic level], familiarity with machine learning concepts and tools [basic level]  

    Short Bio

    Dr Andrey Ustyuzhanin - the head of Yandex-CERN joint research projects as well as the head of the Laboratory of Methods for Big Data Analysis at NRU HSE. His team is the member of frontier research international collaborations: LHCb - collaboration at Large Hadron Collider, SHiP (Search for Hidden Particles) - the experiment is designed for the New Physics discovery. His group is unique for both collaborations since the majority of the team members are coming from the Computer and Data Science worlds. The primary priority of his research is the design of new Machine Learning methods and using them to solve tough scientific enigmas thus improving the fundamental understanding of our world. Amongst the project he has been working on are efficiency improvement of online triggers at LHCb, speed up BDT-based online processing formula, the design of custom convolutional neural networks for processing tracks of muon-like particles on smartphone cameras. Development of the algorithm for tracking in scintillators optical fibre detectors and emulsion cloud chambers. Those project aid research at various experiments: LHCb, OPERA, SHiP and CRAYFIS. Discovering the deeper truth about the Universe by applying data analysis methods is the primary source of inspiration in Andrey’s lifelong journey. Andrey is a co-author of the course on the Machine Learning aimed at solving Particle Physics challenges at Coursera and organiser of the annual international summer schools following the similar set of topics. Andrey has graduated from Moscow Institute of Physics and Technology in 2000 and received PhD in 2007 at Institute of System Programming Russian Academy of Sciences.



    Wil van der Aalst
    (RWTH Aachen University) [introductory/intermediate]
    Process Mining: Data Science in Action

    Summary

    Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.

    The course explains the key analysis techniques in process mining. Participants will learn various process discovery algorithms. These can be used to automatically learn process models from raw event data. Various other process analysis techniques that use event data will be presented. Moreover, the course will provide easy-to-use software, real-life data sets, and practical skills to directly apply the theory in a variety of application domains.

    Process mining provides not only a bridge between data mining and business process management; it also helps to address the classical divide between "business" and "IT". Evidence-based business process management based on process mining helps to create a common ground for business process improvement and information systems development.

    Note that Gartner recently identified process mining software as a new and important class of software. Currently, there are over 25 vendors providing commercial process mining tools and a rapid uptake of the new technology is expected.

    Syllabus

    The course focuses on process mining as the bridge between data science and process science. The course will introduce the three main types of process mining.

      1. The first type of process mining is discovery. A discovery technique takes an event log and produces a process model without using any a-priori information. An example is the Alpha-algorithm that takes an event log and produces a process model (a Petri net) explaining the behavior recorded in the log.
      1. The second type of process mining is conformance. Here, an existing process model is compared with an event log of the same process. Conformance checking can be used to check if reality, as recorded in the log, conforms to the model and vice versa.
      1. The third type of process mining is enhancement. Here, the idea is to extend or improve an existing process model using information about the actual process recorded in some event log. Whereas conformance checking measures the alignment between model and reality, this third type of process mining aims at changing or extending the a-priori model. An example is the extension of a process model with performance information, e.g., showing bottlenecks. Process mining techniques can be used in an offline, but also online setting. The latter is known as operational support. An example is the detection of non-conformance at the moment the deviation actually takes place. Another example is time prediction for running cases, i.e., given a partially executed case the remaining processing time is estimated based on historic information of similar cases.

    The course uses many examples using real-life event logs to illustrate the concepts and algorithms. After taking this course, one is able to run process mining projects and have a good understanding of the Business Process Intelligence field.

    Pre-requisites

    This course is aimed at both students (Master or PhD level) and professionals. A basic understanding of logic, sets, and statistics (at the undergraduate level) is assumed. Basic computer skills are required to use the software provided with the course (but no programming experience is needed). Participants are also expected to have an interest in process modeling and data mining but no specific prior knowledge is assumed as these concepts are introduced in the course.

    References

    W.M.P. van der Aalst. Process Mining: Data Science in Action. Springer-Verlag, Berlin, 2016. (The course will also provide access to slides, several articles, software tools, and data sets.)

    Short Bio

    Prof.dr.ir. Wil van der Aalst is a full professor at RWTH Aachen University leading the Process and Data Science (PADS) group. He is also part-time affiliated with the Technische Universiteit Eindhoven (TU/e). Until December 2017, he was the scientific director of the Data Science Center Eindhoven (DSC/e) and led the Architecture of Information Systems group at TU/e. Since 2003, he holds a part-time position at Queensland University of Technology (QUT). Currently, he is also a visiting researcher at Fondazione Bruno Kessler (FBK) in Trento and a member of the Board of Governors of Tilburg University. His research interests include process mining, Petri nets, business process management, workflow management, process modeling, and process analysis. Wil van der Aalst has published over 200 journal papers, 20 books (as author or editor), 450 refereed conference/workshop publications, and 65 book chapters. Many of his papers are highly cited (he one of the most cited computer scientists in the world; according to Google Scholar, he has an H-index of 138 and has been cited over 85,000 times) and his ideas have influenced researchers, software developers, and standardization committees working on process support. Next to serving on the editorial boards of over ten scientific journals, he is also playing an advisory role for several companies, including Fluxicon, Celonis, Processgold, and Bright Cape. Van der Aalst received honorary degrees from the Moscow Higher School of Economics (Prof. h.c.), Tsinghua University, and Hasselt University (Dr. h.c.). He is also an elected member of the Royal Netherlands Academy of Arts and Sciences, the Royal Holland Society of Sciences and Humanities, and the Academy of Europe. In 2017, he was awarded a Humboldt Professorship.



    Zhongfei Zhang
    (Binghamton University) [introductory/advanced]
    Relational and Multimedia Data Learning

    Summary

    This course aims at exposing the audience an introduction to knowledge discovery and machine learning theories and case studies in real-world applications for relational and multimedia data as well as the relationships between them. The course begins with an extensive introduction to the fundamental concepts and theories of knowledge discovery and machine learning for relational and multimedia data, and then showcases several important applications as case studies in the real-world as examples for big data knowledge discovery and learning from relational and multimedia data.

    Syllabus

    The course consists of three two-hour sessions. The syllabus is as follows:

    • First session: Introduction to the fundamental concepts and theories for relational and multimedia data with the specific foci on an overview of the wide spectrum of techniques and technologies available as well as their relationships for knowledge discovery and learning and applications to big data scenarios through real-world case studies;
    • Second session: Specific discussions on the classic and state-of-the-art methods for relational data learning;
    • Third session: Specific discussions on the state-of-the-art methods on multimedia data learning;

    Pre-requisites:

    College math, fundamentals about computer science

    References:

      1. Bo Long, Zhongfei (Mark) Zhang, and Philip S. Yu, Relational Data Clustering: Models, Algorithms, and Applications, Taylor & Francis/CRC Press, 2010, ISBN: 9781420072617
      1. Zhongfei (Mark) Zhang and Ruofei Zhang, Multimedia Data Mining -- A Systematic Introduction to Concepts and Theory, Taylor & Francis Group/CRC Press, 2008, ISBN: 9781584889663
      1. Zhongfei (Mark) Zhang, Bo Long, Zhen Guo, Tianbing Xu, and Philip S. Yu, Machine Learning Approaches to Link-Based Clustering, in Link Mining: Models, Algorithms and Applications, Edited by Philip S. Yu, Christos Faloutsos, and Jiawei Han, Springer, 2010
      1. Zhen Guo, Zhongfei Zhang, Eric P. Xing, and Christos Faloutsos, Multimodal Data Mining in a Multimedia Database Based on Structured Max Margin Learning, ACM Transactions on Knowledge Discovery and Data Mining, ACM Press, 2015
      1. http://www.cs.binghamton.edu/~forweb/publicationsactive.html

    Short Bio:

    Zhongfei (Mark) Zhang is a full professor of Computer Science at State University of New York (SUNY) at Binghamton, and directs the Multimedia Research Computing Laboratory in the University. He has also served as a QiuShi Chair Professor at Zhejiang University, China, and as the Director of the Data Science and Engineering Research Center at the university while he was on leave from State University of New York (SUNY) at Binghamton, USA. He has received a B.S. in Electronics Engineering (with Honors), an M.S. in Information Sciences, both from Zhejiang University, China, and a PhD in Computer Science from the University of Massachusetts at Amherst, USA. His research interests include machine learning and artificial intelligence, data mining and knowledge discovery, multimedia information indexing and retrieval, computer vision, and pattern recognition. He is the author and co-author of the first monograph on multimedia data mining and the first monograph on relational data clustering, respectively. His research is sponsored by a wide spectrum of government funding agencies, industrial labs, as well as private agencies. He has published over 200 papers in premier venues in his areas and is an inventor for more than 30 patents. He has served in several journal editorial boards and received several professional awards including best paper awards in the premier conferences in his areas.