Communication Cost in Parallel Query Evaluation
Dan Suciu, University of Washington
What is the minimum amount of communication required to compute a query in parallel, on a cluster of servers? In the simplest case when we join two relations without skewed data, then we can get away by simply reshuffling the data once. But if the data is skewed, or if we need to compute multiple joins, then it turns out that the total communication cost is significantly larger than the input data. In this talk I will describe a class of algorithms for which we can prove formally that their communication cost is optimal. I will end by describing several open questions, including a concrete queries where the optimal communication cost is unknown.
Dan Suciu is a Professor in Computer Science at the University of Washington. He received his Ph.D. from the University of Pennsylvania in 1995, was a principal member of the technical staff at AT&T Labs and joined the University of Washington in 2000. Suciu is conducting research in data management, with an emphasis on topics related to Big Data and data sharing, such as probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He is a Fellow of the ACM, holds twelve US patents, received the best paper award in SIGMOD 2000 and ICDT 2013, the ACM PODS Alberto Mendelzon Test of Time Award in 2010 and in 2012, the 10 Year Most Influential Paper Award in ICDE 2013, the VLDB Ten Year Best Paper Award in 2014, and is a recipient of the NSF Career Award and of an Alfred P. Sloan Fellowship. Suciu is an associate editor for the Journal of the ACM, VLDB Journal, ACM TWEB, and Information Systems and is a past associate editor for ACM TODS and ACM TOIS. Suciu’s PhD students Gerome Miklau, Christopher Re and Paris Koutris received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and 2016 respectively, and Nilesh Dalvi was a runner up in 2008.
Learning Models over Relational Databases
Dan Olteanu, University of Oxford
In this talk I will overview recent results on learning classification and regression models over training datasets defined by feature extraction queries over relational databases. I will argue that the performance of the learning task can benefit tremendously from the sparsity in the relational data and the structure of the relational feature extraction queries. The mainstream approach to learning over relational data is to materialise the training dataset, export it out of the database system, and then import it into specialised statistical software packages. These three steps are very expensive and completely unnecessary.
I will instead argue for an in-database approach that avoids these steps by decomposing the learning task into aggregates and pushing them past the joins of the feature extraction queries. This approach comes with lower asymptotic complexity than the mainstream approach and several orders-of-magnitude speed-up over state-of-the-art systems such as TensorFlow, R, and Scikit-learn whenever the latter do not exceed memory limitation or internal design limitations.
I will also highlight on-going work on linear algebra over databases and point out exciting directions for future work.
This work is based on long-standing collaboration with my PhD students Maximilian Schleich and Jakub Zavodny and more recent collaboration with Mahmoud Abo-Khamis and Hung Q. Ngo from RelationalAI and XuanLong Nguyen from University of Michigan.
Dan Olteanu is Professor of Computer Science at the University of Oxford and Computer Scientist at RelationalAI. He received his BSc in Computer Science from Politehnica University of Bucharest in 2000 and his PhD from the University of Munich in 2005. He spends his time understanding hard computational challenges and designing simple and scalable solutions towards these challenges. He has published over 70 papers in the areas of database systems, AI, and theoretical computer science, contributing to XML query processing, incomplete information and probabilistic databases, factorised databases, scalable and incremental in-database optimisation, and the commercial systems LogicBlox and RelationalAI. He co-authored the book « Probabilistic Databases » (2011). He is the recipient of an ERC Consolidator grant (2016) and an Oxford Outstanding Teaching award (2009). He has served as member of over 60 programme committees, associate editor for PVLDB and IEEE TKDE, as track chair for IEEE ICDE’15, group leader for ACM SIGMOD’15, vice chair for ACM SIGMOD’17, and co-chair for AMW’18. He is currently serving as associate editor for ACM TODS. Eight of Olteanu’s former students received the prize for the best thesis in their respective year and cohort at Oxford.
Democracy Big Bang: What data management can(not) do for journalism
Ioana Manolescu, INRIA
The tremendous power of Big Data has not been lost on journalists. As more and more human activity leaves electronic traces, or happens exclusively (or first) in an electronic manner, content management technologies such as databases, knowledge representation, information retrieval and natural language processing are increasingly called upon to help journalists automate and expedite their work.
In the last years, I have worked to understand the connections between existing (or missing!) algorithms and techniques, and real-world problems faced by actual journalists, doing their work, increasingly precariously due to the shift in advertising revenue, but so precious to the functioning of a modern society. I will discuss results obtained in this area together with colleagues, within the ANR ContentCheck project and the Inria-AIST joint team WebClaimExplain, then present a subjective list of open problems and a perspective on the future of data journalism and fact-checking as content management problems I believe our community should study.
Ioana Manolescu is the lead of the CEDAR Inria team, focusing on rich data analytics at cloud scale. She is a member of the PVLDB Endowment Board of Trustees, and a co-president of the ACM SIGMOD Jim Gray PhD dissertation committee. Recently, she has been a general chair of the IEEE ICDE 2018 conference, an associate editor for PVLDB 2017 and 2018, and the program chair of SSDBBM 2016. She has co-authored more than 130 articles in international journals and conferences, and contributed to several books. Her main research interests include data models and algorithms for computational fact-checking, performance optimizations for semistructured data and the Semantic Web, and distributed architectures for complex large data.
Blockchain 2.0: opportunities and risks
Patrick Valduriez, INRIA
Popularized by bitcoin and other digital currencies, the blockchain has the potential to revolutionize our economic and social systems. Blockchain was invented for bitcoin to solve the double spending problem of previous digital currencies without the need of a trusted, central authority. The original blockchain is a public, distributed ledger that can record and share transactions among a number of computers in a secure and permanent way. It is a complex distributed database infrastructure, combining several technologies such as P2P, data replication, consensus protocols and cryptography.
The term Blockchain 2.0 refers to new applications of the blockchain to go beyond transactions and enable exchange of assets without powerful intermediaries. Examples of applications are smart contracts, persistent digital ids, intellectual property rights, blogging, voting, reputation, etc. Blockchain 2.0 could dramatically cut down transaction costs, by automating operations and removing intermediaries. It could allow people to monetize their own information and creators of intellectual property to be properly compensated. The potential impact on society is also huge, as excluded people could join the global economy, e.g. by having digital bank accounts for free.
In this talk, I will introduce Blockchain 2.0 technologies and applications, and discuss the opportunities and risks. In developing countries, for instance, the lack of existing infrastructure and regulation may be a chance to embrace the blockchain revolution and leapfrog traditional solutions. But there are also risks, related to regulation, security, privacy, or integration with existing practice, which must be well understood and addressed.
Patrick Valduriez is a senior scientist at Inria and LIRMM, University of Montpellier, France. He has also been a professor of computer science at University Pierre et Marie Curie (UPMC) in Paris and a researcher at Microelectronics and Computer Technology Corp. in Austin, Texas. He received his Ph. D. degree and Doctorat d’Etat in CS from UPMC in 1981 and 1985, respectively. He is the head of the Zenith team (between Inria and University of Montpellier, LIRMM) that focuses on data science, in particular data management in large-scale distributed and parallel systems and scientific data management. He has authored and co-authored many technical papers and several textbooks, among which “Principles of Distributed Database Systems”. He currently serves as associate editor of several journals, including the VLDB Journal, Distributed and Parallel Databases, and Internet and Databases. He has served as PC chair of major conferences such as SIGMOD and VLDB. He was the general chair of SIGMOD04, EDBT08 and VLDB09. He obtained the best paper award at VLDB00. He was the recipient of the 1993 IBM scientific prize in Computer Science in France and the 2014 Innovation Award from Inria – French Academy of Science – Dassault Systems. He is an ACM Fellow.