Paper Digest: SIGMOD 2020 Highlights
The ACM Special Interest Group on Management of Data (SIGMOD) is one of the top conferences on database management systems and data management technology. In 2020, it is to be held virtually due to covid-19 pandemic.
To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.
If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.
Paper Digest Team
TABLE 1: SIGMOD 2020 Papers, Keynotes, Tutorials, Demos, and Student Abstracts
|Systems and ML: When the Sum is Greater than Its Parts
|Systems and ML: When the Sum is Greater than Its Parts
|Recommending Deployment Strategies for Collaborative Tasks
|Dong Wei, Senjuti Basu Roy, Sihem Amer-Yahia
|We propose StratRec, an optimization-driven middle layer that recommends deployment strategies and alternative deployment parameters to requesters by accounting for worker availability.
|Human-in-the-loop Outlier Detection
|Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo, Samuel Madden
|In this work, we propose a human-in-the-loop outlier detection approach HOD that effectively leverages human intelligence to discover the true outliers.
|QUAD: Quadratic-Bound-based Kernel Density Visualization
|Tsz Nam Chan, Reynold Cheng, Man Lung Yiu
|Our goal is to improve the performance of KDV, in order to support large datasets (e.g., one million points) and high screen resolutions (e.g., 1280 x 960 pixels).
|ShapeSearch: A Flexible and Efficient System for Shape-based Exploration of Trendlines
|Tarique Siddiqui, Paul Luh, Zesheng Wang, Karrie Karahalios, Aditya Parameswaran
|We propose ShapeSearch, an efficient and flexible pattern-searching tool, that enables the search for desired patterns via multiple mechanisms: sketch, natural-language, and visual regular expressions.
|Marviq: Quality-Aware Geospatial Visualization of Range-Selection Queries Using Materialization
|Liming Dong, Qiushi Bai, Taewoo Kim, Taiji Chen, Weidong Liu, Chen Li
|We present a novel middleware-based technique called Marviq.
|Transactional Causal Consistency for Serverless Computing
|Chenggang Wu, Vikram Sreekanti, Joseph M. Hellerstein
|We present protocols for MTCC implemented in a system called HYDROCACHE.
|Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings
|Tarique Siddiqui, Alekh Jindal, Shi Qiao, Hiren Patel, Wangchao Le
|In this work, we investigate two key questions: (i) can we learn accurate cost models for big data systems, and (ii) can we integrate the learned models within the query optimizer.
|Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure
|Ingo Müller, Renato Marroquín, Gustavo Alonso
|In this paper, we present Lambada, a serverless distributed data processing framework designed to explore how to perform data analytics on serverless computing.
|Starling: A Scalable Query Engine on Cloud Functions
|Matthew Perron, Raul Castro Fernandez, David DeWitt, Samuel Madden
|In this paper we present Starling, a query execution engine built on cloud function services that employs a number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization.
|Learning a Partitioning Advisor for Cloud Databases
|Benjamin Hilprecht, Carsten Binnig, Uwe Röhm
|In this paper, we introduce a new learned partitioning advisor based on Deep Reinforcement Learning (DRL) for OLAP-style workloads.
|DB4ML – An In-Memory Database Kernel with Machine Learning Support
|Matthias Jasny, Tobias Ziegler, Tim Kraska, Uwe Roehm, Carsten Binnig
|In this paper, we revisit the question of how ML algorithms can be best integrated into existing DBMSs to not only avoid expensive data copies to external ML tools but also to comply with regulatory reasons.
|Active Learning for ML Enhanced Database Systems
|Lin Ma, Bailu Ding, Sudipto Das, Adith Swaminathan
|In this paper, we address this performance degradation by using B-instances to collect additional data during deployment.
|Qd-tree: Learning Data Layouts for Big Data Analytics
|Zongheng Yang, Badrish Chandramouli, Chi Wang, Johannes Gehrke, Yinan Li, Umar Farooq Minhas, Per-Åke Larson, Donald Kossmann, Rajeev Acharya
|In this paper, we propose a new framework called a query-data routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques.
|Facilitating SQL Query Composition and Analysis
|Zainab Zolaktaf, Mostafa Milani, Rachel Pottinger
|We examine methods that can accelerate and improve this interaction by providing insights about SQL queries prior to execution.
|MONSOON: Multi-Step Optimization and Execution of Queries with Partially Obscured Predicates
|Sourav Sikdar, Chris Jermaine
|In this paper, we describe a query optimizer called the Monsoon optimizer.
|Causal Relational Learning
|Babak Salimi, Harsh Parikh, Moe Kayali, Lise Getoor, Sudeepa Roy, Dan Suciu
|In this paper, we present a formal framework for causal inference from such relational data.
|Sample Debiasing in the Themis Open World Database System
|Laurel Orr, Magdalena Balazinska, Dan Suciu
|We present Themis, the first open world database that automatically rebalances arbitrarily biased samples to approximately answer queries as if they were issued over the entire population.
|Stochastic Package Queries in Probabilistic Databases
|Matteo Brucato, Nishant Yadav, Azza Abouzied, Peter J. Haas, Alexandra Meliou
|We provide methods for in-database support of decision making under uncertainty.
|Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints
|Xi Liang, Zechao Shang, Sanjay Krishnan, Aaron J. Elmore, Michael J. Franklin
|We propose a framework that can produce automatic contingency analysis, i.e., the range of values an aggregate SQL query could take, under formal constraints describing the variation and frequency of missing data tuples.
|Mining Approximate Acyclic Schemes from Relations
|Batya Kenig, Pranay Mundra, Guna Prasaad, Babak Salimi, Dan Suciu
|In this paper we present Maimon, a system for discovering approximate acyclic schemes and MVDs from data.
|AliCoCo: Alibaba E-commerce Cognitive Concept Net
|Xusheng Luo, Luxin Liu, Yonghua Yang, Le Bo, Yuanpeng Cao, Jinghang Wu, Qiang Li, Keping Yang, Kenny Q. Zhu
|In this paper, we propose to construct a large-scale E-commerce Cognitive Concept Net named "AliCoCo", which is practiced in Alibaba, the largest Chinese e-commerce platform in the world.
|A1: A Distributed In-Memory Graph Database
|Chiranjeeb Buragohain, Knut Magne Risvik, Paul Brett, Miguel Castro, Wonhee Cho, Joshua Cowhig, Nikolas Gloy, Karthik Kalyanaraman, Richendra Khanna, John Pao, Matthew Renzelmann, Alex Shamis, Timothy Tan, Shuheng Zheng
|In this paper we describe the A1 data model, RDMA optimized data structures and query execution.
|IBM Db2 Graph: Supporting Synergistic and Retrofittable Graph Queries Inside IBM Db2
|Yuanyuan Tian, En Liang Xu, Wei Zhao, Mir Hamid Pirahesh, Sui Jun Tong, Wen Sun, Thomas Kolanko, Md. Shahidul Haque Apu, Huijuan Peng
|In this paper, we propose an in-DBMS graph query approach, IBM Db2 Graph, to support synergistic and retrofittable graph queries inside the IBM Db2 relational database.
|An Ontology-Based Conversation System for Knowledge Bases
|Abdul Quamar, Chuan Lei, Dorian Miller, Fatma Ozcan, Jeffrey Kreulen, Robert J. Moore, Vasilis Efthymiou
|In this paper, we propose an ontology-based conversation system for domain-specific KBs.
|Aggregation Support for Modern Graph Analytics in TigerGraph
|Alin Deutsch, Yu Xu, Mingxi Wu, Victor E. Lee
|We describe how GSQL, TigerGraph’s graph query language, supports the specification of aggregation in graph analytics.
|GIANT: Scalable Creation of a Web-scale Ontology
|Bang Liu, Weidong Guo, Di Niu, Jinwen Luo, Chaoyue Wang, Zhen Wen, Yu Xu
|In this paper, we present GIANT, a mechanism to construct a user-centered, web-scale, structured ontology, containing a large number of natural language phrases conforming to user attentions at various granularities, mined from the vast volume of web documents and search click logs.
|The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?
|Magda Balazinska, Surajit Chaudhuri, Anastasia Ailamaki, Juliana Freire, Sailesh Krishnamurthy, Michael Stonebraker
|The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?
|Equivalence-Invariant Algebraic Provenance for Hyperplane Update Queries
|Pierre Bourhis, Daniel Deutch, Yuval Moskovitch
|In this paper we present the first (to our knowledge) algebraic provenance model, for a fragment of update queries, that is invariant under set equivalence.
|Causality-Guided Adaptive Interventional Debugging
|Anna Fariha, Suman Nath, Alexandra Meliou
|We propose Adaptive Interventional Debugging (AID) for debugging such intermittent failures.
|PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models
|Yinjun Wu, Val Tannen, Susan B. Davidson
|This paper presents an efficient provenance-based approach, PrIU, and its optimized version, PrIU-opt, for incrementally updating model parameters without sacrificing prediction accuracy.
|BugDoc: Algorithms to Debug Computational Processes
|Raoni Lourenço, Juliana Freire, Dennis Shasha
|We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures.
|Computing Local Sensitivities of Counting Queries with Joins
|Yuchao Tao, Xi He, Ashwin Machanavajjhala, Sudeepa Roy
|In this paper, we present a novel approach to compute local sensitivity of counting queries involving join operations by tracking and summarizing tuple sensitivities.
|Long-lived Transactions Made Less Harmful
|Jongbin Kim, Hyunsoo Cho, Kihwang Kim, Jaeseon Yu, Sooyong Kang, Hyungsoo Jung
|In this paper, we formalize such rules into our version pruning theorem and version classification, of which all form theoretical foundations for our new version management system, vDriver, that bases its record versioning on a new principle: Single In-row Remaining Off-row (SIRO) versioning.
|Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks
|Erfan Zamanian, Julian Shun, Carsten Binnig, Tim Kraska
|In this paper, we first make the case that the new bottleneck which hinders truly scalable transaction processing in modern RDMA-enabled databases is data contention, and that optimizing for data contention leads to different partitioning layouts than optimizing for the number of distributed transactions.
|Handling Highly Contended OLTP Workloads Using Fast Dynamic Partitioning
|Guna Prasaad, Alvin Cheung, Dan Suciu
|Towards addressing this, we propose Strife—a novel transaction processing scheme that clusters transactions together dynamically and executes most of them without any concurrency control.
|A Transactional Perspective on Execute-order-validate Blockchains
|Pingcheng Ruan, Dumitrel Loghin, Quang-Trung Ta, Meihui Zhang, Gang Chen, Beng Chin Ooi
|Inspired by optimistic concurrency control in modern databases, we propose a novel method to enhance the execute-order-validate architecture, by reordering transactions to reduce the abort rate.
|Aggify: Lifting the Curse of Cursor Loops using Custom Aggregates
|Surabhi Gupta, Sanket Purandare, Karthik Ramachandra
|We present Aggify, a technique for optimizing loops over query results that overcomes these overheads.
|Querying Shared Data with Security Heterogeneity
|Yang Cao, Wenfei Fan, Yanghao Wang, Ke Yi
|We formalize query answering as a bi-criteria optimization problem, to minimize both data sharing toll and parallel query evaluation cost. Despite the hardness, we develop a set of approximate algorithms to generate distributed query plans that minimize data sharing toll and reduce parallel evaluation cost.
|SAGMA: Secure Aggregation Grouped by Multiple Attributes
|Timon Hackenjos, Florian Hahn, Florian Kerschbaum
|In this work we present SAGMA — an encryption scheme for performing secure aggregation grouped by multiple attributes.
|Crypt?: Crypto-Assisted Differential Privacy on Untrusted Servers
|Amrita Roy Chowdhury, Chenghong Wang, Xi He, Ashwin Machanavajjhala, Somesh Jha
|In this work, we propose, Crypt?, a system and programming framework that (1) achieves the accuracy guarantees and algorithmic expressibility of the central model (2) without any trusted data collector like in the local model.
|Estimating Numerical Distributions under Local Differential Privacy
|Zitao Li, Tianhao Wang, Milan Lopuhaä-Zwakenberg, Ninghui Li, Boris Škoric
|We introduce a new reporting mechanism, called the square wave (SW) mechanism, which exploits the numerical nature in reporting.
|FalconDB: Blockchain-based Collaborative Database
|Yanqing Peng, Min Du, Feifei Li, Raymond Cheng, Dawn Song
|In this paper, we present FalconDB, which enables different parties with limited hardware resources to efficiently and securely collaborate on a database.
|Exact Single-Source SimRank Computation on Large Graphs
|Hanzhi Wang, Zhewei Wei, Ye Yuan, Xiaoyong Du, Ji-Rong Wen
|In this paper, we present ExSim, the first algorithm that computes the exact single-source and top-k SimRank results on large graphs.
|Distributed Processing of k Shortest Path Queries over Dynamic Road Networks
|Ziqiang Yu, Xiaohui Yu, Nick Koudas, Yang Liu, Yifan Li, Yueting Chen, Dingyu Yang
|We therefore propose KSP-DG, a distributed algorithm for identifying k-shortest paths in a dynamic graph.
|On the Optimization of Recursive Relational Queries: Application to Graph Queries
|Louis Jachiet, Pierre Genevès, Nils Gesbert, Nabil Layaida
|We propose mu-RA, a variation of the Relational Algebra equipped with a fixpoint operator for expressing recursive relational queries.
|Pensieve: Skewness-Aware Version Switching for Efficient Graph Processing
|Tangwei Ying, Hanhua Chen, Hai Jin
|In this work, we observe: 1) high degree vertices incur much more significant storage overheads during graph version evolving compared to low degree vertices; 2) the skewed access frequency among graph versions greatly influences the system performance for version reproducing.
|Extending Graph Patterns with Conditions
|Grace Fan, Wenfei Fan, Yuanhao Li, Ping Lu, Chao Tian, Jingren Zhou
|We propose an extension of graph patterns, referred to as conditional graph patterns and denoted as CGPs.
|Elastic Machine Learning Algorithms in Amazon SageMaker
|Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh Nallapati, Julio Delgado, Amir Sadoughi, Yury Astashonok, Piali Das, Can Balioglu, Saswata Chakravarty, Madhav Jha, Philip Gautier, David Arpin, Tim Januschowski, Valentin Flunkert, Yuyang Wang, Jan Gasthaus, Lorenzo Stella, Syama Rangapuram, David Salinas, Sebastian Schelter, Alex Smola
|We discuss such challenges and derive requirements for an industrial-scale ML platform. Next, we describe the computational model behind Amazon SageMaker, which is designed to meet such challenges
|Timon: A Timestamped Event Database for Efficient Telemetry Data Processing and Analytics
|Wei Cao, Yusong Gao, Feifei Li, Sheng Wang, Bingchen Lin, Ke Xu, Xiaojie Feng, Yucong Wang, Zhenjun Liu, Gejin Zhang
|Timon is a timestamped event database that aims to support aggregations and handle late arrivals both correctly (i.e., upholding the exactly-once semantics) and efficiently.
|Vertica-ML: Distributed Machine Learning in Vertica Database
|Arash Fard, Anh Le, George Larionov, Waqas Dhillon, Chuck Bear
|In this paper, we present our distributed machine learning subsystem within the Vertica database. We explain the architecture of the subsystem, and present a set of experiments to evaluate the performance of the machine learning algorithms implemented on top of it.
|Database Workload Capacity Planning using Time Series Analysis and Machine Learning
|Antony S. Higginson, Mihaela Dediu, Octavian Arsene, Norman W. Paton, Suzanne M. Embury
|In this paper we look at the forecasting techniques in use today and evaluate if those techniques are applicable to the deeper layers of the technological stack such as clustered database instances, applications and groups of transactions that make up the database workload.
|The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development
|Micah J. Smith, Carles Sala, James Max Kanter, Kalyan Veeramachaneni
|To address these problems, we introduce the Machine Learning Bazaar, a new framework for developing machine learning and automated machine learning software systems.
|When the Web is your Data Lake: Creating a Search Engine for Datasets on the Web
|In this talk, I will discuss our work on Dataset Search, which provides search capabilities over potentially all dataset repositories on the Web.
|The Challenge of Building Effective, Enterprise-scale Data Lakes
|In this talk, we describe the real-world implementation patterns of data lakes and give an overview of the many open challenges in deploying successful, enterprise-scale data lakes.
|Cleaning Denial Constraint Violations through Relaxation
|Stella Giannakopoulou, Manos Karpathiotakis, Anastasia Ailamaki
|We propose an approach that performs probabilistic repair of denial constraint violations on-demand, driven by the exploratory analysis that users perform.
|On Multiple Semantics for Declarative Database Repairs
|Amir Gilad, Daniel Deutch, Sudeepa Roy
|We show that there are no one-size-fits-all semantics for repairs in this inclusive setting, and we consequently introduce multiple alternative semantics, presenting the case for using each of them.
|Discovery Algorithms for Embedded Functional Dependencies
|Ziheng Wei, Sven Hartmann, Sebastian Link
|We show that the discovery problem of eFDs is NP-complete, W-complete in the output, and has a minimum solution space that is larger than the maximum solution space for functional dependencies.
|SCODED: Statistical Constraint Oriented Data Error Detection
|Jing Nathan Yan, Oliver Schulte, MoHan Zhang, Jiannan Wang, Reynold Cheng
|We develop SCODED, an SC-Oriented Data Error Detection system, comprising two key components: (1) SC Violation Detection : checks whether an SC is violated on a given dataset, and (2) Error Drill Down : identifies the top-k records that contribute most to the violation of an SC.
|A Statistical Perspective on Discovering Functional Dependencies in Noisy Data
|Yunjia Zhang, Zhihan Guo, Theodoros Rekatsinas
|We study the problem of discovering functional dependencies (FD) from a noisy data set.
|Rethinking Logging, Checkpoints, and Recovery for High-Performance Storage Engines
|Michael Haubenschild, Caetano Sauer, Thomas Neumann, Viktor Leis
|In this work, we propose a new logging and recovery design that supports incremental and fuzzy checkpointing, index recovery, out-of-memory workloads, and low-latency transaction commits.
|Lethe: A Tunable Delete-Aware LSM Engine
|Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Manos Athanassoulis
|To address these challenges, in this paper, we build a new key-value storage engine, Lethe, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order.
|BinDex: A Two-Layered Index for Fast and Robust Scans
|Linwei Li, Kai Zhang, Jiading Guo, Wen He, Zhenying He, Yinan Jing, Weili Han, X. Sean Wang
|In order to obtain fast and robust scans under all selectivities, this paper proposes BinDex, a two-layered index structure based on binned bitmaps that can be used to significantly accelerate the scan operations for in-memory column stores.
|Analysis of Indexing Structures for Immutable Data
|Cong Yue, Zhongle Xie, Meihui Zhang, Gang Chen, Beng Chin Ooi, Sheng Wang, Xiaokui Xiao
|To alleviate the above problem, we present a comprehensive analysis of the existing index structures for immutable data, and evaluate both their asymptotic and empirical performance.
|Harald Lang, Alexander Beischl, Viktor Leis, Peter Boncz, Thomas Neumann, Alfons Kemper
|We propose a novel method to represent compressed bitmaps.
|ALEX: An Updatable Adaptive Learned Index
|Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, Tim Kraska
|In this paper, we present a new learned index called ALEX which addresses practical issues that arise when implementing learned indexes for workloads that contain a mix of point lookups, short range queries, inserts, updates, and deletes.
|Learning Multi-Dimensional Indexes
|Vikram Nathan, Jialin Ding, Mohammad Alizadeh, Tim Kraska
|In this paper, we introduce Flood, a multi-dimensional in-memory read-optimized index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage layout.
|The Case for a Learned Sorting Algorithm
|Ani Kristo, Kapil Vaidya, Ugur Çetintemel, Sanchit Misra, Tim Kraska
|In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data.
|QuickSel: Quick Selectivity Learning with Mixture Models
|Yongjoo Park, Shucheng Zhong, Barzan Mozafari
|In this paper, we propose a selectivity learning framework, called QuickSel, which falls into the query-driven paradigm but does not use histograms.
|Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries
|Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, Gautam Das
|In this paper, we propose two complementary approaches that are effective for this scenario.
|Efficient Algorithms for Densest Subgraph Discovery on Large Directed Graphs
|Chenhao Ma, Yixiang Fang, Reynold Cheng, Laks V.S. Lakshmanan, Wenjie Zhang, Xuemin Lin
|In this paper, we develop an efficient and scalable DDS solution.
|GPU-Accelerated Subgraph Enumeration on Partitioned Graphs
|Wentian Guo, Yuchen Li, Mo Sha, Bingsheng He, Xiaokui Xiao, Kian-Lee Tan
|In this paper, we propose a new approach for GPU-accelerated subgraph enumeration that can efficiently scale to large graphs beyond the GPU memory.
|In-Memory Subgraph Matching: An In-depth Study
|Shixuan Sun, Qiong Luo
|We study the performance of eight representative in-memory subgraph matching algorithms.
|G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching
|Yeonsu Park, Seongyun Ko, Sourav S. Bhowmick, Kyoungmin Kim, Kijae Hong, Wook-Shin Han
|In this paper, for the first time, we present a comprehensive study of the existing cardinality estimation techniques for subgraph matching queries, scaling far beyond the original experiments.
|Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees
|Tashin Reza, Matei Ripeanu, Geoffrey Sanders, Roger Pearce
|We present a new algorithmic pipeline for approximate matching that combines edit-distance based matching with systematic graph pruning.
|A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching
|Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, Mohamed Sarwat
|In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms.
|ZeroER: Entity Resolution using Zero Labeled Examples
|Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, Saravanan Thirumuruganathan
|We investigate an important problem that vexes practitioners: is it possible to design an effective algorithm for ER that requires Zero labeled examples, yet can achieve performance comparable to supervised approaches?
|Towards Interpretable and Learnable Risk Analysis for Entity Resolution
|Zhaoqiang Chen, Qun Chen, Boyi Hou, Zhanhuai Li, Guoliang Li
|In this paper, we propose an interpretable and learnable framework for risk analysis, which aims to rank the labeled pairs based on their risks of being mislabeled.
|SLIM: Scalable Linkage of Mobility Data
|Fuat Basïk, Hakan Ferhatosmano?lu, Bu?ra Gedik
|We present a scalable solution to link entities across mobility datasets using their spatio-temporal information.
|Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach
|Yaoshu Wang, Chuan Xiao, Jianbin Qin, Xin Cao, Yifang Sun, Wei Wang, Makoto Onizuka
|In this paper, we investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection.
|Fast Join Project Query Evaluation using Matrix Multiplication
|Shaleen Deep, Xiao Hu, Paraschos Koutris
|In this paper, we study how a class of join queries with projections can be evaluated faster using worst-case optimal algorithms together with matrix multiplication.
|Maintaining Acyclic Foreign-Key Joins under Updates
|Qichen Wang, Ke Yi
|In this paper, we study the problem of incrementally maintaining the query results of these joins under updates, i.e., insertion and deletion of tuples to any of the relations.
|Thrifty Query Execution via Incrementability
|Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, Michael J. Franklin
|In this paper, we propose a new metric incrementability to quantify the cost-effectiveness of IVM to decide how eagerly or lazily databases should incrementally execute a query.
|A Method for Optimizing Opaque Filter Queries
|Wenjia He, Michael R. Anderson, Maxwell Strome, Michael Cafarella
|We propose voodoo indexing, a two-phase method for optimizing opaque filter queries.
|Functional-Style SQL UDFs With a Capital ‘F’
|Christian Duta, Torsten Grust
|This paper describes how to compile such functional-style UDFs into SQL:1999 recursive common table expressions.
|Learning to Validate the Predictions of Black Box Classifiers on Unseen Data
|Sebastian Schelter, Tammo Rukat, Felix Biessmann
|We propose a simple approach to automate the validation of deployed ML models by estimating the model’s predictive performance on unseen, unlabeled serving data.
|Learning Over Dirty Data Without Cleaning
|Jose Picado, John Davis, Arash Termehchy, Ga Young Lee
|We propose Dirty Learn, DLearn, a novel learning system that learns directly over dirty databases effectively and efficiently without any preprocessing.
|Complaint-driven Training Data Debugging for Query 2.0
|Weiyuan Wu, Lampros Flokas, Eugene Wu, Jiannan Wang
|We propose two novel heuristic approaches based on influence functions which both require linear retraining steps.
|Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks
|Riccardo Cappuzzo, Paolo Papotti, Saravanan Thirumuruganathan
|We propose algorithms for obtaining local embeddings that are effective for data integration tasks on relational databases.
|Minimization of Classifier Construction Cost for Search Queries
|Shay Gershtein, Tova Milo, Gefen Morami, Slava Novgorodov
|The goal of our research is to devise effective algorithms to choose which classifiers one should train to address a given query load while minimizing the cost.
|Scaling Up Distance Labeling on Graphs with Core-Periphery Properties
|Wentao Li, Miao Qiao, Lu Qin, Ying Zhang, Lijun Chang, Xuemin Lin
|To scale up distance labeling, this paper proposes Core-Tree (CT) Index to facilitate a critical and effective trade-off between the index size and query time.
|Factorized Graph Representations for Semi-Supervised Learning from Sparse Data
|Krishna Kumar P., Paul Langton, Wolfgang Gatterbauer
|We instead suggest a principled and scalable method for directly estimating the compatibilities from a sparsely labeled graph.
|Reliable Data Distillation on Graph Convolutional Network
|Wentao Zhang, Xupeng Miao, Yingxia Shao, Jiawei Jiang, Lei Chen, Olivier Ruas, Bin Cui
|Therefore, we propose Reliable Data Distillation, a reliable data driven semi-supervised GCN training method.
|Regular Path Query Evaluation on Streaming Graphs
|Anil Pacaci, Angela Bonifati, M. Tamer Özsu
|We propose deterministic algorithms to efficiently evaluate persistent RPQs under both arbitrary and simple path semantics in a uniform manner.
|Timely Reporting of Heavy Hitters using External Memory
|Prashant Pandey, Shikha Singh, Michael A. Bender, Jonathan W. Berry, Martín Farach-Colton, Rob Johnson, Thomas M. Kroeger, Cynthia A. Phillips
|We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ? N-th occurrence (and hence becomes a heavy hitter).
|A Framework for Emulating Database Operations in Cloud Data Warehouses
|Mohamed A. Soliman, Lyublena Antova, Marc Sugiyama, Michael Duller, Amirhossein Aleyasen, Gourab Mitra, Ehab Abdelhamid, Mark Morcos, Michele Gage, Dmitri Korablev, Florian M. Waas
|In this paper we build on our earlier work in adaptive data virtualization and present novel techniques that allow running applications utilizing sophisticated database features within foreign query engines lacking the native support of such features.
|Taurus Database: How to be Fast, Available, and Frugal in the Cloud
|Alex Depoutovitch, Chong Chen, Jin Chen, Paul Larson, Shu Lin, Jack Ng, Wenlin Cui, Qiang Liu, Wei Huang, Yong Xiao, Yongjun He
|In this paper, we describe the design of Taurus, a new multi-tenant cloud database system.
|Reliability Analytics for Cloud Based Distributed Databases
|Mathieu B. Demarne, Jim Gramling, Tomer Verona, Miso Cilimdzic
|We present RADD, an innovative analytic pipeline used to measure reliability and availability for cloud-based distributed databases by leveraging the vast amount of telemetry present in the cloud.
|CockroachDB: The Resilient Geo-Distributed SQL Database
|Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, Peter Mattis
|This paper presents the design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware.
|Azure SQL Database Always Encrypted
|Panagiotis Antonopoulos, Arvind Arasu, Kunal D. Singh, Ken Eguro, Nitish Gupta, Rajat Jain, Raghav Kaushik, Hanuma Kodavalla, Donald Kossmann, Nikolas Ogg, Ravi Ramamurthy, Jakub Szymaszek, Jeffrey Trimmer, Kapil Vaswani, Ramarathnam Venkatesan, Mike Zwilling
|This paper presents Always Encrypted, a recently released feature of Microsoft SQL Server that uses column granularity encryption to provide cryptographic data protection guarantees.
|Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning
|Ori Bar El, Tova Milo, Amit Somech
|To address this, we present ATENA, a system that takes an input dataset and auto-generates a compelling exploratory session, presented in an EDA notebook.
|Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks
|Cong Yan, Yeye He
|We propose a novel approach to "auto-suggest" contextualized data preparation steps, by "learning" from how data scientists would manipulate data, which are documented by data science notebooks widely available today.
|IDEBench: A Benchmark for Interactive Data Exploration
|Philipp Eichmann, Emanuel Zgraggen, Carsten Binnig, Tim Kraska
|In this paper we argue that this is due to the fact that the workloads and metrics of popular analytical benchmarks such as TPC-H or TPC-DS were designed for traditional performance reporting scenarios, and do not capture distinctive IDE characteristics.
|Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data
|Leilani Battle, Philipp Eichmann, Marco Angelini, Tiziana Catarci, Giuseppe Santucci, Yukun Zheng, Carsten Binnig, Jean-Daniel Fekete, Dominik Moritz
|In this paper, we present a new benchmark to validate the suitability of database systems for interactive visualization workloads.
|Benchmarking Spreadsheet Systems
|Sajjadur Rahman, Kelly Mack, Mangesh Bendre, Ruilin Zhang, Karrie Karahalios, Aditya Parameswaran
|We present a benchmarking study that evaluates and compares the performance of three popular systems, Microsoft Excel, LibreOffice Calc, and Google Sheets, on a range of canonical spreadsheet computation operations.
|Order-Preserving Key Compression for In-Memory Search Trees
|Huanchen Zhang, Xiaoxuan Liu, David G. Andersen, Michael Kaminsky, Kimberly Keeton, Andrew Pavlo
|We present the High-speed Order-Preserving Encoder (HOPE) for in-memory search trees.
|A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics
|Anil Shanbhag, Samuel Madden, Xiangyao Yu
|In this paper, we adopt a model-based approach to understand when and why the performance gains of running queries on GPUs vs on CPUs vary from the bandwidth ratio (which is roughly 16× on modern hardware).
|Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects
|Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, Volker Markl
|In this paper, we investigate how a fast interconnect can resolve these scalability limitations using the example of NVLink 2.0.
|Robust Performance of Main Memory Data Structures by Configuration
|Tiemo Bang, Ismail Oukid, Norman May, Ilia Petrov, Carsten Binnig
|In this paper, we present a new approach for achieving robust performance of data structures making it easier to reuse the same design for different hardware generations but also for different workloads.
|Black or White? How to Develop an AutoTuner for Memory-based Analytics
|Mayuresh Kunjir, Shivnath Babu
|We study the problem of autotuning the memory allocation for applications running on modern distributed data processing systems.
|Vista: Optimized System for Declarative Feature Transfer from Deep CNNs at Scale
|Supun Nakandala, Arun Kumar
|We present Vista, a new data system that resolves these issues by elevating this workload to a declarative level on top of dataflow and deep learning systems.
|Optimizing Machine Learning Workloads in Collaborative Environments
|Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Ziawasch Abedjan, Tilmann Rabl, Volker Markl
|To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse.
|GOGGLES: Automatic Image Labeling with Affinity Coding
|Nilaksh Das, Sanya Chaba, Renzhi Wu, Sakshi Gandhi, Duen Horng Chau, Xu Chu
|We build the GOGGLES system that implements affinity coding for labeling image datasets by designing a novel set of reusable affinity functions for images, and propose a novel hierarchical generative model for class inference using a small development set.
|DeepSqueeze: Deep Semantic Compression for Tabular Data
|Amir Ilkhechi, Andrew Crotty, Alex Galakatos, Yicong Mao, Grace Fan, Xiran Shi, Ugur Cetintemel
|We propose DeepSqueeze, a novel semantic compression framework that can efficiently capture these complex relationships within tabular data by using autoencoders to map tuples to a lower-dimensional representation.
|TRACER: A Framework for Facilitating Accurate and Interpretable Analytics for High Stakes Applications
|Kaiping Zheng, Shaofeng Cai, Horng Ruey Chua, Wei Wang, Kee Yuan Ngiam, Beng Chin Ooi
|In this paper, we propose a general framework TRACER to facilitate accurate and interpretable predictions, with a novel model TITV devised for healthcare analytics and other high stakes applications such as financial investment and risk management.
|Application Driven Graph Partitioning
|Wenfei Fan, Ruochun Jin, Muyang Liu, Ping Lu, Xiaojian Luo, Ruiqi Xu, Qiang Yin, Wenyuan Yu, Jingren Zhou
|For an algorithm of our interest, what partitioning strategy fits it the best and improves its parallel execution? Is it possible to develop graph algorithms with partition transparency, such that the algorithms work under different partitions without changes? This paper aims to answer these questions.
|Progressive Top-K Nearest Neighbors Search in Large Road Networks
|Dian Ouyang, Dong Wen, Lu Qin, Lijun Chang, Ying Zhang, Xuemin Lin
|In this paper, we propose a novel parameter-free index-based solution for the kNN query based on the concept of tree decomposition in large road networks.
|Memory-Aware Framework for Efficient Second-Order Random Walk on Large Graphs
|Yingxia Shao, Shiyue Huang, Xupeng Miao, Bin Cui, Lei Chen
|In this paper, to clearly study the efficiency of various node sampling methods in the context of second-order random walk, we design a cost model, and then propose a new node sampling method following the acceptance-rejection paradigm to achieve a better balance between memory and time cost.
|Hub Labeling for Shortest Path Counting
|Yikai Zhang, Jeffrey Xu Yu
|While many works have devoted to devising efficient distance oracles to compute the shortest distance between any vertices s and t, we study the problem of efficiently counting the number of shortest paths between s and t in light of its applications in tasks such as betweenness-related analysis.
|CHASSIS: Conformity Meets Online Information Diffusion
|Hui Li, Hui Li, Sourav S. Bhowmick
|In this paper, we present a novel framework called chassis to characterize online information diffusion by bridging classical information diffusion model with conformity from social psychology.
|Architecture-Intact Oracle for Fastest Path and Time Queries on Dynamic Spatial Networks
|Victor Junqiu Wei, Raymond Chi-Wing Wong, Cheng Long
|In this paper, we propose an efficient distance and path oracle on dynamic road networks using the randomization technique.
|Data Series Progressive Similarity Search with Probabilistic Quality Guarantees
|Anna Gogolou, Theophanis Tsandilas, Karima Echihabi, Anastasia Bezerianos, Themis Palpanas
|We present and experimentally evaluate a new probabilistic learning-based method that provides quality guarantees for progressive Nearest Neighbor (NN) query answering.
|A GPU-friendly Geometric Data Model and Algebra for Spatial Queries
|Harish Doraiswamy, Juliana Freire
|As a first step towards making GPU spatial query processing mainstream, we propose a new model that represents spatial data as geometric objects and define an algebra consisting of GPU-friendly composable operators that operate over these objects.
|Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures
|John Paparrizos, Chunwei Liu, Aaron J. Elmore, Michael J. Franklin
|Importantly, this study (i) omitted multiple distance measures, including a classic measure in the time-series literature; (ii) considered only a single time-series normalization method; and (iii) reported only raw classification error rates without statistically validating the findings, resulting in or fueling four misconceptions in the time-series literature.
|MIRIS: Fast Object Track Queries in Video
|Favyen Bastani, Songtao He, Arjun Balasingam, Karthik Gopalakrishnan, Mohammad Alizadeh, Hari Balakrishnan, Michael Cafarella, Tim Kraska, Sam Madden
|We propose a novel query-driven tracking approach that integrates query processing with object tracking to efficiently process object track queries and address the computational complexity of object detection methods.
|ACM SIGMOD Jim Gray Dissertation Award W Talk
|Jose M. Faleiro
|This dissertation proposes and explores the use of deterministic execution to address these concerns.
|Effective Data Versioning for Collaborative Data Analytics
|In my PhD thesis, we develop solutions for versioned data management for collaborative data analytics.
|Organizing Data Lakes for Navigation
|Fatemeh Nargesian, Ken Q. Pu, Erkang Zhu, Bahar Ghadiri Bashardoost, Renée J. Miller
|We present a new probabilistic model of how users interact with an organization and propose an approximate algorithm for the data lake organization problem.
|Finding Related Tables in Data Lakes for Interactive Data Science
|Yi Zhang, Zachary G. Ives
|We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables.
|Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference
|Mohammad Raza, Sumit Gulwani
|In this work we present a novel program synthesis approach which combines the benefits of deductive and enumerative synthesis strategies, yielding a semi-supervised technique with which concise programs expressible in standard languages can be synthesized from very few examples.
|SPARQL Rewriting: Towards Desired Results
|Xun Jian, Yue Wang, Xiayu Lei, Libin Zheng, Lei Chen
|Despite their hardness, we propose a (1-1/?)-approximation method for query-restricting and 2 heuristics for query-relaxing.
|Realistic Re-evaluation of Knowledge Graph Completion Methods: An Experimental Study
|Farahnaz Akrami, Mohammed Samiul Saeef, Qingheng Zhang, Wei Hu, Chengkai Li
|This paper is the first systematic study with the main objective of assessing the true effectiveness of embedding models when the unrealistic triples are removed.
|Bitvector-aware Query Optimization for Decision Support Queries
|Bailu Ding, Surajit Chaudhuri, Vivek Narasayya
|In this work, we study how bitvector filters impact query optimization.
|Efficient Join Synopsis Maintenance for Data Warehouse
|Zhuoyue Zhao, Feifei Li, Yuxi Liu
|Towards that end, we propose a novel algorithm SJoin that can maintain a join synopsis over a pre-specified general ?-join query in a dynamic database with continuous inflows of updates.
|Adaptive HTAP through Elastic Resource Scheduling
|Aunn Raza, Periklis Chrysogelos, Angelos Christos Anadiotis, Anastasia Ailamaki
|We propose an in-memory system design which is non-intrusive to the current state-of-art OLTP and OLAP engines, and we use it to evaluate the performance of our approach.
|SPRINTER: A Fast n-ary Join Query Processing Method for Complex OLAP Queries
|Yoon-Min Nam Nam, Donghyoung Han Han, Min-Soo Kim Kim
|In this paper, we propose an effective query planning method for complex OLAP queries.
|Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value Stores
|Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis, Niv Dayan, Wilson Qin, Stratos Idreos
|We introduce Rosetta, a probabilistic range filter designed specifically for LSM-tree based key-value stores.
|RID: Deduplicating Snapshot Computations
|Nikos Tsikoudis, Liuba Shrira
|This paper describes the design, implementation, and performance of RID, the first language-independent optimization framework that eliminates duplicate computations in SQL programs running over low-level snapshots by exploiting snapshot metadata efficiently.
|Architecting a Query Compiler for Spatial Workloads
|Ruby Y. Tahboub, Tiark Rompf
|In this paper, we discuss the underlying reasons why standard query compilation techniques are not fully effective when applied to spatial workloads, and we demonstrate how a particular style of query compilation based on techniques borrowed from partial evaluation and generative programming manages to avoid most of these difficulties by extending the scope of custom code generation into the data structures layer.
|LISA: A Learned Index Structure for Spatial Data
|Pengfei Li, Hua Lu, Qian Zheng, Long Yang, Gang Pan
|We propose a novel Learned Index structure for Spatial dAta (LISA for short).
|Effective Travel Time Estimation: When Historical Trajectories over Road Networks Matter
|Haitao Yuan, Guoliang Li, Zhifeng Bao, Ling Feng
|In this paper, we study the problem of origin-destination (OD) travel time estimation where the OD input consists of an OD pair and a departure time.
|The Solution Distribution of Influence Maximization: A High-level Experimental Study on Three Algorithmic Approaches
|In this paper, we report a high-level experimental study on three well-established algorithmic approaches for influence maximization, referred to as Oneshot, Snapshot, and Reverse Influence Sampling (RIS).
|Influence Maximization Revisited: Efficient Reverse Reachable Set Generation with Bound Tightened
|Qintian Guo, Sibo Wang, Zhewei Wei, Ming Chen
|In this paper, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model.
|Truss-based Community Search over Large Directed Graphs
|Qing Liu, Minjun Zhao, Xin Huang, Jianliang Xu, Yunjun Gao
|In view of its hardness, we propose two efficient 2-approximation algorithms, named Global and Local, that run in polynomial time yet with quality guarantee.
|Densely Connected User Community and Location Cluster Search in Location-Based Social Networks
|Junghoon Kim, Tao Guo, Kaiyu Feng, Gao Cong, Arijit Khan, Farhana M. Choudhury
|In this paper we propose the GeoSocial Community Search problem (GCS) which aims to find a social community and a cluster of spatial locations that are densely connected in a location-based social network simultaneously.
|Global Reinforcement of Social Networks: The Anchored Coreness Problem
|Qingyuan Linghu, Fan Zhang, Xuemin Lin, Wenjie Zhang, Ying Zhang
|Since the coreness of a user has been validated as the "best practice" for capturing user engagement, we propose and study the anchored coreness problem in this paper: anchoring a small number of vertices to maximize the coreness gain (the total increment of coreness) of all the vertices in the network.
|Confidentiality Support over Financial Grade Consortium Blockchain
|Ying Yan, Changzheng Wei, Xuepeng Guo, Xuming Lu, Xiaofu Zheng, Qi Liu, Chenhui Zhou, Xuyang Song, Boran Zhao, Hui Zhang, Guofei Jiang
|In this paper, we present a system design called CONFIDE to support on-chain confidentiality by leveraging Trust Execution Environment (TEE).
|PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension
|Wen Yang, Tao Li, Gai Fang, Hong Wei
|To address these issues, we designed a novel scheme for extending the index-type of PostgreSQL (PG), which enables a similar vector search and achieves a high-performance level and strong reliability of PG.
|Making Search Engines Faster by Lowering the Cost of Querying Business Rules Through FPGAs
|Fabio Maschi, Muhsen Owaida, Gustavo Alonso, Matteo Casalino, Anthony Hock-Koon
|In this paper, we focus on a real-world use case from the airline industry: determining the minimum connection time (MCT) between flights.
|Spur: Mitigating Slow Instances in Large-Scale Streaming Pipelines
|Ke Wang, Avrilia Floratou, Ashvin Agrawal, Daniel Musgrave
|In this paper, we highlight some of the unique challenges imposed by this large scale of operation: other concurrent workloads sharing the cluster may cause random performance deterioration; unavailability of external dependencies may cause temporary stalls in the pipeline; scarcity in the underlying resource manager may cause arbitrarily long delays or rejection of container allocation requests.
|Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial Applications
|Yan Yan, Stephen Meyles, Aria Haghighi, Dan Suciu
|In this work, we describe Amperity’s entity matching framework, Fusion, and how its design provides solutions to these challenges.
|QueryVis: Logic-based Diagrams help Users Understand Complicated SQL Queries Faster
|Aristotelis Leventidis, Jiahui Zhang, Cody Dunne, Wolfgang Gatterbauer, H.V. Jagadish, Mirek Riedewald
|We present initial steps in that direction with visual diagrams that are based on the first-order logic foundation of SQL and can capture the meaning of deeply nested queries.
|Duoquest: A Dual-Specification System for Expressive SQL Queries
|Christopher Baik, Zhongjun Jin, Michael Cafarella, H. V. Jagadish
|Consequently, we propose dual-specification query synthesis, which consumes both a NLQ and an optional PBE-like table sketch query that enables users to express varied levels of domain knowledge.
|SQLCheck: Automated Detection and Diagnosis of SQL Anti-Patterns
|Prashanth Dintyala, Arpit Narechania, Joy Arulraj
|In this paper, we present SQLCheck, a holistic toolchain for automatically finding and fixing anti-patterns in database applications.
|DBPal: A Fully Pluggable NL2SQL Training Pipeline
|Nathaniel Weir, Prasetya Utama, Alex Galakatos, Andrew Crotty, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Nadja Geisler, Benjamin Hättasch, Steffen Eger, Ugur Cetintemel, Carsten Binnig
|Based on these observations, we propose DBPal, a new approach that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation.
|SpeakQL: Towards Speech-driven Multimodal Querying of Structured Data
|Vraj Shah, Side Li, Arun Kumar, Lawrence Saul
|In this work, we propose to bridge this gap by designing a speech-driven querying system and interface for structured data we call SpeakQL. We present the first dataset of spoken SQL queries and a generic approach to generate them for any arbitrary schema.
|Near-Optimal Distributed Band-Joins through Recursive Partitioning
|Rundong Li, Wolfgang Gatterbauer, Mirek Riedewald
|Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost.
|ChronoCache: Predictive and Adaptive Mid-Tier Query Result Caching
|Brad Glasbergen, Kyle Langendoen, Michael Abebe, Khuzaima Daudjee
|In this paper we present ChronoCache, a mid-tier caching system that exploits the presence of geo-distributed edge nodes to cache database query results closer to users.
|Cheetah: Accelerating Database Queries with Switch Pruning
|Muhammad Tirmazi, Ran Ben Basat, Jiaqi Gao, Minlan Yu
|In this paper, we leverage programmable switches in the network to partially offload query computation to the switch.
|External Merge Sort for Top-K Queries: Eager input filtering guided by histograms
|Yannis Chronis, Thanh Do, Goetz Graefe, Keith Peters
|To address these challenges, we introduce a new top-k algorithm that is able to eliminate parts of the input before sorting or writing them to secondary storage, regardless of whether the requested output fits in the available memory.
|Automating Incremental and Asynchronous Evaluation for Recursive Aggregate Data Processing
|Qiange Wang, Yanfeng Zhang, Hao Wang, Liang Geng, Rubao Lee, Xiaodong Zhang, Ge Yu
|In this paper, we lay an analytical foundation for conditions to check if a recursive aggregate program that is monotonic or even non-monotonic can be executed incrementally and asynchronously with its correct result.
|Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems
|Ahmed S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, Walid G. Aref
|Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage.
|Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines
|Bonaventura Del Monte, Steffen Zeuch, Tilmann Rabl, Volker Markl
|In this paper, we propose Rhino, a library for efficient reconfigurations of running queries in the presence of very large distributed state.
|Grizzly: Efficient Stream Processing Through Adaptive Query Compilation
|Philipp M. Grulich, Breß Sebastian, Steffen Zeuch, Jonas Traub, Janis von Bleichert, Zongxiong Chen, Tilmann Rabl, Volker Markl
|In this paper, we present Grizzly, a novel adaptive query compilation-based SPE, to enable highly efficient query execution.
|LightSaber: Efficient Window Aggregation on Multi-core Processors
|Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk
|Based on this, we introduce LightSaber, a new stream processing engine that balances parallelism and incremental processing when executing window aggregation queries on multi-core CPUs.
|Parallel Index-based Stream Join on a Multicore CPU
|Amirhesam Shahvarani, Hans-Arno Jacobsen
|In this paper, we introduce an index data structure, called the partitioned in-memory merge tree, to address the challenges that arise when indexing highly dynamic data, which are common in streaming settings.
|Improving Approximate Nearest Neighbor Search through Learned Adaptive Early Termination
|Conglong Li, Minjia Zhang, David G. Andersen, Yuxiong He
|To achieve a better tradeoff between latency and accuracy, we propose a novel approach that adaptively determines search termination conditions for individual queries.
|Theoretically-Efficient and Practical Parallel DBSCAN
|Yiqiu Wang, Yan Gu, Julian Shun
|This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth).
|A Relational Matrix Algebra and its Implementation in a Column Store
|Oksana Dolmatova, Nikolaus Augsten, Michael H. Böhlen
|This paper proposes a principled solution at the logical level.
|Locality-Sensitive Hashing Scheme based on Longest Circular Co-Substring
|Yifan Lei, Qiang Huang, Mohan Kankanhalli, Anthony K. H. Tung
|In this paper, we propose a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework (LCCS-LSH) with a theoretical guarantee.
|Continuously Adaptive Similarity Search
|Huayi Zhang, Lei Cao, Yizhou Yan, Samuel Madden, Elke A. Rundensteiner
|In this paper, we propose the first solution, called OASIS, to instantaneously adapt the index to conform to a changing distance metric without this prohibitive re-indexing process.
|Automating Exploratory Data Analysis via Machine Learning: An Overview
|Tova Milo, Amit Somech
|In this tutorial, we review recent lines of work for automating EDA.
|Crowdsourcing Practice for Efficient Data Labeling: Aggregation, Incremental Relabeling, and Pricing
|Alexey Drutsa, Valentina Fedorova, Dmitry Ustalov, Olga Megorskaya, Evfrosiniya Zerminova, Daria Baidakova
|In this tutorial, we present a portion of unique industry experience in efficient data labeling via crowdsourcing shared by both leading researchers and engineers from Yandex.
|State of the Art and Open Challenges in Natural Language Interfaces to Data
|Fatma ?zcan, Abdul Quamar, Jaydeep Sen, Chuan Lei, Vasilis Efthymiou
|In this tutorial, we will review these natural language interface solutions in terms of their interpretation approach, as well as the complexity of the queries they can generate.
|SIGMOD 2020 Tutorial on Fairness and Bias in Peer Review and Other Sociotechnical Intelligent Systems
|Nihar B. Shah, Zachary Lipton
|Our presentation will cover a wide range of disciplinary perspectives with the first part focusing on the social impacts of technology and the formulations of fairness and bias defined via protected characteristics and the second part taking a deep into peer review and distributed human evaluations, to explore other forms of bias, such as that due to subjectivity, miscalibration, and dishonest behavior.
|Le Taureau: Deconstructing the Serverless Landscape & A Look Forward
|Anurag Khandelwal, Arun Kejariwal, Karthikeyan Ramasamy
|Inspired by Picasso’s Le Taureau, in the tutorial proposed herein, we shall deconstruct evolution of serverless — the overarching intent being to facilitate better understanding of the serverless landscape.
|Beyond Analytics: The Evolution of Stream Processing Systems
|Paris Carbone, Marios Fragkoulis, Vasiliki Kalavri, Asterios Katsifodimos
|The goal of this tutorial is threefold. First, we aim to review and highlight noteworthy past research findings, which were largely ignored until very recently. Second, we intend to underline the differences between early (’00-’10) and modern (’11-’18) streaming systems, and how those systems have evolved through the years. Most importantly, we wish to turn the attention of the database community to recent trends: streaming systems are no longer used only for classic stream processing workloads, namely window aggregates and joins.
|Optimal Join Algorithms Meet Top-k
|Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald
|This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results.
|Key-Value Storage Engines
|Stratos Idreos, Mark Callaghan
|In this tutorial, we survey the state-of-the-art approaches on how the core storage engine of a key-value store system is designed.
|RASQL: A Powerful Language and its System for Big Data Applications
|Jin Wang, Guorui Xiao, Jiaqi Gu, Jiacheng Wu, Carlo Zaniolo
|To this end, we propose the Recursive-aggregate-SQL (RASQL) language and its system on top of Apache Spark to express and execute complex queries and declarative algorithms in many applications, such as graph search and machine learning.
|PL/SQL Without the PL
|Denis Hirn, Torsten Grust
|We demonstrate a source-to-source compilation technique that can translate user-defined PL/SQL functions into plain SQL queries.
|Analysis of Database Search Systems with THOR
|Theofilos Belmpas, Orest Gkini, Georgia Koutrika
|To help towards this direction, we present THOR that makes 4 important contributions: a query benchmark, a framework for comparing different systems, several search system implementations, and a highly interactive tool for comparing different search systems.
|BOOMER: A Tool for Blending Visual P-Homomorphic Queries on Large Networks
|Yinglong Song, Huey Eng Chua, Sourav S. Bhowmick, Byron Choi, Shuigeng Zhou
|In this demonstration, we present a novel system called BOOMER to realize this paradigm on more generic but complex bounded 1-1 p-homomorphic(BPH) queries on large networks.
|AURORA: Data-driven Construction of Visual Graph Query Interfaces for Graph Databases
|Sourav S. Bhowmick, Kai Huang, Huey Eng Chua, Zifeng Yuan, Byron Choi, Shuigeng Zhou
|In this demonstration, we present a novel data-driven visual subgraph query interface construction engine called AURORA.
|vChain: A Blockchain System Ensuring Query Integrity
|Haixin Wang, Cheng Xu, Ce Zhang, Jianliang Xu
|We demonstrate its verifiable query operations, usability, and performance with visualization for better insights.
|AUDITOR: A System Designed for Automatic Discovery of Complex Integrity Constraints in Relational Databases
|Wentao Hu, Dongxiang Zhang, Dawei Jiang, Sai Wu, Ke Chen, Kian-Lee Tan, Gang Chen
|In this demonstration, we present a new definition of integrity constraint that is more powerful for anomalous data discovery.
|SHARQL: Shape Analysis of Recursive SPARQL Queries
|Angela Bonifati, Wim Martens, Thomas Timm
|In SHARQL, we show how the analysis and exploration of several hundred million queries is possible.
|High Performance Distributed OLAP on Property Graphs with Grasper
|Hongzhi Chen, Bowen Wu, Shiyuan Deng, Chenghuan Huang, Changji Li, Yichao Li, James Cheng
|This Demo presents Grasper, an RDMA-enabled distributed graph OLAP system, which adopts a series of new system designs to overcome the challenges of OLAP on graphs.
|ProcAnalyzer: Effective Code Analyzer for Tuning Imperative Programs in SAP HANA
|Kisung Park, Taeyoung Jeong, Chanho Jeong, Jaeha Lee, Dong-Hun Lee, Young-Koo Lee
|In this demonstration, we present ProcAnalyzer, an expressive and intuitive tool for troubleshooting issues related to performance, code quality, and security.
|LATTE: Visual Construction of Smart Contracts
|Sean Tan, Sourav S Bhowmick, Huey Eng Chua, Xiaokui Xiao
|In this demonstration, we present a novel visual smart contract construction system on Ethereum called latte to make smart contract development accessible to non-programmers.
|PROUD: PaRallel OUtlier Detection for Streams
|Theodoros Toliopoulos, Christos Bellas, Anastasios Gounaris, Apostolos Papadopoulos
|We introduce PROUD, standing for PaRallel OUtlier Detection for streams, which is an extensible engine for continuous multi-parameter parallel distance-based outlier (or anomaly) detection tailored to big data streams.
|MithraCoverage: A System for Investigating Population Bias for Intersectional Fairness
|Zhongjun Jin, Mengjing Xu, Chenkai Sun, Abolfazl Asudeh, H. V. Jagadish
|We demonstrate MithraCoverage, a system for investigating population bias over the intersection of multiple attributes.
|MC3: A System for Minimization of Classifier Construction Cost
|Shay Gershtein, Tova Milo, Gefen Morami, Slava Novgorodov
|In this demo, we introduce MC3, a real-time system that helps data analysts decide which classifiers to construct to minimize the costs of answering a set of search queries.
|Sentinel: Understanding Data Systems
|Brad Glasbergen, Michael Abebe, Khuzaima Daudjee, Daniel Vogel, Jian Zhao
|We demonstrate the Sentinel system, which enables administrators to analyze systems and applications by building models of system execution and comparing them to derive key differences in behaviour.
|BugDoc: A System for Debugging Computational Pipelines
|Raoni Lourenço, Juliana Freire, Dennis Shasha
|We recently proposed a new approach that makes provenance to automatically and iteratively infer root causes and derive succinct explanations of failures; such an approach was implemented in our prototype, BugDoc.
|TQVS: Temporal Queries over Video Streams in Action
|Yueting Chen, Xiaohui Yu, Nick Koudas
|We present TQVS, a system capable of conducting efficient evaluation of declarative temporal queries over real-time video streams.
|ExTuNe: Explaining Tuple Non-conformance
|Anna Fariha, Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani
|We present ExTuNe, a system for Explaining causes of Tuple Non-conformance.
|Interactively Discovering and Ranking Desired Tuples without Writing SQL Queries
|Xuedi Qin, Chengliang Chai, Yuyu Luo, Nan Tang, Guoliang Li
|We propose to demonstrate such as a system, namely DExPlorer.
|Synner: Generating Realistic Synthetic Data
|Miro Mannino, Azza Abouzied
|Synner provides instant feedback on every user interaction by visualizing a preview of the generated data.
|InCognitoMatch: Cognitive-aware Matching via Crowdsourcing
|Roee Shraga, Coral Scharf, Rakefet Ackerman, Avigdor Gal
|We present InCognitoMatch, the first cognitive-aware crowdsourcing application for matching tasks.
|CoClean: Collaborative Data Cleaning
|Mashaal Musleh, Mourad Ouzzani, Nan Tang, AnHai Doan
|We propose a crowd-in-the-loop cleaning system, called CoClean, built on top of Python Pandas dataframe, a widely used library for data scientists.
|STAR: A Distributed Stream Warehouse System for Spatial Data
|Zhida Chen, Gao Cong, Walid G. Aref
|In this demonstration, we present the STAR (Spatial Data Stream Warehouse) system.
|T-REx: Table Repair Explanations
|Daniel Deutch, Nave Frost, Amir Gilad, Oren Sheffer
|To assist users in understanding the output of such data repair algorithms, we propose T-REx, a system for providing data repair explanations through Shapley values.
|SVQ++: Querying for Object Interactions in Video Streams
|Daren Chao, Nick Koudas, Ioannis Xarchakos
|We demonstrate that this system can efficiently identify frames in a streaming video in which an object is interacting with another in a specific way, increasing the frame processing rate dramatically and speed up query processing by at least two orders of magnitude depending on the query.
|F-IVM: Learning over Fast-Evolving Relational Data
|Milos Nikolic, Haozhe Zhang, Ahmet Kara, Dan Olteanu
|We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression.
|CoMing: A Real-time Co-Movement Mining System for Streaming Trajectories
|Ziquan Fang, Yunjun Gao, Lu Pan, Lu Chen, Xiaoye Miao, Christian S. Jensen
|To this end, we develop CoMing, a real-time co-movement pattern mining system, to handle streaming trajectories.
|Unified Spatial Analytics from Heterogeneous Sources with Amazon Redshift
|Nemanja Bori?, Hinnerk Gildhoff, Menelaos Karavelas, Ippokratis Pandis, Ioanna Tsalouchidou
|In this demonstration we present the spatial functionality of Amazon Redshift and its integration with other Amazon services, such as Amazon Aurora PostgreSQL and Amazon S3.
|Big Data Series Analytics Using TARDIS and its Exploitation in Geospatial Applications
|Liang Zhang, Noura Alghamdi, Mohamed Y. Eltabakh, Elke A. Rundensteiner
|In this demonstration, we present GENET, a new interactive exploration demonstration that allows users to support Big Data Series Approximate Retrieval and Recursive Interactive Clustering in large-scale geospatial datasets using TARDIS index techniques.
|CDFShop: Exploring and Optimizing Learned Index Structures
|Ryan Marcus, Emily Zhang, Tim Kraska
|This demonstration allows audience members to (1) gain an intuition about various tuning parameters of RMIs and why learned index structures can greatly accelerate search, and (2) understand how automatic optimization techniques can be used to explore space/time tradeoffs within the space of RMIs.
|TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines
|Emily Caveness, Paul Suganthan G. C., Zhuo Peng, Neoklis Polyzotis, Sudip Roy, Martin Zinkevich
|In this demonstration we showcase TensorFlow Data Validation (TFDV), a scalable data analysis and validation system for ML that we have developed at Google and recently open-sourced.
|Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental Computing
|Zuozhi Wang, Kai Zeng, Botong Huang, Wei Chen, Xiaozong Cui, Bo Wang, Ji Liu, Liya Fan, Dachuan Qu, Zhenyu Hou, Tao Guan, Chen Li, Jingren Zhou
|In this paper, we present Grosbeak, a novel data warehouse that supports resource-aware incremental computing to process recurring routine jobs, smooths the resource skew, and optimizes the resource usage.
|Demonstration of BitGourmet: Data Analysis via Deterministic Approximation
|Saehan Jo, Immanuel Trummer
|We demonstrate BitGourmet, a novel data analysis system that supports deterministic approximate query processing (DAQ).
|Bring Your Own Data to X-PLAIN
|Eliana Pastor, Elena Baralis
|X-PLAIN is an interactive tool that allows human-in-the-loop inspection of the reasons behind model predictions.
|Physical Visualization Design
|Lana Ramjit, Zhaoning Kong, Ravi Netravali, Eugene Wu
|We demonstrate PVD, a system that visualization designers can use to co-design the interface and system architecture of scalable and expressive visualization.
|Demonstration of Chestnut: An In-memory Data Layout Designer for Database Applications
|Mingwei Samuel, Cong Yan, Alvin Cheung
|Demonstration of Chestnut: An In-memory Data Layout Designer for Database Applications
|Breaking Down Memory Walls in LSM-based Storage Systems
|Breaking Down Memory Walls in LSM-based Storage Systems
|Context-Free Path Querying via Matrix Equations
|We show how to reduce GFPQ evaluation to solving systems of matrix equations over R — a problem for which there exist high-performance solutions.
|Simulation-based Approximate Graph Pattern Matching
|In this paper, we propose a simulation-based approximate pattern matching algorithm that is not only efficient to compute, but also able to capture those reasonable matches (missed by existing algorithms)
|High-Dimensional Vector Similarity Search: From Time Series to Deep Network Embeddings
|High-Dimensional Vector Similarity Search: From Time Series to Deep Network Embeddings
|Rethinking Message Brokers on RDMA and NVM
|Rethinking Message Brokers on RDMA and NVM
|Monte Carlo Tree Search for Generating Interactive Data Analysis Interfaces
|We propose to adopt Monte Carlo Tree Search(MCTS) to search for the optimal interface that accounts for hierarchical layout as well as the usability in terms of how easy to express the query log.
|Continuous Prefetch for Interactive Data Applications
|Continuous Prefetch for Interactive Data Applications
|Re-evaluating the Performance Trade-offs for Hash-Based Multi-Join Queries
|Re-evaluating the Performance Trade-offs for Hash-Based Multi-Join Queries
|Interactive View Recommendation
|This paper presents an attempt towards interactive view recommendation that automatically discovers the utility function composition during an exploration that best matches the user’s intentions and exploration task.
|From Worst-Case to Average-Case Analysis: Accurate Latency Predictions for Key-Value Storage Engines
|Meena Jagadeesan, Garrett Tanzer
|In this work, we start to develop an average-case analysis of the performance of storage engines that can achieve significantly more accurate predictions than existing worst-case models.
|Towards the Scheduling of Vertex-constrained Multi Subgraph Matching Query
|Kongzhang Hao, Longbin Lai
|In this paper, we study the problem of vertex-constrained multi subgraph matching query (vMSQ), where we propose a novel scheduling algorithm for processing multiple queries in parallel, while taking into considerations of load balance and maximum possible sharing of computation.
|Serverless Query Processing on a Budget
|We propose a model that will allow service providers to dynamically provision clusters to achieve their users’ desired time-cost tradeoffs.
|Workload-Aware Column Imprints
|We propose efficient algorithms to construct our data structures.
|Towards Scalable UDTFs in Noria
|In this work we design single-tuple UDF and User Defined Aggregates (UDA) interfaces for Noria, a state-of-the art dataflow system with incremental materialized views.
|Column Partition and Permutation for Run Length Encoding in Columnar Databases
|In this paper, we consider compressing columns using the Run Length Encoding (RLE).
|Supporting Database Constraints in Synthetic Data Generation based on Generative Adversarial Networks
|In our research, we focus on data synthesization for relational databases where the database constraints of the original data must be imposed to the generated data.
|An Evaluation of Methods of Compressing Doubles
|In this paper, we perform such a comparison of methods and evaluate their performance in terms of compression ratio and throughput achieved across two dataset repositories of time series and featurized machine-learning problems, as well as on a dataset of machine logs.
|MemFlow: Memory-Aware Distributed Deep Learning
|Towards this we introduce MemFlow, an optimization framework for distributed deep learning that performs joint optimization over memory usage and computation time when searching for a parallelization strategy.
|JSON Schema Matching: Empirical Observations
|JSON Schema Matching: Empirical Observations