Paper Digest: SIGMOD 2015 Highlights
To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.
If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.
Paper Digest Team
TABLE 1: SIGMOD 2015 Papers
|From Data to Insights @ Bare Metal Speed
|Jignesh M. Patel
|Data analytics platforms today largely employ data processing kernels (e.g. implementation of selection and join operator algorithms) that were developed for a now bygone hardware era.
|Distributed Outlier Detection using Compressive Sensing
|Ying Yan, Jiaxing Zhang, Bojun Huang, Xuzhan Sun, Jiaqi Mu, Zheng Zhang, Thomas Moscibroda
|In this paper, we show both theoretically and empirically that these communication costs can be significantly reduced for common distributed computing problems if we take advantage of the fact that production-level big data usually exhibits a form of sparse structure.
|Locality-aware Partitioning in Parallel Database Systems
|Erfan Zamanian, Carsten Binnig, Abdallah Salama
|In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates.
|ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout
|Ziqiang Feng, Eric Lo, Ben Kao, Wenjian Xu
|In this paper we present ByteSlice, a new main memory storage layout that supports both highly efficient scans and lookups.
|Implicit Parallelism through Deep Language Embedding
|Alexander Alexandrov, Andreas Kunft, Asterios Katsifodimos, Felix Schüler, Lauritz Thamsen, Odej Kao, Tobias Herb, Volker Markl
|In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer’s productivity.
|From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System
|Shumo Chu, Magdalena Balazinska, Dan Suciu
|In this paper, we describe a system that can compute efficiently complex join queries, including queries with cyclic joins, on a massively parallel architecture.
|sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms
|Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Waleed Mustafa, Mohamed Hefeeda
|We apply these optimizations to the popular Principal Component Analysis (PCA) algorithm.
|Exploiting Matrix Dependency for Efficient Distributed Matrix Computation
|Lele Yu, Yingxia Shao, Bin Cui
|In this paper, we propose a novel matrix computation system, named DMac, which exploits the matrix dependencies in matrix programs for efficient matrix computation in the distributed environment.
|LEMP: Fast Retrieval of Large Entries in a Matrix Product
|Christina Teflioudi, Rainer Gemulla, Olga Mykytiuk
|To address this problem, we propose the LEMP algorithm, which efficiently retrieves only the large entries in the product matrix without actually computing it.
|Skew-Aware Join Optimization for Array Databases
|Jennie Duggan, Olga Papaemmanouil, Leilani Battle, Michael Stonebraker
|In this paper, we introduce a join optimization framework that is skew-aware for distributed joins.
|Resource Elasticity for Large-Scale Machine Learning
|Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, Frederick R. Reiss
|In this paper, we introduce a simple and robust approach to automatic resource elasticity for large-scale ML.
|SEMROD: Secure and Efficient MapReduce Over HybriD Clouds
|Kerim Yasin Oktay, Sharad Mehrotra, Vaibhav Khadilkar, Murat Kantarcioglu
|This paper describes SEMROD, a sensitive data aware MapReduce (MR) framework for hybrid clouds.
|Authenticated Online Data Integration Services
|Qian Chen, Haibo Hu, Jianliang Xu
|In this paper, we take the first step to propose authenticated data integration services to ensure data and query integrity even in the presence of an untrusted integration server.
|ENKI: Access Control for Encrypted Query Processing
|Isabelle Hang, Florian Kerschbaum, Ernesto Damiani
|ENKI: Access Control for Encrypted Query Processing
|Collaborative Access Control in WebdamLog
|Vera Zaychik Moffitt, Julia Stoyanovich, Serge Abiteboul, Gerome Miklau
|We propose a novel access control model that operates within a distributed data management framework based on datalog.
|Automatic Enforcement of Data Use Policies with DataLawyer
|Prasang Upadhyaya, Magdalena Balazinska, Dan Suciu
|We introduce novel algorithms to efficiently evaluate policies that can cut policy-checking overheads to only a few percent of the total query runtime.
|TencentRec: Real-time Stream Recommendation in Practice
|Yanxiang Huang, Bin Cui, Wenyu Zhang, Jie Jiang, Ying Xu
|In this paper, we tackle the “big", “real-time" and “accurate" challenges in real-time recommendation, and propose a general real-time stream recommender system built on Storm named TencentRec from three aspects, i.e., “system", “algorithm", and “data".
|Twitter Heron: Stream Processing at Scale
|Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, Siddarth Taneja
|This paper presents the design and implementation of this new system, called Heron.
|Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database
|Lucas Braun, Thomas Etter, Georgios Gasparis, Martin Kaufmann, Donald Kossmann, Daniel Widmer, Aharon Avitzur, Anthony Iliopoulos, Eliezer Levy, Ning Liang
|This paper presents an industrial use case and a novel architecture that integrates key-value-based event processing and SQL-based analytical processing on the same distributed store while minimizing the total cost of ownership.
|Why Big Data Industrial Systems Need Rules and What We Can Do About It
|Paul Suganthan G.C., Chong Sun, Krishna Gayatri K., Haojun Zhang, Frank Yang, Narasimhan Rampalli, Shishir Prasad, Esteban Arcaute, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra, AnHai Doan
|In this paper we explore this issue.
|Overview of Data Exploration Techniques
|Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri
|In this tutorial, we survey recent developments in the emerging area of database systems tailored for data exploration.
|Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?
|Christopher Ré, Divy Agrawal, Magdalena Balazinska, Michael Cafarella, Michael Jordan, Tim Kraska, Raghu Ramakrishnan
|Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?
|Cost-based Fault-tolerance for Parallel Data Processing
|Abdallah Salama, Carsten Binnig, Tim Kraska, Erfan Zamanian
|In this paper, we present a novel cost-based fault-tolerance scheme which tackles this issue.
|Squall: Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases
|Aaron J. Elmore, Vaibhav Arora, Rebecca Taft, Andrew Pavlo, Divyakant Agrawal, Amr El Abbadi
|To overcome this problem, we introduce the Squall technique for supporting live reconfiguration in partitioned, main memory DBMSs.
|Madeus: Database Live Migration Middleware under Heavy Workloads for Cloud Environment
|Takeshi Mishima, Yasuhiro Fujiwara
|To efficiently address the hot spot problem, we propose a middleware approach called Madeus that conducts database live migration.
|Lineage-driven Fault Injection
|Peter Alvaro, Joshua Rosen, Joseph M. Hellerstein
|We propose a novel approach for discovering bugs in fault-tolerant data management systems: lineage-driven fault injection.
|Diversity-Aware Top-k Publish/Subscribe for Text Stream
|Lisi Chen, Gao Cong
|We propose a novel solution to efficiently processing a large number of DAS queries over a stream of documents.
|Diverse and Proportional Size-l Object Summaries for Keyword Search
|Georgios Fakas, Zhi Cai, Nikos Mamoulis
|In view of this limitation, in this paper we investigate the effective and efficient generation of two novel types of OS snippets, i.e. diverse and proportional size-l OSs, denoted as DSize-l and PSize-l OSs.
|Local Filtering: Improving the Performance of Approximate Queries on String Collections
|Xiaochun Yang, Yaoshu Wang, Bin Wang, Wei Wang
|In this paper, we explore the opposite paradigm focusing on finding out the differences of database strings to the query string.
|Exact Top-k Nearest Keyword Search in Large Networks
|Minhao Jiang, Ada Wai-Chee Fu, Raymond Chi-Wing Wong
|In this paper, we propose algorithms for top-k nearest keyword search that provide exact solutions and which handle networks of very large sizes.
|Efficient Algorithms for Answering the m-Closest Keywords Query
|Tao Guo, Xin Cao, Gao Cong
|In this paper, we prove that the problem of answering mCK queries is NP-hard.
|Minimum Spanning Trees in Temporal Graphs
|Silu Huang, Ada Wai-Chee Fu, Ruifeng Liu
|We propose efficient linear time algorithms for computing MSTa.
|Efficient Enumeration of Maximal k-Plexes
|Devora Berlowitz, Sara Cohen, Benny Kimelfeld
|This paper presents the first provably efficient algorithms, both for enumerating the maximal k-plexes and for enumerating the maximal connected k-plexes.
|Divide & Conquer: I/O Efficient Depth-First Search
|Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Zechao Shang
|In this paper, we focus on I/O efficiency and study semi-external algorithms to DFS a graph G which is on disk.
|Index-based Optimal Algorithms for Computing Steiner Components with Maximum Connectivity
|Lijun Chang, Xuemin Lin, Lu Qin, Jeffrey Xu Yu, Wenjie Zhang
|In this paper, we study the problem of efficiently computing the steiner component with the maximum connectivity; that is, given a set q of query vertices in a graph G, we aim to find the maximum induced subgraph g of G such that g contains q and g has the maximum connectivity, where g is denoted as SMCC.
|COMMIT: A Scalable Approach to Mining Communication Motifs from Dynamic Networks
|Saket Gurukar, Sayan Ranu, Balaraman Ravindran
|In this paper, we study this problem by mining Communication motifs from dynamic interaction networks.
|LASH: Large-Scale Sequence Mining with Hierarchies
|Kaustubh Beedkar, Rainer Gemulla
|We propose LASH, a scalable, distributed algorithm for mining sequential patterns in the presence of hierarchies.
|Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time
|Michael Cochez, Hao Mou
|In this paper, we propose the use of locality-sensitive hashing combined with a novel data structure called twister tries to provide an approximate clustering for average linkage.
|DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation
|Junhao Gan, Yufei Tao
|In this paper, we prove that for d ≥ 3, the DBSCAN problem requires Ω(n4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science.
|The TagAdvisor: Luring the Lurkers to Review Web Items
|Azade Nazi, Mahashweta Das, Gautam Das
|In this paper, we address the problem of how to engage the lurkers (i.e., people who read reviews but never take time and effort to write one) to participate and write online reviews by systematically simplifying the reviewing task.
|Supporting Data Uncertainty in Array Databases
|Liping Peng, Yanlei Diao
|In this paper, we address uncertain data management in array databases, which may involve both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes.
|Identifying the Extent of Completeness of Query Answers over Partially Complete Databases
|Simon Razniewski, Flip Korn, Werner Nutt, Divesh Srivastava
|In this paper, we propose a natural class of completeness patterns, expressed by selections on database tables, to specify complete parts of database tables.
|k-Hit Query: Top-k Query with Probabilistic Utility Function
|Peng Peng, Raymong Chi-Wing Wong
|In this paper, we present various interesting properties of k-hit queries.
|Linking Temporal Records for Profiling Entities
|Furong Li, Mong Li Lee, Wynne Hsu, Wang-Chiew Tan
|In this paper, we present a new solution for understanding how two facts may be temporally related and exploit the knowledge to profile how entities evolve over time.
|Telco Churn Prediction with Big Data
|Yiqing Huang, Fangzhou Zhu, Mingxuan Yuan, Ke Deng, Yanhua Li, Bing Ni, Wenyuan Dai, Qiang Yang, Jia Zeng
|We show that telco big data can make churn prediction much more easier from the $3$V’s perspectives: Volume, Variety, Velocity.
|The LDBC Social Network Benchmark: Interactive Workload
|Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, Andrey Gubichev, Arnau Prat, Minh-Duc Pham, Peter Boncz
|This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies.
|Rethinking Data-Intensive Science Using Scalable Analytics Systems
|Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson
|In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%.
|QMapper for Smart Grid: Migrating SQL-based Application to Hive
|Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen, Songlin Hu
|In this paper, we propose QMapper, a tool for automatically translating SQL into proper HiveQL.
|Three Favorite Results
|Being honored as the ACM Athena Lecturer has inspired me to reflect upon the research I’ve conducted over my career to date.
|The Power Behind the Throne: Information Integration in the Age of Data-Driven Discovery
|Laura M. Haas
|I will describe the environment we are creating, the advances in the field that enable it, and the challenges that remain.
|On the Design and Scalability of Distributed Shared-Data Databases
|Simon Loesing, Markus Pilman, Thomas Etter, Donald Kossmann
|In this paper, we analyze an alternative architecture design for distributed relational databases that overcomes the limitations of partitioned databases.
|Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems
|Thomas Neumann, Tobias Mühlbauer, Alfons Kemper
|We present a novel MVCC implementation for main-memory database systems that has very little overhead compared to serial execution with single-version concurrency control, even when maintaining serializability guarantees.
|FOEDUS: OLTP Engine for a Thousand Cores and NVRAM
|We analyze the characteristics of these machines and find that no existing database is appropriate.
|Let’s Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems
|Joy Arulraj, Andrew Pavlo, Subramanya R. Dulloor
|To better understand these issues, we implemented three engines in a modular DBMS testbed that are based on different storage management architectures: (1) in-place updates, (2) copy-on-write updates, and (3) log-structured updates.
|Private Release of Graph Statistics using Ladder Functions
|Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao
|In this paper, we introduce a new method which guarantees differential privacy.
|Bayesian Differential Privacy on Correlated Data
|Bin Yang, Issei Sato, Hiroshi Nakagawa
|In this paper, we focus on the private perturbation algorithms on correlated data.
|Modular Order-Preserving Encryption, Revisited
|Charalampos Mavroforakis, Nathan Chenette, Adam O’Neill, George Kollios, Ran Canetti
|In this paper, we systematically address this vulnerability and show that MOPE can be used to build a practical system for executing range queries on encrypted data while providing a significant security improvement over the basic OPE.
|Chiaroscuro: Transparency and Privacy for Massive Personal Time-Series Clustering
|Tristan Allard, Georges Hébrail, Florent Masseglia, Esther Pacitti
|In this paper, we propose Chiaroscuro, a complete solution for clustering personal data with strong privacy guarantees.
|Persistent Data Sketching
|Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, Ji-Rong Wen
|In this paper, we aim at designing persistent sketches, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.
|Scalable Distributed Stream Join Processing
|Qian Lin, Beng Chin Ooi, Zhengkui Wang, Cui Yu
|In this paper, we propose a novel stream join model, called join-biclique, which organizes a large cluster as a complete bipartite graph.
|SCREEN: Stream Data Cleaning under Speed Constraints
|Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu
|Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include: (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient Median Principle,(3) support on out-of-order arrivals of data points, and(4) adaptive window size for balancing repair accuracy and efficiency.
|Location-Aware Pub/Sub System: When Continuous Moving Queries Meet Dynamic Event Streams
|Long Guo, Dongxiang Zhang, Guoliang Li, Kian-Lee Tan, Zhifeng Bao
|In this paper, we propose a new location-aware pub/sub system, Elaps, that continuously monitors moving users subscribing to dynamic event streams from social media and E-commerce applications.
|CE-Storm: Confidential Elastic Processing of Data Streams
|Nick R. Katsipoulakis, Cory Thoma, Eric A. Gratta, Alexandros Labrinidis, Adam J. Lee, Panos K. Chrysanthis
|We will demonstrate both systems working in tandem and also visualize their behavior over time under different scenarios.
|A SQL Debugger Built from Spare Parts: Turning a SQL: 1999 Database System into Its Own Debugger
|Benjamin Dietrich, Torsten Grust
|We demonstrate a new incarnation of Habitat, an observational debugger for SQL.
|Exploratory Keyword Search with Interactive Input
|Zhifeng Bao, Yong Zeng, H.V. Jagadish, Tok Wang Ling
|Therefore, we propose a framework called ClearMap that natively supports visualized exploratory search paradigm on XML data.
|QE3D: Interactive Visualization and Exploration of Complex, Distributed Query Plans
|Daniel Scheibli, Christian Dinse, Alexander Boehm
|In this demonstration, we show how its interactive, three-dimensional plan representation helps to understand and quickly identify hotspots in complex, real-world scenarios.
|DataXFormer: An Interactive Data Transformation Tool
|John Morcos, Ziawasch Abedjan, Ihab Francis Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker
|In this demonstration, we present the user-interaction with DataXFormer and show scenarios on how it can be used to transform data and explore the effectiveness and efficiency of several approaches for transformation discovery, leveraging about 112 million tables and online sources.
|Quality-Driven Continuous Query Execution over Out-of-Order Data Streams
|Yuanzhen Ji, Hongjin Zhou, Zbigniew Jerzak, Anisoara Nica, Gregor Hackenbroich, Christof Fetzer
|We demonstrate a prototype stream processing system, which extends SAP Event Stream Processor with the implementation of AQ-K-slack.
|MoDisSENSE: A Distributed Spatio-Temporal and Textual Processing Platform for Social Networking Services
|Ioannis Mytilinis, Ioannis Giannakopoulos, Ioannis Konstantinou, Katerina Doka, Dimitrios Tsitsigkos, Manolis Terrovitis, Lampros Giampouras, Nectarios Koziris
|In this work, we present MoDisSENSE, an open-source distributed platform that provides personalized search for points of interest and trending events based on the user’s social graph by combining spatio-textual user generated data.
|DocRicher: An Automatic Annotation System for Text Documents Using Social Media
|Qiang Hu, Qi Liu, Xiaoli Wang, Anthony K.H. Tung, Shubham Goyal, Jisong Yang
|We demonstrate a system, DocRicher, to enrich a text document with social media, that implicitly reference certain passages of it.
|A Demonstration of Rubato DB: A Highly Scalable NewSQL Database System for OLTP and Big Data Applications
|Li-Yan Yuan, Lengdong Wu, Jia-Huai You, Yan Chi
|We propose to demonstrate Rubato DB, a highly scalable NewSQL system, supporting various consistency levels from ACID to BASE for OLTP and big data applications.
|G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data
|Kai Zeng, Sameer Agarwal, Ankur Dave, Michael Armbrust, Ion Stoica
|In this demonstration, we present G-OLA, a novel mini-batch execution model that generalizes OLA to support general OLAP queries with arbitrarily nested aggregates using efficient delta maintenance techniques.
|Mining and Forecasting of Big Time-series Data
|Yasushi Sakurai, Yasuko Matsubara, Christos Faloutsos
|The objective of this tutorial is to provide a concise and intuitive overview of the most important tools that can help us find patterns in large-scale time-series sequences.
|Optimal Spatial Dominance: An Effective Search of Nearest Neighbor Candidates
|Xiaoyang Wang, Ying Zhang, Wenjie Zhang, Xuemin Lin, Muhammad Aamir Cheema
|Efficient algorithms are proposed for the dominance check and corresponding NN candidates computation.
|THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads
|Farhan Tauheed, Thomas Heinis, Anastasia Ailamaki
|We propose THERMAL-JOIN, a novel spatial self-join algorithm for dynamic memory-resident workloads.
|Indexing Metric Uncertain Data for Range Queries
|Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen, Baihua Zheng
|In this paper, we represent metric uncertain data by using an object-level model and a bi-level model, respectively.
|Efficient Route Planning on Public Transportation Networks: A Labelling Approach
|Sibo Wang, Wenqing Lin, Yi Yang, Xiaokui Xiao, Shuigeng Zhou
|This paper presents Timetable Labelling (TTL), an efficient indexing technique for route planning on timetable graphs.
|The Importance of Being Expert: Efficient Max-Finding in Crowdsourcing
|Aris Anagnostopoulos, Luca Becchetti, Adriano Fazzone, Ida Mele, Matteo Riondato
|As we highlight in this work, the exclusive use of nonexpert individuals may prove ineffective in some cases, especially when the task at hand or the need for accurate solutions demand some degree of specialization to avoid excessive uncertainty and inconsistency in the answers.
|Minimizing Efforts in Validating Crowd Answers
|Nguyen Quoc Viet Hung, Duong Chi Thang, Matthias Weidlich, Karl Aberer
|Therefore, we develop a probabilistic model that helps to identify the most beneficial validation questions in terms of both, improvement of result correctness and detection of faulty workers.
|iCrowd: An Adaptive Crowdsourcing Framework
|Ju Fan, Guoliang Li, Beng Chin Ooi, Kian-lee Tan, Jianhua Feng
|To this end, we propose an adaptive crowdsourcing framework, called iCrowd.
|QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications
|Yudian Zheng, Jiannan Wang, Guoliang Li, Reynold Cheng, Jianhua Feng
|In this paper, we investigate the online task assignment problem: Given a pool of n questions, which of the k questions should be assigned to a worker?
|tDP: An Optimal-Latency Budget Allocation Strategy for Crowdsourced MAXIMUM Operations
|Vasilis Verroios, Peter Lofgren, Hector Garcia-Molina
|We focus on one of the most extensively studied crowdsourcing operations, the MAX operation (finding the best element in a collection under human criteria), and we study the problem of budget allocation into rounds for this operation.
|Thrifty: Offering Parallel Database as a Service using the Shared-Process Approach
|Petrie Wong, Zhian He, Ziqiang Feng, Wenjian Xu, Eric Lo
|In this demonstration, we present Thrifty, a Parallel-Database-as-a-Service operated using the "shared-process" approach.
|BenchPress: Dynamic Workload Control in the OLTP-Bench Testbed
|Dana Van Aken, Djellel E. Difallah, Andrew Pavlo, Carlo Curino, Philippe Cudré-Mauroux
|We recently introduced OLTP-Bench, an extensible testbed for benchmarking relational databases that is bundled with 15 workloads.
|Demonstrating "Data Near Here": Scientific Data Search
|V.M. Megler, David Maier
|We include an analysis showing that our summary-based approach gives a reasonable approximation of such a "complete dataset" similarity measure.
|Slider: An Efficient Incremental Reasoner
|Jules Chevalier, Julien Subercaze, Christophe Gravier, Frédérique Laforest
|We contribute to solving these problems by introducing Slider, an efficient incremental reasoner.
|WANalytics: Geo-Distributed Analytics for a Data Intensive World
|Ashish Vulimiri, Carlo Curino, Philip Brighten Godfrey, Thomas Jungblut, Konstantinos Karanasos, Jitendra Padhye, George Varghese
|We instead propose WANalytics, a system that solves the WABD problem by orchestrating distributed query execution and adjusting data replication across data centers in order to minimize bandwidth usage, while respecting sovereignty requirements.
|FTT: A System for Finding and Tracking Tourists in Public Transport Services
|Huayu Wu, Jo-Anne Tan, Wee Siong Ng, Mingqiang Xue, Wei Chen
|In this paper, we present our system FTT (Finding and Tracking Tourists) to identify tourists from public transport commuters in a city, and to further track their movements from one place to another.
|SharkDB: An In-Memory Storage System for Massive Trajectory Data
|Haozhou Wang, Kai Zheng, Xiaofang Zhou, Shazia Sadiq
|In this storage design, we try to explore the potential opportunities, which can boost the performance of query processing for trajectory data.
|Ringo: Interactive Graph Analytics on Big-Memory Machines
|Yonathan Perez, Rok Sosič, Arijit Banerjee, Rohan Puttagunta, Martin Raison, Pararth Shah, Jure Leskovec
|We present Ringo, a system for analysis of large graphs.
|STORM: Spatio-Temporal Online Reasoning and Management of Large Spatio-Temporal Data
|Robert Christensen, Lu Wang, Feifei Li, Ke Yi, Jun Tang, Natalee Villa
|We present the STORM system to enable spatio-temporal online reasoning and management of large spatio-temporal data.
|PAXQuery: Parallel Analytical XML Processing
|Jesús Camacho-Rodríguez, Dario Colazzo, Ioana Manolescu, Juan A.M. Naranjo
|We demonstrate PAXQuery, a novel system that parallelizes the execution of XQuery queries over large collections of XML documents.
|Cache-Efficient Aggregation: Hashing Is Sorting
|Ingo Müller, Peter Sanders, Arnaud Lacurie, Wolfgang Lehner, Franz Färber
|In this paper we argue that in terms of cache efficiency, the two paradigms are actually the same.
|Efficient Similarity Join and Search on Multi-Attribute Data
|Guoliang Li, Jian He, Dong Deng, Jian Li
|In this paper we study similarity join and search on multi- attribute data.
|Holistic Indexing in Main-memory Column-stores
|Eleni Petraki, Stratos Idreos, Stefan Manegold
|This paper introduces holistic indexing, a new approach to automated index tuning in dynamic environments.
|CliffGuard: A Principled Framework for Finding Robust Database Designs
|Barzan Mozafari, Eugene Zhen Ye Goh, Dong Young Yoon
|Thus, we propose a new type of database designer that is robust against parameter uncertainties, so that overall performance degrades more gracefully when future workloads deviate from the past.
|Exploiting Correlations for Expensive Predicate Evaluation
|Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, Christopher Re
|In this paper, we study ways to efficiently evaluate selection queries with UDF predicates.
|Query-Oriented Data Cleaning with Oracles
|Moria Bergman, Tova Milo, Slava Novgorodov, Wang-Chiew Tan
|To overcome the limitations of existing data cleaning techniques, we present QOCO, a novel query-oriented system for cleaning data with oracles.
|BigDansing: A System for Big Data Cleansing
|Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin
|In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing.
|Data X-Ray: A Diagnostic Tool for Data Errors
|Xiaolan Wang, Xin Luna Dong, Alexandra Meliou
|Our contributions are three-fold.
|KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing
|Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye
|We propose KATARA, a knowledge base and crowd powered data cleaning system that, given a table, a KB, and a crowd, interprets table semantics to align it with the KB, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data.
|Crowd-Based Deduplication: An Adaptive Approach
|Sibo Wang, Xiaokui Xiao, Chun-Hee Lee
|This paper presents ACD, a new crowd-based algorithm for data deduplication.
|Minimizing Commit Latency of Transactions in Geo-Replicated Data Stores
|Faisal Nawab, Vaibhav Arora, Divyakant Agrawal, Amr El Abbadi
|In this work, we derive a lower-bound on commit latency.
|Optimizing Optimistic Concurrency Control for Tree-Structured, Log-Structured Databases
|Philip A. Bernstein, Sudipto Das, Bailu Ding, Markus Pilman
|To address them, we describe a high-performance transaction mechanism that uses optimistic concurrency control on a multi-versioned tree-structured database stored in a shared log.
|The Homeostasis Protocol: Avoiding Transaction Coordination Through Program Analysis
|Sudip Roy, Lucja Kot, Gabriel Bender, Bailu Ding, Hossein Hojjat, Christoph Koch, Nate Foster, Johannes Gehrke
|This paper describes a new approach to achieving strong consistency in distributed systems while minimizing communication between nodes.
|Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity
|Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, Ion Stoica
|In this work, we empirically investigate modern ORM-backed applications’ use and disuse of database concurrency control mechanisms.
|REEF: Retainable Evaluator Execution Framework
|Markus Weimer, Yingda Chen, Byung-Gon Chun, Tyson Condie, Carlo Curino, Chris Douglas, Yunseong Lee, Tony Majestro, Dahlia Malkhi, Sergiy Matusevych, Brandon Myers, Shravan Narayanamurthy, Raghu Ramakrishnan, Sriram Rao, Russel Sears, Beysim Sezgin, Julia Wang
|This paper presents REEF, a development framework that provides a control-plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager.
|Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications
|Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, Carlo Curino
|In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes.
|Design and Implementation of the LogicBlox System
|Molham Aref, Balder ten Cate, Todd J. Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L. Veldhuizen, Geoffrey Washburn
|In this paper, we discuss the design considerations behind the LogicBlox system and give an overview of its implementation, highlighting innovative aspects.
|Spark SQL: Relational Data Processing in Spark
|Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, Matei Zaharia
|Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis.
|Graft: A Debugging Tool For Apache Giraph
|Semih Salihoglu, Jaeho Shin, Vikesh Khanna, Ba Quan Truong, Jennifer Widom
|We address the problem of debugging programs written for Pregel-like systems.
|Even Metadata is Getting Big: Annotation Summarization using InsightNotes
|Dongqing Xiao, Armir Bashllari, Tyler Menard, Mohamed Eltabakh
|In this paper, we demonstrate the InsightNotes system, a summary-based annotation management engine over relational databases.
|StoryPivot: Comparing and Contrasting Story Evolution
|Anja Gruenheid, Donald Kossmann, Theodoros Rekatsinas, Divesh Srivastava
|In this demonstration we present StoryPivot, a framework that helps its users to detect evolving stories in event datasets over time.
|The Flatter, the Better: Query Compilation Based on the Flattening Transformation
|Alexander Ulrich, Torsten Grust
|We demonstrate the insides and outs of a query compiler based on the flattening transformation, a translation technique designed by the programming language community to derive efficient data-parallel implementations from iterative programs.
|D2WORM: A Management Infrastructure for Distributed Data-centric Workflows
|Martin Jergler, Mohammad Sadoghi, Hans-Arno Jacobsen
|In this demonstration, we present D2Worm, a Distributed Data-centric Workflow Management system.
|Yael Amsterdamer, Anna Kukliansky, Tova Milo
|To account for these challenges, we develop new, dedicated modules and embed them within the modular and easily extensible architecture of NL2CM.
|Optimistic Recovery for Iterative Dataflows in Action
|Sergey Dudoladov, Chen Xu, Sebastian Schelter, Asterios Katsifodimos, Stephan Ewen, Kostas Tzoumas, Volker Markl
|In this paper, we demonstrate our recovery mechanism with the Apache Flink data processing engine.
|A Secure Search Engine for the Personal Cloud
|Saliha Lallali, Nicolas Anciaux, Iulian Sandu Popa, Philippe Pucheral
|We have implemented our engine on a real tamper resistant hardware device and present its capacity to regulate the access to a personal dataspace.
|IReS: Intelligent, Multi-Engine Resource Scheduler for Big Data Analytics Workflows
|Katerina Doka, Nikolaos Papailiou, Dimitrios Tsoumakos, Christos Mantas, Nectarios Koziris
|To this end, we demonstrate IReS, the Intelligent Resource Scheduler for complex analytics workflows executed over multi-engine environments.
|Just can’t get enough: Synthesizing Big Data
|Tilmann Rabl, Manuel Danisch, Michael Frank, Sebastian Schindler, Hans-Arno Jacobsen
|As a solution, we present an automatic approach to data synthetization from existing data sources.
|Rack-Scale In-Memory Join Processing using RDMA
|Claude Barthels, Simon Loesing, Gustavo Alonso, Donald Kossmann
|In this paper we focus on implementing parallel in-memory joins using Remote Direct Memory Access (RDMA), a communication mechanism to transfer data directly into the memory of a remote machine.
|Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation
|Max Heimel, Martin Kiefer, Volker Markl
|We provide an implementation of our estimator and experimentally evaluate it on a variety of datasets and workloads, demonstrating that it efficiently scales up to very large model sizes, adapts itself to database changes, and typically outperforms the estimation quality of both existing Kernel Density Estimators as well as state-of-the-art multidimensional histograms.
|Rethinking SIMD Vectorization for In-Memory Databases
|Orestis Polychroniou, Arun Raghavan, Kenneth A. Ross
|In this paper, we present novel vectorized designs and implementations of database operators, based on advanced SIMD operations, such as gathers and scatters.
|A Padded Encoding Scheme to Accelerate Scans by Leveraging Skew
|Yinan Li, Craig Chasseur, Jignesh M. Patel
|This paper proposes a padded encoding scheme to address this opportunity.
|GetReal: Towards Realistic Selection of Influence Maximization Strategies in Competitive Networks
|Hui Li, Sourav S. Bhowmick, Jiangtao Cui, Yunjun Gao, Jianfeng Ma
|In this paper, we propose a novel framework based on game theory to provide a more realistic solution to the IM problem in competitive networks by jettisoning these unrealistic assumptions.
|Influence Maximization in Near-Linear Time: A Martingale Approach
|Youze Tang, Yanchen Shi, Xiaokui Xiao
|This paper presents an influence maximization algorithm that provides the same worst-case guarantees as the state of the art, but offers significantly improved empirical efficiency.
|Community Level Diffusion Extraction
|Zhiting Hu, Junjie Yao, Bin Cui, Eric Xing
|This paper introduces a new approach, i.e., COmmunity Level Diffusion (COLD), to uncover and explore temporal diffusion.
|BEAR: Block Elimination Approach for Random Walk with Restart on Large Graphs
|Kijung Shin, Jinhong Jung, Sael Lee, U. Kang
|In this paper, we propose BEAR, a fast, scalable, and accurate method for computing RWR on large graphs.
|The Minimum Wiener Connector Problem
|Natali Ruchansky, Francesco Bonchi, David García-Soriano, Francesco Gullo, Nicolas Kourtellis
|In this paper we study the novel problem of finding a minimum Wiener connector: given a connected graph G=(V,E) and a set Q ⊆ V of query vertices, find a subgraph of G that connects all query vertices and has minimum Wiener index.
|From Group Recommendations to Group Formation
|Senjuti Basu Roy, Laks V.S. Lakshmanan, Rui Liu
|We consider the complementary problem of how to form groups such that the users in the formed groups are most satisfied with the suggested top-k recommendations.
|Real-Time Multi-Criteria Social Graph Partitioning: A Game Theoretic Approach
|Nikos Armenatzoglou, Huy Pham, Vasilis Ntranos, Dimitris Papadias, Cyrus Shahabi
|In this paper, we introduce RMGP, a type of real-time multi-criteria graph partitioning for social networks that groups the users based on their connectivity and their similarity to a set of input classes.
|Utility-Aware Social Event-Participant Planning
|Jieying She, Yongxin Tong, Lei Chen
|Existing approaches usually assume that each user only attends one event or ignore location information.
|Online Video Recommendation in Sharing Community
|Xiangmin Zhou, Lei Chen, Yanchun Zhang, Longbing Cao, Guangyan Huang, Chen Wang
|In this paper, we propose an approach based on the content and social information of videos for the recommendation in sharing communities.
|Large-scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-database Prediction
|Shreya Prasad, Arash Fard, Vishrut Gupta, Jorge Martinez, Jeff LeFevre, Vincent Xu, Meichun Hsu, Indrajit Roy
|This paper presents the design of a high performance data transfer mechanism, new data-structures in Distributed R to maintain data locality with database table segments, and extensions to Vertica for saving and deploying R models.
|Oracle Workload Intelligence
|Quoc Trung Tran, Konstantinos Morfonios, Neoklis Polyzotis
|In this work, we present Oracle Workload Intelligence (WI), a tool for workload modeling and mining, as our attempt to infer the processes that generate a given workload.
|Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components
|John Colgrove, John D. Davis, John Hayes, Ethan L. Miller, Cary Sandvig, Russell Sears, Ari Tamches, Neil Vachharajani, Feng Wang
|In this paper, we describe Purity, the foundation of Pure Storage’s Flash Arrays, the first all-flash enterprise storage system to support compression, deduplication, and high-availability.
|On Improving User Response Times in Tableau
|Pawel Terlecki, Fei Xu, Marianne Shaw, Valeri Kim, Richard Wesley
|In this paper we discuss key data processing components in Tableau: the query processor, query caches, Tableau Data Engine [1, 2] and Data Server.
|Data Management in Non-Volatile Memory
|Stratis D. Viglas
|In what follows we present the current work in the area with a view towards identifying the open problems and exposing the research opportunities.
|TEGRA: Table Extraction by Global Record Alignment
|Xu Chu, Yeye He, Kaushik Chakrabarti, Kris Ganjam
|In this work, we address the important problem of automatically extracting multi-column relational tables from such lists.
|Mining Quality Phrases from Massive Text Corpora
|Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han
|In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation.
|Mining Subjective Properties on the Web
|Immanuel Trummer, Alon Halevy, Hongrae Lee, Sunita Sarawagi, Rahul Gupta
|We describe the Surveyor system that mines the dominant opinion held by authors of Web content about whether a subjective property applies to a given entity.
|Microblog Entity Linking with Social Temporal Context
|Wen Hua, Kai Zheng, Xiaofang Zhou
|In this paper, we propose an efficient solution to link entities in tweets by analyzing their social and temporal context.
|Graph-Aware, Workload-Adaptive SPARQL Query Caching
|Nikolaos Papailiou, Dimitrios Tsoumakos, Panagiotis Karras, Nectarios Koziris
|In this work we present a novel system that addresses graph-based, workload-adaptive indexing of large RDF graphs by caching SPARQL query results.
|Left Bit Right: For SPARQL Join Queries with OPTIONAL Patterns (Left-outer-joins)
|In this paper, we present Left Bit Right (LBR), a technique for well-designed nested BGP and OPTIONAL pattern queries.
|How to Build Templates for RDF Question/Answering: An Uncertain Graph Similarity Join Approach
|Weiguo Zheng, Lei Zou, Xiang Lian, Jeffrey Xu Yu, Shaoxu Song, Dongyan Zhao
|We propose several structural and probability pruning techniques to speed up joining.
|RBench: Application-Specific RDF Benchmarking
|Shi Qiao, Z. Meral Özsoyoğlu
|To address the needs of diverse applications, we propose an application-specific framework, called RBench, to generate RDF benchmarks.
|ALEX: Automatic Link Exploration in Linked Data
|Ahmed El-Roby, Ashraf Aboulnaga
|In this paper, we present ALEX, a system that aims at improving the quality of links between RDF data sets by using feedback provided by users on the answers to linked data queries.
|k-Shape: Efficient and Accurate Clustering of Time Series
|John Paparrizos, Luis Gravano
|In this paper, we present k-Shape, a novel algorithm for time-series clustering.
|SMiLer: A Semi-Lazy Time Series Prediction System for Sensors
|Jingbo Zhou, Anthony K.H. Tung
|We propose a new method to apply the GP for sensor time series prediction.
|SQLGraph: An Efficient Relational-Based Property Graph Store
|Wen Sun, Achille Fokoue, Kavitha Srinivas, Anastasios Kementsietsidis, Gang Hu, Guotong Xie
|We show that existing mature, relational optimizers can be exploited with a novel schema to give better performance for property graph storage and retrieval than popular noSQL graph stores.
|Updating Graph Indices with a One-Pass Algorithm
|Dayu Yuan, Prasenjit Mitra, Huiwen Yu, C. Lee Giles
|In order to address this issue, we propose a time-efficient one-pass algorithm that is designed to update a graph index by scanning each frequent subgraph at most once.
|Amazon Redshift and the Case for Simpler Data Warehouses
|Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, Vidhya Srinivasan
|In this paper, we discuss an oft-overlooked differentiating characteristic of Amazon Redshift — simplicity.
|ShareInsights: An Unified Approach to Full-stack Data Processing
|Mukund Deshpande, Dhruva Ray, Sameer Dixit, Avadhoot Agasti
|In this paper we present a platform that aims to significantly reduce the time it takes to build data pipelines.
|An Incremental Anytime Algorithm for Multi-Objective Query Optimization
|Immanuel Trummer, Christoph Koch
|We present an incremental anytime algorithm for MOQO, analyze its complexity and show that it offers an attractive tradeoff between result update frequency, single invocation time complexity, and amortized time over multiple invocations.
|Output-sensitive Evaluation of Prioritized Skyline Queries
|Niccolo’ Meneghetti, Denis Mindolin, Paolo Ciaccia, Jan Chomicki
|In this paper we show that querying using non-compensatory preferences is computationally efficient.
|Learning Generalized Linear Models Over Normalized Data
|Arun Kumar, Jeffrey Naughton, Jignesh M. Patel
|In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting.
|Utilizing IDs to Accelerate Incremental View Maintenance
|Yannis Katsis, Kian Win Ong, Yannis Papakonstantinou, Kevin Keliang Zhao
|This work makes the following contributions: (a) An ID-based IVM system for a large subset of SQL that includes the algebraic operators selection, join, grouping and aggregation, generalized projection involving functions, antisemijoin (and therefore negation/difference) and union.
|S4: Top-k Spreadsheet-Style Search for Query Discovery
|Fotis Psallidas, Bolin Ding, Kaushik Chakrabarti, Surajit Chaudhuri
|To address this limitation, we study the problem of efficiently discovering top-k project join queries which approximately contain the given example tuples in their output.
|Proactive Annotation Management in Relational Databases
|Karim Ibrahim, Xiao Du, Mohamed Eltabakh
|In this paper, we propose the Nebula system, an advanced and proactive annotation management engine in relational databases.
|Weighted Coverage based Reviewer Assignment
|Ngai Meng Kou, Leong Hou U., Nikos Mamoulis, Zhiguo Gong
|In this paper, we propose a generalized framework for fair reviewer assignment.
|Distributed Online Tracking
|Mingwang Tang, Feifei Li, Yufei Tao
|This problem was recently formalized and studied, and a principled approach with optimal competitive ratio was proposed.
|Knowledge Curation and Knowledge Fusion: Challenges, Models and Applications
|Xin Luna Dong, Divesh Srivastava
|Our tutorial highlights the similarities and differences between knowledge management and data integration, and has two goals.
|Smooth Task Migration in Apache Storm
|Mansheng Yang, Richard T.B. Ma
|To handle the task migration process more gracefully, we propose three task migration methods: (i) worker level migration, (ii) executor level migration, and (iii) executor level migration with reliable messaging.
|JAFAR: Near-Data Processing for Databases
|Oreoluwatomiwa O. Babarinsa, Stratos Idreos
|In this paper, we present JAFAR, a near data processing accelerator for pushing selects down to memory.
|Job Scheduling with Minimizing Data Communication Costs
|Trevor Clinkenbeard, Anisoara Nica
|The research presented in this paper analyzes different algorithms for scheduling a set of potentially interdependent jobs in order to minimize the total runtime, or makespan, when data communication costs are considered.
|One Loop Does Not Fit All
|Styliani Pantela, Stratos Idreos
|In this paper, we study JIT compilation for modern in-memory column-stores in detail and we show that, contrary to the common belief that vectorization outweighs the benefits of having one loop, there are cases in which creating a single loop is actually the optimal solution.
|DunceCap: Compiling Worst-Case Optimal Query Plans
|Adam Perelman, Christopher Ré
|In this study, we explore two algorithms that are asymptotically faster than pairwise algorithms for a large class of queries.
|DunceCap: Query Plans Using Generalized Hypertree Decompositions
|Susan Tu, Christopher Ré
|My contribution is to explore query planning using these join algorithms.