Paper Digest: SIGMOD 2015 Highlights

June 16, 2015June 26, 2020 admin

The ACM Special Interest Group on Management of Data (SIGMOD) is one of the top conferences on database management systems and data management technology.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team
team@paperdigest.org

TABLE 1: SIGMOD 2015 Papers

	Title	Authors	Highlight
1	From Data to Insights @ Bare Metal Speed	Jignesh M. Patel	Data analytics platforms today largely employ data processing kernels (e.g. implementation of selection and join operator algorithms) that were developed for a now bygone hardware era.
2	Distributed Outlier Detection using Compressive Sensing	Ying Yan, Jiaxing Zhang, Bojun Huang, Xuzhan Sun, Jiaqi Mu, Zheng Zhang, Thomas Moscibroda	In this paper, we show both theoretically and empirically that these communication costs can be significantly reduced for common distributed computing problems if we take advantage of the fact that production-level big data usually exhibits a form of sparse structure.
3	Locality-aware Partitioning in Parallel Database Systems	Erfan Zamanian, Carsten Binnig, Abdallah Salama	In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates.
4	ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout	Ziqiang Feng, Eric Lo, Ben Kao, Wenjian Xu	In this paper we present ByteSlice, a new main memory storage layout that supports both highly efficient scans and lookups.
5	Implicit Parallelism through Deep Language Embedding	Alexander Alexandrov, Andreas Kunft, Asterios Katsifodimos, Felix Schüler, Lauritz Thamsen, Odej Kao, Tobias Herb, Volker Markl	In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer’s productivity.
6	From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System	Shumo Chu, Magdalena Balazinska, Dan Suciu	In this paper, we describe a system that can compute efficiently complex join queries, including queries with cyclic joins, on a massively parallel architecture.
7	sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms	Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Waleed Mustafa, Mohamed Hefeeda	We apply these optimizations to the popular Principal Component Analysis (PCA) algorithm.
8	Exploiting Matrix Dependency for Efficient Distributed Matrix Computation	Lele Yu, Yingxia Shao, Bin Cui	In this paper, we propose a novel matrix computation system, named DMac, which exploits the matrix dependencies in matrix programs for efficient matrix computation in the distributed environment.
9	LEMP: Fast Retrieval of Large Entries in a Matrix Product	Christina Teflioudi, Rainer Gemulla, Olga Mykytiuk	To address this problem, we propose the LEMP algorithm, which efficiently retrieves only the large entries in the product matrix without actually computing it.
10	Skew-Aware Join Optimization for Array Databases	Jennie Duggan, Olga Papaemmanouil, Leilani Battle, Michael Stonebraker	In this paper, we introduce a join optimization framework that is skew-aware for distributed joins.
11	Resource Elasticity for Large-Scale Machine Learning	Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, Frederick R. Reiss	In this paper, we introduce a simple and robust approach to automatic resource elasticity for large-scale ML.
12	SEMROD: Secure and Efficient MapReduce Over HybriD Clouds	Kerim Yasin Oktay, Sharad Mehrotra, Vaibhav Khadilkar, Murat Kantarcioglu	This paper describes SEMROD, a sensitive data aware MapReduce (MR) framework for hybrid clouds.
13	Authenticated Online Data Integration Services	Qian Chen, Haibo Hu, Jianliang Xu	In this paper, we take the first step to propose authenticated data integration services to ensure data and query integrity even in the presence of an untrusted integration server.
14	ENKI: Access Control for Encrypted Query Processing	Isabelle Hang, Florian Kerschbaum, Ernesto Damiani	ENKI: Access Control for Encrypted Query Processing
15	Collaborative Access Control in WebdamLog	Vera Zaychik Moffitt, Julia Stoyanovich, Serge Abiteboul, Gerome Miklau	We propose a novel access control model that operates within a distributed data management framework based on datalog.
16	Automatic Enforcement of Data Use Policies with DataLawyer	Prasang Upadhyaya, Magdalena Balazinska, Dan Suciu	We introduce novel algorithms to efficiently evaluate policies that can cut policy-checking overheads to only a few percent of the total query runtime.
17	TencentRec: Real-time Stream Recommendation in Practice	Yanxiang Huang, Bin Cui, Wenyu Zhang, Jie Jiang, Ying Xu	In this paper, we tackle the “big", “real-time" and “accurate" challenges in real-time recommendation, and propose a general real-time stream recommender system built on Storm named TencentRec from three aspects, i.e., “system", “algorithm", and “data".
18	Twitter Heron: Stream Processing at Scale	Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, Siddarth Taneja	This paper presents the design and implementation of this new system, called Heron.
19	Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database	Lucas Braun, Thomas Etter, Georgios Gasparis, Martin Kaufmann, Donald Kossmann, Daniel Widmer, Aharon Avitzur, Anthony Iliopoulos, Eliezer Levy, Ning Liang	This paper presents an industrial use case and a novel architecture that integrates key-value-based event processing and SQL-based analytical processing on the same distributed store while minimizing the total cost of ownership.
20	Why Big Data Industrial Systems Need Rules and What We Can Do About It	Paul Suganthan G.C., Chong Sun, Krishna Gayatri K., Haojun Zhang, Frank Yang, Narasimhan Rampalli, Shishir Prasad, Esteban Arcaute, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra, AnHai Doan	In this paper we explore this issue.
21	Overview of Data Exploration Techniques	Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri	In this tutorial, we survey recent developments in the emerging area of database systems tailored for data exploration.
22	Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?	Christopher Ré, Divy Agrawal, Magdalena Balazinska, Michael Cafarella, Michael Jordan, Tim Kraska, Raghu Ramakrishnan	Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?
23	Cost-based Fault-tolerance for Parallel Data Processing	Abdallah Salama, Carsten Binnig, Tim Kraska, Erfan Zamanian	In this paper, we present a novel cost-based fault-tolerance scheme which tackles this issue.
24	Squall: Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases	Aaron J. Elmore, Vaibhav Arora, Rebecca Taft, Andrew Pavlo, Divyakant Agrawal, Amr El Abbadi	To overcome this problem, we introduce the Squall technique for supporting live reconfiguration in partitioned, main memory DBMSs.
25	Madeus: Database Live Migration Middleware under Heavy Workloads for Cloud Environment	Takeshi Mishima, Yasuhiro Fujiwara	To efficiently address the hot spot problem, we propose a middleware approach called Madeus that conducts database live migration.
26	Lineage-driven Fault Injection	Peter Alvaro, Joshua Rosen, Joseph M. Hellerstein	We propose a novel approach for discovering bugs in fault-tolerant data management systems: lineage-driven fault injection.
27	Diversity-Aware Top-k Publish/Subscribe for Text Stream	Lisi Chen, Gao Cong	We propose a novel solution to efficiently processing a large number of DAS queries over a stream of documents.
28	Diverse and Proportional Size-l Object Summaries for Keyword Search	Georgios Fakas, Zhi Cai, Nikos Mamoulis	In view of this limitation, in this paper we investigate the effective and efficient generation of two novel types of OS snippets, i.e. diverse and proportional size-l OSs, denoted as DSize-l and PSize-l OSs.
29	Local Filtering: Improving the Performance of Approximate Queries on String Collections	Xiaochun Yang, Yaoshu Wang, Bin Wang, Wei Wang	In this paper, we explore the opposite paradigm focusing on finding out the differences of database strings to the query string.
30	Exact Top-k Nearest Keyword Search in Large Networks	Minhao Jiang, Ada Wai-Chee Fu, Raymond Chi-Wing Wong	In this paper, we propose algorithms for top-k nearest keyword search that provide exact solutions and which handle networks of very large sizes.
31	Efficient Algorithms for Answering the m-Closest Keywords Query	Tao Guo, Xin Cao, Gao Cong	In this paper, we prove that the problem of answering mCK queries is NP-hard.
32	Minimum Spanning Trees in Temporal Graphs	Silu Huang, Ada Wai-Chee Fu, Ruifeng Liu	We propose efficient linear time algorithms for computing MST_a.
33	Efficient Enumeration of Maximal k-Plexes	Devora Berlowitz, Sara Cohen, Benny Kimelfeld	This paper presents the first provably efficient algorithms, both for enumerating the maximal k-plexes and for enumerating the maximal connected k-plexes.
34	Divide & Conquer: I/O Efficient Depth-First Search	Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Zechao Shang	In this paper, we focus on I/O efficiency and study semi-external algorithms to DFS a graph G which is on disk.
35	Index-based Optimal Algorithms for Computing Steiner Components with Maximum Connectivity	Lijun Chang, Xuemin Lin, Lu Qin, Jeffrey Xu Yu, Wenjie Zhang	In this paper, we study the problem of efficiently computing the steiner component with the maximum connectivity; that is, given a set q of query vertices in a graph G, we aim to find the maximum induced subgraph g of G such that g contains q and g has the maximum connectivity, where g is denoted as SMCC.
36	COMMIT: A Scalable Approach to Mining Communication Motifs from Dynamic Networks	Saket Gurukar, Sayan Ranu, Balaraman Ravindran	In this paper, we study this problem by mining Communication motifs from dynamic interaction networks.
37	LASH: Large-Scale Sequence Mining with Hierarchies	Kaustubh Beedkar, Rainer Gemulla	We propose LASH, a scalable, distributed algorithm for mining sequential patterns in the presence of hierarchies.
38	Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time	Michael Cochez, Hao Mou	In this paper, we propose the use of locality-sensitive hashing combined with a novel data structure called twister tries to provide an approximate clustering for average linkage.
39	DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation	Junhao Gan, Yufei Tao	In this paper, we prove that for d ≥ 3, the DBSCAN problem requires Ω(n4/³) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science.
40	The TagAdvisor: Luring the Lurkers to Review Web Items	Azade Nazi, Mahashweta Das, Gautam Das	In this paper, we address the problem of how to engage the lurkers (i.e., people who read reviews but never take time and effort to write one) to participate and write online reviews by systematically simplifying the reviewing task.
41	Supporting Data Uncertainty in Array Databases	Liping Peng, Yanlei Diao	In this paper, we address uncertain data management in array databases, which may involve both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes.
42	Identifying the Extent of Completeness of Query Answers over Partially Complete Databases	Simon Razniewski, Flip Korn, Werner Nutt, Divesh Srivastava	In this paper, we propose a natural class of completeness patterns, expressed by selections on database tables, to specify complete parts of database tables.
43	k-Hit Query: Top-k Query with Probabilistic Utility Function	Peng Peng, Raymong Chi-Wing Wong	In this paper, we present various interesting properties of k-hit queries.
44	Linking Temporal Records for Profiling Entities	Furong Li, Mong Li Lee, Wynne Hsu, Wang-Chiew Tan	In this paper, we present a new solution for understanding how two facts may be temporally related and exploit the knowledge to profile how entities evolve over time.
45	Telco Churn Prediction with Big Data	Yiqing Huang, Fangzhou Zhu, Mingxuan Yuan, Ke Deng, Yanhua Li, Bing Ni, Wenyuan Dai, Qiang Yang, Jia Zeng	We show that telco big data can make churn prediction much more easier from the $3$V’s perspectives: Volume, Variety, Velocity.
46	The LDBC Social Network Benchmark: Interactive Workload	Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, Andrey Gubichev, Arnau Prat, Minh-Duc Pham, Peter Boncz	This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies.
47	Rethinking Data-Intensive Science Using Scalable Analytics Systems	Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson	In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%.
48	QMapper for Smart Grid: Migrating SQL-based Application to Hive	Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen, Songlin Hu	In this paper, we propose QMapper, a tool for automatically translating SQL into proper HiveQL.
49	Three Favorite Results	Jennifer Widom	Being honored as the ACM Athena Lecturer has inspired me to reflect upon the research I’ve conducted over my career to date.
50	The Power Behind the Throne: Information Integration in the Age of Data-Driven Discovery	Laura M. Haas	I will describe the environment we are creating, the advances in the field that enable it, and the challenges that remain.
51	On the Design and Scalability of Distributed Shared-Data Databases	Simon Loesing, Markus Pilman, Thomas Etter, Donald Kossmann	In this paper, we analyze an alternative architecture design for distributed relational databases that overcomes the limitations of partitioned databases.
52	Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems	Thomas Neumann, Tobias Mühlbauer, Alfons Kemper	We present a novel MVCC implementation for main-memory database systems that has very little overhead compared to serial execution with single-version concurrency control, even when maintaining serializability guarantees.
53	FOEDUS: OLTP Engine for a Thousand Cores and NVRAM	Hideaki Kimura	We analyze the characteristics of these machines and find that no existing database is appropriate.
54	Let’s Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems	Joy Arulraj, Andrew Pavlo, Subramanya R. Dulloor	To better understand these issues, we implemented three engines in a modular DBMS testbed that are based on different storage management architectures: (1) in-place updates, (2) copy-on-write updates, and (3) log-structured updates.
55	Private Release of Graph Statistics using Ladder Functions	Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao	In this paper, we introduce a new method which guarantees differential privacy.
56	Bayesian Differential Privacy on Correlated Data	Bin Yang, Issei Sato, Hiroshi Nakagawa	In this paper, we focus on the private perturbation algorithms on correlated data.
57	Modular Order-Preserving Encryption, Revisited	Charalampos Mavroforakis, Nathan Chenette, Adam O’Neill, George Kollios, Ran Canetti	In this paper, we systematically address this vulnerability and show that MOPE can be used to build a practical system for executing range queries on encrypted data while providing a significant security improvement over the basic OPE.
58	Chiaroscuro: Transparency and Privacy for Massive Personal Time-Series Clustering	Tristan Allard, Georges Hébrail, Florent Masseglia, Esther Pacitti	In this paper, we propose Chiaroscuro, a complete solution for clustering personal data with strong privacy guarantees.
59	Persistent Data Sketching	Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, Ji-Rong Wen	In this paper, we aim at designing persistent sketches, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.
60	Scalable Distributed Stream Join Processing	Qian Lin, Beng Chin Ooi, Zhengkui Wang, Cui Yu	In this paper, we propose a novel stream join model, called join-biclique, which organizes a large cluster as a complete bipartite graph.
61	SCREEN: Stream Data Cleaning under Speed Constraints	Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu	Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include: (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient Median Principle,(3) support on out-of-order arrivals of data points, and(4) adaptive window size for balancing repair accuracy and efficiency.
62	Location-Aware Pub/Sub System: When Continuous Moving Queries Meet Dynamic Event Streams	Long Guo, Dongxiang Zhang, Guoliang Li, Kian-Lee Tan, Zhifeng Bao	In this paper, we propose a new location-aware pub/sub system, Elaps, that continuously monitors moving users subscribing to dynamic event streams from social media and E-commerce applications.
63	CE-Storm: Confidential Elastic Processing of Data Streams	Nick R. Katsipoulakis, Cory Thoma, Eric A. Gratta, Alexandros Labrinidis, Adam J. Lee, Panos K. Chrysanthis	We will demonstrate both systems working in tandem and also visualize their behavior over time under different scenarios.
64	A SQL Debugger Built from Spare Parts: Turning a SQL: 1999 Database System into Its Own Debugger	Benjamin Dietrich, Torsten Grust	We demonstrate a new incarnation of Habitat, an observational debugger for SQL.
65	Exploratory Keyword Search with Interactive Input	Zhifeng Bao, Yong Zeng, H.V. Jagadish, Tok Wang Ling	Therefore, we propose a framework called ClearMap that natively supports visualized exploratory search paradigm on XML data.
66	QE3D: Interactive Visualization and Exploration of Complex, Distributed Query Plans	Daniel Scheibli, Christian Dinse, Alexander Boehm	In this demonstration, we show how its interactive, three-dimensional plan representation helps to understand and quickly identify hotspots in complex, real-world scenarios.
67	DataXFormer: An Interactive Data Transformation Tool	John Morcos, Ziawasch Abedjan, Ihab Francis Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker	In this demonstration, we present the user-interaction with DataXFormer and show scenarios on how it can be used to transform data and explore the effectiveness and efficiency of several approaches for transformation discovery, leveraging about 112 million tables and online sources.
68	Quality-Driven Continuous Query Execution over Out-of-Order Data Streams	Yuanzhen Ji, Hongjin Zhou, Zbigniew Jerzak, Anisoara Nica, Gregor Hackenbroich, Christof Fetzer	We demonstrate a prototype stream processing system, which extends SAP Event Stream Processor with the implementation of AQ-K-slack.
69	MoDisSENSE: A Distributed Spatio-Temporal and Textual Processing Platform for Social Networking Services	Ioannis Mytilinis, Ioannis Giannakopoulos, Ioannis Konstantinou, Katerina Doka, Dimitrios Tsitsigkos, Manolis Terrovitis, Lampros Giampouras, Nectarios Koziris	In this work, we present MoDisSENSE, an open-source distributed platform that provides personalized search for points of interest and trending events based on the user’s social graph by combining spatio-textual user generated data.
70	DocRicher: An Automatic Annotation System for Text Documents Using Social Media	Qiang Hu, Qi Liu, Xiaoli Wang, Anthony K.H. Tung, Shubham Goyal, Jisong Yang	We demonstrate a system, DocRicher, to enrich a text document with social media, that implicitly reference certain passages of it.
71	A Demonstration of Rubato DB: A Highly Scalable NewSQL Database System for OLTP and Big Data Applications	Li-Yan Yuan, Lengdong Wu, Jia-Huai You, Yan Chi	We propose to demonstrate Rubato DB, a highly scalable NewSQL system, supporting various consistency levels from ACID to BASE for OLTP and big data applications.
72	G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data	Kai Zeng, Sameer Agarwal, Ankur Dave, Michael Armbrust, Ion Stoica	In this demonstration, we present G-OLA, a novel mini-batch execution model that generalizes OLA to support general OLAP queries with arbitrarily nested aggregates using efficient delta maintenance techniques.
73	Mining and Forecasting of Big Time-series Data	Yasushi Sakurai, Yasuko Matsubara, Christos Faloutsos	The objective of this tutorial is to provide a concise and intuitive overview of the most important tools that can help us find patterns in large-scale time-series sequences.
74	Optimal Spatial Dominance: An Effective Search of Nearest Neighbor Candidates	Xiaoyang Wang, Ying Zhang, Wenjie Zhang, Xuemin Lin, Muhammad Aamir Cheema	Efficient algorithms are proposed for the dominance check and corresponding NN candidates computation.
75	THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads	Farhan Tauheed, Thomas Heinis, Anastasia Ailamaki	We propose THERMAL-JOIN, a novel spatial self-join algorithm for dynamic memory-resident workloads.
76	Indexing Metric Uncertain Data for Range Queries	Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen, Baihua Zheng	In this paper, we represent metric uncertain data by using an object-level model and a bi-level model, respectively.
77	Efficient Route Planning on Public Transportation Networks: A Labelling Approach	Sibo Wang, Wenqing Lin, Yi Yang, Xiaokui Xiao, Shuigeng Zhou	This paper presents Timetable Labelling (TTL), an efficient indexing technique for route planning on timetable graphs.
78	The Importance of Being Expert: Efficient Max-Finding in Crowdsourcing	Aris Anagnostopoulos, Luca Becchetti, Adriano Fazzone, Ida Mele, Matteo Riondato	As we highlight in this work, the exclusive use of nonexpert individuals may prove ineffective in some cases, especially when the task at hand or the need for accurate solutions demand some degree of specialization to avoid excessive uncertainty and inconsistency in the answers.
79	Minimizing Efforts in Validating Crowd Answers	Nguyen Quoc Viet Hung, Duong Chi Thang, Matthias Weidlich, Karl Aberer	Therefore, we develop a probabilistic model that helps to identify the most beneficial validation questions in terms of both, improvement of result correctness and detection of faulty workers.
80	iCrowd: An Adaptive Crowdsourcing Framework	Ju Fan, Guoliang Li, Beng Chin Ooi, Kian-lee Tan, Jianhua Feng	To this end, we propose an adaptive crowdsourcing framework, called iCrowd.
81	QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications	Yudian Zheng, Jiannan Wang, Guoliang Li, Reynold Cheng, Jianhua Feng	In this paper, we investigate the online task assignment problem: Given a pool of n questions, which of the k questions should be assigned to a worker?
82	tDP: An Optimal-Latency Budget Allocation Strategy for Crowdsourced MAXIMUM Operations	Vasilis Verroios, Peter Lofgren, Hector Garcia-Molina	We focus on one of the most extensively studied crowdsourcing operations, the MAX operation (finding the best element in a collection under human criteria), and we study the problem of budget allocation into rounds for this operation.
83	Thrifty: Offering Parallel Database as a Service using the Shared-Process Approach	Petrie Wong, Zhian He, Ziqiang Feng, Wenjian Xu, Eric Lo	In this demonstration, we present Thrifty, a Parallel-Database-as-a-Service operated using the "shared-process" approach.
84	BenchPress: Dynamic Workload Control in the OLTP-Bench Testbed	Dana Van Aken, Djellel E. Difallah, Andrew Pavlo, Carlo Curino, Philippe Cudré-Mauroux	We recently introduced OLTP-Bench, an extensible testbed for benchmarking relational databases that is bundled with 15 workloads.
85	Demonstrating "Data Near Here": Scientific Data Search	V.M. Megler, David Maier	We include an analysis showing that our summary-based approach gives a reasonable approximation of such a "complete dataset" similarity measure.
86	Slider: An Efficient Incremental Reasoner	Jules Chevalier, Julien Subercaze, Christophe Gravier, Frédérique Laforest	We contribute to solving these problems by introducing Slider, an efficient incremental reasoner.
87	WANalytics: Geo-Distributed Analytics for a Data Intensive World	Ashish Vulimiri, Carlo Curino, Philip Brighten Godfrey, Thomas Jungblut, Konstantinos Karanasos, Jitendra Padhye, George Varghese	We instead propose WANalytics, a system that solves the WABD problem by orchestrating distributed query execution and adjusting data replication across data centers in order to minimize bandwidth usage, while respecting sovereignty requirements.
88	FTT: A System for Finding and Tracking Tourists in Public Transport Services	Huayu Wu, Jo-Anne Tan, Wee Siong Ng, Mingqiang Xue, Wei Chen	In this paper, we present our system FTT (Finding and Tracking Tourists) to identify tourists from public transport commuters in a city, and to further track their movements from one place to another.
89	SharkDB: An In-Memory Storage System for Massive Trajectory Data	Haozhou Wang, Kai Zheng, Xiaofang Zhou, Shazia Sadiq	In this storage design, we try to explore the potential opportunities, which can boost the performance of query processing for trajectory data.
90	Ringo: Interactive Graph Analytics on Big-Memory Machines	Yonathan Perez, Rok Sosič, Arijit Banerjee, Rohan Puttagunta, Martin Raison, Pararth Shah, Jure Leskovec	We present Ringo, a system for analysis of large graphs.
91	STORM: Spatio-Temporal Online Reasoning and Management of Large Spatio-Temporal Data	Robert Christensen, Lu Wang, Feifei Li, Ke Yi, Jun Tang, Natalee Villa	We present the STORM system to enable spatio-temporal online reasoning and management of large spatio-temporal data.
92	PAXQuery: Parallel Analytical XML Processing	Jesús Camacho-Rodríguez, Dario Colazzo, Ioana Manolescu, Juan A.M. Naranjo	We demonstrate PAXQuery, a novel system that parallelizes the execution of XQuery queries over large collections of XML documents.
93	Cache-Efficient Aggregation: Hashing Is Sorting	Ingo Müller, Peter Sanders, Arnaud Lacurie, Wolfgang Lehner, Franz Färber	In this paper we argue that in terms of cache efficiency, the two paradigms are actually the same.
94	Efficient Similarity Join and Search on Multi-Attribute Data	Guoliang Li, Jian He, Dong Deng, Jian Li	In this paper we study similarity join and search on multi- attribute data.
95	Holistic Indexing in Main-memory Column-stores	Eleni Petraki, Stratos Idreos, Stefan Manegold	This paper introduces holistic indexing, a new approach to automated index tuning in dynamic environments.
96	CliffGuard: A Principled Framework for Finding Robust Database Designs	Barzan Mozafari, Eugene Zhen Ye Goh, Dong Young Yoon	Thus, we propose a new type of database designer that is robust against parameter uncertainties, so that overall performance degrades more gracefully when future workloads deviate from the past.
97	Exploiting Correlations for Expensive Predicate Evaluation	Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, Christopher Re	In this paper, we study ways to efficiently evaluate selection queries with UDF predicates.
98	Query-Oriented Data Cleaning with Oracles	Moria Bergman, Tova Milo, Slava Novgorodov, Wang-Chiew Tan	To overcome the limitations of existing data cleaning techniques, we present QOCO, a novel query-oriented system for cleaning data with oracles.
99	BigDansing: A System for Big Data Cleansing	Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin	In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing.
100	Data X-Ray: A Diagnostic Tool for Data Errors	Xiaolan Wang, Xin Luna Dong, Alexandra Meliou	Our contributions are three-fold.
101	KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing	Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye	We propose KATARA, a knowledge base and crowd powered data cleaning system that, given a table, a KB, and a crowd, interprets table semantics to align it with the KB, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data.
102	Crowd-Based Deduplication: An Adaptive Approach	Sibo Wang, Xiaokui Xiao, Chun-Hee Lee	This paper presents ACD, a new crowd-based algorithm for data deduplication.
103	Minimizing Commit Latency of Transactions in Geo-Replicated Data Stores	Faisal Nawab, Vaibhav Arora, Divyakant Agrawal, Amr El Abbadi	In this work, we derive a lower-bound on commit latency.
104	Optimizing Optimistic Concurrency Control for Tree-Structured, Log-Structured Databases	Philip A. Bernstein, Sudipto Das, Bailu Ding, Markus Pilman	To address them, we describe a high-performance transaction mechanism that uses optimistic concurrency control on a multi-versioned tree-structured database stored in a shared log.
105	The Homeostasis Protocol: Avoiding Transaction Coordination Through Program Analysis	Sudip Roy, Lucja Kot, Gabriel Bender, Bailu Ding, Hossein Hojjat, Christoph Koch, Nate Foster, Johannes Gehrke	This paper describes a new approach to achieving strong consistency in distributed systems while minimizing communication between nodes.
106	Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity	Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, Ion Stoica	In this work, we empirically investigate modern ORM-backed applications’ use and disuse of database concurrency control mechanisms.
107	REEF: Retainable Evaluator Execution Framework	Markus Weimer, Yingda Chen, Byung-Gon Chun, Tyson Condie, Carlo Curino, Chris Douglas, Yunseong Lee, Tony Majestro, Dahlia Malkhi, Sergiy Matusevych, Brandon Myers, Shravan Narayanamurthy, Raghu Ramakrishnan, Sriram Rao, Russel Sears, Beysim Sezgin, Julia Wang	This paper presents REEF, a development framework that provides a control-plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager.
108	Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications	Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, Carlo Curino	In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes.
109	Design and Implementation of the LogicBlox System	Molham Aref, Balder ten Cate, Todd J. Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L. Veldhuizen, Geoffrey Washburn	In this paper, we discuss the design considerations behind the LogicBlox system and give an overview of its implementation, highlighting innovative aspects.
110	Spark SQL: Relational Data Processing in Spark	Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, Matei Zaharia	Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis.
111	Graft: A Debugging Tool For Apache Giraph	Semih Salihoglu, Jaeho Shin, Vikesh Khanna, Ba Quan Truong, Jennifer Widom	We address the problem of debugging programs written for Pregel-like systems.
112	Even Metadata is Getting Big: Annotation Summarization using InsightNotes	Dongqing Xiao, Armir Bashllari, Tyler Menard, Mohamed Eltabakh	In this paper, we demonstrate the InsightNotes system, a summary-based annotation management engine over relational databases.
113	StoryPivot: Comparing and Contrasting Story Evolution	Anja Gruenheid, Donald Kossmann, Theodoros Rekatsinas, Divesh Srivastava	In this demonstration we present StoryPivot, a framework that helps its users to detect evolving stories in event datasets over time.
114	The Flatter, the Better: Query Compilation Based on the Flattening Transformation	Alexander Ulrich, Torsten Grust	We demonstrate the insides and outs of a query compiler based on the flattening transformation, a translation technique designed by the programming language community to derive efficient data-parallel implementations from iterative programs.
115	D2WORM: A Management Infrastructure for Distributed Data-centric Workflows	Martin Jergler, Mohammad Sadoghi, Hans-Arno Jacobsen	In this demonstration, we present D2Worm, a Distributed Data-centric Workflow Management system.
116	NL	Yael Amsterdamer, Anna Kukliansky, Tova Milo	To account for these challenges, we develop new, dedicated modules and embed them within the modular and easily extensible architecture of NL₂CM.
117	Optimistic Recovery for Iterative Dataflows in Action	Sergey Dudoladov, Chen Xu, Sebastian Schelter, Asterios Katsifodimos, Stephan Ewen, Kostas Tzoumas, Volker Markl	In this paper, we demonstrate our recovery mechanism with the Apache Flink data processing engine.
118	A Secure Search Engine for the Personal Cloud	Saliha Lallali, Nicolas Anciaux, Iulian Sandu Popa, Philippe Pucheral	We have implemented our engine on a real tamper resistant hardware device and present its capacity to regulate the access to a personal dataspace.
119	IReS: Intelligent, Multi-Engine Resource Scheduler for Big Data Analytics Workflows	Katerina Doka, Nikolaos Papailiou, Dimitrios Tsoumakos, Christos Mantas, Nectarios Koziris	To this end, we demonstrate IReS, the Intelligent Resource Scheduler for complex analytics workflows executed over multi-engine environments.
120	Just can’t get enough: Synthesizing Big Data	Tilmann Rabl, Manuel Danisch, Michael Frank, Sebastian Schindler, Hans-Arno Jacobsen	As a solution, we present an automatic approach to data synthetization from existing data sources.
121	Rack-Scale In-Memory Join Processing using RDMA	Claude Barthels, Simon Loesing, Gustavo Alonso, Donald Kossmann	In this paper we focus on implementing parallel in-memory joins using Remote Direct Memory Access (RDMA), a communication mechanism to transfer data directly into the memory of a remote machine.
122	Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation	Max Heimel, Martin Kiefer, Volker Markl	We provide an implementation of our estimator and experimentally evaluate it on a variety of datasets and workloads, demonstrating that it efficiently scales up to very large model sizes, adapts itself to database changes, and typically outperforms the estimation quality of both existing Kernel Density Estimators as well as state-of-the-art multidimensional histograms.
123	Rethinking SIMD Vectorization for In-Memory Databases	Orestis Polychroniou, Arun Raghavan, Kenneth A. Ross	In this paper, we present novel vectorized designs and implementations of database operators, based on advanced SIMD operations, such as gathers and scatters.
124	A Padded Encoding Scheme to Accelerate Scans by Leveraging Skew	Yinan Li, Craig Chasseur, Jignesh M. Patel	This paper proposes a padded encoding scheme to address this opportunity.
125	GetReal: Towards Realistic Selection of Influence Maximization Strategies in Competitive Networks	Hui Li, Sourav S. Bhowmick, Jiangtao Cui, Yunjun Gao, Jianfeng Ma	In this paper, we propose a novel framework based on game theory to provide a more realistic solution to the IM problem in competitive networks by jettisoning these unrealistic assumptions.
126	Influence Maximization in Near-Linear Time: A Martingale Approach	Youze Tang, Yanchen Shi, Xiaokui Xiao	This paper presents an influence maximization algorithm that provides the same worst-case guarantees as the state of the art, but offers significantly improved empirical efficiency.
127	Community Level Diffusion Extraction	Zhiting Hu, Junjie Yao, Bin Cui, Eric Xing	This paper introduces a new approach, i.e., COmmunity Level Diffusion (COLD), to uncover and explore temporal diffusion.
128	BEAR: Block Elimination Approach for Random Walk with Restart on Large Graphs	Kijung Shin, Jinhong Jung, Sael Lee, U. Kang	In this paper, we propose BEAR, a fast, scalable, and accurate method for computing RWR on large graphs.
129	The Minimum Wiener Connector Problem	Natali Ruchansky, Francesco Bonchi, David García-Soriano, Francesco Gullo, Nicolas Kourtellis	In this paper we study the novel problem of finding a minimum Wiener connector: given a connected graph G=(V,E) and a set Q ⊆ V of query vertices, find a subgraph of G that connects all query vertices and has minimum Wiener index.
130	From Group Recommendations to Group Formation	Senjuti Basu Roy, Laks V.S. Lakshmanan, Rui Liu	We consider the complementary problem of how to form groups such that the users in the formed groups are most satisfied with the suggested top-k recommendations.
131	Real-Time Multi-Criteria Social Graph Partitioning: A Game Theoretic Approach	Nikos Armenatzoglou, Huy Pham, Vasilis Ntranos, Dimitris Papadias, Cyrus Shahabi	In this paper, we introduce RMGP, a type of real-time multi-criteria graph partitioning for social networks that groups the users based on their connectivity and their similarity to a set of input classes.
132	Utility-Aware Social Event-Participant Planning	Jieying She, Yongxin Tong, Lei Chen	Existing approaches usually assume that each user only attends one event or ignore location information.
133	Online Video Recommendation in Sharing Community	Xiangmin Zhou, Lei Chen, Yanchun Zhang, Longbing Cao, Guangyan Huang, Chen Wang	In this paper, we propose an approach based on the content and social information of videos for the recommendation in sharing communities.
134	Large-scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-database Prediction	Shreya Prasad, Arash Fard, Vishrut Gupta, Jorge Martinez, Jeff LeFevre, Vincent Xu, Meichun Hsu, Indrajit Roy	This paper presents the design of a high performance data transfer mechanism, new data-structures in Distributed R to maintain data locality with database table segments, and extensions to Vertica for saving and deploying R models.
135	Oracle Workload Intelligence	Quoc Trung Tran, Konstantinos Morfonios, Neoklis Polyzotis	In this work, we present Oracle Workload Intelligence (WI), a tool for workload modeling and mining, as our attempt to infer the processes that generate a given workload.
136	Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components	John Colgrove, John D. Davis, John Hayes, Ethan L. Miller, Cary Sandvig, Russell Sears, Ari Tamches, Neil Vachharajani, Feng Wang	In this paper, we describe Purity, the foundation of Pure Storage’s Flash Arrays, the first all-flash enterprise storage system to support compression, deduplication, and high-availability.
137	On Improving User Response Times in Tableau	Pawel Terlecki, Fei Xu, Marianne Shaw, Valeri Kim, Richard Wesley	In this paper we discuss key data processing components in Tableau: the query processor, query caches, Tableau Data Engine [1, 2] and Data Server.
138	Data Management in Non-Volatile Memory	Stratis D. Viglas	In what follows we present the current work in the area with a view towards identifying the open problems and exposing the research opportunities.
139	TEGRA: Table Extraction by Global Record Alignment	Xu Chu, Yeye He, Kaushik Chakrabarti, Kris Ganjam	In this work, we address the important problem of automatically extracting multi-column relational tables from such lists.
140	Mining Quality Phrases from Massive Text Corpora	Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han	In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation.
141	Mining Subjective Properties on the Web	Immanuel Trummer, Alon Halevy, Hongrae Lee, Sunita Sarawagi, Rahul Gupta	We describe the Surveyor system that mines the dominant opinion held by authors of Web content about whether a subjective property applies to a given entity.
142	Microblog Entity Linking with Social Temporal Context	Wen Hua, Kai Zheng, Xiaofang Zhou	In this paper, we propose an efficient solution to link entities in tweets by analyzing their social and temporal context.
143	Graph-Aware, Workload-Adaptive SPARQL Query Caching	Nikolaos Papailiou, Dimitrios Tsoumakos, Panagiotis Karras, Nectarios Koziris	In this work we present a novel system that addresses graph-based, workload-adaptive indexing of large RDF graphs by caching SPARQL query results.
144	Left Bit Right: For SPARQL Join Queries with OPTIONAL Patterns (Left-outer-joins)	Medha Atre	In this paper, we present Left Bit Right (LBR), a technique for well-designed nested BGP and OPTIONAL pattern queries.
145	How to Build Templates for RDF Question/Answering: An Uncertain Graph Similarity Join Approach	Weiguo Zheng, Lei Zou, Xiang Lian, Jeffrey Xu Yu, Shaoxu Song, Dongyan Zhao	We propose several structural and probability pruning techniques to speed up joining.
146	RBench: Application-Specific RDF Benchmarking	Shi Qiao, Z. Meral Özsoyoğlu	To address the needs of diverse applications, we propose an application-specific framework, called RBench, to generate RDF benchmarks.
147	ALEX: Automatic Link Exploration in Linked Data	Ahmed El-Roby, Ashraf Aboulnaga	In this paper, we present ALEX, a system that aims at improving the quality of links between RDF data sets by using feedback provided by users on the answers to linked data queries.
148	k-Shape: Efficient and Accurate Clustering of Time Series	John Paparrizos, Luis Gravano	In this paper, we present k-Shape, a novel algorithm for time-series clustering.
149	SMiLer: A Semi-Lazy Time Series Prediction System for Sensors	Jingbo Zhou, Anthony K.H. Tung	We propose a new method to apply the GP for sensor time series prediction.
150	SQLGraph: An Efficient Relational-Based Property Graph Store	Wen Sun, Achille Fokoue, Kavitha Srinivas, Anastasios Kementsietsidis, Gang Hu, Guotong Xie	We show that existing mature, relational optimizers can be exploited with a novel schema to give better performance for property graph storage and retrieval than popular noSQL graph stores.
151	Updating Graph Indices with a One-Pass Algorithm	Dayu Yuan, Prasenjit Mitra, Huiwen Yu, C. Lee Giles	In order to address this issue, we propose a time-efficient one-pass algorithm that is designed to update a graph index by scanning each frequent subgraph at most once.
152	Amazon Redshift and the Case for Simpler Data Warehouses	Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, Vidhya Srinivasan	In this paper, we discuss an oft-overlooked differentiating characteristic of Amazon Redshift — simplicity.
153	ShareInsights: An Unified Approach to Full-stack Data Processing	Mukund Deshpande, Dhruva Ray, Sameer Dixit, Avadhoot Agasti	In this paper we present a platform that aims to significantly reduce the time it takes to build data pipelines.
154	An Incremental Anytime Algorithm for Multi-Objective Query Optimization	Immanuel Trummer, Christoph Koch	We present an incremental anytime algorithm for MOQO, analyze its complexity and show that it offers an attractive tradeoff between result update frequency, single invocation time complexity, and amortized time over multiple invocations.
155	Output-sensitive Evaluation of Prioritized Skyline Queries	Niccolo’ Meneghetti, Denis Mindolin, Paolo Ciaccia, Jan Chomicki	In this paper we show that querying using non-compensatory preferences is computationally efficient.
156	Learning Generalized Linear Models Over Normalized Data	Arun Kumar, Jeffrey Naughton, Jignesh M. Patel	In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting.
157	Utilizing IDs to Accelerate Incremental View Maintenance	Yannis Katsis, Kian Win Ong, Yannis Papakonstantinou, Kevin Keliang Zhao	This work makes the following contributions: (a) An ID-based IVM system for a large subset of SQL that includes the algebraic operators selection, join, grouping and aggregation, generalized projection involving functions, antisemijoin (and therefore negation/difference) and union.
158	S4: Top-k Spreadsheet-Style Search for Query Discovery	Fotis Psallidas, Bolin Ding, Kaushik Chakrabarti, Surajit Chaudhuri	To address this limitation, we study the problem of efficiently discovering top-k project join queries which approximately contain the given example tuples in their output.
159	Proactive Annotation Management in Relational Databases	Karim Ibrahim, Xiao Du, Mohamed Eltabakh	In this paper, we propose the Nebula system, an advanced and proactive annotation management engine in relational databases.
160	Weighted Coverage based Reviewer Assignment	Ngai Meng Kou, Leong Hou U., Nikos Mamoulis, Zhiguo Gong	In this paper, we propose a generalized framework for fair reviewer assignment.
161	Distributed Online Tracking	Mingwang Tang, Feifei Li, Yufei Tao	This problem was recently formalized and studied, and a principled approach with optimal competitive ratio was proposed.
162	Knowledge Curation and Knowledge Fusion: Challenges, Models and Applications	Xin Luna Dong, Divesh Srivastava	Our tutorial highlights the similarities and differences between knowledge management and data integration, and has two goals.
163	Smooth Task Migration in Apache Storm	Mansheng Yang, Richard T.B. Ma	To handle the task migration process more gracefully, we propose three task migration methods: (i) worker level migration, (ii) executor level migration, and (iii) executor level migration with reliable messaging.
164	JAFAR: Near-Data Processing for Databases	Oreoluwatomiwa O. Babarinsa, Stratos Idreos	In this paper, we present JAFAR, a near data processing accelerator for pushing selects down to memory.
165	Job Scheduling with Minimizing Data Communication Costs	Trevor Clinkenbeard, Anisoara Nica	The research presented in this paper analyzes different algorithms for scheduling a set of potentially interdependent jobs in order to minimize the total runtime, or makespan, when data communication costs are considered.
166	One Loop Does Not Fit All	Styliani Pantela, Stratos Idreos	In this paper, we study JIT compilation for modern in-memory column-stores in detail and we show that, contrary to the common belief that vectorization outweighs the benefits of having one loop, there are cases in which creating a single loop is actually the optimal solution.
167	DunceCap: Compiling Worst-Case Optimal Query Plans	Adam Perelman, Christopher Ré	In this study, we explore two algorithms that are asymptotically faster than pairwise algorithms for a large class of queries.
168	DunceCap: Query Plans Using Generalized Hypertree Decompositions	Susan Tu, Christopher Ré	My contribution is to explore query planning using these join algorithms.