0% found this document useful (0 votes)
19 views8 pages

Database Management As A Service Challen

This document discusses the challenges and opportunities of database management as a service. It outlines three main issues with outsourcing databases to third party providers: (1) ensuring data privacy and security when queries are performed, (2) enabling private queries so the provider does not know what data is being accessed, and (3) providing a trust mechanism to ensure honest behavior from both providers and clients. The document also notes that while encryption is commonly used, it is computationally expensive, and instead proposes using distribution across multiple sites and secret sharing algorithms to enable efficient, fault-tolerant, and private outsourced database management.

Uploaded by

s.bahrami1104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views8 pages

Database Management As A Service Challen

This document discusses the challenges and opportunities of database management as a service. It outlines three main issues with outsourcing databases to third party providers: (1) ensuring data privacy and security when queries are performed, (2) enabling private queries so the provider does not know what data is being accessed, and (3) providing a trust mechanism to ensure honest behavior from both providers and clients. The document also notes that while encryption is commonly used, it is computationally expensive, and instead proposes using distribution across multiple sites and secret sharing algorithms to enable efficient, fault-tolerant, and private outsourced database management.

Uploaded by

s.bahrami1104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

IEEE International Conference on Data Engineering

Database Management as a Service: Challenges and


Opportunities
Divyakant Agrawal #1 , Amr El Abbadi #2 , Fatih Emekci ∗3 Ahmed Metwally @4
#
Department of Computer Science, University of California at Santa Barbara
Santa Barbara, CA 93106, USA
1
[email protected]
2
[email protected]

LinkedIn Corporation
2029 Stierlin Court, Mountain View, CA 94043, USA
3
[email protected]
@
Google Inc.
1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
4
[email protected]

Abstract— Data outsourcing or database as a service is a new in these capabilities, as well as the need to manage and
paradigm for data management in which a third party service maintain the software providing these services. Amazon, for
provider hosts a database as a service. The service provides data example, has created the service EC2, which provides clients
management for its customers and thus obviates the need for the
service user to purchase expensive hardware and software, deal with scalable servers; as well as another service S3, which
with software upgrades and hire professionals for administrative provides scalable storage to clients. Recently, NSF partnered
and maintenance tasks. Since using an external database service with Google and IBM to offer academic institutions access
promises reliable data storage at a low cost it is very attractive for to large scale distributed infrastructure under the NSF CLuE
companies. Such a service would also provide universal access, program. There has clearly been a radical paradigm shift due
through the Internet to private data stored at reliable and secure
sites. However, recent governmental legislations, competition to the wide acceptance of and reliance on Internet and Web-
among companies, and database thefts mandate companies to based technologies.
use secure and privacy preserving data management techniques. One of the reasons for the success of Internet-scale com-
The data provider, therefore, needs to guarantee that the data puting is the role it has played in eliminating the size of
is secure, be able to execute queries on the data, and the results an enterprise as a critical factor in its economic success.
of the queries must also be secure and not visible to the data
provider. Current research has been focused only on how to An excellent example of this change is the notion of data
index and query encrypted data. However, querying encrypted centers which provide clients with the physical infrastructure
data is computationally very expensive. Providing an efficient needed to host their computer systems, including redundant
trust mechanism to push both database service providers and power supplies, high bandwidth communication capabilities,
clients to behave honestly has emerged as one of the most environment monitoring, and security services. Data centers
important problem before data outsourcing to become a viable
paradigm. In this paper, we describe scalable privacy preserving eliminate the need for small companies to make a large capital
algorithms for data outsourcing. Instead of encryption, which is expenditure in building an infrastructure to create a global
computationally expensive, we use distribution on multiple data customer base. The data center model has been effective since
provider sites and information theoretically proven secret sharing it allows an enterprise of any size to manage growth with the
algorithms as the basis for privacy preserving outsourcing. The popularity of its product or service while at the same time also
technical contributions of this paper is the establishment and
development of a framework for efficient fault-tolerant scalable allows the enterprise to cut its losses if the launched product
and theoretically secure privacy preserving data outsourcing that or service does not succeed. During the past few years we
supports a diversity of database operations executed on different have seen a rapid acceleration of innovation in new business
types of data, which can even leverage publicly available data paradigms and data centers have played a very important role
sets. in this process.
In addition to the physical infrastructure needed to support
I. I NTRODUCTION
Internet and web-based applications, such applications have
Internet-scale computing has resulted in dramatic changes data management needs as well. To enable more sophisti-
in the design and deployment of information technology in- cated business analysis and user customization, e-commerce
frastructure components. Cloud computing has been gaining in applications maintain data or log information for every user
popularity in the commercial world, where various computing interaction rather than only storing transaction data (e.g. sales
based capabilities are provided as a service to clients, thus transactions in the retail industry). This trend has resulted
relieving those clients from the need to develop expertise in an explosive growth in the amount of data associated

1084-4627/09 $25.00 © 2009 IEEE 1709


DOI 10.1109/ICDE.2009.151
with these applications. Storage and retrieval of such data retrieval, the problem has been studied as the theoretical
poses monumental challenges especially for relatively small- formulation where the client must be able to retrieve the ith
sized companies since the cost of data management is esti- element from N data elements without the service provider
mated to be five-to-ten times higher than the data acquisition discovering that the client is interested in the ith element [10],
cost [1]. More importantly, in-house data management requires [11], [12], [13], [14], [15], [16]. Finally, in the last case we
a much higher level skill set to deal with issues of storage are concerned with a malicious environment, and therefore
technologies, capacity planning, fault-tolerance and reliability, need to ensure completeness and correctness of user queries
disaster recovery, and DBMS and Operating Systems software in that the results returned by the service provider are indeed
upgrades. Most commercial entities would rather direct their the exact answers to the user queries [17], [18], [19], [20],
valuable technical resources and engineering talent to focus on [21]. If, for example, the service provider corrupted the data,
their business applications instead of becoming full-time data it would be impossible to recover it for the service user. To
management companies. be able to use external DBSPs in real-world settings, there
Due to the above concerns, data outsourcing or database as must be a mechanism to recover the data and also to verify
a service has emerged as a new paradigm for data management that data has been corrupted. Providing a trust mechanism to
in which a third party service provider hosts a database and ensure both DBSPs and clients behave honestly has emerged
provides the associated software and hardware support. The as one of the most important problem that must be overcome
Database Service Provider (DBSP) provides data management before data outsourcing becomes a viable paradigm. Clearly,
for its customers, and thus obviates the need for the customer a wide adaptation of data outsourcing framework will only be
to purchase expensive hardware and software, deals with soft- possible if all three issues are adequately addressed.
ware upgrades, and hires professionals for administrative and
maintenance tasks. Since using a DBSP promises reliable data Technical Rationale
storage at a low cost, it is very attractive for large enterprises Most approaches proposed for secure and private data
such as commercial entities, intelligence agencies, and other outsourcing are based on data encryption [22]. In the re-
public and private organizations. Such a service would also cent International Conference on Very Large Data bases
provide universal access through the Internet to private data (VLDB’2007), Sion [22] presented a comprehensive review
stored at reliable and secure sites. A client company can of the current state-of-the-art in secure data outsourcing. A
outsource their data, and its employees then have access to review of the tutorial notes reveals that indeed most of the
the company data irrespective of their current locations. Rather techniques primarily rely on advanced cryptographic algo-
than carrying data with them as they travel or logging remotely rithms and more recently there is some effort to design and
into their home machines, employees can store and access develop special purpose hardware technology to overcome
company data with the DBSP, thus eliminating the risk of the overhead associated with data encryption. In fact, in his
data lost or theft. concluding slide, Sion states that the practical maturity of
Although the data outsourcing paradigm has compelling secure data outsourcing is in its infancy and adaptation of
economic and technical advantages, its adaptation has not such technology is barely crawling. Secure data outsourcing
mirrored the success of data centers. The main reason for this is complementary to the notion of data centers in that it will
failure is that recent governmental legislations, competition enable enterprises to outsource both application processing
among companies, and database thefts mandate that enterprises as well as data management to external entities and thus
use secure and privacy preserving data management tech- leveraging from economies of scale resulting in significant
niques. Using an external database service is an example of efficiencies in the Information Technology infrastructures.
a straightforward client-server application in an environment
II. BACKGROUND
where service providers and clients are honest and clients do
not hesitate to share their data with database service providers. A. Encryption-based Data Security
However, this is usually not the case, and thus the research Current research on data security use encryption to hide the
challenge is to build a robust and efficient service to manage content of the data from service providers [1], [3], [5], [7], [8],
data in a secure and privacy-preserving manner. The DBSP, [9]. The concept of database as a service has been of interest
therefore, needs to guarantee that data is secure, that the results to the research community under various guises. NetDB2 [1]
of the queries must also be secure, and that the queries can was developed and deployed as a database service on the
be executed efficiently and correctly, Internet. NetDB2 directly addresses two of the main challenges
Previous research in the context of secure data outsourcing in developing databases as a service, namely data privacy and
has focused on these areas independently. In the case of performance. NetDB2 uses encryption to ensure privacy and
ensuring security or equivalently confidentiality of outsourced is the first work that directly addresses and evaluates the issue
data, most of the research is concerned with ensuring that of performance, which is critical for success in databases.
even though the data is outsourced to a third party, the Recently [23] proposed using homomorphic encryption to sup-
individual data values should not be discernible to the service port secure aggregate outsourcing. In [19] Sion introduced the
provider [1], [2], [3], [4], [5], [6], [7], [8], [9]. In the case of notion of query execution assurance in outsourced databases,
the security of query results, also called private information namely, assurances are provided by the database server that

1710
the served client queries were in fact correctly executed on database. A simple proof establishes that if we only have one
the entire outsourced database. In [24], the vulnerabilities of server, the trivial solution is the best we can hope for [11].
using a single key for data encryption is raised and hence A way to obtain sub-linear communication complexity is to
the authors [24] propose using a hierarchical key assignment replicate the database at several servers. It has been shown that
scheme. Anciaux et al. [25] address the interesting problem with k servers the communication complexity can be reduced
1
of executing privacy preserving operations on both public as to O(N 2k−1 ). There is a long history of theoretical research in
well as private data, when the latter are stored on smart USB this area and most solutions rely on the availability of multiple
keys. servers or multiple service providers [10], [11], [12], [13],
The computational complexity of encrypting and decrypting [14], [15], [16].
data to execute a query increases the query response time. The notion of private information retrieval which ensures
Therefore, this complexity is one of the bottlenecks in current privacy of user queries has also been extended to the case
solutions [26]. In fact, Agrawal et al. [26] show that computing where the privacy of data is a concern. This is referred to
a privacy preserving intersection problem using encryption as symmetric private information retrieval [27], [28], [29].
results in a very high time complexity. The cost estimation In a recent work by Sion and Carburnar [16], the authors
of encryption based approach on a synthetic data consisting have challenged the computational practicality of both private
of 10 documents at one site and 100 documents at another information retrieval and symmetric private information re-
site (each with 1000 words) could take as much as 2 hours of trieval. In that, the authors through extensive experimentation
computation and approximately 3 Gigabits of data transmis- and evaluation have established that the private information
sion. Similarly, for a real dataset consisting of approximately 1 retrieval protocols are several orders of magnitude slower than
million medical records, the encryption based approach takes the trivial protocol of transferring entire database to the client
approximately 4 hours of computation time and and 8 Gbits of to ensure user privacy. Thus, in essence the authors doubt the
data transmission. Another problem with the data encryption viability of private information protocols in practice and state
approach is that finding the required tuples to execute the that alternative approaches are warranted to address the very
query over encrypted data is a significant challenge. In order to real and practical problem of data outsourcing.
solve this problem, current proposals reveal some information
about the content of the data to be used in filtering the C. Information Distribution
required tuples [1], [2]. With a good filtration mechanism, the In the area of computer and data security, there is an
communication cost of retrieving data from service providers orthogonal approach which is based on information dispersal
would be less and thus the query response time would be much or distribution instead of encryption. Most of this work on
better. However, the quality of the filtration process strictly security arose in the context of communicating a secret value
depends on the amount of information revealed to the service from one party to another. Many approaches rely on encrypting
provider. Therefore, there is a privacy performance tradeoff the secret value using encryption keys and ensure that the
in these solutions. Finally, in order to execute range queries information can only be revealed to a party that has the key.
efficiently order preserving data encryption mechanisms have Shamir [30] proposed an orthogonal scheme that does not rely
been proposed [3]. However, it has also been argued that order on the notion of keys. Instead, he proposed splitting the secret
preservation may weaken data security [5]. into n pieces such that the secret value can only be revealed if
one has access to any k of these n pieces, k < n. This scheme
B. Private Information Retrieval is shown to be information-theoretically secure as long as there
Interestingly, the problem of private information retrieval is some guarantee that the adversary cannot access k pieces.
has been a research topic for more than a decade. The problem Unlike encryption methods, Shamir’s secret sharing algorithm
was first proposed in the context of a user accessing third- is computationally efficient. Our intent is take a distributed
party data without revealing to the third-party his/her exact approach towards secure data outsourcing in that we want
interests [11]. An example scenario is an analyst who wants to to explore using secret-sharing approach and multiple service
retrieve information from an investor database made available providers. The advantage of this approach is that it addresses
by a certain company, but who would not like his/her future both data security as well as privacy-preserving querying of
intentions be exposed to that company. Although the original outsourced data.
formulation of this problem was with respect to third-party In [31], [32], we proposed using Shamir’s secret sharing
data it is very much applicable in the context of outsourced algorithm to execute privacy preserving operations among a set
data. In particular, users may not want to reveal their queries of distributed data warehouses. In [31], a middleware, Abacus,
to the service provider since this information can compromise was developed to support selection, intersection and join
the privacy of their behavior patterns. operations. This approach was designed for data warehouses
Formally, private information retrieval is stated as follows: and took advantage of the dimension table in the star scheme
the database is modeled as a string x of length N held at a to hide information using inexpensive one-way hash functions.
remote server, and the user wants to retrieve the bit xi for some In [32], this work was generalized to any type of database,
i, without disclosing any information about i to the server [11]. and used distributed third parties and Shamir’s secret sharing
A trivial solution would be for the user to retrieve the entire algorithm to secure information and support privacy preserving

1711
selection, intersection, join and aggregation operations (such We start by discussing simple techniques of outsourcing
as SUM and MIN/MAX operations) on a set of distributed a singleton numeric attribute such as salary in an idealized
databases. environment. Then, we generalize our techniques to deal with
more complex operations, and propose extensions for non-
III. OVERVIEW OF THE P ROPOSED A PPROACH numeric data. The solution is based on Shamir’s secret sharing
method [30]. Data source D divides the numeric value vs
In this section, we formulate the secure data outsourcing
into n shares and stores one share at each service provider,
problem and then briefly discuss our proposed approach to-
DAS 1 , DAS 2 , ..., DAS n , such that knowledge of any k
wards an effective solution. Our approach is to use data disper-
(k ≤ n) shares in addition to some secret information, X,
sion on multiple servers instead of data encryption to achieve
known only to the data source is required to reconstruct the se-
data security and privacy of user queries. This approach is
cret. Since, even complete knowledge of k−1 service providers
not only efficient, but also exploits the paradigm of Internet-
cannot reveal any information about the secret even if X is
scale computing by taking advantage of the large number of
known, this method is information theoretically secure [30].
available resources. Assume that a client, which we refer to
In the secret sharing method, data source D chooses a
as the data source D, wants to outsource its data to eliminate
random polynomial q(x) of degree k − 1 where the constant
its database maintenance cost by using the database service
term is the secret value, vs , and secret information X which
provided by database service providers DAS 1 , ..., DAS n . D
is a set of n random points, each corresponding to one of the
needs to store and access its data remotely without revealing
database service providers. Then, data source D computes the
the content of the database to any of the database services.
share of each service provider as q(xi ), xi ∈ X and sends it
For the sake of this discussion, assume D has a single
to database service provider DASi . To reconstruct the secret
table Employees, with employee names and salaries, in its
value vs , the data source retrieves shares from the service
database and stores Employees using the services provided
providers. The shares can be rewritten as follows:
by DAS 1 , ..., DAS n . After storing Employees, D needs to
query Employees without revealing any information about shares(vs , 1) = q(x1 ) = axk−1
1 + bxk−2
1 ... + vs
either the content of the table or queries. D can pose any of
shares(vs , 2) = q(x2 ) = axk−1
2 + bxk−2
2 ... + vs
the following queries over time:
..
1) Exact match queries such as: Retrieve all information .
about employees whose name is ‘John’. shares(vs , n) = q(xn ) = axk−1
n + bxk−2
n ... + vs
2) Range queries such as: Retrieve all information about
employees whose salary is between 10K and 40K. The secret value can be reconstructed using any k of the above
3) Aggregate queries such as MIN/MAX, MEDIAN, SUM equations since there are k unknowns including the secret
and AVERAGE. Aggregates can be over ranges for value vs . In order to reconstruct the secret value vs , any set
example sum of the salaries of employees whose salary of k service providers will need to share the information they
is between 10K and 40K. In addition, they can be over have received, and they need to know the set of secret points,
exact matches such as average of the salaries of all X, used by D. Since only data source D knows X, only it
employees whose name is ‘John’. can reconstruct the secret after getting at least k shares from
any k of the service providers.
We propose a complete approach to execute exact match,
range, and aggregation queries in a privacy preserving manner.
The goal is to build a practical scalable system and answer
Salary
queries without revealing any information. Throughout the
210
paper, we will assume that there are two kinds of attributes 30
in tables namely numeric attributes such as salary and non- 42

numeric attributes such as name. We will contrast the work Polynomial To DAS 1
64
88
in [1], [2] referred to as data encryption, wherever appropriate, Salary
q10(x) = 100x+10
10 DAS 1 Salary
with our proposed technique so as to highlight the differences 410
q20(x) = 5x+20
and compare them. 20 To DAS 2 40
q40(x) = x + 40
In our solution, data is divided into n shares and each share 40 44

q60(x) = 2x+60 64
is stored in a different service provider. When a query is 60
96
generated at a data source, it is rewritten, the relevant shares 80 q80(x) = 4x+80 Salary
DAS 2
110
are retrieved from the service providers, and the query answer To DAS 3
25
Data Source
is reconstructed at the data source. In order to answer queries, 41
any k of the service providers must be available. The main idea 62
84
here is that the service providers are not able to infer anything
DAS 3
about the content of the data they store, and still the data
source is able to query its database by incurring reasonable Fig. 1. Demonstration of Example 1
communication and computation costs.

1712
We illustrate the secret sharing model using a concrete derived from v1 , v2 ,.., vN respectively need to preserve the
example. Assume that data source D needs to outsource the order (i.e., share(v1 , i) < share(v2 , i) < .. < share(vN , i)).
salary attribute of an Employees table using 3 database Since the order of the shares at the service provider is not pre-
service providers, DAS1 , DAS2 and DAS3 . In order to served in the solution in Section 3, database service providers
do this, it chooses 5 random polynomials with degree one cannot filter the data. However, If we had a mechanism
for each salary value in the table whose constant term is to construct the polynomials calculating shares in an order
the salary (n = 3 and k = 2). In addition, secret infor- preserving manner for a specific domain, then data source D
mation X, X = {x1 = 2, x2 = 4, x3 = 1}, is also could retrieve only the required tuples instead of a superset
chosen one for each database service provider. Therefore, to answer a query. We now propose an order preserving
the polynomials would be q10 (x) = 100x + 10, q20 (x) = polynomial building technique to achieve this goal. Without
5x + 20, q40 (x) = x + 40, q60 (x) = 2x + 60 and q80 (x) = loss of generality, we will assume that polynomials are of
4x + 80 for salaries {10, 20, 40, 60, 80} respectively. Then, degree 3 and in the following form ax3 + bx2 + cx + d (i.e.,
it sends {q10 (xi ), q20 (xi ), q40 (xi ), q60 (xi ), q80 (xi )} to service k = 4). Given any two secret values v1 and v2 from a domain
provider DASi to store them. The data outsourcing step is il- DOM , we need to construct two polynomials pv1 (x) =
lustrated in Figure 1. Note that neither the polynomials nor the a1 x3 + b1 x2 + c1 x + v1 and pv2 (x) = a2 x3 + b2 x2 + c2 x + v2
salaries are stored at the service provider and Figure 1 shows for these values such that pv1 (x) < pv2 (x) for all x points
the information stored at each of the service provider. When if v1 < v2 . The key observation for our solution is that
a query is initiated at the data source, it needs to retrieve the pv1 (x) < pv2 (x) for all positive x values if a1 < a2 , b1 < b2 ,
shares corresponding to all salaries from the service providers, c1 < c2 and v1 < v2 . We first present a straightforward
i.e., {q10 (xi ), q20 (xi ), q40 (xi ), q60 (xi ), q80 (xi )} from DASi . approach to construct a set of order preserving polynomials
After this, it needs to find out the coefficient of each polyno- and show why it is not secure. Then, we present a secure
mial q and thus all secret salaries (note that receiving any method for constructing such order preserving polynomials.
k shares is enough for this since the polynomials are of A straightforward method to form a set of order preserving
degree k − 1). In our example, data source D needs to receive polynomials for a specific domain is to use monotonically
shares from any 2 of the service providers and uses the set of increasing functions of the secret values to determine the
secret values X, to compute the coefficients of polynomials coefficients of the polynomials. In this scheme, we need three
q10 , q20 , q40 , q60 and q80 and thus salaries, 10, 20, 40, 60 and monotonically increasing functions fa , fb and fc to find the
80 can be retrieved. coefficients of the polynomial pvs = ax3 + bx2 + cx + vs
which is used to divide the secret value vs . The coefficients
IV. P RACTICAL S OLUTIONS FOR S ECURE DATA of the polynomial pvs are the values of the monotonically
O UTSOURCING increasing functions of the secret value vs where a =
In this section, we extend the techniques developed in fa (vs ), b = fb (vs ) and c = fc (vs ). Therefore, for two secret
the previous section to only retrieve the required data from values v1 and v2 (v1 < v2 ) and their respective polynomials
service providers or a small superset and thus reduce the pv1 (x) = fa (v1 )x3 + fb (v1 )x2 + fc (v1 )x + v1 and pv2 (x) =
computation and communication cost in query processing. fa (v2 )x3 + fb (v2 )x2 + fc (v2 )x + v2 , the value of pv1 (x) is
In the simple solution described above, the database service always less than the value of polynomial pv2 (x) for all x
providers are primarily used as storage servers. They do not values. Since any service provider DASi gets the value of the
play any role in the query processing itself and therefore polynomials at point xi , the share coming from secret value
the proposed approach is not practical. This will result in v1 , share(v1 , i) would always be less than the share coming
a large communication cost since the entire database needs from the secret value v2 , share(v2 , i) (i.e., p1 (xi ) < p2 (xi )).
to be retrieved from the service provider for every query. However, this solution is not secure enough to hide secret
Furthermore, the data source itself will become a processing values from the service providers. For example, assume the
bottleneck since it needs to process all user queries. in fact, following monotonic functions are used: fa (vs ) = 3vs + 10,
this is exactly the same as the same overhead as in the case of fb (vs ) = vs + 27 and fc (vs ) = 5vs + 1. Then, the share of
data encryption with a single server. In the rest of the paper data source DASi from secret value v1 would be p1 (xi ) =
we use this idealized solution as a starting point and propose (3v1 + 10)x3i + (v1 + 27)x2i + (5v1 + 1)xi + v1 which is
refinements to make the solution more practical and scalable. p1 (xi ) = (3x3i + x2i + 5xi + 1)v1 + (10x3i + 27x2i + xi ).
We now propose an extension to the information dispersal Basically, the secrets are multiplied by the same constants and
method, which will allow the retrieval of only the required then the same constant is added to compute the share of a
tuples from the service providers instead of a superset. The service provider for all secret values. Therefore, if a service
key observation to achieve this is that the order of the values provider is able to break this method for one secret item can
in the domain DOM = {v1 , v2 ,...,vN } needs to remain the determine the complete set of the secret values.
same in the shares of the service providers. In other words, if Since the above approach to construct an order preserving
data source D needs to outsource secret values from domain polynomial is not secure, we propose another scheme to
DOM and v1 < v2 < ... < vN , the shares of a service build order preserving polynomials for values from a specific
provider DASi , share(v1 , i), share(v2 , i), .., share(vn , i), domain. In particular, we propose a secure method using

1713
different coefficients for each secret value so that service V. R ESEARCH D IRECTIONS & C HALLENGES
providers cannot know the relation between secret values
except the order. In polynomial construction, the coefficients A. Query Processing
a, b and c are chosen from the domains DOMa , DOMb In this section, we briefly discuss how different queries are
and DOMc . Since the coefficients can be real numbers, the processed in the secret sharing data outsourcing framework.
sizes of the coefficient domains are independent from the data We consider: exact match, range, aggregation and join queries.
domain size. For finite domain DOM = {v1 , v2 , ...vN }, the In each case, we will illustrate how the encryption scheme
domains DOMa , DOMb and DOMc are divided into N solves the problem and contrast it with our approach.
equal sections. For example DOMa is divided into N slots
Exact Match Queries. An example of an Exact Match Query
: [1, |Dom
N
a|
] for v1 , [ |Dom
N
a|
+ 1, 2 |Dom
N
a|
] for v2 ,...,[(N −
|Doma |
would be “retrieve the names of all employees whose salary is
1) N + 1, |Doma |] for vN . After this, coefficient avi for 20”. In the data encryption model [2], [1], all the attributes are
value vi is selected from the slot [(i − 1) |Dom N
a|
+ 1, i |DomN
a|
] encrypted and the service provider only knows the encrypted
with the help of hash function ha which maps vi to a value values. Queries are formulated in terms of the encrypted
from [(i − 1) |DomN
a|
+ 1, i |Dom
N
a|
]. The other coefficients bvi values, and tuples corresponding to those encrypted values are
and cvi are computed similarly with the hash functions from retrieved. In the secret sharing model, data source D needs
domains Domb and Domc . Finally, the polynomial used to to retrieve shares from the service providers. Therefore, it
divide the secret value vi into shares would be pvi (x) = rewrites k queries one for each service provider. For exam-
avi x3 + bvi x2 + cvi x + vi . ple, the rewritten query for DASi would be: Retrieve the
tuples of all employees whose salary is share(20, i), where
We now consider the security of the proposed polynomial share(20, i) is the share of service provider DASi for the
construction technique. Basically, we discuss what a service secret value 20. In order to find share(20, i), data source D
provider can infer from the stored data and then show that first constructs1 the polynomial for secret item 20, p20 (x),
it cannot know the content of the data with the inferred and then it computes the shares, share(20, i) = p20 (xi ). After
information. From the stored data, service provider DASi retrieving the corresponding tuples from the service providers,
can know an upper bound on the sum of the domain sizes data source D computes the secret values.
(i.e., |DOM | + |Doma | + |Domb | + |Domc |). This can only Range Queries. In order to process range queries in the
happen when it stores the last secret value from DOM and encryption model [2], [1], labels are associated with the
the coefficients are mapped to the last slots of the domains encrypted tuples. These labels indicate the range of the
for the last secret value vN in the domain. Assume this worst particular encrypted values. The labels are used to find a
case. Then, the polynomial for secret value vN would be superset of the answer in the data encryption method. In
PvN (x) = |Doma |x3 + |Domb |x2 + |Domc |x + vN and order to answer the same query in the secret sharing method,
the share of DASi would be share(i, vN ) = PvN (xi ) = data source D rewrites n queries (one for each of the
|Doma |x3i + |Domb |x2i + |Domc |xi + vN . From this share, service provider). For example, the query sent to service
DASi can only know an upper bound on the sum of the provider DASi is: All employees whose salaries are between
sizes of the domains and that upper bound is too loose to share(20, i) and share(50, i). In order to compute shares,
infer something about the content of the data. Therefore, we share(20, i) and share(50, i), two order preserving polyno-
claim that the database service providers can only know an mials, p20 (x) and p50 (x), are constructed (share(20, i) =
upper bound on the sum of the domains from the stored p20 (xi ) and share(50, i) = p50 (xi )). Service provider DASi ,
information. Furthermore, database service provider DASi then, sends the tuples of all employees whose salaries are
cannot know each domain size or the exact value of the between share(20, i) and share(50, i). Since we have an
sum of the coefficient domain sizes even if it knows the order preserving polynomial construction technique for the
secret point xi , in the worst case scenario described above. domain, DASi can send only the required tuples. After getting
Because, there are four unknowns, Doma , Domb , Domc this information from the service providers, data source D
and vN , in the share of DASi , share(i, vN ) = PvN (xi ) = executes the query by solving the polynomials. Therefore,
|Doma |x3i + |Domb |x2i + |Domc |xi + vN (assuming xi is the computation and communication is performed for those
known). Thus, these unknowns cannot be found. tuples which are required to answer the query. However,
the secret sharing scheme does need to communicate with
In addition to the security guarantees, for two secret values
multiple service providers. The consequence of this overhead
vi and vj from the same domain, data source DASi will get
does result in greater fault-tolerance and data availability
its shares share(vi , i) = pvi (xi ) (share of DASi from vi )
in the presence of failures. Future work entails a detailed
and share(vj , i) = pvj (xj ). If vi < vj then share(vi , i) <
performance evaluation to determine the computation versus
share(vj , i) due to the polynomial construction method. That
communication trade-off under the two models.
is, for any two secret values vi and vj from the same domain,
the shares of data source DASi , share(vi , i) = pvi (xi ) (share 1 Note that the polynomials are not stored at the data source which would
of DASi from vi ) and share(vj , i) = pvj (xj ), preserves the amount to storing the entire data itself. Instead, the polynomials are generated
order (i.e., if vi < vj then share(vi , i) < share(vj , i)). as part of the front-end query-processing at the data source.

1714
Aggregation Queries. We consider Sum/Average, the ManagerUserName cannot be answered with the proposed
Min/Max/Median aggregation queries. We classify aggregation scheme. Such extensions and generalizations are the subject
queries in two class: 1) Aggregations over Exact Matches of future research and development challenges.
2) Aggregation over ranges. We illustrate aggregation query
processing techniques with the following example queries: B. Different Types of Data
• Sum/Average of the salaries of the employees whose We have considered only numeric attributes so far and
name is “John” (Sum/Average over Exact Match). the proposed technique is for numeric attributes. We now
• Sum/Average of the salaries of the employees whose briefly explore more complex data types. Of particular interest
salary is between 20 and 40 (Sum/Average over Ranges). will be how to represent non-numeric data, and potentially
• Min/Max/Median of the salaries of the employees whose compressed data. We illustrate a simple approach for demon-
name is “John” (Min/Max/Median over Exact Match). strating how to apply our scheme for non-numeric attributes.
• Min/Max/Median of the all salaries of the employees In particular, we need to convert them to numeric attributes.
whose salary is between 20 and 40 (Min/Max/Median For example, the attribute name length of 5 characters(i.e.,
over Ranges). VARCHAR(5)), can be represented as a numeric attribute
although it is in fact a non-numeric attribute. For the sake
Query execution consists of two steps. In the first step,
of this discussion, assume the characters in names can be one
service providers receive the rewritten queries from the data
of the letters in the English alphabet and they can be shorter
source and perform an intermediate computation. In the second
than 5 characters. Thus, the regular expression for this attribute
step, the data source receives the intermediate results from all
is (A|B|....|Z|∗)5 where ∗ represents blank. For instance,
of the service providers and computes the final answer. The
name “ABC” is rewritten as “ABC**” while name “FATIH”
above queries are rewritten as follows and sent to the service
is rewritten as “FATIH” (because it already has 5 characters).
provider DASi :
The name attribute consists of a combination of 27 possible
• Sum/Average of the salaries of the employees whose characters which are enumerated (∗ = 0, A = 1, B = 2, C =
name is share(‘John , i). 3..., Z = 26). and thus, each name can represent a number
• Sum/Average of the salaries of the employees whose in a number system of base 27. For example, name “ABC**”
salary is between share(20, i) and share(40, i). can be rewritten as (12300)27 which corresponds to 21998878
• Min/Max/Median of the salaries of the employees whose in decimals. With this simple enumeration technique, non-
name is share(‘John , i). numeric attributes can be converted into numeric attributes
• Min/Max/Median of the all salaries of the employees and then the proposed outsourcing technique can directly be
whose salary is between share(20, i) and share(40, i). applied. With the proposed enumeration technique execution
Then, DASi finds the tuples needed to answer these queries of widely used queries over non-numeric attributes can be han-
and performs an intermediate computation over them. After dled easily. For example, a query asking for employees whose
getting all of these intermediate results, data source D com- name starts with “AB” or a query asking employees whose
putes the final answer. name is between “Albert” and “Jack” can be converted into
Join Operations. So far in our development, we assumed range queries and executed with the range query processing
data sources have only one table for the sake of the pre- technique discussed earlier.
sentation and thus did not consider join operations involv-
ing multiple tables. In order to provide database manage- C. Database Updates
ment as a service it is necessary to support join queries Although our discussion so far has focused on database
over multiple tables. We now demonstrate that the pro- queries, we would like to note that the proposed techniques are
posed technique can be applied if these tables are related also applicable for database updates. An update would involve
to each other through referential keys and join is based on retrieving the shares corresponding to the tuples that need to
these keys. Consider a simple schema consisting of two ta- be updated by using the querying approaches discussed above.
bles Employees(EID,Name,Lastname,Department,Salary) and The actual values for the tuples are reconstructed at the client,
Managers(EID, ManegerID, ManagerUserName, Password). and the relevant updates are performed. A new polynomial is
A possible query may ask for the salaries of all managers. To constructed and the new shares are distributed to all the service
execute this query, these two tables should be joined using the providers. Alternatively, lazy update approaches could be
attribute EID. Our scheme can be directly applied to execute incorporated, and lazy updates as well as incremental updating
this query since join is based on two attributes which are from of values that might reduce the communication overhead are
the same domain and our polynomials are constructed for each some of the possible directions for future work.
domain not for each attribute. Therefore, this join can be done
by the service provider at the service provider site. However, D. Management of Private and Public Data
if a join is based on two attributes from different domains such Once data has been stored at a database service provider,
as Name and ManagerUserName, then the proposed approach a user may not only want to query his/her own private data,
cannot be used for this kind of joins. Thus, the query asking but possibly some public data provided by the database service
for the salaries of the managers whose name is the same as provider. In fact, the database service provider can thus provide

1715
a value added incentive: not only storing the client’s private [5] M. Kantarcioglu and C. Clifton, “Security issues in querying encrypted
data, but also providing seamless access and integration with data,” in Proc. of the IFIP Conference on Database and Applications
Security, 2005.
a large repository of public data. For example, a client’s data [6] B. Hore, S. Mehrotra, and G. Tsudik, “A privacy-preserving index for
may contain her private collection of friends, including infor- range queries,” in Proc. of the International Conference on Very Large
mation such as phone numbers, addresses, etc. The server may Data Bases, 2004.
[7] J. Li and R. Omiecinski, “Efficiency and security trade-off in supporting
have a database of restaurants and their addresses. The client range queries on encrypted databases,” in Proc. of the IFIP Conference
can exploit the public data to request restaurants that are close on Database and Applications Security, 2005.
to a friend’s house, without revealing any private information [8] E. Shmueli, R. Waisenberg, Y. Elovici, and E. Gudes, “Designing secure
indexes for encrypted databases,” in Proc. of the IFIP Conference on
about the friend, i.e., their name, address, phone number, Database and Applications Security, 2005.
etc. The combining or mash-up of public and private data [9] Z. Yang, S. Zhong, and R. Wright, “Privacy-preserving queries on
is especially pertinent in the context of applications arising encrypted data,” in Proc. of the 11 European Symposium on Research
In Computer Security, 2006.
from national security. Consider the case of an agency such [10] E. Kushilevitz and R. Ostrovsky, “Replication is not needed: Single
as the FBI that tracks suspicious individuals. Now consider database, computationally-private information retrieval,” in Proc. of the
many other public/private agencies such as the TSA that need FOCS, 1997.
[11] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, “Private infoma-
to correlate the travelers at San Francisco International Airport tion retrieval,” Journal of the ACM, vol. 45, no. 6, pp. 965–982, 1998.
with the FBI List. This example illustrates the need for secure [12] J. Stern, “A new and efiicient all-or-nothing disclosure of secrets
mash-up of public and private data. This problem has recently protocol,” in Proc. of Asia Crypt, 1998.
[13] E. Kushilevitz and R. Ostrovsky, “One-way trapdoor permuttions are
been addressed in the more limited context of private data sufficient for non-trivial single-server private information retrieval,” in
stored in a smart USB key [25]. Exploration of the problem Proc. of the EUROCRYPT, 2000.
of executing database queries on both private and public data [14] C. Cachin, S. Micali, and M. Stadler, “Computationally private infor-
mation retrieval with polylogarithmic communication,” in Proc. of the
in the context of data outsourcing at database service providers EUROCRYPT, 1999.
remains a formidable challenge. [15] Y. Chang, “Single database private information retrieval with logarithmic
communication,” 2004.
VI. C ONCLUDING R EMARKS [16] R. Sion and B. Carbunar, “On the computational practicality of private
information retrieval,” in Proc. of the Networks and Distributed Systems
In this paper, we present scalable secure and privacy- Security, 2007.
[17] P. Devambu, M. Gertz, C. Martel, and S. Stubblebine, “Authentic third-
preserving algorithms for data outsourcing. Instead of encryp- party data publication,” in Proc. of the IFIP Workshop on Database
tion, which is computationally expensive, we use distribution Security, 2000.
on multiple data provider sites and information theoretically [18] E. Mykletun, M. Narasimha, and G. Tsudik, “Authentiction and integrity
in outsourced databases,” in Proc. of the ISOC Symposium on Network
proven secret-sharing algorithms as the basis for privacy and Distributed Systems Security, 2004.
preserving outsourcing. The research is timely due to the [19] R. Sion, “Query execution assurance for outsourced database,” in Proc.
ever increasing private and public data being generated. Also, of VLDB Conf., 2005.
[20] H. Pang, A. Jain, K. Ramamritham, and K. Tan, “Verifying completeness
the easy accessibility of data providers on the web makes of relational query resultts in data publishing,” in Proc. of the ACM
this paradigm attractive and scalable. Finally, the scientific SIGMOD Conf, 2005.
challenges we have identified are (a) Efficient algorithms for [21] M. Narasimha and G. Tsudik, “Authentication of outsourced database
using signature aggregation and chaining,” in Proc. of DASFAA, 2006.
the secure and private execution of different types of database [22] R. Sion, “Secure data outsourcing,” in Proc. of the VLDB Conf., 2007,
operations, e.g., intersection, join, aggregation, etc. (b) Ex- pp. 1431–1432.
ploration of different failure models and the development of [23] T. Ge and S. B. Zdonik, “Answering aggregation queries in a secure
system model,” in Proc. of the VLDB Conf., 2007, pp. 519–530.
algorithms for both benign and malicious environments. (c) [24] S. D. C. di Vimercati, S. Foresti, S. Jajodia, S. Paraboschi, and
Algorithms for the management and querying of both private P. Samarati, “Over-encryption: Management of access control evolution
and public data. on outsourced data,” in Proc. of the VLDB Conf., 2007, pp. 123–134.
[25] N. Anciaux, M. Benzine, L. Bouganim, P. Pucheral, and D. Shasha,
“Ghostdb: querying visible and hidden data without leaks,” in Proc. of
ACKNOWLEDGMENT the ACM SIGMOD Conf., 2007, pp. 677–688.
[26] R. Agrawal, A. Evfimievski, and R. Srikant, “Information sharing across
This research was partially supported by the UC Discovery private databases,” in Proc. of the ACM SIGMOD Conf., 2003, pp. 86–
Grant com-dig05-10189. 97.
[27] R. Ostrovsky and V. Shoup, “Private Information Storage,” in Proc. of
R EFERENCES the STOC, 1997.
[28] Y. Gertner, Y. Ishai, E. Kushilevitz, and T. Malkin, “Protecting data
[1] H. Hacigumus, B. R. Iyer, C. Li, and S. Mehrotra, “Executing SQL privacy in private information retrieval schemes,” in Proc. of the STOC,
over encrypted data in the database service provider model,” in Proc. 1998, pp. 151–160.
of the ACM SIGMOD Conf., 2002. [29] M. Naor and B. Pinkas, “Oblivious transfer and polynomial evaluation,”
[2] B. Hore, S. Mehrotra, and G. Tsudik, “A privacy-preserving index for in Proc. of the STOC, 1999, pp. 245–254.
range queries,” in Proc. of the VLDB Conf., 2004, pp. 720–731. [30] A. Shamir, “How to share a secret,” Commun. ACM, vol. 22, no. 11,
[3] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Order preserving pp. 612–613, 1979.
encryption for numeric data,” in Proc. of the ACM SIGMOD Conf., [31] F. Emekci, D. Agrawal, and A. E. Abbadi, “Abacus: A distributed
2004, pp. 563–574. middleware for privacy preserving data sharing across private data
[4] G. Aggarwal, M. Bawa, P. Ganesan, H. Garcia-Molina, K. Kenthapadi, warehouses,” in ACM/IFIP/USENIX 6th International Middleware Con-
R. Motwani, U. Srivastava, D. Thomas, and Y. Xu, “Two can keep a ference, 2005.
secret: A distributed architecture for secure database services.” in CIDR, [32] F. Emekçi, D. Agrawal, A. El Abbadi, and A. Gulbeden, “Privacy
2005, pp. 186–199. preserving query processing using third parties.” in ICDE, 2006, p. 27.

1716

You might also like