Parate 08 Trace Pub
Parate 08 Trace Pub
Abhinav Parate, Gerome Miklau University of Massachusetts, Amherst Department of Computer Science 140 Governors Drive, Amherst, MA
[email protected], [email protected]
Introduction
Sharing network traces across institutions is critical for advancing network research and protecting against cyberinfrastructure attacks. But the public release of network traces remains highly constrained by privacy and security concerns[5, 19]. Traces contain highly sensitive information about users in the network, proprietary applications running in enterprises, as well as network topology and other network-sensitive information that could be used to aid a cyber-attack. The most common approach to enabling secure trace analysis is through anonymizing transformations. The original trace is transformed by removing sensitive content and obscuring sensitive elds and the result is released to the public. The appeal of trace anonymization is that the trace owner can generate a single anonymized trace, publish it, and analysts can perform computations on the trace independently of the trace owner. A number of anonymizing transformations have been proposed for network traces, with IP packet traces receiving the most attention. Proposed anonymization techniques include tcpurify [3], the tcpdpriv [2] and Crypto-PAn [8] tools (which can preserve prex relationships among anonymized IP addresses), as well as frameworks for dening transformations, such as tcpmkpub [15]. Unfortunately, reliable standards for the privacy and utility of these transformations has been elusive. A number of recent attacks on trace transformation techniques have been demonstrated by researchers [18, 7, 12, 4, 6] and the research community is still actively pursuing metrics for trace privacy that can guide trace owners [18, 7]. In addition, the utility of anonymized traces to analysts has received less attention than privacy, and the benets of improved anonymizing transformations are sometimes at the cost of usefulness to analysts of the published trace. In this paper we address the problem faced by a trace owner who wishes to allow a group of independent analysts to study an IP-level network trace. Our trace publication framework allows the trace owner to anonymize a trace for the needs of a particular analysis, or a related set of analyses. The published traces can be more secure because they provide only the needed information, omitting everything else. In addition, we provide procedures for formally verifying that a published trace meets utility and privacy objectives.
Original Trace
Transformed Traces
Req A
Analysts Analyst A
T1 T T23 T3 T2
Req B
Analyst B
Req C
Analyst C
Req D
Analyst D
Figure 1: The proposed trace protection framework: the original trace T may be transformed in multiple ways (T1 , T2 , T3 , T23 ) to support the requirements of different analysts.
Our publication framework is illustrated informally in Figure 1. The gure shows an original trace T transformed in four different ways, for use by different analysts. Trace T1 contains sufcient information for both analysts A and B . Trace T2 is devised for use exclusively by the analyst C , and trace T3 is customized for the needs of analyst D. An alternative to publishing both trace T2 and T3 is to derive the single trace T23 which can support analysts C and D simultaneously. The goal of this paper is not to propose a novel anonymizing transformation for network traces. Instead, our goal is to create a framework in which basic trace transformation operations can be applied with a precise, formal understanding of their impact on trace utility and privacy. Our framework consists of the following: Formal transformations A set of simple, formally-dened transformation operators that are applied to the trace to remove or obscure information. These include encryption, eld removal, domain translation, etc. Transformation operators can be combined to form composite transformations. The output of the chosen composite transformation is published (along with the description of the transformation). Input from the analyst We assume the requesting trace analyst provides a description of the information needed for analysis. We propose a simple language for utility constraints which express the need for certain relationships to hold between the original trace and the published trace. The constraints can require, for example, that certain elds are present and unmodied, or that other values can be obscured as long as they preserve the ordering relationship present in the original trace. It is usually straightforward to determine the constraints that are needed to support a particular analysis; we provide examples of recent network studies of IP traces and their utility requirements in Section 2 and Appendix A.1. Formally evaluating privacy and utility Because both the transformations and the utility requirements of analysts are specied formally, it is possible for the trace owner to: decide whether a composite trace transformation satises an analysts requirements, implying that the desired analysis can be carried out on the transformed trace with perfect utility. compute the most secure transform satisfying a given set of analyst requirements. compare the security of transforms or analyze the impact of a collusion attack that might allow published traces to be combined. Publishing multiple traces Because our trace transformations are often customized to the needs of individual analyses, it is worth comparing our framework with an alternative to trace anonymization recently proposed by Mogul et al [14]. In that work, the authors propose an execution framework in which analysts submit code to the trace owner. The trace owner executes the code locally on the trace and publishes only the program output to the trace analyst. The principle challenge is verifying whether the program output can be safely released without violating privacy. In our framework, the analyst submits a set of utility constraints not a general-purpose program. Therefore, the trace owner does not bear the signicant burden of evaluating the safety of a program or the danger of covert channels. In addition, we still publish a transformed trace which both relieves the trace owner of the need to provide computational resources, and allows the analyst to rene their computations on the trace. We believe that our utility 2
constraints are therefore the right methodology for supporting a trace publisher in customizing trace transformation to the needs of analysts. The advantages of our framework do entail some challenges for the trace owner. Compared with conventional trace anonymization, the trace owner in our framework must make more ne-grained choices about which transformed traces to publish to which users, and must compute and publish multiple anonymized traces instead of just one. We believe that the benets to trace security warrant this effort. Our transformation operators are efcient to apply, and we provide a number of tools to help the trace owner make publication decisions. In addition, the trace owner can choose to publish a single trace supporting multiple analysts. For example, in Figure 1, publishing trace T23 may be easier than publishing both T2 and T3 , but may require a sacrice in privacy as, intuitively, Analyst C will receive some additional information used in the trace analysis D. An additional concern with publishing multiple traces are attacks in which a single party poses as two or more analysts, or two or more analysts collude. In this case different published views of the trace could be combined to reveal more information than intended. In the absence of a trustworthy authority validating the identities, it is a challenge to counter such attacks. We can however analyze the risk of collusion formally: in Section 4 we show that it is possible to bound how much a group of colluding parties can learn, distinguishing between cases where collusion is a serious risk and cases where little can be gained from collusion. The remainder of the paper is organized as follows. In Section 2, we describe a case study in which we apply the main components of our framework to a trace analysis of TCP connection characteristics. Section 3 describes our trace transformation operators and our language for specifying analyst requirements. Section 4 details the formal steps in utility analysis and the computation of most secure transformation for an analysis. Section 5 measures the privacy of sample transformations quantitatively through experiments on a real network trace. We discuss related work in Section 6 and conclude in Section 7.
In this section, we provide an overview of our framework by describing its use in enabling a real study of TCP connection characteristics carried out by Jaiswal et al. [10]. First, we explain the analysis reseachers wish to perform and derive the basic requirements that must be satised by any usable trace. Then we describe an anonymizing transformation, and nally, we verify the transformation for the requirements satisfaction and give a brief statement on privacy of the network trace. Analysis Description A TCP connection is identied by two IP addresses (ip1, ip2) and two ports (pt1, pt2), corresponding to the sender and receiver. Jaiswal et al. study the characteristics of TCP connections through passive monitoring [10]. Their study focuses on measuring the senders congestion window (cwnd) and the connection round-trip time (RTT). In this analysis, the congestion window of a connection is estimated using a nite state machine (FSM). The transitions in this FSM are triggered by receiver-to-sender ACKs or by out-of-sequence packet retransmissions. The FSM processes the packets in connection in the order they were observed at observation point, O. The estimation of variable, RTT, is done indirectly by adding the trip time of packet from point O to the receiver and then, back to O with the trip time between point O, the sender and O again. The full details of this estimation can be seen in [10]. Utility Requirements Based on this description, we present the requirements or the sufcient conditions that must be satised by any transformed trace supporting the analysis described. 1. R1 The trace must include the type of the packet (SYN or ACK). 2. R2 The trace must allow the analyst to identify if any given two records belong to the same connection or not. 3. R3 The trace must allow the analyst to order packets in the same connection by sequence numbers(seq no) or timestamps(ts). 4. R4 The trace should preserve the difference between ts values for packets in same connection. 5. R5 The trace should preserve the difference between seq no for packets in same connection. 6. R6 The sequence numbers of the senders packets and the acknowledgement numbers(ack no) of receivers packets of same connection should be comparable for equality in the trace.
Table 1: A formal description of utility requirements sufcient to support the example analysis of TCP connection properties.
Formal Utility Requirements Any (t) t.syn = (t).syn Any (t) t.ack = (t).ack Any (t1, t2) ((t1.ip1== t2.ip1) && (t1.ip2 ==t2.ip2) && (t1.pt1== t2.pt1)&& (t1.pt2 ==t2.pt2)) = ((t1).ip1== (t2).ip1 && (t1).ip2 ==(t2).ip2 && (t1).pt1== (t2).pt1 && (t1).pt2 ==(t2).pt2) Same-Conn(t1,t2) (t1.seq no t2.seq no) = ((t1).seq no (t2).seq no) Same-Conn(t1,t2) (t1.ts t2.ts) = ((t1).ts (t2).ts) Same-Conn(t1,t2) (t1.seq no t2.seq no) = ((t1).seq no (t2).seq no) Same-Conn(t1,t2) (t1.ts t2.ts) = ((t1).ts (t2).ts) Opp-Pckts(t1,t2) (t1.seq no == t2.ack no) = ((t1).seq no = (t2).ack no) Any (t) t.window = (t).window Any (t) t.dir = (t).dir Qualiers Any (t) { } Any (t1, t2) { } Same-Conn(t1,t2){(t1.ip1 == t2.ip1) , (t1.ip2 == t2.ip2) , (t1.pt1 == t2.pt1) , (t1.pt2 == t2.pt2)} Opp-Pckts(t1,t2) {(t1.ip1 == t2.ip1), (t1.ip2 == t2.ip2), (t1.pt1 == t2.pt1), (t1.pt2 == t2.pt2), (t1.dir! = t2.dir)}
7. R7 The actual value of TCP eld, window, should be present in the transformed trace. The above requirements are specied formally as a set of constraints given in Table 1. The formal requirements are constraints stating that certain relationships must hold between the original trace and the anonymized trace. The full description of our specication language is given in Section 3.2. Trace Anonymization Next we describe a simple transformation that provably satises the above utility requirements. For this transformation, we concatenate the elds (ip1, ip2), encrypt the concatenated string to obtain anonymized (ip1, ip2) elds. Similarly, we obtain anonymized (pt1, pt2) by concatenating (pt1, pt2, ip1, ip2) and encrypting it. This will map same (pt1, pt2) values to different values if they are from different connections. The elds seq no, ack no are anonymized by linear translation such that minimum value in these elds becomes 0. For example, the values (150, 165, 170) will be linearly translated to (0, 15, 20). The eld ts is anonymized similarly. We do not anonymize any other eld. Finally, we remove any eld which is not required in the analysis. In Section 3.1, we provide a basic set of formal transformation operators. The anonymization scheme mentioned above can be expressed as a composite function of the formal transformations on individual elds. This composite transformation function is given by: = X E{ip1,ip2},1 E{pt1,pt2}(ip1,ip2),2 T{ts}(C ) T{seq no,ack no}(C ) I{dir,window,syn,ack} Here C = {ip1, ip2, pt1, pt2}, E is encryption operator, T is translation operator, is projection operator, I is identity operator, X is the set of required attributes and 1 , 2 are keys for encryption function. In Table 2, a simplied example of a network trace is given. The records in this table are then transformed using above transformation function , to obtain the anonymized view given in Table 3. The encrypted values have been replaced by variables for clarity. Provable Utility The utility analysis veries that the anonymization scheme, dened by the composite transformation function, satises the constraints. Informally, as we do not anonymize syn and ack bits in the trace, the type information of the packet is available, satisfying R1. The encryption of connection elds still supports grouping together of records in same connection (R2). The linear translation of ts and seq no preserves the relative order (R3) and the relative differences in these elds (R4,R5). By using the same transformation for seq no and ack no, we make sure that R6 is satised, allowing the equality tests on these elds. R7 is satised as eld,window is not anonymized. The formal verication process requires formal description of requirements and the anonymization scheme and it has been described in detail in section 4.1. Privacy Analysis 4
Any anonymized view of the trace must be checked for the potential leakage of sensitive information about the users and the network involved. We have described a formal measure for evaluating the trace in section 5. The decision must be taken by the publisher if this measure indicates risk of information leakage. Without the measure, we can still give some statements on privacy based on the transformations chosen. In this example, we can see that the ip addresses and ports have been encrypted together. As a result, the information about individual hosts is lost. The frequency analysis of individual attributes like IP addresses, which has been key to many de-anonymizing attacks, is not possible. Also, simple transformation like translation of timestamps can have signicant impact on the privacy and security. Due to the translation, every connection in the anonymized trace starts at time 0 and hence, inter-arrival information of packets across connections is lost. These translations of ts, seq no,and ack no make a ngerprinting attack difcult for the adversary. The removal of unrequired elds from the trace prevents leakage of any undesired ngerprints. Yet, some of the connections and users may have unique feature, which must be identied using the measure in Section 5 and the decision must be taken by the publisher before releasing the view.
In this section we describe the two main objects of our framework: operators, used by the trace owner to dene trace transformations, and constraints, used by analysts to express utility requirements.
3.1
The following transformation operators are applied to a network trace in order to obscure, remove, or translate eld values. Each transformation operator removes information from the trace, making it more difcult for an adversary to attack, but also less useful for analysts. The trace owner may combine individual operators to form composite transformations, balancing utility and security considerations. The output of a composite transformation will be released to the analyst. Operator descriptions We consider a network trace as a table consisting of records representing IP packets. Each record contains packet header elds and a timestamp eld, but does not include the packet payload. Projection The simplest operator is projection, which removes from the input trace one or more designated elds. The term projection is a reference to the relational algebra operator of the same name. Projection is denoted X 5
where X is a set of elds to be retained; all other elds are eliminated in the output trace. Encryption The encryption operator hides target elds by applying a symmetric encryption function to one or more elds. The encryption operator is denoted EX (Y ), where X is set of target elds to be encrypted, Y is an optional set of grouping attributes for encryption, and is a secret encryption key. The encryption operation is applied as follows. For each record in the trace, the values of attributes from set X are concatenated with the values of attributes from set Y to form a string. The string is appropriately padded and then encrypted under a symmetric encryption algorithm using as the key. The ciphertext output replaces the elds of X in the output trace; the values for attributes in Y are not affected. The encryption key is never shared, so the output trace must be analyzed without access to the values in these elds. A different encryption key is used for each encryption operator applied, but the same encryption key is used for all values of the elds in X . Thus, common values in an encrypted eld are revealed to the analyst. However, if two records agree upon values in X but differ in values in Y , then the encrypted values of X will be different for these records. As a result, the encryption of two records will be same only if they agree upon values for X as well as for Y . Table 3 shows the result of applying encryption operators E{ip1,ip2}, and E{pt1,pt2}(ip1,ip2), to Table 2. The encryption allows connections (identied by source and destination IP, port elds) to be differentiated. However, it is not possible see that two connections share the same destination port, for example. Further, because source and destination IP are used as input for encryption of ports, it is not possible to correlate ports across different connections. Canonical Ordering The canonical ordering operator is used to replace elds whose actual values can be eliminated as long as they are replaced by synthetic values respecting the ordering of the original values. The ordering operator is denoted OX (Y ) where X is the set of target elds to be replaced, and Y is an optional set of grouping elds. If the input set Y is empty, the data entries in elds of X are sorted and replaced by their order in the sorted list, beginning with zero. If the input set Y is not empty, then the ordering operation is done separately for each group of records that agree on values for the columns in Y . Translation The translation operation is applied to numerical elds in the trace, shifting values through addition or subtraction of a given constant. The operator is denoted TX (Y ) where X is a set of target columns that are translated by the operator. The operator can optionally have another set of columns Y called grouping columns, which are not affected by the operation. If the input set Y is empty, all the data-entries in target columns in X are shifted by a parameter c. The shift is caused by subtracting a random parameter c from each entry in the columns. If the input set Y is not empty, then all the records in a trace are formed into groups such that the records in each group have same data-entries for columns in Y . For records in each group, the target columns X are shifted by a parameter c where the value of parameter is dependent on the group. The parameter value can be chosen randomly or by using a function that takes data-entry of Y for the group as input. Scaling The scaling operation scales all the values in a given eld by multiplying it with a constant multiplier. The scaling operator is denoted SX,k for a set of target elds X . The scaling operator acts scales all the values in elds in X by a factor of k . It is sometimes convenient to consider the identity transformation, denoted IX , which does not transform eld X , including it in the output without modication. Composite Transformations The operators above can be combined to form composite transformations for a network trace. We assume in the sequel that composite transformations are represented in the following normal form:
2 n = X 1 X1 X2 ... Xn
(1)
th where i operator in which acts on attribute set Xi and for all i, i Xi refers to (i + 1) Xi {E, T, O, S, I }. We denote the set of all such transformations . The last operation applied to the trace is the projection X . Since X is
the set of attributes retained by , with all others removed, any other operators on elds not in X are irrelevant and can be disregarded. Thus, without loss of generality we assume i, Xi X . Further we restrict our attention to composite operations in which each eld in the trace is affected only by one operation: i, j, Xi Xj = {}. In the paper, we will assume X to be present even if not mentioned in . For example, EX1 TX2 and X1 X2 EX1 TX2 will mean the same. Other operators Our framework can easily accommodate other transformation operators. We have found that this simple set of operators can be used to generate safe transformations supporting a wide range of network analyses performed in the research literature (in addition to the example in Section 2, we consider other examples in Appendix A.1). In many cases, adding additional transformation operators to our framework requires only minor extensions to the algorithms described in Section 4. For example, prex-preserving encryption of IP addresses could be added as a special transformation operator. We do not consider it here, since it has been thoroughly discussed in the literature [8, 15], and we have focused on supporting the wide range of networking studies that do not require prex preservation. However, it is worth noting that some potentially important operators (e.g. random perturbation of numeric elds, or generalization of eld values) will lead to analysis results that are approximately correct, but not exact. In this initial investigation, we are concerned with supporting exact analyses. We leave as future work the consideration of such operators and the evaluation of approximately correct analysis results.
3.2
In our framework, the analyst seeking access to a network trace must specify their utility requirements formally. These requirements are expressed as a set of constraints asserting a given relationship between elds in the original trace and elds in the anonymized trace. The analyst is expected to specify constraints that are sufcient to allow the exact analysis to be carried out on the trace. Each constraint states which item of information must be preserved while anonymizing the trace. An item of information in a network trace can be either: (i) the value of some eld in the trace, or (ii) the value of some arithmetic or boolean expression evaluated using the elds from network trace only. The syntax of notation for the constraint is as follows: Denition 1 (Utility Constraint). A utility constraint is described by a rule of the following form: qualier (expr(orig ) = expr(anon)) A complete grammar for utility constraints is given in Table 4. We use record.eld and (record).eld, which are terminal symbols in the grammar, to mean any valid eld in the original network trace and the transformed trace, respectively. The above constraint can be interpreted as if there are one or more records in a trace that satisfy the qualifying condition given in qualier , then the value of expression expr evaluated over these records must be equal to the value of same expression when evaluated over corresponding anonymized records. We can use this grammar to generate complex arithmetic expressions involving any elds in the trace. A constraint rule is unary if its conditions refer to a single record, or binary if it refers to two records. For example, if an analyst wants to test two port numbers involved in a connection for equality, this can be expressed as the following unary constraint: Any (t) ((t.pt1 == t.pt2) = ((t).pt1 == (t).pt2))
The qualier Any (t) is true for any record in the the trace. The constraint says that if the two port numbers in a record have same value then the corresponding values in the anonymized record should also be the same. We can see that the information that needs to be preserved is the equality of port numbers. A transformation need not preserve the actual values of port numbers in order to satisfy this rule. A binary constraint requires two records for evaluating its expression. For example, the analyst may want to verify that the acknowledgement number in one packet is equal to the sequence number of another packet moving in opposite direction in a TCP connection. This requirement can be expressed as following constraint: Opp(t1, t2) (t1.ack == t2.seq ) = ((t1).ack == (t2).seq ) The information that needs to be preserved here is the equality of two elds across records. The actual values need not be preserved. The qualier Opp(t1, t2) is user-dened and it states the conditions that must be true for two 7
records to belong to packets moving in opposite directions in a connection. In this case, the list of conditions is {(t1.ip1 == t2.ip1), (t1.ip2 == t2.ip2), (t1.pt1 == t2.pt1), (t1.pt2 == t2.pt2), (t1.dir! = t2.dir)}. We believe trace analysts will be able to use these constraint rules to accurately describe the properties of a trace required for accurate analysis. In most cases it is not difcult to consider a trace analysis and derive the elds whose values must be unchanged, or the relationships between values that must be maintained. (See Appendix A.1 for examples.) We note that it is in the interest of the analyst to choose a set of constraint rules which specic to the desired analysis task. Obviously, if all elds of all records in the trace are required to be unmodied, then the only satisfying trace will be the original, unanonymized trace. Our framework does not impose any explicit controls on the utility requirements submitted by analysts, except that the trace owner is likely to reject requests for constraint requirements that are too general.
An important feature of our framework is that it enables the trace owner to reason formally about the relationship between utility requirements and anonymizing transformations. In this section we show how the trace owner can determine conclusively that a published trace will satisfy the utility requirements expressed by analysts. In addition, we show how the trace owner can derive the most secure transformation satisfying a desired set of utility constraints. Lastly, we show how the trace owner can compare alternative publication strategies and analyze the potential impact of collusion amongst analysts who have received traces. We refer to the formal reasoning about trace transformations and utility constraints as static analysis because these relationships between transformations hold for all possible input traces. Other aspects of trace privacy cannot be assessed statically; in Section 5 we measure the privacy of real traces under sample transformations.
4.1
We now show that it is possible to test efciently whether a given transformation will always satisfy the utility requirements expressed by a set of constraints. Denition 2 (Utility constraint satisfaction). Given a set of constraints C and a transformation , we say satises C (denoted |= C ) if, for any input trace, the output of the transformation satises each constraint in C . Checking utility constraint satisfaction is performed independently for each constraint rule in C by matching the conditions specied in a constraint to the operators which impact the named elds. Recall that the general form for unary constraints is qualier (expr(t) = expr((t))) where expr can either be conjunctive normal form of one or more comparisons, or an arithmetic expression. Since the unary constraint has only one record, each comparison 8
expression must involve two attributes from the network trace. For each comparison or arithmetic expression in expr, we look for the corresponding entry in Table 5 which lists expressions and compatible transformation operators. If the composite transform function has a matching transformations in Table 5, then we proceed to the next comparison or sub-expression. Otherwise we conclude that does not satisfy the constraint. If has a matching transformation for each of the sub-expressions, the constraint is said to be satised by the transformation. The procedure for verifying binary constraints is similar, with some minor modications. We describe the verication process in Appendix A.2.
4.2
Since each transformation operator removes information from the trace, some composite transformations can be compared with one another in terms of the amount of information they preserve. We show here that there is a natural partial order on transformations. Denition 3 (Inverse Set of a transformed Trace). Let N be a network trace transformed using transformation to get transformed trace (N ). Then, the inverse set for trace (N ) is given by all possible network traces that have the same algebraic properties as in (N ) and hence, can give (N ) as output when transformed using . We use notation 1 [N ] to represent this set. Example Let us consider a transformed trace (N ) obtained by applying = {A} O{A} on N . In our example, we consider (N ) to be {1,2,3}. We can see that if we apply on network traces {1,5,100} and {4120, 5230, 6788}, the result will be same as (N ). Infact there are N 3 such traces that have same algebraic properties as retained by in (N ) where N is size of domain of attribute A. Thus, the inverse set 1 [N ] consists of N 3 traces. Denition 4 (Equivalence of Traces under transformation ). Two traces N1 and N2 are equivalent under transformation iff 1 [N1 ] = 1 [N2 ]. We denote this relation as N1 N2 . It can be seen that the relation is transitive. Hence, we can divide the entire domain of network traces into equivalence classes where all the network traces in an equivalence class are equivalent under . Lemma 1 (Equivalence Class). For any network trace N and transformation , the equivalence class containing N (denoted by e (N )) is same as the inverse set 1 [N ]. The proof for this lemma is given in Appendix A.3 Denition 5 (Equivalence of Transformations). Two transformations 1 and 2 are equivalent if the relations 1 and 2 divide the domain of network traces into same equivalence classes.
1 1 The implication of this denition is that for any trace N , the inverse sets 1 [N ] = 2 [N ]. In other words, the information retained under two transformations is exactly the same.
Denition 6 (Strictness Relation). Given two composite transformations 1 and 2 , we say that 1 is stricter than 2 if Network Trace N , e2 (N ) e1 (N ) This denition implies that if the transformation 1 is stricter than 2 , then 1 1 Network Trace N , 2 [N ] 1 [N ]
In other words, 1 contains less information about the original trace and hence, the size of inverse set is bigger than that obtained using 2 . Using the denition of strictness, the most strict transformation is , which removes all attributes of the trace. The least strict transformation is IX , which simply applies the identity transformation to all attributes, returning the original trace without modication. All other transformations fall between these two in terms of strictness. For example, we have EX (Y ) OX (Y ) because encryption removes the ordering information from the data-entries. Also, EX (Y ) EX (Y ) if Y Y as EX (Y ) removes the equality information of X -entries from records which have the 9
same entries for Y but differ in some attribute in (Y Y ). More strictness relations for basic operators are given in Lemma 4 in Appendix A.3. Recall that denotes the set of all composite transformations. Then the following theorems show that the strictness relation has a number of convenient properties. Theorem 1. (, ) is a partially ordered set. Theorem 2. (, ) is a join-semilattice i.e. for any two transformations 1 and 2 , there is another transformation in , denoted 1 2 , which is the least upper bound of 1 and 2 .
The proofs of these results are included in Appendix A.4. Theorem 2 can be easily extended to conclude that any set of transforms has a unique least upper bound and this fact has a number of important consequences for the trace publisher: First, given a set of constraints C it is important for the trace publisher to compute the most secure transformation satisfying C . Theorem 2 shows that such a transformation always exists.
Next, imagine that the trace publisher has derived three transforms 1 , 2 , 3 specic to three analyst requests. The publisher may wish to consider publishing a single trace that can satisfy all three requests simultaneously. The least upper bound of these three transformations, denoted lub(1 , 2 , 3 ) is the transformation the publisher must consider. Similarly, if the publisher has already released the traces derived from 1 , 2 , 3 and fears that the analysts may collude, then the least upper bound transformation lub(1 , 2 , 3 ) is a conservative bound on the amount of information the colluding parties could recover by working together.
Input : Set of Constraints C Output: Composite Transform
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Let S = {} be empty set of attributes; Let map = {} Constraint-set to Transformation Map; foreach constraint c in C do foreach attribute a present in c do S = S {a}; end =Most Secure Operator from look-up table that satises c; PUT(map,{c},); end while dependent sets C1 , C2 in map do 1 = GET(map,C1 ); 2 = GET(map,C2 ); = LEAST-UPPER-BOUND(1 , 2 ); REMOVE(map, C1 ); REMOVE(map, C2 ); PUT(map, C1 C2 , ); end Q = S; foreach set C in map do = GET(map,C ); = ; end return()
4.3
The strictness relation can be used as the basis of an algorithm for nding the most secure transformation satisfying a set of utility requirements. 10
Denition 7 (Most Secure Transformation). Given a set of constraints C , the most secure transformation is the minimum element in [C ], denoted min([C ]). We denote by [C ] the set of transformations satisfying the constraints of C . Algorithm 1 computes the most secure transform given a set of constraints. The algorithm uses a map data-structure which keeps the mapping of a set of constraints to its most secure transform. It starts by forming |C | different constraint sets with each set having exactly one constraint. Using look-up table for constraints, the strictest operator is obtained for each constraint and the entry is made in the map (Lines 3-8). As a next step, two constraint sets (C1 , C2 ) are chosen such that there exist an attribute which is referred by atleast one constraint in each set. The composite transforms for C1 and C2 can operate differently on this common attribute. Thus, the least upper bound of these transforms is computed to get the most secure transform having properties of both the transforms (Lines 11-13). The steps for obtaining lub can be seen in proof of Theorem 2 (given in Appendix A.4). The constraint sets C1 and C2 are now merged to obtain a single set and is put into the map along with lub. The previous entries for the two sets are removed from the map. (Lines 14-16). The above steps are repeated until no dependent constraint sets are left. Now, all the transforms in the map transform disjoint set of attributes and do not conict. As a nal step, the composition of all these transforms is done. The resulting composition operator along with the required projection operator is returned as the most secure transform(Lines 18-23).
4.4
Customizing published traces to the needs of analysts means that any given analyst will have the information they need, but not more than they need. However, if a trace owner publishes a set of traces, we must consider the implications of one party acquiring these traces and attempting to combine them. The ability to correlate the information present in two (or more) transformed traces depends greatly on the particular transformations. As a straightforward defense against the risks of collusion, the trace owner can always consider the least upper bound, lub, of the published transformations. The lub provides a conservative bound on the amount of information released since each of the published traces could be computed from it. Therefore the trace owner can evaluate the overall privacy of publishing the lub transformation; if it is acceptable, then the risk of collusion can be ignored. In many practical cases the lub is overly conservative, as it is not possible to merge the traces accurately. For example, if the two transformations E{ip1,ip2,pt1,pt2},1 Iwin and E{ip1,ip2},2 IT T L are released, it is difcult to correlate them. One trace includes a window eld for each encrypted connection, while the other includes a TTL eld for each pair of hosts. Because distinct port pairs are included in the encryption in the rst trace, but removed from the second, it would be very difcult to relate packets across the two traces. In general, the relationship between the information content present in two transformations 1 , 2 and the information present in the lub(1 , 2 ) depends on (i) the transformation operators applied to elds common to 1 and 2 , and (ii) the degree to which these elds uniquely identify packets in the trace. Static analysis tools can be used to evaluate (i), however (ii) may depend on the actual input trace and requires quantitative evaluation, similar to that described next.
The techniques in the previous section allow the trace owner to compare the information content of some transformations, and to nd the most secure transformation that satises a given set of utility requirements. This provides a relative evaluation of trace privacy, but it does not allow the trace owner decide whether a specic transformation meets an acceptable privacy standard. In this section we measure quantitatively the privacy of sample trace transformations by applying the transformations to a real enterprise IP trace, simulating attacks by an informed adversary, and measuring the risk of disclosure. In addition to providing a reliable evaluation of the absolute security of a transformation, the quantitative analysis also allows us to compare the security of transformations that are not comparable (recall that these cases occur because the relation in Section 4 is only a partial order). Also, we can quantify the improved security that results from publishing two traces customized for separate analyses, as compared with publishing a single trace that can support both.
11
Figure 2: (a) Five example transformations used in the quantitative evaluation of host anonymity. (b) Tree representing strictness
relationships between the example transformations (i.e. [child] [parent] in each case).
0
1 1 = E{ip1}, E{ip2}, E{pt1},2 E{pt2},2 T{ts}(C ) T{seq,ack}(C ) I{win} 2 = E{ip1}, E{ip2}, E{pt1,pt2},2 T{ts}(C ) T{seq,ack}(C ) I{win} 1 = E{ip1}, E{ip2}, E{pt1},2 E{pt2},2 O{ts}(C ) O{seq,ack}(C ) O{ipid}(C ) 2 = E{ip1}, E{ip2}, E{pt1,pt2},2 O{ts}(C ) O{seq,ack}(C ) O{ipid}(C ) 0 = E{ip1}, E{ip2}, T{ts}(C ) T{seq,ack}(C ) I{pt1,pt2,win,T T L} O{ipid}(C ) (a)
2 (b)
Experimental setup We use a single IP packet trace collected at a gateway router of a public university. The trace consists of 1.83 million packets and has 41930 hosts, both external and internal. The trace was stored as a relational table in the open-source PostgreSQL database system running on a Dell workstation with Intel Core2 Duo 2.39 GHz processor and 2GB RAM. Each transformation was applied to the trace using the database system. Attack model We focus on one of the most common threats considered in trace publication the re-identication of anonymized host addresses. We assume the adversary has access (through some external means) to information about trafc properties of the hosts participating in the collected trace. We call these host ngerprints. The adversary attacks the published trace by matching ngerprints of these hosts to the available attributes of hosts in the published, transformed trace. The result of the attack is a set of unanonymized candidate hosts that could feasibly correspond to a target anonymized host. Adversary knowledge We consider a powerful adversary who is able to ngerprint hosts using the collection of host attributes described in Figure 4. The port1 in Figure 4 refers to the ports on which the ngerprinted host receives packets, whereas port2 and ip-address2 corresponds to hosts communicating with the ngerprinted host. The rest of the elds have their usual meanings. We do not require an exact match of ngerprints and trace attributes. Instead, the adversary applies a similarity metric to the host pairs, and any un-anonymized host having similarity to the anonymized host, A, above certain threshold is added to the candidate set of A. A higher threshold value reects the high condence of adversary about the accuracy of his ngerprints. In order to simulate a strong adversary, we compute the ngerprints available to adversary from the original trace and choose a high threshold value of 0.98. If the ngerprints being matched are ||X Y |) sets of values X and Y , then the similarity is given by sim(X, Y ) = 1 (|X Y . The similarity metric for |X Y |
|xy | continuous numeric-value ngerprints is given by sim(x, y ) = 1 max( x,y ) Finally, we use the Pearson correlation coefcient to compare ngerprints which are chains of values (e.g. the chain of seq no for a connection). We average the similarity of all the available ngerprints to compute overall similarity of the host.
Privacy Measure We measure privacy by computing, for various k , the value of N (k ): the number of anonymized hosts in the trace that have a candidate set of size less than or equal to k . For example out of the total number of hosts in the trace, N (1) indicates the number of addresses the adversary is able to uniquely de-anonymize. Clearly, a lower value of N (k ) indicates a more privacy-preserving trace. Transformations We evaluate the anonymity of the ve transformations shown in Figure 2(a), which were motivated by some of the sample analyses considered earlier in the paper. The rst two transformations 1 , 2 support the example analysis given in Section 2. Using strictness relations in basic operators (Appendix ??), we can see that 2 1 . In transform 1 , ports are encrypted separately allowing an adversary to use external information on the entropy of the ports, a host sends trafc on. This information is unavailable in transformation 2 because ports are encrypted together.
12
The transformations 1 and 2 allow the analyst to count the number of hosts which are involved in connections with a high rate of out-of-sequence packets. The identication of out-of-sequence packets requires only the orderinformation of sequence numbers, timestamps, acknowledgement numbers and IP identiers in a connection. As these elds have been transformed using the O operator, none of the ngerprints are available for ngerprinting. It is easy to observe that 2 1 . The transformation 0 has been chosen as a base case, in which all the ngerprints listed in Figure 4 are available to the adversary. Also, 0 is a relaxation of 1 and 1 . As such, it is capable of supporting both the above-mentioned analyses. Figure 2(b) illustrates the strictness relationships between the ve transformations studied here. Results We have summarized our results in Figure 3. The privacy measure N (1) gives the number of uniquely identied host and is greatest for 0 , as expected. All the ngerprints are available in 0 leading to identication of 1904 of 41930 hosts. Also, the lower value of N (1) for 2 when compared with 1 validates the relation 2 1 . It can be seen that the results are valid for similar relations among other operators. In addition, the privacy measure allows us to compare transforms 2 and 2 which are incomparable statically. The lowest N (1) value for 2 indicates that it is the safest transform among the set of transforms considered. We can also study the signicance of elds to the adversary using these results. For example, the signicant difference in N (1) value for 1 and 2 indicates that the entropy of ports is highly informative for the adversary. Finally, we can see that releasing two secure views 2 and 2 for two different analysts results in disclosure of only 66 and 25 hosts respectively and is much less than 1904 hosts disclosed by a general view 0 . This illustrates the signicant gain in anonymity that results from publishing two traces customized for individual analyses, rather than publishing a single trace to satisfy both.
Related Work
Slagell et al [19] recognized the need for a framework to support the trade-off between the security and utility of the trace and provide multiple levels of anonymization. They stressed the need of metric to compare the utility of two traces based on elds available in them. In [13], the authors have proposed a measure for evaluating utility of network trace for intrusion detection. The proposed measure is specic to intrusion detection and cannot be applied to other analysis. As described in the introduction, a wide range of anonymizing transformations have been considered for network traces. In [2], Xu et al proposed a cryptography based prex-preserving anonymization scheme for the ip-addresses. This scheme provides one-to-one consistent anonymization for ip-addresses and is not applicable for the transformation of other elds. In [16], Pang et al have proposed a framework with some of the possible transformation for the different packet elds. However, its main focus is transforming the information contained in packet payload and not in the packet header elds. In [15], the various header elds have been studied in some detail and a framework to support anonymization of different elds is proposed. This framework allows for the anonymization policies to control the various elds. In theory, it can publish multiple views but it lacks the tools for analyzing utility of the different views.
13
In [14], Mogul et al propose a framework that requires an analyst to write the analysis program in the language supported by framework. This program is then reviewed by experts for any privacy or security issues. The approved program is executed by the trace owner and results are provided to the analyst. We compared our framework with this approach in Section 1. The PREDICT [1] repository has been established to make network traces available for research. The network traces are distributed at various data hosting sites, and access to the traces is authorized only after the purpose and identity of researchers is reviewed and veried. To the best of our knowledge, the anonymization of traces is not customized to the needs analysts and multiple versions of traces are not published.
Conclusion
We have described a publication framework which allows a trace owner to customize published traces in order to minimize information disclosure while provably meeting the utility of analysts. Using our framework, the trace owner can verify a number of useful privacy and utility properties statically. Such properties hold for all possible traces, and can be veried efciently. However, some aspects of trace privacy must be evaluated on each particular trace. We have implemented our techniques and quantied trace privacy under example transformations. Our preliminary implementation suggests that our trace transformation operators can be applied efciently. In future work we would like to implement a trace transformation infrastructure to efciently support multiple transformations of a trace using indexing and parallel data access. We would also like to extend our framework to support additional transformation operators, and to pursue further analysis of collusion risks.
References
[1] Predict. https://www.predict.org/. [2] Tcpdpriv. http://ita.eee.lbl.gov/html/contrib/tcpdpriv.html. [3] Tcpurify. http://irg.cs.ohiou.edu/ eblanton/tcpurify. [4] Anonymization of ip trafc monitoring data - attacks on two prex-preserving anonymization schemes and some proposed remedies. In Proceedings of the Workshop on Privacy Enhancing Technologies, pages 179196, May 2005. [5] K. Claffy. Ten things lawyers should know about internet research. http://www.caida.org/publications/papers/2008/lawyers top ten/. [6] S. Coull, C. Wright, F. Monrose, M. Collins, and M. K. Reiter. Playing devils advocate: Inferring sensitive information from anonymized network traces. In Proceedings of the 14th Annual Network and Distributed System Security Symposium, pages 3547, February 2007. [7] S. E. Coull, C. V. Wright, A. D. Keromytis, F. Monrose, and M. K. Reiter. Taming the devil: Techniques for evaluating anonymized network data. Proceedings of the 15th Network and Distributed System Security Symposium, February 2008. [8] J. Fan, J. Xu, M. Ammar, and S. Moon. Prex-preserving ip address anonymization: Measurement-based security evaluation and a new cryptography-based scheme. Computer Networks, 46(2):263272, October 2004. [9] S. Jaiswal, G. Iannaccone, C. Diot, J. Kurose, and D. Towsley. Measurement and classication of out-of-sequence packets in a tier-1 ip backbone. In Proceedings of IEEE INFOCOM, pages 11991209, April 2003. [10] S. Jaiswal, G. Iannaccone, C. Diot, J. Kurose, and D. Towsley. Inferring tcp connection characteristics through passive measurements. In Proceedings of INFOCOMM, 2004. [11] H. Jiang and C. Dovrolis. Source-level ip packet bursts: causes and effects. ACM Internet Measurements Conference (IMC), 2003. [12] D. Koukis, S. Antonatos, and K. Anagnostakis. On the privacy risks of publishing anonymized ip network traces. In Proceedings of Communications and Multimedia Security, pages 2232, Oct 2006. [13] K. Lakkaraju and A. Slagell. Evaluating the utility of anonymized network traces for intrusion detection. In 4th Annual SECURECOMM Conference, September 2008. [14] J. C. Mogul and M. Arlitt. Sc2d: an alternative to trace anonymization. In MineNet 06: Proceedings of the 2006 SIGCOMM workshop on Mining network data, pages 323328, New York, NY, USA, 2006. [15] R. Pang, M. Allman, V. Paxon, and J. Lee. The devil and packet trace anonymization. ACM SIGCOMM Computer Communication Review, 36(1):2938, January 2006.
14
Table 6: Utility Requirements for Different Real Network Analyses along with the supporting transformation
Formal Utility Requirements Study of TCP connection interarrivals Any (t) t.syn = (t).syn Any (t) t.ack = (t).ack Any (t1, t2) ((t1.ip1== t2.ip1) && (t1.ip2 ==t2.ip2) && (t1.pt1== t2.pt1)&& (t1.pt2 ==t2.pt2)) = ((t1).ip1== (t2).ip1 && (t1).ip2 ==(t2).ip2 && (t1).pt1== (t2).pt1 && (t1).pt2 ==(t2).pt2) Any(t1,t2) (t1.ts t2.ts) = ((t1).ts (t2).ts) Study of Packet-Bursts: (addition to above requirements) Same-Conn(t1,t2) (t1.seq no t2.seq no) = ((t1).seq no (t2).seq no) Same-Conn(t1,t2) (t1.seq no t2.seq no) = ((t1).seq no (t2).seq no) Opp-Pckts(t1,t2) (t1.seq no == t2.ack no) = ((t1).seq no = (t2).ack no) Classication of Out-Of-Sequence Packets Same-Conn(t1,t2) (t1.ipid == t2.ipid) = ((t1).ipid = (t2).ipid) Rest of the requirements for this case are exactly same as given in Section 2 Transform 1 = I{syn,ack} E{ip1,ip2,pt1,pt2} T{ts} 2 = I{syn,ack} T{ts} E{ip1,ip2,pt1,pt2} T{seq no,ack no}(C ) 3 = 2 I{window,dir} Eipid(C )
[16] R. Pang and V. Paxson. A high-level programming environment for packet trace anonymization and transformation. In Proceedings of ACM SIGCOMM 03, pages 339351, New York, NY, USA, 2003. [17] V. Paxson and S. Floyd. Wide area trafc: the failure of poisson modeling. IEEE/ACM Transactions on Networking, 3:226 244, 1995. [18] B. Ribeiro, W. Chen, G. Miklau, and D. Towsley. Analyzing privacy in enterprise packet trace anonymization. In Proceedings of the 15th Network and Distributed Systems Security Symposium, 2008. [19] A. Slagell and W. Yurcik. Sharing computer network logs for security and privacy: A motivation for new methodologies of anonymization. IEEE/CREATENET SecureComm, pages 8089, 2005.
APPENDIX A
A.1
We provide some examples of the real world analyses along with their description and the utility constraints. It can be seen that the utility constraints are easily derivable from the description of the analysis with some backhround knowledge about the networks. In [17], TCP connection interarrivals have been studied to determine if it can be modeled correctly with Poisson modeling. This study requires the starting time(can be relative) for each connection which can be obtained from SYN/FIN packets. In [11], the causes and effects of packet bursts from individual ows have been studied. The packet burst is identied by several packets sent back-to-back in a ow. Their impact on aggregate trafc is studied by observing the scaling behavior of the trafc in trace over range of timescales. In [9], the classication of out-of-sequence packets is done. This has similar requirements as given in Table 1. The additional requirement imposed is the availability of IP Identier with its equality information preserved across records in connection. The utility constraints and the composite transformation satisfying them is given in Table 6.
A.2
Given transformation and constraint set C , checking the utility constraint satisfaction requires checking satisability of each constraint (unary or binary) in C . Here we present the key concepts in understanding the verication of binary constraints. Qualifying Conditions The qualifying conditions in a constraint are the set of conditions given in qualier that must hold true for the records qualied for constraint satisfaction. These conditions provide the comparisons of the elds in these records. The example of qualier and its set of conditions is given in Table 1. 15
Table 7: Lookup table for binary constraint required for verifying utility of a transformation
qualier-set expression t1.a t2.a t1.a t2.a t1.a t2.a t1.a t2.a t1.a t2.a t1.a + t2.a t1.a t2.a t1.a/t2.a t1.a t1.a t2.b t1.a t2.b t1.a t2.b t1.a t2.b Q t1.a t2.b t1.a + t2.b t1.a t2.b t1.a/t2.b Q t1.a ! = t2.b t1.a == t2.b transformations OX (Y ) , TX (Y ) , IX , SX,k o E{ a}(Y ), a X , QY Q TX (Y ) , IX a X , QY Q IX aX IX aX SX,k , IX aX IX , a X OX (Y ) , TX (Y ) , IX , SX,k o o E{ E{ a}(Y ), b}(Y ), {a, b} X QY Q TX (Y ) , IX {a, b} X , QY Q IX {a, b} X , QY Q IZ {a, b} Z SX,k , IX {a, b} X OX (Y ) , TX (Y ) , IX , SX,k o o E{ E{ a}(Y ), b}(Y ), {a, b} X , QY Q
Denition 8. Given two sets of conditions Q1 and Q2 and Q1 = Q2 , we say that Q1 is relaxed as compared to Q2 if any records which satisfy conditions in Q2 , also satises the conditions in Q1 . Q1 is relaxed as compared to Q2 only if Q1 Q2 . Grouping Conditions The transformation operators like TX (Y ) , OX (Y ) or EX (Y ), divides the trace into group of records where each group agrees upon its values for attributes in set Y . The transformed attributes X can be compared across records only if these two records belong to the same group. Thus, set Y = {y1 , ..., yn } imposes a grouping condition and is given byQY = {(t1.y1 == t2.y1 ), (t1.y2 == t2.y2 ), ..., (t1.yn == t2.yn )} Any transformation with grouping condition QY can satisfy constraint with qualifying condition Q only if the records selected by Q are subset of records grouped by QY . This holds true if QY Q. Verication Process The general form for the binary constraint is-
qualif ier(t1, t2) (expr(t1, t2) = expr((t1), (t2)) Let us assume that the binary constraint has Q as the set of qualifying conditions and is the transformation being checked. As with the unary constraints, we divide the expression expr into atomic sub-expressions where each subexpression can be either comparison or an arithmetic expression. We look for satisability of each sub-expression. For any sub-expression except of type (t1.a == t2.a), we look for the entry for the subexpression in Table 7 which lists the expressions with compatible transformation operators and the requisites. If the transformation has a matching transformation in Table 7 and it satises the requisites mentioned, the sub-expression is satised and we look for the next subexpression. For subexpression of type (t1.a == t2.a), we nd out the set X of attributes such that x X , (t1.x == t2.x) is present in expr(t1, t2). These sub-expressions are satised if following holds true16
For any encryption function of type EY, or EY (Z ), in , (Y X == {}) must be true. In addition, either (Z X ) or QZ Q must hold. If satises all the sub-expressions in constraint, the constraint is satised. For any transformation of type TY (Z ) or OY (Z ) such that (Y X ! = {}) then either Z X or QZ Q.
A.3
In this section, we describe the properties of strictness relations. We begin with the proof of Lemma 1. Lemma (Equivalence Class). For any network trace N and transformation , the equivalence class containing N (denoted by e (N )) is same as the inverse set 1 [N ]. Proof. Let us consider any trace N such that N e (N ) i.e. N N . Hence, using the denition of equivalence (refer denition 4), we have 1 [N ] = 1 [N ]. Since, N 1 [N ], N 1 [N ] and 1 [N ] = 1 [N ], we get N 1 [N ]. Thus, any trace N N is present in the inverse set 1 [N ]. Now, we show that any two traces N1 and N2 in 1 [N ] belong to the equivalence class of N . We see that both the N1 , N2 have all the algebraic properties retained by in (N ). Since retains only these properties, the traces (N1 ) and (N2 ) will have the same properties as (N ). As a result, the inverse set of (N1 ) and (N2 ) will be the same as the inverse set of (N ). Hence, proved. Denition 9 (Independent Transformations). Two transformations are independent of each other if they transform and project disjoint set of attributes. Lemma 2. If 1 2 and is a transformation independent of 1 and 2 , then 1 2 .
Proof. It is given that 1 and are independent of each other. Thus, if X and Y are the attribute sets transformed and projected by 1 and respectively, then X Y = null. If these two transforms are used to transform a network trace N , then 1 (N ) does not retain any information about 1 1 attribute set Y . As a result, [N ] 1 [N ] contains traces with all possibilities for Y from its domain. However, consists of traces with fewer possibilities for Y but it has all the possibilities for X from its domain. But when we apply 1 , it retains information about X as well as Y and it can be seen that 1 1. (1 )1 [N ] 1 [N ]
2. (1 )1 [N ] 1 [N ]
1 1 3. (1 )1 [N ] = [N ] 1 [N ]
1 1 1. (1 )1 [N ] = [N ] 1 [N ] 1 1 2. (2 )1 [N ] = [N ] 2 [N ] 1 1 3. 2 [N ] 1 [N ] ( 1 2 )
Similar statements can be made about 2 because 2 and are independent as well. Thus, we have -
From the set theory, we know that if A B then A U B U . Using this, we get (2 )1 [N ] (1 )1 [N ] 2 .
This is true for any network trace N . Hence, we have shown that 1
Corollary 3 (Strictness Chain). If there exists a chain of strictness relations from 2 to 1 such that 1 1 ... n 2 then 1 2
Corollary 2 (Transitive Relation). The strictness relation is transitive i.e. if 1 2 and 2 3 , then 1 3 .
The proofs for these corollaries are straightforward using strictness denition (refer 6). The Corollary 3 allows us to compute the strictness relations from the known strictness relations. 17
Example Assuming that we have already computed the following strictness relations using the denition of strictness - EX1 EX2 EX1 X2 and TY OY for any X1 , X2 and Y . Let us compare EX1 EX2 TY and EX1 X2 OY . EX1 EX2 TY EX1 X2 TY EX1 X2 OY
Thus, there exists a chain where each step in chain has been obtained using the precomputed relations, Lemma 2 and the commutative property given in Corollary 1. Finally, Corollary 3 implies that EX1 X2 OY EX1 EX2 TY . Now, we give a theorem which allows us to get the minimal set of strictness relations that must be pre-computed in order to compare any two transformations. Lemma 3. If 1 , 2 , ..., n are composite transforms independent to each other and each i , then 1 2 ... n Proof. We provide the informal proof to this lemma as follows- As each i is stricter than , the information contained in each i is upper-bounded by the information in . Since i s are independent, each has information about distinct set of attributes. As a result, the overall information contained in i s can be at most equal to the information in . Thus, the composition of i s is stricter or equal to . Theorem 3 (Basis Set of Strictness Relation). If S is the set of basic operators, then we can derive a relation between any two comparable composite transforms (in normal form) using the basis set of strictness relations which is given by the relations for following pairs1. {X (Y ) , X (Y ) } where Y Y S . 2. {X (Y ) , X
(Y ) }
where X X S .
3. {X (Y ) , X1 (Y ) X2 (Y ) } where X = X1 X2 S . 4. {X (Y ) , X (Y ) } , S . Proof. We need to prove that we can nd a strictness chain for any two comparable composite transformations 1 and 2 . Let 1 2 . Let X1 , X2 be the set of attributes transformed and projected by 1 and 2 respectively. First, we show that X1 X2 . It is because if X1 had some attribute which is not in X2 , then the transformation 1 will have atleast some information about this attribute and can be used to reduce the possible values of this domain. 1 1 This is not possible in 2 . Hence, for any network trace N , 2 [N ] 1 [N ]. This is contradiction as 1 2 . Hence, X1 X2 . Case I: 1 = X1 (Y1 ) , 2 = X2 (Y2 ) where X1 X2 First, we observe that in order to have 1 2 , we must have Y2 Y1 . It is because if Y1 was subset of Y2 , then 1 will have records grouped by Y1 which will be bigger than the group of records formed by Y2 . Thus, 1 will have more information retained. Hence, we cannot have Y1 Y2 . These two sets cannot be disjoint either as that will make operators incomparable. Thus, Y2 Y1 . From the basis set, we can compare 2 = X2 (Y2 ) with 3 = X2 (Y1 ) using relation of type 1. This relation must imply that 3 2 . ( 1 2 for any X1 , Y1 , X2 , Y2 where X1 X2 and Y2 Y1 , then we must have 3 2 ). Now, we can compare 3 with 1 using relation of type 2 in basis set. This relation must imply that 1 3 . Thus, we can have a chain of comparisons 1 3 2 obtainable from basis set of relations. Case II: 1 = X1 (Y1 ) , 2 = X2 (Y2 ) where X1 X2 From the basis set, we can compare 1 = X1 (Y1 ) with 3 = X1 (Y1 ) using relation of type 4. This relation must imply that 1 3 ( 1 2 for any X1 , Y1 , X2 , Y2 where X1 X2 , then we must have 1 3 ). Now, we look at 4 = X2 (Y1 ) . We can compare 4 to 3 using relation of type 2 in basis set. As X1 X2 , we can argue that 3 4 . As in Case I, we can show that Y2 Y1 . Now, using relation of type 1, we can compare 4 to 2 . Also, it must imply 4 2 . Since Y2 Y1 implies that 2 must have more information than 4 . Thus, we can have chain of comparison from 2 to 1 using relations in basis set only. k 1 2 k Case III: 1 = X (Y ) , 2 = X X ... X where X 1 Xi 1 (Y1 ) 2 (Y2 ) k (Yk ) 1 2 k Let us consider 3 = X (Y ) X (Y ) ... X (Y ) where Xi = Xi X X . Using case I, we can get the chain
1 2 k
18
of relations from 2 to 3 such that 3 2 . Using relations of type 4 in basis set, we can get a chain of relations 1 2 k from 3 to 4 = X ... X2 (Y ) Xk (Y ) such that 4 3 . Finally, relations of type 3 in basis set allows us to 1 (Y ) get a chain of relations from 4 to 1 = Sk X (Y ) = X (Y ) . Thus, we have a chain of relation from 2 to 1 using 1 i relations in basis set only. m n 1 2 m 1 2 n Case IV: 1 = X1 (Y1 ) X2 (Y2 ) ... Xm (Ym ) , 2 = X X ... X where 1 Xi 1 Xi n (Yn ) 1 (Y1 ) 2 (Y2 ) Since 1 2 and 1 and 2 are such that each attribute is transformed by exactly one operator, the information i retained by each operator in 1 is unique and is less than the overall information in 1 . Thus, we have each X i (Yi ) i i 1 . Since, 1 2 , we must have Xi (Yi ) 2 . Using the steps in Case III, we can prove that each Xi (Yi ) 2 i using relations from basis set. Now, using lemma 3, X 2 for i = 1 to m implies that i (Yi )
2 m 1 2 ... X X X 2 (Y2 ) m (Ym ) 1 (Y1 )
The above expression implies 1 2 . Since, Case IV represents the most general case, we have shown that we can compare any two comparable composite transforms 1 and 2 using only relations present in basis set. In the following lemma, we have listed the minimal set of strictness relations among basic operators in {E, O, T, S, I, }. These relations can be veried using the denition of strictness. Any pair of composite transformations can be compared using these relations. Lemma 4. Strictness Relation among Basic Operators 1. X (Z ) X (Y ) if Y Z and {E, O, T, S } 3. EX (Y ) EX1 (Y ) EX2 (Y ) if X = X1 X2 4. EX (Y ) and EX 5. X (Y ) X
(Y ) (Y )
6. {} EX (Y ) OX (Y ) 8. OX (Y ) TX (Y ) 7. OX (Y ) SX
10. X X if X X
9. X (Y ) IX if {E, O, T, S }
11. X1 X2 = X where {, I }
A.4
Proof of Theorem 2
The proof for Theorem 2 requires a small result which we have given here as a lemma. This states that there exists a least upper bound in set of composite transforms for any two basic operators. Lemma 5 (Most Secure Transform for Basic Operators). If X1 (Y1 ) and X2 (Y2 ) are two operators from set {E, O, T, S, I }, then the most secure transform is the least upper bound X1 (Y1 ) X2 (Y2 ) , and it is given by = X1 (Y1 ) X2 (Y2 ) if (X1 X2 ) = {} = X1 X2 (Y ) if = EX (Y ) EX1 X (Y1 ) EX2 X (Y2 ) if = = E
, = E and (X1 X2 )! = {}
Theorem. (, ) forms a join-semilattice i.e. any two elements in have a least upper bound in . 1 , 2 (1 2 ) 19
Proof. We will use notation {, , , } for composite transforms and {, , } for basic operators. We will say that two transforms are independent if they transform disjoint set of attributes. We will say one transform is in conict with another if there exists an attribute which is transformed by both. Case I 1 = and 2 = For any two basic operators, there exists a least upper bound(lub) and is given in Lemma 5. Case II 1 and 2 are independent. This is a trivial case. Since the transforms 1 , 2 act on two disjoint attribute sets, the information in one is unrelated to another. Thus, the lub is given by composition 1 2 . Case III 1 = and 2 = , and are independent but and are in conict.. We show that there exists a lub for {( ), } in and is given by = ( ) . From Case II, we know that is lub of and ( ). Since, ( ) is lub of and . Hence, is lub of { , , }. Also, anything greater than and is also greater than . Thus, ( ) and = implies ( ). Using this with , we get lub(( ), ). Thus, is upper bound of 1 and 2 . Let us say be some upper bound of {( ), }. Hence, ( ) ( ) and ( ). Also, . We conclude that , and . The rst two conclusion implies that ( ). Since and transform disjoint sets, ( ) and implies (( ) ). Hence, we have shown that any upper bound for {1 , 2 } is greater than which itself is an upper bound. Hence, is the lub. Case IV 1 = 1 2 ... n and 2 = . in conict with more than one i . Let us begin with computing = 1 . From Lemma 5, we can observe that there can be atmost one basic operator in which transforms attributes of only. Other basic operators in contains attributes of 1 . Hence, there is atmost one operator in which can conict with 2 as it can conict with operator containing attributes of only. Using case III, there exists lub 2 . We can continue to get = ((...( 1 ) 2 )...n ). Now, we prove that is the lub of {, 1 ... n }.Let there be some upper bound of {, (1 ... n )}. Then, we have (1 ... n ) which implies 1 , 2 and so on. We have 1 ( 1 ) 2 ) ( 1 ) 2 (
(...( 1 ) 2 ... n ) = Thus, any upper bound is greater than . Also, can be shown to be upper bound. Hence, is the least upper bound. Case V1 = 1 2 .... n and 2 = 1 2 .... m . Using case IV, we can compute (1 1 ). Similarly, we can compute ((1 1 ) 2 ). We can continue to get = (..(1 1 ) 2 )... m ). Using similar steps as in case IV, we can prove that any upper bound of {1 , 2 } is also an upper bound of . Also, we can show that itself is upper bound of 1 and 2 . Thus, is the lub. Hence, we proved that there exists a least upper bound in for 1 and 2 in . In addition, the proof outlines the way it can be obtained.
20