Submitted By,
JOSNA KRISHNA
S7 CSE
ROLL No.:35
 INTRODUCTION
 SENSITIVE DATAS IN COMPANIES
 DATA LEAKAGE-------HOW???
 DANGER…
 TOWARDS SECURITY
 EXISTING SYSTEM
 PROPOSED SYSTEM
 INTO THE ALGORITHM
 CONCLUSION
DATA LEAKAGE:
Data leakage is the unauthorized
transmission of sensitive data or
information from within an organization
to an external destination .
•Intellectual Properties
•Financial Information
•Patient Information
•Personal Credit Card Data,
•& Other Information
Depending Upon the
Business and the industry.
•In the course of business, data must be
handed over to trusted 3rd Parties for
some operations.
•Sometimes these trusted 3rd
Parties may act as points of
Data leakage.
•Data Leakage mainly
happens due to
Human Errors.
•A hospital may give patient records to
researcher who will devise new treatment.
•Company may have partnership with other
companies that require sharing of customer
data.
•An enterprise may outsource
it’s data processing, so data
must be given to various other
companies.
•Number of leaked sensitive data records has
grown 10 times in recent years.
•Data leakage by accidents exceeds the risk posed
by vulnerable software.
•Sensitive data leakage is more in cases where
there is no End-to-End encryption (example: PGP-
Pretty Good Privacy)
•Prevent clear text sensitive Data from Direct Access.
•Deploy a Screening Tool:
-To scan computer file systems.
-To scan server storage.
-Inspect outbound network traffic.
•Data leak detection differs from AntiVirus and Network
Intrusion Detection System (AV&NIDS).
->New security requirements
&
->Algorithmic Challenges.
Algorithmic Challenges:
-Data Transformation
-Scalability
•Direct usage of Automata-based string matching
is not possible.
It is based on Set Intersection.
Operation performed on 2 sets
of n-grams.
One from content and one from sensitive data.
This method is used to detect similar
documents on:
•The web.
•Shared malicious traffic pattern.
•Malware.
•E-mail spam.
 Symantec DLP
 Identity Finder
 Global Velocity
 GoCloud DLP etc.
Set Intersection is order less.
(Ordering of shared n-grams is not analyzed)
Generates false alerts.
(When n is set to small value)
Cannot detect the partial data leakage.
It is not an adequate method.
This one is holding sequential alignment
algorithm.
Executed on :
•Sampled sensitive data sequence.
•Sampled content being inspected.
Alignment produces the amount of sensitive data
in a content.
More accuracy is achieved.
Scalability issue is solved by sampling both the
Sensitive Data & Content Sequence before aligning.
A pair of algorithms is used:
•Comparable Sampling Algorithm
•Sampling Oblivious Alignment Algorithm
High detection specificity.
Pervasive & localized modifications.
o The Comparable Sampling Algorithm yields
constant samples of a sequence wherever
the sampling starts and ends
o The Sampling Oblivious Alignment
Algorithm infers the similarity between the
original unsampled sequence with
sophisticated techniques through dynamic
programming.
 In this method, both sensitive data &
content sequence are sampled.
 The alignment is performed on sampled
sequences
 Here, a ‘Comparable Sampling’ property is
used.
 Both the algorithms performs more faster
on a GPU than a CPU.
 Promises high speed security scanning.
INTO THE ALGORITHMS 
Requirements:
Definition 1: A substring is a consecutive
segment of the original string.
Definition 2: A subsequence does not
require its items to be consecutive in the
original string.
Definition 3: Given string x is substring
of y ,comparable sampling on x and y
yields x’ and y’. x’ is similar to a
substring of y’.
Definition 4: Given x as a substring of
y, a subsequence preserving sampling on
x and y yield two subsequences x’ and y’
,so that x’ is substring of y’.
 It is deterministic and subsequence
preserving.
 This algorithm is unbiased.
 It yields a constant samples of a
sequence wherever the sampling starts
and ends.
 Input: an array S of items, a size |w| for a sliding
window w, a
 selection function f (w, N) that selects N smallest
items from a
 window w, i.e., f = min(w, N)
 Output: a sampled array T
 1: initialize T as an empty array of size |S|
 2: w ←read(S, |w|)
 3: let w.head and w.tail be indices in S
corresponding to the
 higher-indexed end and lower-indexed end of w,
respectively
 4: collection mc ← min(w, N)
 5: while w is within the boundary of S do
 6: mp ←mc
 7: move w toward high index by 1
 8: mc ← min(w, N)
 9: if mc = mp then
 10: item en ← collectionDiff (mc,mp)
 11: item eo ← collectionDiff (mp,mc)
 12: if en < eo then
 13: write value en to T at w.head’s position
 14: else
 15: write value eo to T at w.tail’s position
 16: end if
 17: end if
 18: end while
We set our sampling procedure with a sliding window
of size 6 (i.e., |w| = 6) and N= 3. The input
sequence is 1,5,1,9,8,5,3,2,4,8. The initial window
w= [1,5,1,9,8,5] and collection mc = sliding{1,1,5}.
 The complexity of selection function is
O(n log|w|) or O(n),where n is the size of
input, |w| is the size of the window.
 The factor O(log|w|) comes from
maintaining the smallest N items within
the window.
Requirements:
The algorithm runs on compact sampled sequences L .
Extra fields for scoring matrix cells in dynamic
programming.
Extra step in recurrence relation for updating the null
region.
Complex weight function computes similarities
between two null region.
 Order –aware comparison
 High Tolerance to pattern variation
 Capability of detecting partial leaks
 Consistent
 Input: A weight function fw, visited cells in
H matrix that are
adjacent to H(i, j ): H(i −1, j −1), H(i, j −1),
and H(i −1, j ),
and the i -th and j -th items Lai,Lbj
in two sampled sequences La
and Lb, respectively.
•Presented here is a content inspection technique
for sensitive data leakage.
•Detection approach is based on aligning 2
samples for similarity comparison.
•Our alignment method is useful for common data
scenarios.
Fast detection of transformed data leaks[mithun_p_c]

Fast detection of transformed data leaks[mithun_p_c]

  • 1.
  • 2.
     INTRODUCTION  SENSITIVEDATAS IN COMPANIES  DATA LEAKAGE-------HOW???  DANGER…  TOWARDS SECURITY  EXISTING SYSTEM  PROPOSED SYSTEM  INTO THE ALGORITHM  CONCLUSION
  • 3.
    DATA LEAKAGE: Data leakageis the unauthorized transmission of sensitive data or information from within an organization to an external destination .
  • 4.
    •Intellectual Properties •Financial Information •PatientInformation •Personal Credit Card Data, •& Other Information Depending Upon the Business and the industry.
  • 5.
    •In the courseof business, data must be handed over to trusted 3rd Parties for some operations. •Sometimes these trusted 3rd Parties may act as points of Data leakage. •Data Leakage mainly happens due to Human Errors.
  • 6.
    •A hospital maygive patient records to researcher who will devise new treatment. •Company may have partnership with other companies that require sharing of customer data. •An enterprise may outsource it’s data processing, so data must be given to various other companies.
  • 8.
    •Number of leakedsensitive data records has grown 10 times in recent years. •Data leakage by accidents exceeds the risk posed by vulnerable software. •Sensitive data leakage is more in cases where there is no End-to-End encryption (example: PGP- Pretty Good Privacy)
  • 9.
    •Prevent clear textsensitive Data from Direct Access. •Deploy a Screening Tool: -To scan computer file systems. -To scan server storage. -Inspect outbound network traffic. •Data leak detection differs from AntiVirus and Network Intrusion Detection System (AV&NIDS).
  • 10.
    ->New security requirements & ->AlgorithmicChallenges. Algorithmic Challenges: -Data Transformation -Scalability •Direct usage of Automata-based string matching is not possible.
  • 11.
    It is basedon Set Intersection. Operation performed on 2 sets of n-grams. One from content and one from sensitive data. This method is used to detect similar documents on: •The web. •Shared malicious traffic pattern. •Malware. •E-mail spam.
  • 12.
     Symantec DLP Identity Finder  Global Velocity  GoCloud DLP etc.
  • 13.
    Set Intersection isorder less. (Ordering of shared n-grams is not analyzed) Generates false alerts. (When n is set to small value) Cannot detect the partial data leakage. It is not an adequate method.
  • 14.
    This one isholding sequential alignment algorithm. Executed on : •Sampled sensitive data sequence. •Sampled content being inspected. Alignment produces the amount of sensitive data in a content. More accuracy is achieved.
  • 15.
    Scalability issue issolved by sampling both the Sensitive Data & Content Sequence before aligning. A pair of algorithms is used: •Comparable Sampling Algorithm •Sampling Oblivious Alignment Algorithm High detection specificity. Pervasive & localized modifications.
  • 16.
    o The ComparableSampling Algorithm yields constant samples of a sequence wherever the sampling starts and ends o The Sampling Oblivious Alignment Algorithm infers the similarity between the original unsampled sequence with sophisticated techniques through dynamic programming.
  • 17.
     In thismethod, both sensitive data & content sequence are sampled.  The alignment is performed on sampled sequences  Here, a ‘Comparable Sampling’ property is used.  Both the algorithms performs more faster on a GPU than a CPU.  Promises high speed security scanning.
  • 18.
  • 19.
    Requirements: Definition 1: Asubstring is a consecutive segment of the original string. Definition 2: A subsequence does not require its items to be consecutive in the original string.
  • 20.
    Definition 3: Givenstring x is substring of y ,comparable sampling on x and y yields x’ and y’. x’ is similar to a substring of y’. Definition 4: Given x as a substring of y, a subsequence preserving sampling on x and y yield two subsequences x’ and y’ ,so that x’ is substring of y’.
  • 21.
     It isdeterministic and subsequence preserving.  This algorithm is unbiased.  It yields a constant samples of a sequence wherever the sampling starts and ends.
  • 22.
     Input: anarray S of items, a size |w| for a sliding window w, a  selection function f (w, N) that selects N smallest items from a  window w, i.e., f = min(w, N)  Output: a sampled array T  1: initialize T as an empty array of size |S|  2: w ←read(S, |w|)  3: let w.head and w.tail be indices in S corresponding to the  higher-indexed end and lower-indexed end of w, respectively  4: collection mc ← min(w, N)  5: while w is within the boundary of S do
  • 23.
     6: mp←mc  7: move w toward high index by 1  8: mc ← min(w, N)  9: if mc = mp then  10: item en ← collectionDiff (mc,mp)  11: item eo ← collectionDiff (mp,mc)  12: if en < eo then  13: write value en to T at w.head’s position  14: else  15: write value eo to T at w.tail’s position  16: end if  17: end if  18: end while
  • 24.
    We set oursampling procedure with a sliding window of size 6 (i.e., |w| = 6) and N= 3. The input sequence is 1,5,1,9,8,5,3,2,4,8. The initial window w= [1,5,1,9,8,5] and collection mc = sliding{1,1,5}.
  • 25.
     The complexityof selection function is O(n log|w|) or O(n),where n is the size of input, |w| is the size of the window.  The factor O(log|w|) comes from maintaining the smallest N items within the window.
  • 26.
    Requirements: The algorithm runson compact sampled sequences L . Extra fields for scoring matrix cells in dynamic programming. Extra step in recurrence relation for updating the null region. Complex weight function computes similarities between two null region.
  • 27.
     Order –awarecomparison  High Tolerance to pattern variation  Capability of detecting partial leaks  Consistent
  • 28.
     Input: Aweight function fw, visited cells in H matrix that are adjacent to H(i, j ): H(i −1, j −1), H(i, j −1), and H(i −1, j ), and the i -th and j -th items Lai,Lbj in two sampled sequences La and Lb, respectively.
  • 30.
    •Presented here isa content inspection technique for sensitive data leakage. •Detection approach is based on aligning 2 samples for similarity comparison. •Our alignment method is useful for common data scenarios.