SlideShare a Scribd company logo
Security and Privacy in a Big Data World
                         Dr. Flavio Villanustre, CISSP, LexisNexis Risk Solutions
VP of Information Security & Lead for the HPCC Systems open source initiative
                                                                28 January 2013
But what is Big Data?
•   Gartner told us that it’s defined by its four dimensions:
       •   Volume
       •   Velocity
       •   Variety
       •   Complexity
•   Driven by the proliferation of social media, sensors, the
    Internet of Things and the such (a lot of the latter)
•   Became accessible thanks to open source distributed data
    intensive platforms (for example, the open source
    LexisNexis HPCC Systems platform and Hadoop)
•   Made popular by consumer oriented services such as
    recommendation systems and search engines


                                                                2
Big Data platforms: key design principles

•   Distributed local store
    • Move algorithm to the data – exploit locality
•   Many data problems are embarrassingly parallel
    • Leverage massive resource aggregation:
       •   Thousands of execution cores
       •   Hundreds of disk controllers
       •   Hundreds of network interfaces
       •   Terabytes of memory and massive memory bandwidth
•   Storage is cheap
•   Moving data into the system takes time (hence keep the
    data around, if possible)
•   It’s fine (and encouraged) to perform iterative exploration
•   In the end: It’s just and all about the data
                                                                  3
A timeline of the main open source Big Data platforms

 LexisNexis designs
 the HPCC Systems                             Google’s
platform to meet its                     MapReduce Paper        First Hadoop                 These platforms gain
own Big Data Needs.                        is Published.           Summit                    mainstream adoption




      Late
    90s/Early          2001               2004         2007            2008           2011               2012
     2000s



             The first few systems are              First Release of           The LN HPCC Systems
              sold to companies and               Hadoop (designed              platform is officially
                   organizations                  after Google’s Map            released as an open
                                                     Reduce ideas)                 source project




                                                                                                                4
Just when we thought that we knew data security…
Big Data is not your dad’s data
   a.   More data sources (beware! data + data > 2 * information)
   b.   Boiling the ocean is at the reach of your hand
   c.   Public clouds offer scalability but introduce risks
   d.   Distributed data stores can blur boundaries
   e.   Leveraging diverse skills implies more people accessing the data

Old tricks may not work
   a.   Tokenization only goes so far
   b.   Encryption at rest just protects against misplaced hardware
   c.   Tracking data provenance is hard
   d.   Enforcing data access controls can quickly get unwieldy
   e.   Conveying policy information across multiple systems is hard



                                                                           5
The ever present challenges

•   Security
    •   Keeping the bad guys out
    •   Making sure the good guys are good and stay good
    •   Preventing mistakes
    •   Disposing of unnecessary/expired data
    •   Enforcing “least privilege” and “need to know” basis
•   Privacy
    • Statistics safer than aggregates
    • Aggregates safer than tokenized samples
    • Tokenized samples safer than individuals
    • Don’t underestimate the power of de-anonymization
    • Mistakes in privacy are irreversible
•   Security <> Privacy

                                                               6
Be wary of “tokenization in a box”

•   Tokenized dataset * Tokenized dataset ~= identifiable data

•   The problem of eliminating inference is NP-complete

•   Several examples in the last decade:

    •   The “Netflix case”

    •   The “Hospital discharge data case”

    •   The “AOL case”


                                                                 7
Common sense to the rescue
•   Track data provenance, permissible use and expiration
    through metadata (data labels and RCM)
•   Enforce fine granular access controls through code/data
    encapsulation (data access wrappers)
•   Utilize homogeneous platforms that allow for end-to-end
    policy preservation and enforcement
•   Deploy (and properly configure) Network and host based
    Data Loss Prevention
•   A comprehensive data governance process is king
•   Use proven controls (Homomorphic encryption and PIR
    are, so far, only theoretical concepts)


                                                              8
And always keep in mind that…


• Access to the hardware ~= Access to the data
   •   Encryption at rest only mitigates the risk for the isolated
       hard drive, but NOT if the decryption key goes with it
   •   Compromise of a running system is NOT mitigated by
       encryption of data at rest

• Virtualization may increase efficiency, but…
   •   In virtual environments, s/he who has access to your
       VMM/Hypervisor, also has access to your data
   •   Cross-VM side-channel attacks are not just theoretical



                                                                     9
Remember


•   Data exhibits its own form of quantum entanglement:
    once a copy is compromised, all other copies instantly are

•   The closer to the source the [filtering, grouping,
    tokenization] is applied, the lower the risk

•   YCLWYDH! You can’t lose what you don’t have (data
    destruction)




                                                                 10
Useful Links
 LexisNexis Risk Solutions: http://lexisnexis.com/risk
 LexisNexis Open Source HPCC Systems Platform: http://hpccsystems.com
 Cross-VM side-channel attacks:
  http://blog.cryptographyengineering.com/2012/10/attack-of-week-cross-
  vm-timing-attacks.html
 K-Anonymity: a model for protecting privacy:
  http://dataprivacylab.org/dataprivacy/projects/kanonymity/kanonymity.p
  df
 Robust de-anonymization of Large Sparse Datasets (the Netflix case):
  http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
 Broken Promises of Privacy: Responding to the surprising failure of
  anonymization:
  http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006
 Tamper detection and Relationship Context Metadata:
  http://blogs.gartner.com/ian-glazer/2011/08/19/follow-up-from-catalyst-
  2011-tamper-detection-and-relationship-context-metadata/

                                                          The HPCC Systems Platform   11
Questions?




             Email: info@hpccsystems.com


                                           The HPCC Systems Platform   12

More Related Content

PPTX
Security issues associated with big data in cloud
PDF
IRJET- Distributed Decentralized Data Storage using IPFS
PDF
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
PPTX
Decentralized Cloud Storage-Storjio
PDF
Doc A hybrid cloud approach for secure authorized deduplication
PDF
Cloud Computing Forensic Science
PPT
Unit 3 -Data storage and cloud computing
DOCX
A Hybrid Cloud Approach for Secure Authorized Deduplication
Security issues associated with big data in cloud
IRJET- Distributed Decentralized Data Storage using IPFS
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Decentralized Cloud Storage-Storjio
Doc A hybrid cloud approach for secure authorized deduplication
Cloud Computing Forensic Science
Unit 3 -Data storage and cloud computing
A Hybrid Cloud Approach for Secure Authorized Deduplication

What's hot (20)

DOCX
a hybrid cloud approach for secure authorized reduplications
PDF
A Study of Data Storage Security Issues in Cloud Computing
DOCX
Hybrid Cloud Approach for Secure Authorized Deduplication
PDF
How One to One Sharing Enforces Secure Collaboration - xonom
PDF
11.cyber forensics in cloud computing
PPTX
Research Data (and Software) Management at Imperial: (Everything you need to ...
PDF
Oruta privacy preserving public auditing
PPTX
Analysis-of-Security-Algorithms-in-Cloud-Computing [Autosaved]
PDF
DDS-to-JSON and DDS Real-time Data Storage with MongoDB
PPTX
Hadoop and Big Data Security
PPTX
Thoughts on Cybersecurity
PDF
The “obsession” with checksums by Helen Hockx-Yu
PDF
Reactive Systems with Data Distribution Service (DDS)
PDF
A Hybrid Cloud Approach for Secure Authorized De-Duplication
PDF
Improved deduplication with keys and chunks in HDFS storage providers
PDF
C017421624
PPSX
Secure and Privacy-Preserving Big-Data Processing
PPTX
Paul Stokes (Jisc) - A provocation about preservation
PDF
Encryption based multi user manner secured data sharing and storing in cloud
PDF
Cloud Computing Using Encryption and Intrusion Detection
a hybrid cloud approach for secure authorized reduplications
A Study of Data Storage Security Issues in Cloud Computing
Hybrid Cloud Approach for Secure Authorized Deduplication
How One to One Sharing Enforces Secure Collaboration - xonom
11.cyber forensics in cloud computing
Research Data (and Software) Management at Imperial: (Everything you need to ...
Oruta privacy preserving public auditing
Analysis-of-Security-Algorithms-in-Cloud-Computing [Autosaved]
DDS-to-JSON and DDS Real-time Data Storage with MongoDB
Hadoop and Big Data Security
Thoughts on Cybersecurity
The “obsession” with checksums by Helen Hockx-Yu
Reactive Systems with Data Distribution Service (DDS)
A Hybrid Cloud Approach for Secure Authorized De-Duplication
Improved deduplication with keys and chunks in HDFS storage providers
C017421624
Secure and Privacy-Preserving Big-Data Processing
Paul Stokes (Jisc) - A provocation about preservation
Encryption based multi user manner secured data sharing and storing in cloud
Cloud Computing Using Encryption and Intrusion Detection
Ad

Viewers also liked (20)

PDF
IBM's four key steps to security and privacy for big data
PPTX
Big Data Security and Privacy - Presentation to AFCEA Cyber Symposium 2014
PDF
Trivadis TechEvent 2016 Big Data Privacy and Security Fundamentals by Florian...
PPT
Information security in big data -privacy and data mining
PPTX
Time Of Courage
PDF
Cyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
PDF
走出IT人才荒 研討會
PPT
Data Privacy &amp; Security Update 2012
PDF
Privacy and Big Data Overload!
PDF
Privacy preserving detection of sensitive data exposure
PPTX
The Impact of Cloud: Cloud Computing Security and Privacy
PPTX
Big Data Day LA 2016/ NoSQL track - Privacy vs. Security in a Big Data World,...
PDF
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
PPTX
Information Security in Big Data : Privacy and Data Mining
PPTX
PPTX
Big Data and Security - Where are we now? (2015)
PPTX
Paper presentation held at national seminar
PPTX
Big data security
PPTX
Conference Powerpoint Presentations
PDF
The Security and Privacy Threats to Cloud Computing
IBM's four key steps to security and privacy for big data
Big Data Security and Privacy - Presentation to AFCEA Cyber Symposium 2014
Trivadis TechEvent 2016 Big Data Privacy and Security Fundamentals by Florian...
Information security in big data -privacy and data mining
Time Of Courage
Cyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
走出IT人才荒 研討會
Data Privacy &amp; Security Update 2012
Privacy and Big Data Overload!
Privacy preserving detection of sensitive data exposure
The Impact of Cloud: Cloud Computing Security and Privacy
Big Data Day LA 2016/ NoSQL track - Privacy vs. Security in a Big Data World,...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Information Security in Big Data : Privacy and Data Mining
Big Data and Security - Where are we now? (2015)
Paper presentation held at national seminar
Big data security
Conference Powerpoint Presentations
The Security and Privacy Threats to Cloud Computing
Ad

Similar to Global bigdata conf_01282013 (20)

PDF
Data Analytics Governance and Ethics
PPTX
2015 04 bio it world
PDF
The Internet-of-things: Architecting for the deluge of data
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
Big Data Fabric: A Necessity For Any Successful Big Data Initiative
PDF
Cloud - Security - Big Data
PPTX
Introduction to Cloud computing and Big Data-Hadoop
PDF
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
PDF
PPTX
Big data and hadoop
PPT
Toward a Mobile Data Commons
PPT
Peer-to-peer Systems.ppt
PDF
Big data and cloud computing 9 sep-2017
ODP
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
PDF
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
PPTX
The How and Why of Container Vulnerability Management
PPTX
The How and Why of Container Vulnerability Management
PDF
110307 cloud security requirements gourley
PDF
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
PDF
Data Analytics Governance and Ethics
2015 04 bio it world
The Internet-of-things: Architecting for the deluge of data
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Big Data Fabric: A Necessity For Any Successful Big Data Initiative
Cloud - Security - Big Data
Introduction to Cloud computing and Big Data-Hadoop
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Big data and hadoop
Toward a Mobile Data Commons
Peer-to-peer Systems.ppt
Big data and cloud computing 9 sep-2017
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
The How and Why of Container Vulnerability Management
The How and Why of Container Vulnerability Management
110307 cloud security requirements gourley
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)

More from HPCC Systems (20)

PPTX
Natural Language to SQL Query conversion using Machine Learning Techniques on...
PPT
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
PPTX
Towards Trustable AI for Complex Systems
PPTX
Welcome
PPTX
Closing / Adjourn
PPTX
Community Website: Virtual Ribbon Cutting
PPTX
Path to 8.0
PPTX
Release Cycle Changes
PPTX
Geohashing with Uber’s H3 Geospatial Index
PPTX
Advancements in HPCC Systems Machine Learning
PPTX
Docker Support
PPTX
Expanding HPCC Systems Deep Neural Network Capabilities
PPTX
Leveraging Intra-Node Parallelization in HPCC Systems
PPTX
DataPatterns - Profiling in ECL Watch
PPTX
Leveraging the Spark-HPCC Ecosystem
PPTX
Work Unit Analysis Tool
PPTX
Community Award Ceremony
PPTX
Dapper Tool - A Bundle to Make your ECL Neater
PPTX
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
PPTX
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Towards Trustable AI for Complex Systems
Welcome
Closing / Adjourn
Community Website: Virtual Ribbon Cutting
Path to 8.0
Release Cycle Changes
Geohashing with Uber’s H3 Geospatial Index
Advancements in HPCC Systems Machine Learning
Docker Support
Expanding HPCC Systems Deep Neural Network Capabilities
Leveraging Intra-Node Parallelization in HPCC Systems
DataPatterns - Profiling in ECL Watch
Leveraging the Spark-HPCC Ecosystem
Work Unit Analysis Tool
Community Award Ceremony
Dapper Tool - A Bundle to Make your ECL Neater
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
Teaching material agriculture food technology
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
cuic standard and advanced reporting.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
A Presentation on Artificial Intelligence
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
Encapsulation_ Review paper, used for researhc scholars
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
“AI and Expert System Decision Support & Business Intelligence Systems”
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Teaching material agriculture food technology
gpt5_lecture_notes_comprehensive_20250812015547.pdf
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
Machine Learning_overview_presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
A Presentation on Artificial Intelligence
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx

Global bigdata conf_01282013

  • 1. Security and Privacy in a Big Data World Dr. Flavio Villanustre, CISSP, LexisNexis Risk Solutions VP of Information Security & Lead for the HPCC Systems open source initiative 28 January 2013
  • 2. But what is Big Data? • Gartner told us that it’s defined by its four dimensions: • Volume • Velocity • Variety • Complexity • Driven by the proliferation of social media, sensors, the Internet of Things and the such (a lot of the latter) • Became accessible thanks to open source distributed data intensive platforms (for example, the open source LexisNexis HPCC Systems platform and Hadoop) • Made popular by consumer oriented services such as recommendation systems and search engines 2
  • 3. Big Data platforms: key design principles • Distributed local store • Move algorithm to the data – exploit locality • Many data problems are embarrassingly parallel • Leverage massive resource aggregation: • Thousands of execution cores • Hundreds of disk controllers • Hundreds of network interfaces • Terabytes of memory and massive memory bandwidth • Storage is cheap • Moving data into the system takes time (hence keep the data around, if possible) • It’s fine (and encouraged) to perform iterative exploration • In the end: It’s just and all about the data 3
  • 4. A timeline of the main open source Big Data platforms LexisNexis designs the HPCC Systems Google’s platform to meet its MapReduce Paper First Hadoop These platforms gain own Big Data Needs. is Published. Summit mainstream adoption Late 90s/Early 2001 2004 2007 2008 2011 2012 2000s The first few systems are First Release of The LN HPCC Systems sold to companies and Hadoop (designed platform is officially organizations after Google’s Map released as an open Reduce ideas) source project 4
  • 5. Just when we thought that we knew data security… Big Data is not your dad’s data a. More data sources (beware! data + data > 2 * information) b. Boiling the ocean is at the reach of your hand c. Public clouds offer scalability but introduce risks d. Distributed data stores can blur boundaries e. Leveraging diverse skills implies more people accessing the data Old tricks may not work a. Tokenization only goes so far b. Encryption at rest just protects against misplaced hardware c. Tracking data provenance is hard d. Enforcing data access controls can quickly get unwieldy e. Conveying policy information across multiple systems is hard 5
  • 6. The ever present challenges • Security • Keeping the bad guys out • Making sure the good guys are good and stay good • Preventing mistakes • Disposing of unnecessary/expired data • Enforcing “least privilege” and “need to know” basis • Privacy • Statistics safer than aggregates • Aggregates safer than tokenized samples • Tokenized samples safer than individuals • Don’t underestimate the power of de-anonymization • Mistakes in privacy are irreversible • Security <> Privacy 6
  • 7. Be wary of “tokenization in a box” • Tokenized dataset * Tokenized dataset ~= identifiable data • The problem of eliminating inference is NP-complete • Several examples in the last decade: • The “Netflix case” • The “Hospital discharge data case” • The “AOL case” 7
  • 8. Common sense to the rescue • Track data provenance, permissible use and expiration through metadata (data labels and RCM) • Enforce fine granular access controls through code/data encapsulation (data access wrappers) • Utilize homogeneous platforms that allow for end-to-end policy preservation and enforcement • Deploy (and properly configure) Network and host based Data Loss Prevention • A comprehensive data governance process is king • Use proven controls (Homomorphic encryption and PIR are, so far, only theoretical concepts) 8
  • 9. And always keep in mind that… • Access to the hardware ~= Access to the data • Encryption at rest only mitigates the risk for the isolated hard drive, but NOT if the decryption key goes with it • Compromise of a running system is NOT mitigated by encryption of data at rest • Virtualization may increase efficiency, but… • In virtual environments, s/he who has access to your VMM/Hypervisor, also has access to your data • Cross-VM side-channel attacks are not just theoretical 9
  • 10. Remember • Data exhibits its own form of quantum entanglement: once a copy is compromised, all other copies instantly are • The closer to the source the [filtering, grouping, tokenization] is applied, the lower the risk • YCLWYDH! You can’t lose what you don’t have (data destruction) 10
  • 11. Useful Links  LexisNexis Risk Solutions: http://lexisnexis.com/risk  LexisNexis Open Source HPCC Systems Platform: http://hpccsystems.com  Cross-VM side-channel attacks: http://blog.cryptographyengineering.com/2012/10/attack-of-week-cross- vm-timing-attacks.html  K-Anonymity: a model for protecting privacy: http://dataprivacylab.org/dataprivacy/projects/kanonymity/kanonymity.p df  Robust de-anonymization of Large Sparse Datasets (the Netflix case): http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf  Broken Promises of Privacy: Responding to the surprising failure of anonymization: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006  Tamper detection and Relationship Context Metadata: http://blogs.gartner.com/ian-glazer/2011/08/19/follow-up-from-catalyst- 2011-tamper-detection-and-relationship-context-metadata/ The HPCC Systems Platform 11
  • 12. Questions? Email: info@hpccsystems.com The HPCC Systems Platform 12