SlideShare a Scribd company logo
Technical Deep Dive
Radek Dymacz
Co-founder and CTO
Twitter: @kazoup @ChroniclesOfRD
Demo: https://demo.kazoup.com
Kazoup software.
Provides analytics, enterprise
search and policy based data
management.
Delivered as Virtual
Appliance behind
corporate network
Discovers and indexes
large volumes of
unstructured file data
Overview.
Files
Current supported file
stores are SAMBA
shares these include
Windows and Linux
Kazoup
Appliance
All indexed data is stored
locally on Kazoup appliance
behind customer network.
Accessible via secure web
interface providing search,
analytics and archive
services
Customer infrastructure
Cloud Object
Storage
When archiving, data
leaving customer
network is
de-duplicated,
compressed on a file
level and encrypted at
transfer and rest
Encrypted files are stored in
supported object storage
providers AWS, Azure,
Google, CenturyLink and
private object storage.
AES 256 Envelope Data
Encryption
SSL Connection over WAN
Kazoup virtual appliance.
Docker
Elasticsearch Postgres RabbitMQ Celery Tika
Natural Language
Processing
Backend App ( Python )
API ( REST )
Frontend ( Polymer + D3.js )
Preconfigured Ubuntu virtual appliance for VMware and Hyper-V with Docker service
Analytics.
Elasticsearch
Data Aggregations
D3.js
Metadata Content
Web Interface
Local Archive Files
Index
Web FrontendPresentation
Analytics
Engine
Data
NLP
Backend App
Discovery and indexing.
Elasticsearch
Discovery
Worker
Meta Scan
Worker
Checksum
Worker
Tika Worker NLP Worker
Recursively walks
directories and finds
files in data source
and passes them to
meta scan queue
Extracts file metadata
information and
updates Elasticsearch
document
Finds files which haven’t
been checksummed,
runs checksum and
passes them to queue
for content extraction
with Tikka
Extracts content plus
some additional meta
data based on the type
of the file. Updates
Elasticsearch and pass
text to NLP queue
Natural Language
Processing (NLP) of
extracted text performs
Named Entities Recognition
and updates Elasticsearch
document
DISCOVERY AND METADATA SCAN CHECKSUM AND CONTENT EXTRACTION SCAN
Extracted Metadata
● Name
● Location
● Size
● Access Time
● Modification Time
● Created Time
● MIME Type
● Extension
● Category
● SMB Extended Attributes
● AD ACL
Extracted Content.
● Checksum
● Content raw text
● Tika Metadata
● Language
● Named Entity Extraction
○ Location
○ Organisation
○ Places
○ Person
○ Money
○ Percent
○ Date
○ Time
Search.
Elasticsearch
AD ACL
AD Authentication
User search
Search supports Active
Directory integration. It
allows users to logon to web
interface with AD credentials.
All search results are scoped
to AD permissions for logged
on user. Web interface is
optimised for desktop and
mobile access.
Elasticsearch Archive Job
Archive
Workers
Checksum
Encrypt
Multipart
upload
Validate
Checksum
Files
Cloud or Private
Object Storage
C7A2BE207FE44286AB12F22DFBA360E3
818705DB0AB94DD5B284336E9DAE39D4
AB76D89A01BB4C388F821A3E763AE44F
….
Compress
Deduplicate
SSL Connection over WAN
AES 256 Data Encryption
ZLIB Compression
Kazoup archive has a policy based engine - there are 2 types of policies; Mirror
policy will copy files but not delete them from the original location. Meaning files
will be both local and in the cloud. Archive policies will copy files and delete them
from the original location once archived successfully. You can mix and match
archive and mirror policies.
The Archive Job runs daily, finding all files that match a given policy and are then
sent to the archive workers. The workers make sure files are checksummed before
applying envelope encryption and compression to the files. Once complete
multipart upload transfers data to chosen object storage service. After successful
upload file checksum is validated and file is set as send to archive. After Kazoup
appliance daily encrypted backup finishes files are marked as archived
Deep Dive into Archive.
Kazoup Appliance
Cloud Object Storage
Encryption is important.
DataMaster Key
Key Generator
Envelope KeyEncryption Encryption
Encrypted DataEncrypted Envelope Key
Cloud stored object.
● Name - random based on UUID ie:
C7A2BE207FE44286AB12F22DFBA360E3
● Metadata encrypted with master key
○ x-amz-meta-x-kazoup-iv
○ x-amz-meta-x-kazoup-compression-level
○ x-amz-meta-x-kazoup-original-location
○ x-amz-meta-x-kazoup-envelope-key
○ x-amz-meta-x-kazoup-master-key-hash
○ x-amz-meta-x-kazoup-compression
● Content encrypted with envelope key
● Object Storage Provider metadata
○ size
○ last modified
○ object storage class
○ etc
AWS example.
Business Continuity.
When file archiving is enabled the appliance configuration and index are encrypted
and backed up daily to the Cloud Object Storage service. These can be easily
recovered to fresh Kazoup appliance installation in DR
In case of losing entire customer site or appliance, then the software can
be reinstalled from backups to existing customer site, new customer
location or even the Cloud providing access to the archived files
When original file data source is not accessible all archived data is still
available via appliance
Appliance updates
● Appliance sends usage statistics every
hour for billing purposes
○ appliance software version
○ size of analysed data
○ no sensitive information is stored or
transferred to Kazoup backend
■ like file or folder names,
content, ACL’s etc
● Automatic updates are pulled over
HTTPS and doesn’t require opening any
additional ports
Integrated support.
● Appliance web frontend contains build in
support chat application
● Remote session can be initiated by client
via appliance console
○ gives access to appliance behind
firewall for troubleshooting and
diagnosis of the issues without need
to open any additional ports
○ can be only initiate by client
Kazoup Security.
Index data
stored behind
customer
network
Any data leaving
customer network
is encrypted in
transfer with SSL
and at rest with
AES-256 envelope
encryption
Appliance updates
are pulled over
HTTPS and doesn’t
require opening any
additional ports
Search
integration with
Active Directory
ACL’s
Remote support
can be only
initiate by client
Technology roadmap
Cloud Archive
Cloud data sources OneDrive,
Dropbox, Box, Google Drive,
Slack, Gmail
Desktop and
Mobile search
integration
Additional data sources:
Office 365, Sharepoint,
Egnyte, Salesforce
Cloud SaaS version
We are here
Automated document
classification, automatic
anomaly detection,
Speech API and Audio
to text
Natural Language API,
Content Classification
Translation, Distributed
ML (leverage spare
compute of users
workstation for
analytics)
Machine Learning with
Tensor Flow, Vision API, OCR
and Image Content Analysis
We are strong believers in open source and our software wouldn’t be possible without it the following
technology that help us make Kazoup happen;
ElasticSearch - powers our analytics and search engine.
Google Polymer and Web Components - future of the web development is here and we are using it.
Docker - our development, staging, QA, production, deployment and automatic updates hero.
RabbitMQ - robust messaging for our application.
HTTP2 - delivers our application at speed even on high latency mobile networks.
Envelope Encryption - keeps customer data safe and allows us to rotate the encryption key at any time.
Machine Learning - helps us make sense of the large volumes of data.
The Twelve-Factor App and TDD - guides our development process.
Github/CircleCI - Continuous Integration and Deployment made right.
Intercom - helps us communicate directly with our users.
AWS - provides our backend infrastructure.
Python and Go Programming Language - makes us happy.
Slack - helps us communicate internally.
Intelligent file data management.

More Related Content

PDF
Kazoup Solution Overview
PDF
Cloud Modernization and Data as a Service Option
PPTX
Big data in Azure
PDF
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
PPTX
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
PDF
How to visualize Cosmos DB graph data
PPTX
Digikrit Company Profile
PPTX
Webinar: DataStax Enterprise 5.0 What’s New and How It’ll Make Your Life Easier
Kazoup Solution Overview
Cloud Modernization and Data as a Service Option
Big data in Azure
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
How to visualize Cosmos DB graph data
Digikrit Company Profile
Webinar: DataStax Enterprise 5.0 What’s New and How It’ll Make Your Life Easier

What's hot (20)

PDF
Introduction to Graph Databases
PPTX
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
PPTX
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
PPTX
Webinar: Fighting Fraud with Graph Databases
PDF
Getting started with Cosmos DB + Linkurious Enterprise
PDF
Introduction to Big Data Technologies & Applications
PDF
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
PDF
Finding the insights hidden in your graph data
PPTX
Modern data warehouse
PDF
Paris Spark Meetup - Trifacta - 03_04_2017
PDF
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
PDF
Why HR Should Consider Agile Modern Data Delivery Platform
PDF
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...
PDF
Creating a Modern Data Architecture for Digital Transformation
PDF
Key note big data analytics ecosystem strategy
PDF
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
PDF
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
PPTX
An exploration in analysis and visualization
PDF
Perth Microsoft Data & Analytics User Group - Building Solutions with Azure D...
PPTX
Big data analytics.
Introduction to Graph Databases
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
Webinar: Fighting Fraud with Graph Databases
Getting started with Cosmos DB + Linkurious Enterprise
Introduction to Big Data Technologies & Applications
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
Finding the insights hidden in your graph data
Modern data warehouse
Paris Spark Meetup - Trifacta - 03_04_2017
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
Why HR Should Consider Agile Modern Data Delivery Platform
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...
Creating a Modern Data Architecture for Digital Transformation
Key note big data analytics ecosystem strategy
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
An exploration in analysis and visualization
Perth Microsoft Data & Analytics User Group - Building Solutions with Azure D...
Big data analytics.
Ad

Similar to Kazoup software appliance - A technical deep dive (20)

PPTX
Databricks Platform.pptx
PDF
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
PPTX
Microsoft Azure
PPTX
Slide Storage.pptx
PPTX
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
ODP
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
PDF
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
PPTX
Microsoft Azure update
PPTX
Backup multi-cloud solution based on named pipes
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PDF
Metadata Lakes for Next-Gen AI/ML - Lisa N. Cao
PDF
Zenko @Cloud Native Foundation London Meetup March 6th 2018
PDF
Metadata Lakes for Next-Gen AI/ML - Datastrato
PDF
Ibm integrated analytics system
PDF
Repository As A Service (RaaS) at ICPSR
PPT
Clouds in Your Coffee Session with Cleversafe & Avere
PDF
Zenko: Enabling Data Control in a Multi-cloud World
PDF
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...
PDF
Meetup Oracle Database MAD_BCN: 1.1 Servicios de Oracle Database en la nube
PDF
AWS 101 December 2014
Databricks Platform.pptx
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Microsoft Azure
Slide Storage.pptx
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Microsoft Azure update
Backup multi-cloud solution based on named pipes
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Metadata Lakes for Next-Gen AI/ML - Lisa N. Cao
Zenko @Cloud Native Foundation London Meetup March 6th 2018
Metadata Lakes for Next-Gen AI/ML - Datastrato
Ibm integrated analytics system
Repository As A Service (RaaS) at ICPSR
Clouds in Your Coffee Session with Cleversafe & Avere
Zenko: Enabling Data Control in a Multi-cloud World
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...
Meetup Oracle Database MAD_BCN: 1.1 Servicios de Oracle Database en la nube
AWS 101 December 2014
Ad

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Global journeys: estimating international migration
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
1_Introduction to advance data techniques.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Introduction to Business Data Analytics.
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Knowledge Engineering Part 1
Launch Your Data Science Career in Kochi – 2025
Global journeys: estimating international migration
Clinical guidelines as a resource for EBP(1).pdf
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
1_Introduction to advance data techniques.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Quality review (1)_presentation of this 21
Introduction to Business Data Analytics.

Kazoup software appliance - A technical deep dive

  • 1. Technical Deep Dive Radek Dymacz Co-founder and CTO Twitter: @kazoup @ChroniclesOfRD Demo: https://demo.kazoup.com
  • 2. Kazoup software. Provides analytics, enterprise search and policy based data management. Delivered as Virtual Appliance behind corporate network Discovers and indexes large volumes of unstructured file data
  • 3. Overview. Files Current supported file stores are SAMBA shares these include Windows and Linux Kazoup Appliance All indexed data is stored locally on Kazoup appliance behind customer network. Accessible via secure web interface providing search, analytics and archive services Customer infrastructure Cloud Object Storage When archiving, data leaving customer network is de-duplicated, compressed on a file level and encrypted at transfer and rest Encrypted files are stored in supported object storage providers AWS, Azure, Google, CenturyLink and private object storage. AES 256 Envelope Data Encryption SSL Connection over WAN
  • 4. Kazoup virtual appliance. Docker Elasticsearch Postgres RabbitMQ Celery Tika Natural Language Processing Backend App ( Python ) API ( REST ) Frontend ( Polymer + D3.js ) Preconfigured Ubuntu virtual appliance for VMware and Hyper-V with Docker service
  • 5. Analytics. Elasticsearch Data Aggregations D3.js Metadata Content Web Interface Local Archive Files Index Web FrontendPresentation Analytics Engine Data NLP Backend App
  • 6. Discovery and indexing. Elasticsearch Discovery Worker Meta Scan Worker Checksum Worker Tika Worker NLP Worker Recursively walks directories and finds files in data source and passes them to meta scan queue Extracts file metadata information and updates Elasticsearch document Finds files which haven’t been checksummed, runs checksum and passes them to queue for content extraction with Tikka Extracts content plus some additional meta data based on the type of the file. Updates Elasticsearch and pass text to NLP queue Natural Language Processing (NLP) of extracted text performs Named Entities Recognition and updates Elasticsearch document DISCOVERY AND METADATA SCAN CHECKSUM AND CONTENT EXTRACTION SCAN
  • 7. Extracted Metadata ● Name ● Location ● Size ● Access Time ● Modification Time ● Created Time ● MIME Type ● Extension ● Category ● SMB Extended Attributes ● AD ACL Extracted Content. ● Checksum ● Content raw text ● Tika Metadata ● Language ● Named Entity Extraction ○ Location ○ Organisation ○ Places ○ Person ○ Money ○ Percent ○ Date ○ Time
  • 8. Search. Elasticsearch AD ACL AD Authentication User search Search supports Active Directory integration. It allows users to logon to web interface with AD credentials. All search results are scoped to AD permissions for logged on user. Web interface is optimised for desktop and mobile access.
  • 9. Elasticsearch Archive Job Archive Workers Checksum Encrypt Multipart upload Validate Checksum Files Cloud or Private Object Storage C7A2BE207FE44286AB12F22DFBA360E3 818705DB0AB94DD5B284336E9DAE39D4 AB76D89A01BB4C388F821A3E763AE44F …. Compress Deduplicate SSL Connection over WAN AES 256 Data Encryption ZLIB Compression Kazoup archive has a policy based engine - there are 2 types of policies; Mirror policy will copy files but not delete them from the original location. Meaning files will be both local and in the cloud. Archive policies will copy files and delete them from the original location once archived successfully. You can mix and match archive and mirror policies. The Archive Job runs daily, finding all files that match a given policy and are then sent to the archive workers. The workers make sure files are checksummed before applying envelope encryption and compression to the files. Once complete multipart upload transfers data to chosen object storage service. After successful upload file checksum is validated and file is set as send to archive. After Kazoup appliance daily encrypted backup finishes files are marked as archived Deep Dive into Archive.
  • 10. Kazoup Appliance Cloud Object Storage Encryption is important. DataMaster Key Key Generator Envelope KeyEncryption Encryption Encrypted DataEncrypted Envelope Key
  • 11. Cloud stored object. ● Name - random based on UUID ie: C7A2BE207FE44286AB12F22DFBA360E3 ● Metadata encrypted with master key ○ x-amz-meta-x-kazoup-iv ○ x-amz-meta-x-kazoup-compression-level ○ x-amz-meta-x-kazoup-original-location ○ x-amz-meta-x-kazoup-envelope-key ○ x-amz-meta-x-kazoup-master-key-hash ○ x-amz-meta-x-kazoup-compression ● Content encrypted with envelope key ● Object Storage Provider metadata ○ size ○ last modified ○ object storage class ○ etc
  • 13. Business Continuity. When file archiving is enabled the appliance configuration and index are encrypted and backed up daily to the Cloud Object Storage service. These can be easily recovered to fresh Kazoup appliance installation in DR In case of losing entire customer site or appliance, then the software can be reinstalled from backups to existing customer site, new customer location or even the Cloud providing access to the archived files When original file data source is not accessible all archived data is still available via appliance
  • 14. Appliance updates ● Appliance sends usage statistics every hour for billing purposes ○ appliance software version ○ size of analysed data ○ no sensitive information is stored or transferred to Kazoup backend ■ like file or folder names, content, ACL’s etc ● Automatic updates are pulled over HTTPS and doesn’t require opening any additional ports
  • 15. Integrated support. ● Appliance web frontend contains build in support chat application ● Remote session can be initiated by client via appliance console ○ gives access to appliance behind firewall for troubleshooting and diagnosis of the issues without need to open any additional ports ○ can be only initiate by client
  • 16. Kazoup Security. Index data stored behind customer network Any data leaving customer network is encrypted in transfer with SSL and at rest with AES-256 envelope encryption Appliance updates are pulled over HTTPS and doesn’t require opening any additional ports Search integration with Active Directory ACL’s Remote support can be only initiate by client
  • 17. Technology roadmap Cloud Archive Cloud data sources OneDrive, Dropbox, Box, Google Drive, Slack, Gmail Desktop and Mobile search integration Additional data sources: Office 365, Sharepoint, Egnyte, Salesforce Cloud SaaS version We are here Automated document classification, automatic anomaly detection, Speech API and Audio to text Natural Language API, Content Classification Translation, Distributed ML (leverage spare compute of users workstation for analytics) Machine Learning with Tensor Flow, Vision API, OCR and Image Content Analysis
  • 18. We are strong believers in open source and our software wouldn’t be possible without it the following technology that help us make Kazoup happen; ElasticSearch - powers our analytics and search engine. Google Polymer and Web Components - future of the web development is here and we are using it. Docker - our development, staging, QA, production, deployment and automatic updates hero. RabbitMQ - robust messaging for our application. HTTP2 - delivers our application at speed even on high latency mobile networks. Envelope Encryption - keeps customer data safe and allows us to rotate the encryption key at any time. Machine Learning - helps us make sense of the large volumes of data. The Twelve-Factor App and TDD - guides our development process. Github/CircleCI - Continuous Integration and Deployment made right. Intercom - helps us communicate directly with our users. AWS - provides our backend infrastructure. Python and Go Programming Language - makes us happy. Slack - helps us communicate internally.
  • 19. Intelligent file data management.