Event: GDPR, AI and Big Data: Oxymorons or strategic necessities
Data Insights chosen as one of the Top 10 Data Analytics Companies of 2018

Data Insights first master thesis

A few days ago, one of our youngest teammates, Saurabh presented his master thesis at the Technischen University of München (T.U.M.) here in Germany. This was a great achievement for him, for Data Insight team as well as for me as his supervisor.  Saurabh as  a  master Student was selected for the Data Insights master thesis program in November 2017 and I became his supervisor for quite nine months in this very intriguing adventure. The topic of his thesis was the encryption of the topic files in Apache Kafka. We first began by downloading the repo of Apache Kafka that like all open sources, are available to everyone.

While planning our approach we both agreed on some important points over the structure of the software that we were going to write (we called the project EnDec: Encrypter – Decrypter):

  1. The EnDec had to be a plug-in, that can be added to the Apache Kafka software without any change in the core code.
  2. The “activation” of the code must be done from a standard preference external file (standard preference file from Apache Kafka).
  3. EnDec must be available for use by any custom made external Encryption algorithm that the user prefers to use (or write).
  4. EndDec would not be taking care of the management of keys used for decryption or encryption.

Some of the points were really challenging and required repeated test and further research, it was imperative that the code is clear, concise and overall completely “transparent” to the standard functioning of the Core Kafka. The main trick was to deploy the EnDec class as an extension of the core Serde class. Once that was done we had our back door open and we were able to develop all the EnDec Logic without any interference with the Core Kafka functioning. We wrote an interface to standardize the way the encryption and decryption were applied, we then applied the interface to the serialized message and before the deserialization. We relied upon Hashi Corp Vault technology for the storage and exchange of the encryption/decryption keys because the management of the keys was not the objective of the software.

It took several months to finalize the project and several reviews but, in the end, we had a good product on our hands and the benchmark test showed us that our plug-in was adding neglectable latency and lag to the core Kafka which is famous for its impressive performance. The only significant lag added was of course, by the external encryption-decryption classes and their algorithms.

After nine months of hard work our teammate, on the 1st of August presented his Master Thesis at TUM, it was an extremely satisfying success for Data Insight, the first in its field, and also a good reason to be personally proud of him as my master student, who did a great job.

The link to our github repository: https://github.com/DataInsightsGmbH/thesis-kafka-endec/tree/master.

The figure below describes the Endec Interface, and its interaction with the existing Kafka Components