Traditional relational databases have dominated the data model for the past decades. However, NoSQL (Not only SQL), as well as graph databases have started attracting people’s attention in the last few years. In this blog, we’ll introduce a small internal application exploring data insights within Data Insights based on one of the most prominent Graph Databases — Neo4j.
Backstory
At the end of last year (2020), Data Insights hosted the first internal data competition. The scenario was proposed by our colleague Marin (who can always be depended on for inspiring ideas). This time, Marin suggested exploring and developing some data-driven applications based on the data we have on our internal Slack channels.
Therefore, my colleagues Christian, Mario, and I started brainstorming; which data-driven applications could be useful, and to which problems could they be applied? We immediately thought of the tedious quarterly updating of personal CVs, because sometimes we simply forget what we have been doing for the last quarter. Another potential application could be when we feel stuck with some very specific technical problems, but we don’t know which person to approach.
To solve these annoyances, we decided to build a small prototype of the central Data Insights Database based on our internal Slack messages. The proposed solution enables us to have:
- A clear picture of each person’s profile
- A big picture about how our profiles are linked with one another
- Identified domain experts for various technologies
- Rough ideas about what we have done over the past time (weeks, months, quarters, even years)
- The buzzwords we are talking about
This lays out the rough proposal of our internal project LinkeDIn (Data Insights’ version of LinkedIn). In this blog post, I would like to explain how we approached this problem, the tech stacks we used, focusing on Graph Database (not so surprising because of the title of this blog post :p).
Graph Database
What is Graph Database?
Traditionally, data are usually stored within a row-column table in a relational database. This means that a data record is a row in the database table, and this table could contain multiple attributes. However, it is wasteful and chaotic to store everything in just one table. Usually, therefore, there will be multiple tables for different categories (for example customers, orders, items, etc.). Under this architecture, if we would like to ask some questions (for example, which customers have bought which items), we need to join different tables (customer and items) to retrieve the results.
However, the idea of a graph database is different. Graph databases focus more on the links between data. In other words, how the data are connected. Therefore, in a graph database, the data are stored as Nodes (e.g., Customers, Items, Orders, etc.) and Relationships (e.g., BOUGHT, ORDERED, etc.), which aim at capturing how the data are linked. Besides, we can also add Properties for the nodes, for example, first_name and last_name for the Customer. Of course, we could also assign some Constraints to the relational database, to state which properties must exist.
Why Graph Database?
Now that we have a rough idea about what a graph database looks like, the next question is: why do we need a graph database? As mentioned above, to query the result of a complex question in a relational database, we might have to do multiple joins of tables or index lookups. This kind of operation could be quite expensive. However, in a graph database (for example, Neo4j), the database engine just needs to follow the pointers to navigate the data.
Besides, a graph database can answer the reversed question as fast as the non-reversed ones, which is usually not the case in a relational database. For example, “who bought the items?” or alternately “which items have been bought by whom?”
When we are talking about query latency (how long the query takes to run) in a relational database, this is highly dependent on the size of the data. On the other hand, in a graph database, it is proportional to the length of the traversed path in the graph.
Unlike a relational database, where the schema needs to be explicitly defined when creating the table, a graph database is schema-less. The constraints only need to be added as needed as the data grows. Since constraints are served as a schema, this makes schemas in graph databases more flexible.
As previously mentioned, the value of graph databases lies in the links between nodes, which includes indirect connections as well. To explore the indirect connections, we simply follow the paths in the graph database. This characteristic brings some astonishing advantages, for example, recognizing the hidden value of the patterns (people may have the same interests as the friends of friends).
All these intriguing properties provide a basis for multiple graph data science applications. These include using a machine learning approach for fraud detection, classification, pattern recognition, recommendation systems, graph embeddings, and so on.
Workflow
After a (short) introduction to the graph database, we are going back to our data-driven application. Let’s grasp a big picture of the workflow first, which includes 5 steps:
- Slack Scrapper: scrap data from slack channels
- Data Preprocessing: extract and prepare the data we are going to use
- Name Entity Recognition (NER): retrieve the terms that we are interested in from the data
- Graph Database: insert the data which includes nodes and relationships to Neo4j Database
- Cypher Query: use the Cypher query to analyze the data from Neo4j Database
Slack Data
Now we are familiar with the graph database, and we have also seen the workflow of the application, but what about the data? Which data should we collect, to solve the problems which we addressed? Well, luckily, in Data Insights, usually everybody writes down their weekly update within the Slackbot thread. And this is exactly the data we intend to extract. For example:
And the raw data in JSON format from the Slack channel looks this:
After preprocessing, mapping the user name, and extracting the messages we are interested in, we get the processed data resembling the following:
Then we feed this processed data into our NLP pipeline to get name entities. We also insert the nodes and the corresponding relationships between nodes into our graph in Neo4j. You would like to know more about what kind of nodes and relationships we insert into our graph? Here it comes!
MetaGraph
In our implementation, we choose Graph Database (Neo4j) as our data model. Now, we are going to illustrate our graph and how we are able to analyze the data within our graph.
As the wisdom goes, understand your data before you are going to do further analysis. So let’s take a look at the schema of our data. In Graph Database, a schema is the meta graph, which is shown below:
// Show meta-graph
CALL db.schema.visualization()
From this meta graph, we can observe the nodes and the relationships between them. For example:
- (Author) -[WROTE]-> (Message)
- (Message) -> [CONTAINS] -> (Tech)
- (Author) -[MENTIONED] -> (Person), etc.
Query Graph — Cypher
You probably also noticed that I use some weird notation to describe the relationships between entities. This might look a bit weird at the first glance, but still, it’s quite comprehensible without further explanation, right?
This is exactly the basic idea of Cypher, the equivalent of SQL for graph databases — straightforward and declarative when one wishes to analyze data in a graph database. Let us go through some scenarios.
Personal Profile
Say you would like to get a certain pattern from your graph database. For example, to get a certain personal profile (in this case, mine), you can use the MATCH
clause. The syntax will look like the following:
MATCH (a:AUTHOR {name:"hsiaoching"}) --> (t:tech)
RETURN a, t
You just need to specify the pattern you would like to observe. In this case, the author (hsiaoching), and the technical terms she’s mentioning. And voilà , here comes the answer after you ask your question in Cypher language. Apparently, I talk a lot about Amazon Web Services and SQL. 👀
How our personal profiles are linked with each other
Now I would like to know which technology terms are mentioned by our LinkeDIn colleagues, so this time I use the WITH
clause to specify the authors. Let’s say I want the authors to include christian.p and Mario. Now the syntax looks like this:
MATCH (author:AUTHOR) -[]- (item) WHERE item:tech and author.name in ["hsiaoching", "mario", "christian.p"] RETURN author, item
Apparently, Christian (Paul) and I were both mentioning Amazon Web Services and Confluence in our weekly update. And Mario and I were doing Testing and SQL at some point in the past weeks. 🤔
Domain Expert
Now, let’s say, I had some questions about Ab Initio, and I’d like to know which internal domain expert I should approach. So this time, instead of naming the specific author, I am specifying the technology term as Ab Initio.
MATCH (a:AUTHOR) --> (tech:tech {name: "Ab Initio"})
RETURN a, tech
It looks like we have a lot of experts who are equipped with the knowledge of Ab Initio. But of course, if we approach Pavel (the leader of our Ab Initio team), we should always be able to get our questions solved! 💡
Then what if the problems I have are regarding Amazon Web services?
MATCH (a:AUTHOR) --> (tech:tech {name: "Amazon Web Services"})
RETURN a, tech
This time, Pasquale, Socrates, and Evan might be the right person for me to discuss! 🤓
Let’s see what Pasquale talks about over the years
So far so good? Then let’s play around with some more complex examples. Let’s say we would like to know the buzzwords Pasquale (the CEO and Founder of Data Insights) talks about over the years.
Since, in our data model, the name of the Message node is the timestamp (the exact timestamp when the Author wrote this message), we need to parse the timestamp string to get the year. In Cypher, we can approach the similarly as with other programming languages, splitting the timestamp string (split
), and then converting it to integer (toInteger
). After that, we convert this integer to the datetime type (datetime
) and then retrieve the year of this timestamp.
You might notice that the MATCH
clause is now a bit more complex because we would like to retrieve buzzwords. These are words that not only Pasquale mentioned, but also technical terms that other people were using in the same year. We use a <> a2
syntax to make sure, when we are checking the message from the author (a
), we will only consider the messages from authors a2
other than a
.
We illustrate the code snippets below. Pasquale’s buzzwords from 2018 to 2021 are shown further below.
MATCH (a:AUTHOR {name:"Pasquale"})-->(msg:MESSAGE)-->(tech:tech)<--(msg2:MESSAGE)<--(a2:AUTHOR)
WITH a, a2, msg, msg2, tech,
toInteger(split(msg.name, ".")[0]) AS ts,
toInteger(split(msg2.name, ".")[0]) AS ts2
WHERE a <> a2
AND datetime({epochSeconds:ts}).year = 2018 // 2019, 2020, 2021
AND datetime({epochSeconds:ts2}).year = 2018 // 2019, 2020, 2021
RETURN a, tech, COUNT(tech) AS cnt
ORDER BY cnt DESC
LIMIT 10
2018
2019
2020
2021
Here, you may notice that there are always 32 counts between Pasquale and Amazon Web Services. The reason is that here the graph shows the nodes Author and Amazon Web Services, and the relationship between them is MENTIONED. This relationship does not depend on the timestamp in our data model (as mentioned before, only the node Message contains the information of timestamp). Here we avoid showing Message nodes in the graph on purpose in order to have a clean illustration.
As you can see, the cnt
from the table is actually from count(tech) AS cnt
. This code snippet is doing cartesian join for each message from the specified year between Pasquale and other authors (except Pasquale). It’s a rough approximation enabling us to grasp the tech terms people were writing about within the specified year.
Buzzwords within Data Insights
We could also ask a similar question about Data Insight’s buzzwords using Cypher. After seeing the previous examples, I believe now this syntax should be really straightforward for you.
MATCH (author:AUTHOR)-[e1]-(item)-[e2]-(author2:AUTHOR)
WHERE e1.count > 2 AND e2.count > 2 and item:tech
RETURN author, item, author2
It looks like, at Data Insights, we indeed talk a lot about Big Data buzzwords, such as BigData, Amazon Web Services, Ab Initio, DevOps, Databricks, Testing, Docker, Jenkins, etc. 📈
Summary
In this blog post, we have explained how to store data in a graph database and how to analyze these data within the Neo4j by using Cypher. Based on the characteristic of a graph database, tons of graph data science approaches can be applied to explore more relations between different entities. I hope you enjoyed reading this article, and feel free to let us know if you have any feedback 🙂