Becoming Data-Driven Part 3 – Data Tools

Becoming Data-Driven Part 3 – Data Tools

l wrote about the sources of the data we will need for our analytics. These data exist in different systems, internal or external to your organisation. It is advisable to have a dedicated platform for your analytics data. This platform may have one or more data tools. I will refer to this as your data tool ecosystemAnalytics can consume a lot of system resources, and you want to avoid impacting the performance of an operational system. System resources are memory, hard disk and central processing units (CPU).

So you have your vision of your company, and you have your data. What is your technology vision? The vision is more about what capabilities the technologies you need must have? You need to understand your data, and when it becomes stale, this will enable you to decide if you require data to be received and processed from source in real-time, hourly, daily, weekly, monthly and so on. Is your data going to grow very large (“big data”)? It is essential to note the critical elements of your data to help you choose the right tool.

Choosing and reviewing your tools should be done periodically. Set up a process to do this. ThoughtWorks [1] has a tool that the architectural team can use to drive the data technology vision. It has four stages, hold, assess, trial and adopt. There are three further states within the stages, new, moved in/out, no change.


Before choosing your data tools, you need to decide on your data storage, and this depends on where your source data is stored and the type of computation required for your analytics. 

Data can be moved from one data system to another using a process known as Extract Transform and Load (ETL). Moving data from one location transforming introduces points of potential failure and adds latency to the availability of the data. With memory becoming cheaper and new technology emerging and improving, organisations are looking to reduce or eliminate this problem. 


Data virtualization integrates data from disparate sources without moving the data. This can provide a single customer view without the users knowing where the data is stored. This process eliminates the need for projects to create new systems, and the data is available when it is in the source systems. One disadvantage is that the data is not related to each other.


Data federation is like data virtualization, but a standard data model shows the relationships between entities.

In-Memory computing

As the name suggests, memory computing uses random access memory to store all the data required for analysis and processing is performed in memory.

In-database analytics

In-database analytics allows the processing of data within the database.


When creating a data platform, you may need one or more tools to help achieve your goal. We will discuss the tools on the market in this section. The tools use the micros service architecture, which I will discuss in a future article.

When you get data from different sources, there are different ways to store your data. One of the most common and widely used ways is through relational databases. Always look at the use case of the database you choose to ensure that it is the right choice.

Types of databases

Relational database

Relational databases store data in two-dimensional structures called tables. Tables can be related to other tables using inbuilt relational concepts.

Properties of relational databases: 

  • Data is structured
  • Table structure must exist before attempting to load data
  • ACID, atomicity, consistency, isolation and durability

Examples of relational databases:

Graph database

A graph database uses graph structures for semantic queries with nodes, edges and properties to represent and store data.[2]

Examples of graph databases:

Document Stores

A document store database is a non-relational database that stores its data in JSON-like documents [3] [4].

Properties of document stores:

  • High-performance data persistence
  • Replication & Failover
  • Outward scaling
  • Multiple storage engines

Examples of document store database:

Columnar database

The columnar database is optimised for retrieving columns of data fast and is excellent for the analytical workload. The storage reduces the overall disk I/O requirements due to reduced data load from the disk. 

Examples of columnar databases [5]:

Key-Value database

A key-value database is a non-relational database that uses the key-value method to store data.

Examples of key-value databases:


Now you have a new database, how do you extract your data from the different internal and external sources into your database periodically.

As discussed above, we use the ETL process. Let us look at some tools that will allow you to extract the data from source systems of files, transform and normalise, load it to your datastore ready for ingestion by internal and external sources. 

Example ETL Tools:

Organisations use ETL processing patterns when the following conditions are met. 

  1. The business does not require the data in real-time.
  2. The batch processing workload is the best.


What if you want your data in near real-time or real-time. When an event happens, you want to predict the impact. Some use cases include

  • Stock prices
  • Economic news
  • Weather news
  • Recommender systems 
  • Fraud detection
  • Decouple and scale microservices

Example event streaming tools:

Master data management systems


Master data management systems are useful when you need to standardise your data. A good example will be if you are collecting data from different organisations, each may use different name for the same product.  This can be challenging to manage in code. When a change is required, you will have to go through the change process of your organisation to have this in production.

Master data management allows business owners to change their data as and when needed with a history of changes available.

Master data management tools allow you to also build, govern and manage, data quality, data hierarchies and business glossary.
Examples of master data management systems:



On premise means, your service is hosted in house physically within your organisation. You are therefore responsible for updates, licenses, servers, database software.

In the cloud

The hosting company handles cloud services, which means that they are responsible for the hardware, software, and updates. Your organisation pays for the services that it has requested and also by usage. 

Choosing a data platform is not a small task, but you can mix and match it with microservices depending on your data types and future data requirements.

In part four, I will be discussing analytics and its use case.

Further reading


  1. 2021. [online] Available at: <> [Accessed 11 July 2021].
  2. 2021. Graph database – Wikipedia. [online] Available at: <> [Accessed 31 July 2021].
  3. 2021. JSON. [online] Available at: <> [Accessed 1 August 2021].
  4. 2021. Document Database – an overview | ScienceDirect Topics. [online] Available at: <> [Accessed 1 August 2021].
  5. 2021. Column-oriented DBMS – Wikipedia. [online] Available at: <> [Accessed 1 August 2021].
The Year is 2040 – Refactoring Legacy Systems

The Year is 2040 – Refactoring Legacy Systems

When building it is worth sparing a thought for those who will maintain the system and how the system will evolve seamlessly in the future. If you work with technology, front or back end, you may have come across some systems that have evolved to become monsters in the company. In some case, when asked to modify some functionality, the developers break out in a sweat because they know they are going somewhere and do not know when, if at all, they will return.

There was a time when it was fashionable to create codes and put the business logic in triggers in the database (Cringe!)—the computer dark ages. Then we had several generations after, where more code was built upon this not so great foundation.Fast-forward years later, you are asked to make some changes, and you attempt to make one change and discover that the effect ripples across many “hidden” entities which in turn affect the business processes.

So are we out of the woods? Well, a better question to ask is, do those legacy systems still exist? Unfortunately, they still exist, and the concept of “if it isn’t broken why fix it” is still alive and kicking.To be clear, I speak of monolithic systems. I won’t name names, but they are everywhere, and these systems are used every day to develop even more monolithic systems.

These systems are hard to refracture or migrate, and the mere thoughts of analysing these send shivers down the developer’s spine.
It is not only the refactoring that is the problem; it is the fact that these products do not scale.I know some companies have made an effort to say their monoliths are scalable and use all the buzz words, but the fact is, these products do not scale. Under the hood, they are mere monoliths.

Fast forward to the year 2040, the world is different; people rarely talk to each other. Everyone walks about virtually. You are sitting in your living room, and you think ah, I have run out of milk I will go to the shop. Instead of turning on your device and loading an application, you get into your virtual car and go to your virtual supermarket. You walk into the supermarket and see others there, the shops are all stacked nicely, and you pick a basket and pick put things in your basket and pay.

How is this possible? Well, you have a chip that uniquely identifies you. When you put your clothes on, it knows what clothes you have on because this is also tagged. When you go into the shop virtually, you appear as you because you are all chipped up and wired to a giant computer. Your chip is also linked to your bank, tax and all the internet of things.Next, you hear the doorbell ring, a drone drops your delivery, yes the chip knows where you live so you do not even need to enter your address. Your little robot picks your delivery and unpacks.

So who are the developers? Are they humans or robots? What are the developers in 2040 thinking of what we are doing now? Code today is the legacy code of the future.