Becoming Data-Driven Part 3 – Data Tools

l wrote about the sources of the data we will need for our analytics. These data exist in different systems, internal or external to your organisation. It is advisable to have a dedicated platform for your analytics data. This platform may have one or more data tools. I will refer to this as your data tool ecosystemAnalytics can consume a lot of system resources, and you want to avoid impacting the performance of an operational system. System resources are memory, hard disk and central processing units (CPU).

So you have your vision of your company, and you have your data. What is your technology vision? The vision is more about what capabilities the technologies you need must have? You need to understand your data, and when it becomes stale, this will enable you to decide if you require data to be received and processed from source in real-time, hourly, daily, weekly, monthly and so on. Is your data going to grow very large (“big data”)? It is essential to note the critical elements of your data to help you choose the right tool.

Choosing and reviewing your tools should be done periodically. Set up a process to do this. ThoughtWorks [1] has a tool that the architectural team can use to drive the data technology vision. It has four stages, hold, assess, trial and adopt. There are three further states within the stages, new, moved in/out, no change.

DATA STORAGE

Before choosing your data tools, you need to decide on your data storage, and this depends on where your source data is stored and the type of computation required for your analytics. 

Data can be moved from one data system to another using a process known as Extract Transform and Load (ETL). Moving data from one location transforming introduces points of potential failure and adds latency to the availability of the data. With memory becoming cheaper and new technology emerging and improving, organisations are looking to reduce or eliminate this problem. 

Virtualization

Data virtualization integrates data from disparate sources without moving the data. This can provide a single customer view without the users knowing where the data is stored. This process eliminates the need for projects to create new systems, and the data is available when it is in the source systems. One disadvantage is that the data is not related to each other.

Federation

Data federation is like data virtualization, but a standard data model shows the relationships between entities.

In-Memory computing

As the name suggests, memory computing uses random access memory to store all the data required for analysis and processing is performed in memory.

In-database analytics

In-database analytics allows the processing of data within the database.


DATA TOOLS ECOSYSTEM

When creating a data platform, you may need one or more tools to help achieve your goal. We will discuss the tools on the market in this section. The tools use the micros service architecture, which I will discuss in a future article.

When you get data from different sources, there are different ways to store your data. One of the most common and widely used ways is through relational databases. Always look at the use case of the database you choose to ensure that it is the right choice.

Types of databases

Relational database

Relational databases store data in two-dimensional structures called tables. Tables can be related to other tables using inbuilt relational concepts.

Properties of relational databases: 

  • Data is structured
  • Table structure must exist before attempting to load data
  • ACID, atomicity, consistency, isolation and durability

Examples of relational databases:

Graph database

A graph database uses graph structures for semantic queries with nodes, edges and properties to represent and store data.[2]

Examples of graph databases:

Document Stores

A document store database is a non-relational database that stores its data in JSON-like documents [3] [4].

Properties of document stores:

  • High-performance data persistence
  • Replication & Failover
  • Outward scaling
  • Multiple storage engines

Examples of document store database:

Columnar database

The columnar database is optimised for retrieving columns of data fast and is excellent for the analytical workload. The storage reduces the overall disk I/O requirements due to reduced data load from the disk. 

Examples of columnar databases [5]:

Key-Value database

A key-value database is a non-relational database that uses the key-value method to store data.

Examples of key-value databases:

ETL TOOLS

Now you have a new database, how do you extract your data from the different internal and external sources into your database periodically.

As discussed above, we use the ETL process. Let us look at some tools that will allow you to extract the data from source systems of files, transform and normalise, load it to your datastore ready for ingestion by internal and external sources. 

Example ETL Tools:

Organisations use ETL processing patterns when the following conditions are met. 

  1. The business does not require the data in real-time.
  2. The batch processing workload is the best.

EVENT STREAMING

What if you want your data in near real-time or real-time. When an event happens, you want to predict the impact. Some use cases include

  • Stock prices
  • Economic news
  • Weather news
  • Recommender systems 
  • Fraud detection
  • Decouple and scale microservices

Example event streaming tools:

Master data management systems

 

Master data management systems are useful when you need to standardise your data. A good example will be if you are collecting data from different organisations, each may use different name for the same product.  This can be challenging to manage in code. When a change is required, you will have to go through the change process of your organisation to have this in production.

Master data management allows business owners to change their data as and when needed with a history of changes available.

Master data management tools allow you to also build, govern and manage, data quality, data hierarchies and business glossary.
Examples of master data management systems:

ON-PREMISE OR IN THE CLOUD

On-Premise

On premise means, your service is hosted in house physically within your organisation. You are therefore responsible for updates, licenses, servers, database software.

In the cloud

The hosting company handles cloud services, which means that they are responsible for the hardware, software, and updates. Your organisation pays for the services that it has requested and also by usage. 

Choosing a data platform is not a small task, but you can mix and match it with microservices depending on your data types and future data requirements.

In part four, I will be discussing analytics and its use case.

Further reading

 

  1. Assets.thoughtworks.com. 2021. [online] Available at: <https://assets.thoughtworks.com/assets/technology-radar-vol-24-en.pdf> [Accessed 11 July 2021].
  2. En.wikipedia.org. 2021. Graph database – Wikipedia. [online] Available at: <https://en.wikipedia.org/wiki/Graph_database> [Accessed 31 July 2021].
  3. Json.org. 2021. JSON. [online] Available at: <https://www.json.org/json-en.html> [Accessed 1 August 2021].
  4. Sciencedirect.com. 2021. Document Database – an overview | ScienceDirect Topics. [online] Available at: <https://www.sciencedirect.com/topics/computer-science/document-database> [Accessed 1 August 2021].
  5. En.wikipedia.org. 2021. Column-oriented DBMS – Wikipedia. [online] Available at: <https://en.wikipedia.org/wiki/Column-oriented_DBMS> [Accessed 1 August 2021].
  6. https://www.gartner.com/reviews/market/master-data-management-solutions/

Do you have questions ?

If you have questions regarding the article you have read or need resource to assist, drop us a mail using the contact us button below. Give as much detail as possible.

You May Also Like…

Subscribe

Join our mailing list to receive the latest articles and updates from our team.

You have Successfully Subscribed!