l wrote about the sources of the data we will need for our analytics. These data exist in different systems, internal or external to your organisation. It is advisable to have a dedicated platform for your analytics data. This platform may have one or more data tools. I will refer to this as your data tool ecosystem. Analytics can consume a lot of system resources, and you want to avoid impacting the performance of an operational system. System resources are memory, hard disk and central processing units (CPU).
So you have your vision of your company, and you have your data. What is your technology vision? The vision is more about what capabilities the technologies you need must have? You need to understand your data, and when it becomes stale, this will enable you to decide if you require data to be received and processed from source in real-time, hourly, daily, weekly, monthly and so on. Is your data going to grow very large (“big data”)? It is essential to note the critical elements of your data to help you choose the right tool.
Choosing and reviewing your tools should be done periodically. Set up a process to do this. ThoughtWorks  has a tool that the architectural team can use to drive the data technology vision. It has four stages, hold, assess, trial and adopt. There are three further states within the stages, new, moved in/out, no change.
Before choosing your data tools, you need to decide on your data storage, and this depends on where your source data is stored and the type of computation required for your analytics.
Data can be moved from one data system to another using a process known as Extract Transform and Load (ETL). Moving data from one location transforming introduces points of potential failure and adds latency to the availability of the data. With memory becoming cheaper and new technology emerging and improving, organisations are looking to reduce or eliminate this problem.
Data virtualization integrates data from disparate sources without moving the data. This can provide a single customer view without the users knowing where the data is stored. This process eliminates the need for projects to create new systems, and the data is available when it is in the source systems. One disadvantage is that the data is not related to each other.
Data federation is like data virtualization, but a standard data model shows the relationships between entities.
As the name suggests, memory computing uses random access memory to store all the data required for analysis and processing is performed in memory.
In-database analytics allows the processing of data within the database.
DATA TOOLS ECOSYSTEM
When creating a data platform, you may need one or more tools to help achieve your goal. We will discuss the tools on the market in this section. The tools use the micros service architecture, which I will discuss in a future article.
When you get data from different sources, there are different ways to store your data. One of the most common and widely used ways is through relational databases. Always look at the use case of the database you choose to ensure that it is the right choice.
Types of databases
Relational databases store data in two-dimensional structures called tables. Tables can be related to other tables using inbuilt relational concepts.
Properties of relational databases:
- Data is structured
- Table structure must exist before attempting to load data
- ACID, atomicity, consistency, isolation and durability
Examples of relational databases:
A graph database uses graph structures for semantic queries with nodes, edges and properties to represent and store data.
Examples of graph databases:
A document store database is a non-relational database that stores its data in JSON-like documents  .
Properties of document stores:
- High-performance data persistence
- Replication & Failover
- Outward scaling
- Multiple storage engines
Examples of document store database:
The columnar database is optimised for retrieving columns of data fast and is excellent for the analytical workload. The storage reduces the overall disk I/O requirements due to reduced data load from the disk.
Examples of columnar databases :
A key-value database is a non-relational database that uses the key-value method to store data.
Examples of key-value databases:
Now you have a new database, how do you extract your data from the different internal and external sources into your database periodically.
As discussed above, we use the ETL process. Let us look at some tools that will allow you to extract the data from source systems of files, transform and normalise, load it to your datastore ready for ingestion by internal and external sources.
Example ETL Tools:
- Informatica PowerCenter
- SQL Server Integration Services (SSIS)
- Oracle Data Integrator (ODI)
Organisations use ETL processing patterns when the following conditions are met.
- The business does not require the data in real-time.
- The batch processing workload is the best.
What if you want your data in near real-time or real-time. When an event happens, you want to predict the impact. Some use cases include
- Stock prices
- Economic news
- Weather news
- Recommender systems
- Fraud detection
- Decouple and scale microservices
Example event streaming tools:
Master data management systems
Master data management systems are useful when you need to standardise your data. A good example will be if you are collecting data from different organisations, each may use different name for the same product. This can be challenging to manage in code. When a change is required, you will have to go through the change process of your organisation to have this in production.
Master data management allows business owners to change their data as and when needed with a history of changes available.
Master data management tools allow you to also build, govern and manage, data quality, data hierarchies and business glossary.
Examples of master data management systems:
- IBM InfoSphere Master Data Management
- Informatica MDM
- Tibco EBX
- SAP Master Data Governance
- IBM Product Master
- Informatica Customer 360
- Microsoft MDS
ON-PREMISE OR IN THE CLOUD
On premise means, your service is hosted in house physically within your organisation. You are therefore responsible for updates, licenses, servers, database software.
In the cloud
The hosting company handles cloud services, which means that they are responsible for the hardware, software, and updates. Your organisation pays for the services that it has requested and also by usage.
Choosing a data platform is not a small task, but you can mix and match it with microservices depending on your data types and future data requirements.
In part four, I will be discussing analytics and its use case.
- Assets.thoughtworks.com. 2021. [online] Available at: <https://assets.thoughtworks.com/assets/technology-radar-vol-24-en.pdf> [Accessed 11 July 2021].
- En.wikipedia.org. 2021. Graph database – Wikipedia. [online] Available at: <https://en.wikipedia.org/wiki/Graph_database> [Accessed 31 July 2021].
- Json.org. 2021. JSON. [online] Available at: <https://www.json.org/json-en.html> [Accessed 1 August 2021].
- Sciencedirect.com. 2021. Document Database – an overview | ScienceDirect Topics. [online] Available at: <https://www.sciencedirect.com/topics/computer-science/document-database> [Accessed 1 August 2021].
- En.wikipedia.org. 2021. Column-oriented DBMS – Wikipedia. [online] Available at: <https://en.wikipedia.org/wiki/Column-oriented_DBMS> [Accessed 1 August 2021].