menu
more detail

Our Blog

Articles  

Welcome to our Blog!

Here you'll find cool data tech articles and our learnings from working on data solutions.

Unleashing Data Power with Lake House Architecture on AWS

Source AWS — https://aws.amazon.com/blogs/big-data/build-a-lake-house-architecture-on-aws/

Introduction
In our data-centric world, organizations are embracing contemporary data architectures to manage, process, and analyze vast volumes of information. The Lake House Architecture, a merger of data lake and data warehouse, is one such model. This article delves into the advantages of AWS-based Lake House Architecture and how it empowers businesses to fully harness their data.

A Lake house reference architecture as proposed by AWS. Source AWS — https://aws.amazon.com/blogs/big-data/build-a-lake-house-architecture-on-aws/

Data Lake architectures vertigo — do you get it?

In the previous post, I explained what a data lake is and its main business benefits. This time around we will have a brief look at related and most commonly adopted data architectures.

A data lake has just recently become a mainstream solution adopted by businesses who want to stay on top of their data. Digital transformation was only the beginning, in the year 2035 data creation is forecasted to reach 45 times that of the year 2020 (source StatistaCharts). I believe that value extracted from data is growing in line with the growth of its volume and so data platform implementations are following suit. Therefore, it is not a surprise that new data architecture patterns are cropping up and data lake has gone through its first evolutions. There are now three distinct data lake implementations and architectures — centralised ( just data lake ), de-centralised ( data mesh ) and a third an integration of data warehouse and data lake called Lakehouse.

When we talk about data lake we usually refer to centralised architecture, certainly having data all in one place has its benefits and works great with organisations having small to medium data footprints. However, for those with a large production of data across an organisation, a decentralised data architecture is necessary. Enter data mesh, an architecture that organises data by a specific business domain (i.e marketing and sales ). Allowing for more ownership to the producers of a given dataset and allowing better connection with data consumers.

Last but not least, those organisations that already utilise data warehouse and data lake may want to implement Lake House architecture which integrates the two together. From an operational standpoint this results in avoiding issues related to data gravity — unnecessary data movement and data redundancies, business-wise robust analytics allows for faster time to market.

Since we’ve covered architectural pattern types related to the data lake in my next posts we will have a more detailed look into each architecture.

As always, for any data platform needs, contact us at DataPhoenix — a small end-to-end data solutions provider.

Image source AWS.

Understanding the Elements of a Well-Built Data Lake that Translate into Business Value

In this article, we will overview the different layers of a well-built data lake, as well as highlight common pitfalls and challenges to consider when implementing a data lake architecture.

Overview:

A typical data lake allows for decoupling of storage, processing, and analytics. The data is segmented into a landing, raw, trusted, and curated zones to store data depending on its consumption readiness. Each layer should be designed with cost optimisation, flexibility, and security in mind. Let’s take a closer look at what each layer does and which services are typically used.

Ingestion:

AWS allows for a myriad of data producers, and ingestion methods can process structured, unstructured, and semi-structured data loaded with batch or in real-time (e.g., streaming with Kinesis Firehose). The files are stored in the AWS S3 service, which is excellent as it has no storage limit. When data arrives, it can be in any format; this is called schema on read. S3 is also budget-friendly as it delivers automatic storage cost savings when access patterns change with intelligent tiering. Last but not least, it boasts excellent security features, from fine-grain access policies to robust encryption with KMS.

Processing and Transformation:

In a Data Lake or Lake House architecture, the data processing layer is responsible for transforming data into a consumable state through data validation, cleanup, normalisation, transformation, and enrichment. AWS Glue is used to track and catalogue the data. Purpose-built components perform a variety of transformations, big data processing (EMR), and near-real-time ETL. Amazon Athena can be used for quick probing into data with SQL-style queries.

Analytics/Visualisation:

This layer provides data access and consumption primarily through S3. SageMaker allows for machine learning models to be trained and deployed. Visualisations and dashboards can be built using tools like AWS QuickSight or offloaded to Tableau (SaaS service). It’s essential to note that the quality of the analytics and visualisations depends on the quality of the data stored in the data lake.

AWS data lake architecture — DataPhoenix

Common pitfalls:

  • The main challenge with a data lake architecture is that raw data is stored without the oversight of its contents. For a data lake to make data usable, it needs to have defined mechanisms to catalogue and secure data. Without these elements, data cannot be found or trusted, resulting in a “data swamp.”
  • A Data Lake is not the answer to all data analytics needs; businesses may need to leverage a data warehouse. When these two are combined, we talk about a Lakehouse architecture.
  • Governance and access control mechanisms are also necessary to ensure the data being accessed is trustworthy. Often overlooked in the implementation phase, these issues later cause frustration with teams not having enough flexibility to re-process and consume data in a democratic way. When familiar issues with data silos come into play, a data mesh architecture may be the answer.

By understanding the different layers of a well-built data lake, businesses can realise the full potential of their data, enabling advanced analytics, generating insights, and improving decision-making processes.

As always, for any data platform needs, contact us at DataPhoenix — small end-to-end data solutions.

How well do you know a data lake and its main business benefits?

When I tackled my first data lake project it was difficult for me to understand what it actually is. However, I finally grasped the concept with the help of understanding that there are three types of data — structured, semi-structured, and unstructured. In layman’s term, it is a data management solution that stores and process raw data at scale, that doesn’t fit well into SQL databases(structured), data warehouses (structured) or NoSQL(unstructured) solutions. A highlight of difference to other data management solutions is that having the ability to economically store data allows for answering business questions in future that you otherwise may not have had data for.

Robust implementation of a data lake brings a wealth of benefits. Analytics extracted from new sources like log files, data from click-streams, social media, and IIOT can increase revenue by uncovering valuable insights, it allows to make better business decisions faster and bring budget savings. When data is all in one place, storage cost optimised (AWS delivers automatic storage cost savings when access patterns change), catalogued, and democratised(available to everyone) you can expect OpEx costs lowered as well. Moreover, benefits don’t stop there, cloud providers typically focus on security and governance features reducing risks and allowing them to meet the highest compliance standards.

Data Lake architecture. Image source AWS.

Now that you understand what a data lake is and what the main benefits are, you can start preparing your business for robust data analytics. While there are many use cases for data stored in data lakes, I’m particularly excited about machine learning, which is closely related to AI.

In my next article, I will briefly overview data lake architecture on AWS.

As always, for any data platform needs, please contact us at info@dataphoenix.io — a small end-to-end data solutions provider.

DataPhoenix explains everything you need to know about data lakes in 'The Definitive Guide to data lakes on AWS”

Overview

This guide aims to empower you with the fundamental knowledge needed to understand what a standard data lake is, it’s main architecture elements and how you can utilise AWS services for managing your data securely and effectively. With base-level understanding, you can build your expertise in data lakes and implement them across various business functions.


About authors

DataPhoenix provides consulting and professional services focused on delivery of data management platforms in AWS. Based on our in depth experience delivering data platforms to small business and large enterprise we present you “The Definitive Guide to Data Lakes”, a comprehensive resource for decision-makers, CTOs , Data Architects and Engineers. Delve into the world of Data Lakes and understand technical aspects, as well as business benefits.


Index

 

What is a data lake

Picture yourself working at a company, serving as the system administrator. Your primary responsibility is to ensure that everything runs seamlessly without any hitches. The company delegates tasks among smaller teams, each overseeing a different aspect, such as code development, project management, or database modelling. However, you soon encounter a problem: customers are complaining about slow system performance. Upon investigation, you realise that the database is nearing its capacity and is being used for all functions. It becomes apparent that a “one size fits all” approach is inappropriate. To address this issue, you need to segregate the data into separate components, allowing for more strategic administration.

A data lake serves as the solution to challenges arising from traditional database and data processing techniques, offering a superior approach for modern workloads.

If you find yourself in a similar situation, a data lake could be immensely beneficial. Cloud-based data lake services like AWS can handle storage, analytics, real-time analysis, and machine learning separately.

In essence, a data lake is an intelligent storage system capable of accommodating any business-related data, both structured and unstructured, directly from its source.


Why use a data Lake

Progressive businesses employ data lakes to leverage their data and outshine competitors. Pairing data lakes with machine learning solutions and in-depth analytics, they reap a wide array of benefits. The flexibility of AWS services enables businesses to attain their unique objectives. Insights gleaned from AWS Data Lakes can aid in decision-making, operational efficiency, cost reduction, proactive maintenance, productivity enhancement, research, customer retention, and more. Consequently, AWS data lakes can enhance your company’s efficiency in handling database workloads and reduce transactional costs.

Even if you’re unsure about how or when to use specific unstructured data, data lakes offer an inexpensive and secure storage solution. Businesses can extract insights from their data lakes whenever they’re ready, providing a considerable advantage over traditional databases that discard data if not needed immediately and unlike data warehouses storing data in data lake is much more cost effective. This future-proofs your business without hindering access to vital business data. The data can be accumulated over months and analysed when a business case arises.

Another significant benefit of data lakes is their flexibility. Traditional databases often comprise multiple mechanisms and properties, which can pose challenges for platforms that require specific structures to process the data. In contrast, data lakes can store data without processing or altering the structure of the information (decoupling).

Furthermore, data lakes enable secure storage of both structured and unstructured data without scaling limitations. This advantage can save considerable time and effort that would otherwise be spent sorting out different data structures for the computer to process.

The term ‘lake’ aptly reflects these benefits, evoking the organic nature of an actual lake.


Characteristics of data lake

A database typically collects only one type of structured data, whereas a data lake is data agnostic, meaning it is not limited to a specific structure, data type, or file size. This includes diverse formats like photos, videos, texts, compressed or encrypted files, and more. The freeform structure of data lakes enables businesses to make swift changes to queries and models, providing a competitive edge. Data lakes facilitate a broader range of analytics compared to traditional databases.

Another crucial characteristic of data lakes is their ability to help with prediction of future outcomes based on historical data. This can help businesses detect market gaps, potential problems, and gain insight into future performance. This future-proofing can offer a significant advantage over competitors. The simplicity and flexibility of data lake infrastructure allow for structural remodelling based on evolving business needs.


Data Lake vs Data Warehouse

A data warehouse is a centralised database that receives input from transactional systems and line-of-business applications and is set up to analyse this data. Data must be curated within data warehouses to fit into tables, which requires cleaning and formatting to facilitate fast SQL queries.

Although this may sound similar to data lakes, there are significant differences. Data warehouses are typically used for analytics and insights, but require defining the structure of your data beforehand. This means they mostly contain structured data, and the structure must be defined before analysis can occur, known as schema-on-write architecture.

Some may prefer this method as a defined structure can identify or reject data anomalies. However, data lakes offer greater flexibility with a schema-on-read architecture, which allows for unstructured data processing. As a result, the quality of analytics varies between data warehouses and data lakes. Data lake analysis includes big data analytics, full-text search, real-time analytics, and machine learning, all of which can reveal valuable business insights.

Another difference is that data warehouses primarily use SQL for queries, limiting their flexibility. Data lakes, on the other hand, support a range of tools depending on the service needed. While data warehouses can provide business intelligence, batch reporting, and visualisations, data lakes also offer predictive analytics, data discovery, machine learning, profiling.

Data warehouses typically cater only to business analysts, whereas AWS Data Lakes can support all users. From operations managers needing daily reports to data scientists conducting statistical analysis and predictive modelling, everyone in the business can access the required information from this single repository.

It’s worth noting that the choice between data lakes and data warehouses isn’t always mutually exclusive. Many organisations opt for a combination of the two, which is called a lake house architecture, catering to different needs and uses, and ultimately achieving the biggest success in extracting value from data.


Data lake components on AWS

AWS Services serve as the foundation of a data lake, providing a plethora of options to cater to various needs. These services fall into three primary categories: data storage and organisation, data transfer, and analytics and processing. One of the main advantages of a data lake is its ability to store data on an unlimited scale, addressing any concerns regarding space constraints. Furthermore, data lakes can store both structured and unstructured data using a schema-on-read approach. They also organise data in a manner that readies it for transfer – a critical step for processing and analytics. Lastly, the analytics and processing category encompasses a wide array of services, enabling users to manage multiple databases, generate reports, visualise data, or modify their data structure.

Amazon S3 – data storage

Your business may require a service capable of handling vast amounts of structured and unstructured data. This necessitates a solution that supports schema-on-read and unlimited scaling. Amazon Simple Storage Service (S3), along with S3 Glacier, fits the bill, functioning as the ‘data storage and organisation’ service. It is designed to simplify computing for both you and developers.

A standout feature of S3 is its ability to store any type of data file, be it a video, document, audio file, or a mix of various formats. This data-agnostic approach ensures centralisation and consolidation of data in one location.

Moreover, S3 caters to individual scaling needs, accommodating anything from a few gigabytes to hundreds of terabytes. As your data grows over time, this service adapts accordingly. A key concern for many is the durability of data storage services. S3 boasts an exceptional durability rate of 99.999999999% (more nines than anybody cares to count) annually, making the chances of data loss exceedingly low. Consequently, even if you don’t use the data now, you can safely retrieve it in the future.

S3 also employs a clever storage technique that helps reduce costs. It offers various storage tiers, categorising data based on access frequency. For example, frequently accessed data incurs lower access charges but higher storage costs. Conversely, infrequently accessed data has lower storage costs but higher retrieval fees. Thanks to ‘intelligent cataloguing’, you needn’t worry about organising these tiers, as the system monitors access patterns and manages the data for you. Amazon S3 Glacier and S3 Glacier Deep services cater to rarely accessed data by archiving it, ensuring lower storage costs. However, retrieving this data may be more time-consuming and expensive, which shouldn’t pose an issue if the data is seldom used. More commonly accessed data is placed in the S3 standard tier for easier access.


AWS Glue – data catalog

For businesses struggling to organise vast amounts of data or identify data types and sources, AWS Glue is the ideal solution. AWS Glue is a fully managed, serverless service designed to process and organise data on your behalf. It features an AWS Glue catalog containing tables that help categorise and declutter your data. The ‘glue crawler’ then sorts these tables into the appropriate databases by classifying the data by type, schema, and structure within your S3. While the glue crawler can perform this task automatically, you still have the option to classify data according to your preferences, particularly useful if the crawler cannot recognise certain data structures. As your data evolves over time, the glue crawler continues to update the existing data catalog.


Kinesis – data streaming services

When moving data, the data’s origin and type are crucial factors to consider. The Kinesis family of services provides real-time data ingestion in a reliable, secure, and cost-effective manner at scale. Kinesis Data Streams collect data from all sources, process it, and then move it into another service, such as S3. If you prefer a self-management approach, the Software Development Kit (SDK) is suitable for you. SDK processes data in smaller fragments, giving you control over scaling and data destinations. One advantage of SDK is that multiple people can process the data, allowing you to delegate tasks to colleagues.

If you prefer a more automated approach requiring minimal intervention, Kinesis Firehose is the service for you. Like SDK, Firehose processes data in real-time, with the only aspect requiring management being the data’s destination and storage location. This approach is typically better suited for cases that demand minimal processing before data movement and storage.

Another service to consider is API Gateway. If your data doesn’t require real-time processing, API Gateway could be a more suitable option. By inputting your HTTP source, the data is processed and directed to your data lake. One drawback, however, is that the data may not be ingested as deeply as with the Kinesis family of services. API Gateway offers a simpler approach with fewer features included.


Batch processing data services

Batch processing is a common method for analysing and processing data. It can help identify errors, missing information, or areas for improvement within the data. AWS EMR which is Apache Hadoop managed solution, is a service designed for the efficient analysis and processing of vast amounts of data (even petabytes). It does this by organising a cluster of computers that can coordinate and process data quickly, which is often more efficient and cost-effective than relying on a single supercomputer. EMR allows you to scale your data processing in a customised way, with S3 serving as your storage. This enables the efficient processing of large amounts of data in batches and is again cost-effective, as you only pay for the clusters being processed. You can choose between temporary clusters for infrequent data processing or persistent clusters that continue running even after data processing is complete.


Data Analytic Services

Once your data is processed and structured, it’s ready for further analysis. Amazon Athena is a serverless service that helps to processes your data using SQL, without relying on clusters. Athena integrates seamlessly with Amazon S3, allowing for direct data processing without the need for data movement.

Other AWS services that work well with S3 include Amazon Redshift (data warehouse), which is similar to Athena but is more suited to pre-stored data. For real-time data processing, Amazon Kinesis Firehose can be used, with Amazon Data Analytics creating reports to highlight anomalies and trends in real-time data.


Amazon Elasticsearch / OpenSearch

Amazon Elasticsearch, this days known renamed to OpenSearch is a service that helps you locate specific pieces of data, whether it’s text, video, structured, or unstructured. Supported by S3, this service makes finding data stored in your data lake a breeze. Amazon ES includes Kibana, a service that helps visualise your data after it has been analysed, making it more discoverable.


Machine Learning Services

Machine learning is an emerging field that utilises artificial intelligence, especially useful for predicting future business outcomes. Many businesses use machine learning to gain a competitive advantage and save money, provided the predictions are accurate. It is crucial to select a reliable data to minimise risks associated with incorrect predictions.

AWS offers a range of machine learning services, such as Amazon SageMaker, which allows you to build, train, and deploy machine learning models easily. Additionally, AWS provides pre-trained AI services, like Amazon Rekognition for image and video analysis, and Amazon Comprehend for natural language processing. These services enable businesses to harness the power of machine learning to gain valuable insights and make data-driven decisions.


Summary

AWS provides a comprehensive suite of services designed to meet the diverse needs of businesses and individuals managing data lakes. These services cover data storage and organisation, data movement, processing, analytics, and machine learning. Key AWS services include Amazon S3 for storage, AWS Glue for data organisation, Kinesis family for real-time data movement, AWS EMR for batch processing, Amazon Athena and Redshift for analytics, Amazon Elasticsearch for data discovery, and Amazon QuickSight for dashboarding. The flexibility and interoperability of these services make AWS an ideal solution for those looking to store, process, and gain insights from their data. The choice of service depends on your specific use case. By leveraging the right combination of AWS services, users can effectively manage and make the most of their data in today’s data-driven world.

We have not covered all data lake related services, as AWS releases many new features on a regular basis that can help to expand data lake capabilities. One of the recent additions is Amazon MWAA, a managed Apache Airflow tool that provides advanced batch processing capabilities. That being said AWS is not by all means the ultimate end for data platforms’ needs. There is a richness of products from competing vendors that can do a better job in certain scenarios than native AWS services. Large enterprises often select to combine their AWS data cloud platform with products like SnowFlake which is a powerful data warehousing solution, Tableau for dashboarding and visualisation, Matilion for simplified and effective ETL, and Atlan which offers AI-aided data catalogue capability and which is combined with AI, last but not least Dataiku for AI aided machine learning. Usually, those services come with a premium price tag compared to AWS.

Additionally, other general cloud providers like Google Cloud Platform and Microsoft Azure are competing for the crown of best data management platforms. While Databricks focuses solely on data solutions.

Stay tuned to learn more about data platforms, you can follow the @DataPhoenix LinkedIn page.
As always, for any data platform needs, contact us at info@dataphoenix.io – a small end-to-end data solutions provider.

DataPhoenix team

DataPhoenix goes live!

iStock-639637280-e1599124354668

Hi

There is a real thirst in so many organisations to get more value from the data.  Helping them to do this is something I’m really passionate about. That’s why I’m launching DataPhoenix. DataPhoenix is a data consultancy business which has expertise in architecting and implementing data lakes and data mesh solutions on AWS to help organisations unleash the power of their data.

We’re working with some really exceptional people who also have expertise in data lakes, data pipelines, AWS cloud, and DevOps.  Whether you’re trying to make sense of existing big data, want to integrate disparate pools of data or simply want to know what’s possible to help you unleash the potential of your data, DataPhoenix can design and implement the right solution.

SoboL @DataPhoenix

Contact Us  

We can help you unleash your data’s potential. Get in touch with the DataPhoenix team here.