Extract, Transform, and Load Process

When discussing data analytics, we need to talk about the extract, transform, and load process. This is a conceptual method for obtaining the information we need to analyze. It’s a process that extracts information from a system, structures the information, and then stores it.

Firstly, we have to extract the data. Let’s say we are a company that develops a mobile phone app and we want to perform data analytics. Initially, we need to extract the sales data to begin the analysis. The extraction could either be automatic, or we could assign an employee to manually extract the data.

Having extracted the data, the second step is to transform it. This is when we format the data to prepare it for analysis. We deal with structured data and unstructured data in this phase. Unstructured data includes information that may contain duplicates or sensitive information that needs to be deleted. For instance, if we’ve extracted the data and it includes irrelevant information, like a batch number, we can delete this. Upon noticing an erroneous transaction, that too would be removed. Having cleaned this up, it’s now in its structured data form ready for storage.

Study Tip: Unstructured data is not ready to use for analysis. Structure data is cleaned up and ready for analysis.

Once we’ve extracted and transformed the data, the third step is to load it. This simply means we’re going to store the data on a certain type of database, from which we can begin using it for analysis.

In terms of loading data, there are three potential storage options. They all deal with whether the data is structured or unstructured, and whether it’s for company-wide needs or department-specific needs.

The first option is a data warehouse, which only contains structured data for company-wide needs. Since it’s structured, it’s ready to use.

The next option is a data mart. This is like a smaller version of a warehouse. Although it’s like a warehouse, instead of being for company-wide needs, it’s for specific department needs. It is also already structured and ready to use.

The last option is a data lake. This is suitable when there isn’t enough time to transform the data. All the information, both structured and unstructured, gets dumped here. For instance, a platform like Instagram might store all of its information in a data lake, including all text, images, and usage time data.

Study Tip: A data warehouse and data mart only contain structured data. A data lake contains structured and unstructured data.

Previous
Previous

General IT Controls

Next
Next

Data Analytics