Best Practices in Azure Data Factory Version 2
While you can find a number of “how to” articles on the web about Microsoft’s Azure Data Factory (ADF), there are virtually no “why to” articles. Here we’ll define some best practices to keep in mind while working in Azure Data Factor version 2.
Azure Data Factory Best Practices
Since there aren’t many guiding resources on Azure Data Factory version 2, I wanted to share some “bigger-picture” notions about how to approach orchestration and data pipelines from a more architectural perspective. Let me preface this by mentioning that ADF version 2 differs from version 1 in its features, which is why I direct the below content to heavily appeal to version 2.
First things first — Remember that good architecture practices always call for appropriate separation of concerns/functionality between your solution layers. If you are working in ADF, it stands to reason that you are probably building a Modern Data Architecture solution in the Azure cloud. Therefore, your solution should consist of at least 3 separate components/layers:
- The Ingestion Layer - This layer purely focuses on the intake of raw data from source systems. Typically in Modern Data Architecture, this layer stores data in a “raw zone” in a data lake store. Perform minimal cleansing or transformation here if any exists. Further, there should be little (if any) consumption of data by non-system users from the raw zone.
- Transformation/Experimentation Layer - This layer is where we massage data from the raw zone into a consumable form. Typically, this process is more than just a data store. Transformations are performed on raw data, and that data is stored in the consumption layer. This is where we fulfill experimentation and data science needs, as such data sets might not be used identically for end-user consumption.
- Consumption Layer - This is where we store ready-to-use data for user consumption. It may consist of a formalized data warehouse or mart structure, but it also might be stored in the data lake itself in a “cleansed” or “user” zone.
Having laid out these concepts, note that Azure Data Factory version 2 doesn’t play as much of a role in the consumption end of things. It is mostly intended as a key utility in the ingestion and transformation layers. That said, the specific role of ADF and your approach to it is different between those layers.
While ingestion can be carried out solely by ADF itself (with some considerations), transformation is not as straightforward, and ADF is better relegated to the role of orchestrator.
Which brings an important point into focus…
ADF is primarily an orchestration tool — not so much a data transformation tool. Yes, it has capabilities in that regard, but typical uses defer transformation logic into Databricks, Spark/Storm, or (less commonly these days) HDInsight.
What does this mean for your ELT/ETL architecture with ADF? It means consider each layer of the solution and zone of the data architecture separately. Couple your ingestion subsystem loosely with your transformation subsystem, and consider the needs of each separately. Don’t feel compelled to force ADF into a role it’s not suited for.
Looking for more on Azure?
Explore more insights and expertise at smartbridge.com/data
Keep Reading: Modern Data Integration in Cloud — Azure Data Factory for Snowflake
Originally published at https://smartbridge.com on April 6, 2020.