Home Artificial Intelligence Feature Engineering with Microsoft Fabric and Dataflow Gen2 What’s Dataflow Gen2? The Challenge The no code approach The low code approach The all code approach Conclusion

Feature Engineering with Microsoft Fabric and Dataflow Gen2 What’s Dataflow Gen2? The Challenge The no code approach The low code approach The all code approach Conclusion

0
Feature Engineering with Microsoft Fabric and Dataflow Gen2
What’s Dataflow Gen2?
The Challenge
The no code approach
The low code approach
The all code approach
Conclusion

Fabric Madness part 3

Towards Data Science
Image by writer and ChatGPT. “Design an illustration, featuring a Paralympic basketball player in motion, this time the theme is on data pipelines” prompt. ChatGPT, 4, OpenAI, 15April. 2024. https://chat.openai.com.

Within the previous post, we discussed the way to use Notebooks with PySpark for feature engineering. While spark offers loads of flexibility and power, it could possibly be quite complex and requires loads of code to start. Not everyone seems to be comfortable with writing code or has the time to learn a brand new programming language, which is where Dataflow Gen2 is available in.

Dataflow Gen2 is a low-code data transformation and integration engine that lets you create data pipelines for loading data from a wide range of sources into Microsoft Fabric. It’s based on Power Query, which is integrated into many Microsoft products, similar to Excel, Power BI, and Azure Data Factory. Dataflow Gen2 is a fantastic tool for creating data pipelines without code via a visible interface, making it easy to create data pipelines quickly. For those who are already conversant in Power Query or aren’t afraid of writing code, you may as well use the underlying M (“Mashup”) language to create more complex transformations.

On this post, we’ll walk through the way to use Dataflow Gen2 to create the identical features needed to coach our machine learning model. We are going to use the identical dataset as within the previous post, which incorporates data about college basketball games.

Fig. 1 — The end result. Image by writer.

There are two datasets that we will likely be using to create our features: the regular season games and the tournament games. These two datasets are also split into the Men’s and Women’s tournaments, which is able to have to be combined right into a single dataset. In total there are 4 csv files, that have to be combined and transformed into two separate tables within the Lakehouse.

Using Dataflows there are multiple ways to unravel this problem, and on this post I would like to indicate three different approaches: a no code approach, a low code approach and at last a more advanced all code approach.

The primary and simplest approach is to make use of the Dataflow Gen2 visual interface to load the info and create the features.

The Data

The information we’re taking a look at is from the 2024 US college basketball tournaments, which was obtained from the on-going March Machine Learning Mania 2024 Kaggle competition, the main points of which will be found here, and is licensed under CC BY 4.0

Loading the info

Step one is to get the info from the Lakehouse, which will be done by choosing the “Get Data” button within the Home ribbon after which choosing More… from the list of knowledge sources.

Fig. 2 — Selecting a knowledge source. Image by writer.

From the list, select OneLake data hub to search out the Lakehouse after which once chosen, find the csv file within the Files folder.

Fig. 3 — Select the csv file. Image by writer.

This may create a brand new query with 4 steps, that are:

  • Source: A function that queries the Lakehouse for all of the contents.
  • Navigation 1: Converts the contents of the Lakehouse right into a table.
  • Navigation 2: Filters the table to retrieve the chosen csv file by name.
  • Imported CSV: Converts the binary file right into a table.
Fig. 4 — Initial load. Image by writer.

Now that the info is loaded we are able to start with some basic data preparation to get it right into a format that we are able to use to create our features. The very first thing we want to do is ready the column names to be based on the primary row of the dataset. This will be done by choosing the “Use first row as headers” option in either the Transform group on the Home ribbon or within the Transform menu item.

The subsequent step is to rename the column “WLoc” to “location” by either choosing the column within the table view, or by right clicking on the column and choosing “Rename”.

The placement column incorporates the situation of the sport, which is either “H” for home, “A” for away, or “N” for neutral. For our purposes, we would like to convert this to a numerical value, where “H” is 1, “A” is -1, and “N” is 0, as this can make it easier to make use of in our model. This will be done by choosing the column after which using the Replace values… transform within the Transform menu item.

Fig. 5 — Replace Values. Image by writer.

This may have to be done for the opposite two location values as well.

Finally, we want to vary the info variety of the situation column to be a Whole number as an alternative of Text. This will be done by choosing the column after which choosing the info type from the drop down list within the Transform group on the Home ribbon.

Fig. 6 — Final data load. Image by writer.

As an alternative of repeating the rename step for every of the situation types, a little bit little bit of M code will be used to switch the values in the situation column. This will be done by choosing the previous transform within the query (Renamed columns) after which choosing the Insert step button within the formula bar. This may add a brand new step, and you’ll be able to enter the next code to switch the values in the situation column.

Table.ReplaceValue(#"Renamed columns", each [location], each if Text.Comprises([location], "H") then "1" else if Text.Comprises([location], "A") then "-1" else "0", Replacer.ReplaceText, {"location"})

Adding features

We’ve got the info loaded, however it’s still not right for our model. Each row within the dataset represents a game between two teams, and includes the scores and statistics for each the winning and losing team in a single wide table. We’d like to create features that represent the performance of every team in the sport and to have a row per team per game.

To do that we want to separate the info into two tables, one for the winning team and one for the losing team. The best option to do that is to create a brand new query for every team after which merge them back together at the top. There are just a few ways in which this may very well be done, nonetheless to maintain things easy and comprehensible (especially if we ever need to come back back to this later), we’ll create two references to the source query after which append them together again, after performing some light transformations.

Referencing a column will be done either from the Queries panel on the left, or by choosing the context menu of the query if using Diagram view. This may create a brand new query that references the unique query, and any changes made to the unique query will likely be reflected in the brand new query. I did this twice, once for the winning team and once for the losing team after which renamed the columns by prefixing them with “T1_” and “T2_” respectively.

Fig. 7 — Split the dataset. Image by writer.

Once the column values are set, we are able to then mix the 2 queries back together through the use of Append Queries after which create our first feature, which is the purpose difference between the 2 teams. This will be done by choosing the T1_Score and T2_Score columns after which choosing “Subtract” from the “Standard” group on the Add column ribbon.

Now that’s done, we are able to then load the info into the Lakehouse as a brand new table. The end result should look something like this:

Fig. 8 — All joined up. Image by writer.

There are just a few limitations with the no code approach, the foremost one is that it’s tough to reuse queries or transformations. Within the above example we would want to repeat the identical steps one other thrice to load each of the person csv files. That is where copy / paste is useful, however it’s not ideal. Let’s take a look at a low code approach next.

Within the low code approach we’ll use a mix of the visual interface and the M language to load and transform the info. This approach is more flexible than the no code approach, but still doesn’t require loads of code to be written.

Loading the info

The goal of the low code approach is to scale back the variety of repeated queries which are needed and to make it easier to reuse transformations. To do that we’ll reap the benefits of the incontrovertible fact that Power Query is a functional language and that we are able to create functions to encapsulate the transformations that we would like to use to the info. After we first loaded the info from the Lakehouse there have been 4 steps that were created, the second step was to convert the contents of the Lakehouse right into a table, with each row containing a reference to a binary csv file. We are able to use this because the input right into a function, which is able to load the csv right into a latest table, using the Invoke custom function transformation for every row of the table.

Fig. 9 — Lakehouse query with the binary csv files in a column called Content. Image by writer.

To create the function, select “Blank query” from the Get data menu, or right click the Queries panel and choose “Recent query” > “Blank query”. In the brand new query window, enter the next code:

(TableContents as binary) =>let
Source = Csv.Document(TableContents, [Delimiter = ",", Columns = 34, QuoteStyle = QuoteStyle.None]),
PromoteHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars = true])
in
PromoteHeaders

The code of this function has been copied from our initial no code approach, but as an alternative of loading the csv file directly, it takes a parameter called TableContents, reads it as a csv file Csv.Document after which sets the primary row of the info to be the column headers Table.PromoteHeaders.

We are able to then use the Invoke custom function transformation to use this function to every row of the Lakehouse query. This will be done by choosing the “Invoke custom function” transformation from the Add column ribbon after which choosing the function that we just created.

Fig. 10 — Invoke custom function. Image by writer.

This may create a brand new column within the Lakehouse query, with your entire contents of the csv file loaded right into a table, which is represented as [Table] within the table view. We are able to then use the expand function on the column heading to expand the table into individual columns.

Fig. 11 — Expand columns. Image by writer.

The result effectively combines the 2 csv files right into a single table, which we are able to then proceed to create our features from as before.

There are still some limitations with this approach, while we’ve reduced the variety of repeated queries, we still must duplicate the whole lot for each the regular season and tournament games datasets. That is where the all code approach is available in.

The all code approach is probably the most flexible and powerful approach, but additionally requires probably the most amount of code to be written. This approach is best suited for many who are comfortable with writing code and need to have full control over the transformations which are applied to the info.

Essentially what we’ll do is grab all of the M code that was generated in each of the queries and mix them right into a single query. This may allow us to load all of the csv files in a single query after which apply the transformations to every of them in a single step. To get all of the M code, we are able to select each query after which click on the Advanced Editor from the Home ribbon, which displays all of the M code that was generated for that question. We are able to then copy and paste this code right into a latest query after which mix all of them together.

To do that, we want to create a brand new blank query after which enter the next code:

(TourneyType as text) => let
Source = Lakehouse.Contents(null){[workspaceId = "..."]}[Data]{[lakehouseId = "..."]}[Data],
#"Navigation 1" = Source{[Id = "Files", ItemKind = "Folder"]}[Data],
#"Filtered rows" = Table.SelectRows(#"Navigation 1", each Text.Comprises([Name], TourneyType)),
#"Invoked custom function" = Table.AddColumn(#"Filtered rows", "Invoked custom function", each LoadCSV([Content])),
#"Removed columns" = Table.RemoveColumns(#"Invoked custom function", {"Content", "Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path", "ItemKind", "IsLeaf"}),
#"Expanded Invoked custom function" = Table.ExpandTableColumn(#"Removed columns", "Invoked custom function", {"Season", "DayNum", "WTeamID", "WScore", "LTeamID", "LScore", "WLoc", "NumOT", "WFGM", "WFGA", "WFGM3", "WFGA3", "WFTM", "WFTA", "WOR", "WDR", "WAst", "WTO", "WStl", "WBlk", "WPF", "LFGM", "LFGA", "LFGM3", "LFGA3", "LFTM", "LFTA", "LOR", "LDR", "LAst", "LTO", "LStl", "LBlk", "LPF"}, {"Season", "DayNum", "WTeamID", "WScore", "LTeamID", "LScore", "WLoc", "NumOT", "WFGM", "WFGA", "WFGM3", "WFGA3", "WFTM", "WFTA", "WOR", "WDR", "WAst", "WTO", "WStl", "WBlk", "WPF", "LFGM", "LFGA", "LFGM3", "LFGA3", "LFTM", "LFTA", "LOR", "LDR", "LAst", "LTO", "LStl", "LBlk", "LPF"}),
#"Renamed columns" = Table.RenameColumns(#"Expanded Invoked custom function", {{"WLoc", "location"}}),
Custom = Table.ReplaceValue(#"Renamed columns", each [location], each if Text.Comprises([location], "H") then "1" else if Text.Comprises([location], "A") then "-1" else "0", Replacer.ReplaceText, {"location"}),
#"Change Types" = Table.TransformColumnTypes(Custom, {{"Season", Int64.Type}, {"DayNum", Int64.Type}, {"WTeamID", Int64.Type}, {"WScore", Int64.Type}, {"LTeamID", Int64.Type}, {"LScore", Int64.Type}, {"location", Int64.Type}, {"NumOT", Int64.Type}, {"WFGM", Int64.Type}, {"WFGA", Int64.Type}, {"WFGM3", Int64.Type}, {"WFGA3", Int64.Type}, {"WFTM", Int64.Type}, {"WFTA", Int64.Type}, {"WOR", Int64.Type}, {"WDR", Int64.Type}, {"WAst", Int64.Type}, {"WTO", Int64.Type}, {"WStl", Int64.Type}, {"WBlk", Int64.Type}, {"WPF", Int64.Type}, {"LFGM", Int64.Type}, {"LFGA", Int64.Type}, {"LFGM3", Int64.Type}, {"LFGA3", Int64.Type}, {"LFTM", Int64.Type}, {"LFTA", Int64.Type}, {"LOR", Int64.Type}, {"LDR", Int64.Type}, {"LAst", Int64.Type}, {"LTO", Int64.Type}, {"LStl", Int64.Type}, {"LBlk", Int64.Type}, {"LPF", Int64.Type}}),
Winners = Table.TransformColumnNames(#"Change Types", each if Text.StartsWith(_, "W") then Text.Replace(_, "W", "T1_") else Text.Replace(_, "L", "T2_")),
#"Rename L" = Table.TransformColumnNames(#"Change Types", each if Text.StartsWith(_, "W") then Text.Replace(_, "W", "T2_") else Text.Replace(_, "L", "T1_")),
#"Replaced Value L" = Table.ReplaceValue(#"Rename L", each [location], each if [location] = 1 then -1 else if Text.Comprises([location], -1) then 1 else [location], Replacer.ReplaceValue, {"location"}),
Losers = Table.TransformColumnTypes(#"Replaced Value L", {{"location", Int64.Type}}),
Combined = Table.Mix({Winners, Losers}),
PointDiff = Table.AddColumn(Combined, "PointDiff", each [T1_Score] - [T2_Score], Int64.Type)
in
PointDiff

Note: the Lakehouse connection values have been removed

What’s happening here is that we’re:

  1. Loading the info from the Lakehouse;
  2. Filtering the rows to only include the csv files that match the TourneyType parameter;
  3. Loading the csv files into tables;
  4. Expanding the tables into columns;
  5. Renaming the columns;
  6. Changing the info types;
  7. Combining the 2 tables back together;
  8. Calculating the purpose difference between the 2 teams.

Using the query is then so simple as choosing it, after which invoking the function with the TourneyType parameter.

Fig. 12 — Invoke function. Image by writer.

This may create a brand new query with the function because it’s source, and the info loaded and transformed. It’s then only a case of loading the info into the Lakehouse as a brand new table.

Fig. 13 — Function load. Image by writer.

As you’ll be able to see, the LoadTournamentData function is invoked with the parameter “RegularSeasonDetailedResults” which is able to load each the Men’s and Women’s regular season games right into a single table.

And that’s it!

Hopefully this post has given you a great overview of the way to use Dataflow Gen2 to arrange data and create features on your machine learning model. Its low code approach makes it easy to create data pipelines quickly, and it incorporates loads of powerful features that will be used to create complex transformations. It’s a fantastic first port of call for anyone who needs to remodel data, but more importantly, has the good thing about not needing to write down complex code that’s liable to errors, is tough to check, and is difficult to keep up.

On the time of writing, Dataflows Gen2 are unsupported with the Git integration, and so it’s impossible to version control or share the dataflows. This feature is anticipated to be released in Q4 2024.

LEAVE A REPLY

Please enter your comment!
Please enter your name here