Introduction
In the previous article, I described how to connect to the Microsoft Graph API using Synapse Analytics, including linked service configuration and a sample pipeline. This pipeline consumes a generic dataset which uses the “Relative URL” dataset property together with pipeline and dataset parameters. Finally, a set of AAD groups and their attributes were retrieved and stored in a parquet dataset in ADLS.
In this post I will enhance the previous pipeline by adding additional features like filtering out empty AAD groups, retrieve and store group members, and create batches to minimize the number of API calls.
Business Scenario
Let’s imagine we have the following requirements from business:
- AAD groups with a name pattern must be imported in a database table
- The members of the previous group must be imported in another database table too
- The number of group members must be store in the group table
- The number of calls to the Graph API must be minimize
Target data model
The following diagram represents our target data model:
- graph_api.AAD_Group: table to store the AAD groups
- graph_api.AAD_Group_Member: table to store the members of the AAD groups existing in the AAD_Group table.
- graph api.AAD_Group_Member_Raw: this is a kind of staging table and will help us to store intermediate results of the API calls.
Remark: I have not optimized the data types in the model.
The Synapse Pipeline
This is an example pipeline to meet our requirements. Please be aware that I am not including logging or error handling in the design.
- Copy AAD Groups: this activity imports the AAD group as explained in the previous article. The only difference is we are using a table in a SQL dedicated pool instead of a parquet dataset as target.
- Create requests AAD Members: create the request body for each batch.
- SP Copy members: an SP to move the members from the AAD_Group_Member_Raw to the AAD_Group_Member table.
Grouping group IDs in batches
Some entities of the graph APIs accept batch requests of a maximum of 20 objects. In this case batches of 20 group IDs are built using SQL. Since all IDs in the graph_api.AAD_Group table must be processed, the total number of records is divided by 20. The result + 1 is the total number of batches (the last batch has from 0 to 19 objects). Then the SQL function NTILE is used to create buckets of 20 rows and assign a corresponding “batch” or bucket number.
declare @number_of_groups int;
set @number_of_groups = (select count(*) from graph_api.[AAD_Group]);
select
concat('[',
STRING_AGG(
CONCAT('{"id":"' ,Qry.Group_Id, '","method": "GET","url":"/groups/', Qry.[group_id]
,'/members?$count=true&$select=id,displayName,mail,userPrincipalName'
,',"headers":{"ConsistencyLevel": "eventual"}}') ,',')
,']') as json_output
from (
select Group_Id,
row_number() over(order by load_ts desc) as id,
ntile((@number_of_groups / 20)+1) over(order by load_ts desc) as tile from
[graph_api].[AAD_Group]
)as Qry group by tile;
This is an example output containing only two groups for visibility, but sets of 20 groups are created if there are enough results:
For additional information visit this article: Combine multiple requests in one HTTP call using JSON batching.
For Each Batch import the group members
Now we have the body for a batch request optimizing the number of calls with a reduction factor of 20. Let’s configure the Copy activity within the loop:
- The source dataset is the same used to get the groups but this time it has a different setting: Relative URL: https://graph.microsoft.com/v1.0/$batch
- Request Method = POST
- Request Body: the json_output of every item in the foreach collection
- Header: Content-Type = application/json
This is the Sink configuration:
The mappings are configured to iterate through the “responses” collection and from the body the following attributes:
- value: array containing the members
- @odata.count: total count of group members
- id: id of the request which is the group_id (because we decided it in order to keep track it)
- status: status of the API call – Only 200 responses are considered valid
This is the output of the graph_api.AAD_Group_Member_Raw Table:
Now it’s quite simple to detect groups without members for further processing. The members for every group are stored in an array in the column “value”. This array is processed in the next activity to copy the members to its target table.
Loading the AAD members table
Finally, we are going to move the members to the target table using a stored procedure to process JSON arrays of members and filter out empty groups or status other than 200.
CREATE PROC [graph_api].[Move_AAD_Members_Stg_DWH] AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON
-- Truncate target table
TRUNCATE TABLE [graph_api].[AAD_Group_Member]
-- Unpack json values
INSERT INTO [graph_api].[AAD_Group_Member]
select Member_Id, id as Group_Id, Member_Name, User_Principal_Name, Mail, GETDATE() as DSTS
from [graph_api].[AAD_Group_Member_Raw]
cross apply openjson(value) WITH (
Member_Id VARCHAR(32) '$.id',
Member_Name NVARCHAR(255) '$.displayName',
User_Principal_Name NVARCHAR(255) '$.userPrincipalName',
Mail NVARCHAR(255) '$.mail'
)
where member_count > 0 and status = '200';
END
Summary
In this article an approach to optimize calls to the Microsoft Graph API is described. To showcase the features database tables, Synapse pipelines and some SQL code were used to query AAD groups and their members. As in previous posts, SQL code was used to interact with JSON data.