Working with Arrays in Azure Streaming Analytics

When working with Azure Streaming Analytics (ASA) it is extremely common to use JSON for your events.  JSON allows us to express data in a rich way with things like nested objects and arrays.  This though can cause some head scratching further along the chain when we need to work with that data,  In this post I want to look at working with arrays in ASA.

 

Here is my event data

jsondata

As you can see I have an array of objects which describe samples.  In this format they are not much use to me.  What I need to do is Pivot this data as it comes through ASA ready to be put into my destination.

Here’s the query that does that

SELECT
source___timestamp,
source.time,
flattened.arrayindex,
flattened.arrayvalue.sample,
flattened.arrayvalue.x,
flattened.arrayvalue.y,
flattened.arrayvalue.z
INTO
output
FROM
source
CROSS APPLY GetElements(source.samples) AS flattened


And here is what it looks like on the other side


output



Perfect. It is now in a format I can use more easily.

DocumentDB Indexing Policies–Portal Choices

I have been using Azure DocumentDB for a good while now and think it is a very interesting service.  The fact that all attributes by default are able to be indexed with no real sign of performance impact on insertion speaks volumes for the team in Redmond.
As I said by default all paths within a document are indexed.  DocumentDB allows you, when creating a collection in the portal, to specify how you want to index documents for that collection.  In days gone by everything was indexed using a hash index and whilst this was good for equality joins it wouldn’t work with range queries (Yes I know you can specify ScanInQuery to get around it but this is a performance hole).
You now get a choice and I for one think this is great.  You can still create your own custom indexing policy whereby you may exclude certain paths from being indexed or you may put a range index on a string field but I think a large number of indexing cases will be satisifed by the options in the portal.  Here are a couple of examples of where range queries help.

 

“Get me all documents where people working in HR earn more than £50k per year”
“Show me documents where people who sat the music exam scored between 45 and 65 percent”

 

Example

 

I have created my database and collection and I also specified the default indexing policy from this screen.

 

image

 

This is a sample document from my collection

 

{
"candidateid": 123456,
"school": "Castleford Academy",
"results": 
[
{
"exam": "english",
"gradepct": 76
},
{
"exam": "maths",
"gradepct": 80
},
{
"exam": "music",
"gradepct": 55
}
]
}

 

And finally here is my query with results that asks for documents where candidates have a gradepct value greater than 78.

 

SELECT c.candidateid, examresult FROM c JOIN examresult IN c.results
WHERE examresult.gradepct > 78

 

[
{
"candidateid": "123456",
"examresult": {
"exam": "maths",
"gradepct": 80
}
}
]

 

Conclusion

 

I really like that DocumentDB has sensible/expected defaults when it comes to indexing your documents as well as still being able to be very precise with custom indexing policies.