Deleting Multiple Documents from Azure DocumentDB

I have been using DocumentDB a lot recently and wanted to share with you something that is harder to do than it should be.  When I say harder I mean you have to type more to get what seems a really easy thing to do.  I will also tell you how to do this more efficiently but because of cost sensitivity I couldn’t do it that way.

Scenario

I have a collection of documents.  These documents are being streamed into DocumentDB from a worker role which is reading from Azure Event Hubs.  As you can imagine I get a lot of documents in a relatively short space of time.  The size of a DocumentDB collection is 10GB.  I only want to use one collection.  My situation is that really I only need to keep two days worth of data in the collection at any one time.  My requirements therefore are

  • Maintain only one collection
  • Retain only two days worth of data
  • Remove documents on a schedule.

For point #3 I am using a .net StopWatch object and that is really simple.  Having only one collection is also very simple so really it comes down to

How do I delete a bunch of documents from DocumentDB?

First Attempt

//setup
string databaseId = ConfigurationManager.AppSettings["DatabaseId"];
string collectionId = ConfigurationManager.AppSettings["CollectionId"];
string endpointUrl = ConfigurationManager.AppSettings["EndPointUrl"];
string authorizationKey = ConfigurationManager.AppSettings["AuthorizationKey"];
//connection policy
ConnectionPolicy policy = new ConnectionPolicy()
{
ConnectionMode = ConnectionMode.Direct,
ConnectionProtocol = Protocol.Tcp
};
//Build our selection criteria
var sqlquery = "SELECT * FROM c WHERE " + ToUnixTime(DateTime.Now).ToString() + " - c.time > 172800";
//Get our client
DocumentClient client = new DocumentClient(new Uri(endpointUrl), authorizationKey, policy);
//Database
Database database = client.CreateDatabaseQuery().Where(db => db.Id == databaseId).ToArray().FirstOrDefault();

//Get a reference to the collection
DocumentCollection coll = client.CreateDocumentCollectionQuery(database.SelfLink).Where(c => c.Id == collectionId).ToArray().FirstOrDefault();
//Issue our query against the collection
var results = client.CreateDocumentQuery<Document>(coll.DocumentsLink, sqlquery).AsEnumerable();
Console.WriteLine("Deleting Documents");
//How many documents do we have to delete
Console.WriteLine("Count of docs to delete = {0}", results.Count().ToString());
//Enumerate the collection
foreach (var item in results)
{
// Console.WriteLine("Deleting");
client.DeleteDocumentAsync(item.SelfLink);
}
//How many documents are still left
var postquery = client.CreateDocumentQuery<Document>(coll.DocumentsLink, sqlquery).AsEnumerable();
Console.WriteLine("Count of docs remaining = {0}", postquery.Count().ToString());
Console.ReadLine();

You may be expecting the result of the second count to be 0.  Unless you have 100 documents or less as the result of the first query then you are going to be disappointed.  We enumerate through the result of our first query getting a reference to each document and deleting it.  Seem fine?  The problem is that DocumentDB only returns 100 documents to us at a time and we didn’t go back and ask for more.  The solution is to execute our query and tell DocumentDB to re-execute the query if it has more results.  You can see a visual example of this when you use Query Explorer in the portal.  Down the bottom, under your query after execution you will find something like this.

More

The solution.

Here is the code that asks if there are more results to be had and if there are then can we go get the next batch

//setup
string databaseId = ConfigurationManager.AppSettings["DatabaseId"];
string collectionId = ConfigurationManager.AppSettings["CollectionId"];
string endpointUrl = ConfigurationManager.AppSettings["EndPointUrl"];
string authorizationKey = ConfigurationManager.AppSettings["AuthorizationKey"];
//Connection Policy
ConnectionPolicy policy = new ConnectionPolicy()
{
ConnectionMode = ConnectionMode.Direct,
ConnectionProtocol = Protocol.Tcp
};
//build our selection criteria
var sqlquery = "SELECT * FROM c WHERE " + ToUnixTime(DateTime.Now).ToString() + " - c.time > 172800";
//client
DocumentClient client = new DocumentClient(new Uri(endpointUrl), authorizationKey, policy);
//database
Database database = client.CreateDatabaseQuery().Where(db => db.Id == databaseId).ToArray().FirstOrDefault();

//Get a reference to the collection
DocumentCollection coll =
client.CreateDocumentCollectionQuery(database.SelfLink)
.Where(c => c.Id == collectionId)
.ToArray()
.FirstOrDefault();
//First execution of the query
var results = client.CreateDocumentQuery<Document>(coll.DocumentsLink, sqlquery).AsDocumentQuery();

Console.WriteLine("Deleting Documents");
//While there are more results
while (results.HasMoreResults)
{
Console.WriteLine("Has more...");
//enumerate and delete the documents in this batch
foreach (Document doc in await results.ExecuteNextAsync())
{
await client.DeleteDocumentAsync(doc.SelfLink);
}
}
//second count should now be 0
var postquery = client.CreateDocumentQuery<Document>(coll.DocumentsLink, sqlquery).AsEnumerable();
Console.WriteLine("Count of docs remaining = {0}", postquery.Count().ToString());

The key is this statement

var results = client.CreateDocumentQuery<Document>(coll.DocumentsLink, sqlquery).AsDocumentQuery();

using AsDocumentQuery allows us to know if we have more results.

The Easy Way

The easy way and most definitely the proper way to do this is to use Partitions in DocumentDB.  A Partition is essentially a collection.  There are different types of partition but for this example I would have used a range partition over time.  When I wanted to delete documents I would have just simply dropped a partition.  My partitions would have been based on dates.  I would always have had 2 full partitions (full meaning closed for data) and one partition (current) that was filling with data

collections

In the example above collection #1 and #2 would be closed for data as we are filling collection #3.  Once collection #3 is full then we drop collection #1, add collection #4 and make that the one that is accepting data.

Conclusion

This is simple to do when you know how but it does seem like a long winded approach.  I would like to see something a little less verbose.