FROM MESSY DATA TO INSIGHT – LEGAL DATA IN THE REAL WORLD

ANDREW DUNKLEY

In my first blog, I talked about the importance of starting your analytics journey in the right way. By defining valuable questions that have answers you can use to drive business decisions and behaviours, you’ll get off on the right foot.

So, now it’s time to dive into the data to get those answers!

This is normally the point where things slow down… Why? Because the data is almost always a mess. Incomplete, scattered across multiple systems, with different structures and questionable accuracy.

There are two common responses at this point. A lot of analysis projects simply die, abandoned as too hard, not worth the effort or not able to produce results in time for them to be useful. Others get bogged down in time-consuming and expensive manual data clean-ups, trying to produce an accurate dataset for analysis.

So what’s the solution?

You are not alone!

The first step is to not beat yourself up. Practically no one in the legal industry has really good quality data because legal data is complex! Legal doesn’t do enough transactions for the ‘big-data’ phrase to fit, but it does produce a lot of information. That information usually contains many different relationships, concepts, people, and organisations – and a lot of it is hidden in words, documents and unstructured text rather than nicely arranged in numbers and rows.

So don’t make the mistake of thinking that you can’t do analytics because you don’t have perfect data, or that you need to create a perfectly tagged database before you can start. Instead, there are two questions you can ask:

What can I do with the data that I have?
Given the importance of the questions I’m asking, what should I do to improve that data?

Data wrangling – Getting the most out of what you’ve got

There is always a data cleaning component to analysis work and it usually takes longer than the clever stuff that comes after. It can be a slog but spending time working with your data can be the difference between getting good results and none at all.

There are a number of things that you might need to do here:

Data mapping – working out how your data is structured and how that structure relates to the real-world question you are trying to answer. In particular, you need to get to grips with the ‘units of measurement’ in the data – claimants vs claims, contracts vs counterparties etc.
Normalising – standardising names and terms, correcting spelling errors, and making sure date formats from different systems line up.
Taxonomies – understanding where different taxonomies in different systems overlap and can be mapped against each other. You’re looking for ‘1:1’ and ‘MANY:1’ relationships – ‘1:MANY’ and ‘MANY:MANY’ relationships make analysis harder.
Proxy measures – if you haven’t collected a piece of information, sometimes you can simulate it based on correlating data that you do have. A great example is basing matter close dates on last edit dates or time entry data.
Sampling – if you have a dataset that you don’t trust for some reason then checking a small sample of the underlying data can help you evaluate how reliable it is. This is designed to improve your confidence in the data rather than improving the data itself.

It’s really important to remember the questions you are trying to solve when you are doing a data clean-up. Urgent questions sometimes mean you have to be agile and use smaller samples or proxy measures. You need to make sure that the data actually measures the thing you are trying to track. The aim is to get to a dataset that is good enough to give you the insight you need – that’s all.

Back classification – when is it worth it?

Sometimes all the data wrangling in the world doesn’t help. Maybe the data quality is just too poor or perhaps it’s simply not there, either because it wasn’t filled in or wasn’t asked for in the first place.

So, to answer our questions, we need to collect some data to analyse. Historically this meant using paralegals to read a lot of documents and create a big spreadsheet. This is slow, expensive and hard to justify, so it’s only worth doing for really important questions that don’t require an urgent answer – and how often do you see that?

Luckily there are other approaches that can produce a valuable dataset faster and more cost effectively:

Surveys – sometimes just asking your colleagues the answer in a structured way is the fastest route to helpful data. I once ran a workshop looking at client wait times on arrival – there was no data. A quick consultation with the receptionists gave us enough information to take meaningful decisions.
Small sample classifications – tagging a few examples from your dataset can often give valuable insights. It’s not 100% reliable, but it will give you an idea of the order of magnitude and amount of variation. This is particularly useful when you are dealing with numbers and values.
AI assisted reviews – artificial intelligence has now advanced to the point where you can back classify large numbers of historic documents with high levels of accuracy. Providers like SYKE can offer this as a service faster and more cost effectively than ever, saving you time money and stress, so you can focus on more important legal work.

The key point here is that you aren’t trying to produce a perfect dataset. You are trying to get a representative dataset that is large and reliable enough to answer the questions you are asking. Having an eye on the issues you are trying to understand will help you decide how far you need to go when planning a back classification exercise.

Be confident but honest

The thread that runs through this can be summed up fairly simply – don’t be afraid of your data. You must be honest with yourself about its strengths and weaknesses and what that means for your analysis. This guides how you work with that data and whether you decide to improve it. That honesty then carries through to the analysis itself, so your stakeholders understand the choices you’ve made.

You want a dataset you can be confident in, not necessarily a perfect one. That confidence comes from understanding what’s in it, its accuracy, and how that relates to what you need to do with it.

If you want to start working through your data but don’t have the time or resources, or if you’d like an initial chat about how best to approach your data, get in touch.

Andrew is the Associate Director, Analytics & Insights for Data Services at SYKE.

He is an industry-recognised legal technologist with a history of innovative work in the sector. A qualified solicitor, he has a 10 year track record of designing and delivering legal technology and analytics. He has a law degree from Oxford University and has been published in a peer-reviewed journal through an R&D collaboration with the London School of Economics.

FROM MESSY DATA TO INSIGHT – LEGAL DATA IN THE REAL WORLD

You are not alone!

Data wrangling – Getting the most out of what you’ve got

Back classification – when is it worth it?

Be confident but honest

Related Posts