Data science maturity and the cloud

Data science has enormous potential to do good in the public sector. The efficiencies that are possible from automation and reproducible analytical pipelines alone are huge—if you like this is improvement at existing tasks. Throw machine learning and advanced analytics into the mix and data science can also complete entirely new tasks, expanding the horizon of what’s possible. It’s an exciting time to be a data scientist.

And yet I regularly speak to data scientists who are frustrated in their roles because the tech in their organisation simply does not give them the ability to do their job in the best way possible; or, even worse, they do not have the agency to do their job well. Data science, and data scientists, need the right conditions to flourish.

So, if you’re looking at your own organisation’s data science offering, what are the key things you should be able to do? And how can we ensure that data scientists have them?

How to check an organisation’s data science maturity

This is a highly personal, non-empirical, experience-based list of what the essentials are for data scientists to be productive. To some extent, subsequent elements build upon previous ones.

First of all, data scientists need an integrated development environment (IDE) to write their code in. No, this isn’t just a Jupyter Notebook, though vendors seem to think that’s all data scientists ever use (it’s great to have notebooks but they’re not enough on their own). It looks more like Visual Studio Code for most languages, or perhaps RStudio for R (though you can use Visual Studio Code for R too, as covered in this blog post).
Packages for the integrated development environment. Before you’re even writing code, you need the right extensions (aka packages) for your IDE to allow you to work effectively. For example, the Python extension is critical for using Python in Visual Studio Code. But there’s a bunch of others for markdown, automatically writing docstrings, colourising hex colour codes, integrating with github, sorting your package imports, writing latex, and on and on… These are essential to a (productive) data science workflow.
A way to manage installations of programming languages that can execute code. This means installations of Python and R, but not just having a single version of those on a machine : data scientists need a way to manage multiple environments, usually on a per project basis. This might mean one project is on Python 3.8.8, while another is using Python 3.10. Data scientists need control of this, and tools such as poetry or Anaconda give it to them. With this, data scientists can execute their code.
A way to install packages and libraries for base installations of programming languages. Python and R alone aren’t much good. Their power comes from extending them with, in the case of Python, 100s of thousands of extra code libraries. These libraries come from repositories such as PyPI and CRAN. In the case of Python, they are installed via an instruction on the command line that triggers dependency resolution and then a download over the internet. Both poetry and Anaconda can act as intermediaries to the Python repositories, and can be used as command line tools to install packages in specific coding environments.
Access to the command line. A command line is a way to write instructions directly to a computer. Data scientists need it for all kinds of things, from install packages (see above) to renaming and moving files, to managing code environments (see 3). On some enterprise IT solutions, access to the command line is blocked. Windows doesn’t have a conventional command line (well, it does, but it uses a different set of commands, and has fewer useful tools).
A way to put code under version control. It’s best practice for data scientists to put code under version control and it’s absolutely essential for collaboration and audit. In practice, this means an installation of git, the most popular version control tool. You can use git either through an integrated development environment (see 1) or through the command line (see above). Data scientists will also need a central repository service to share code with each other, usually Gitlab or Github
The ability to create efficient stores of data, and to access data programmatically. It might seem like an absolute basic, but many organisations struggle with where to keep their data. There are infamous examples of public sector operations going wrong because of errors in spreadsheets and the bottom line is that neither data nor computations should be in spreadsheets. Data scientists need to be able to flexibly create stores of data on servers; putting data on a shared network drive does not suffice. For example, most data scientists will need to be able to create databases that their colleagues can also access. They also need to be able to access stored data programmatically (ie through analytical tools such as R and Python). Without efficient read and write options like these, data scientists are going to be slowed right down.
A unix-like computing environment, for example Linux or MacOS. Microsoft’s Windows operating system has its strong points, (and, despite its cost, it’s a popular solution for public sector IT) but it’s not at all geared toward coding or automation. So much so that some modern data science libraries don’t work at all on Windows. There are a host of reasons behind this. They don’t matter, the point is the same: for data scientists, working on unix-like environments is just going to be a lot easier.
Tooling around reproducibility. A key tenet of good data science, not to mention good analysis, is that it should be reproducible. Clearly this is important for reproducible analytical pipelines too. We’ve already met a few of the tools that can reproduce code environments (eg poetry and Anaconda), but data scientists also need tools to run pipelines (eg Make and Dagster), and even to reproduce entire operating systems (eg Docker). So these tools need to be available and usable, and a good test of an organisation is whether it can support the deployment of Docker images.
Continuous integration / continuous deployment, and the ability to schedule code execution. If we’re serious about getting data science solutions deployed in operation areas, it’s absolutely critical that data scientists can test code on the fly as part of pull requests, one element of continuous integration. And that, before deployment, a series of checks take place before something goes live. Far from having the ability to do these, many organisations would struggle to have a script that is executed at a regular frequency. Without the ability to schedule events and scripts, what data science can do is going to be severely limited to having a human in the loop—missing out on a lot of the potential benefits.
The cloud. The reality of data science in 2023 is that much more can be achieved on the cloud than using a single laptop or on an on-prem machine (say a server sat in the basement). For example, if you’re working with data at enormous scales, you probably want to put it in something like Google’s BigQuery. I’m not even sure how you would deploy a machine learning model if not on the cloud—and asking how many models have been deployed to production is another good one for assessing an organisation’s data science maturity. There are emerging cloud services such as Google Cloud Workstation and Github Codespaces that make getting started on cloud easier than ever, too. You may hear arguments that cloud isn’t safe. While I’m not sure I buy those arguments given the plausible alternatives, the policy of the UK government is cloud-first anyway—and it has been since 2013. Increasingly, the best practice principle is to not code directly on your work laptop. So if you encounter an organisation that is entirely on-prem for “security” reasons, I’d really question whether they have a comparative advantage in providing secure computing services and what trade-off with efficiency and functionality they’re implicitly making.
The ability to compile code and install code-adjacent tools. While Python, R, and SQL do not need compiling in the same way that C++ does, they do sometimes write their own code that needs compiling. The packages that are front-ends to the Bayesian library Stan are great examples of this—even though you write a Python or R code, somewhere in the background code in another language needs to be compiled. Enterprise Windows laptops will block that compilation. Another example would be the popular Python geospatial data science package geopandas which has a bunch of dependencies that aren’t in Python at all, but still need to be installed.

Perhaps surprisingly, many organisations, even those with data scientists, struggle to provide 1—4.

How to create the right environment for data science to flourish

You’re probably interested in how an organisation can effectively achieve the environment that data scientists need to flourish. Looking at the list above, it might seem like a lot. But it’s actually not hard. Basically, an account with AWS (Amazon Web Services), GCP (Google Cloud Platform), or Azure (Microsoft’s cloud platform) will open up all of this. A lot of organisations get that far (though not all).

Where organisations then fall down is putting a barrier that stops data scientists provisioning their own specific services from these cloud providers: instead of giving data scientists a budget and telling them to do what they need to, individual cloud services are often managed by an intermediate layer: usually the IT department and sometimes an external vendor that aims to provide a complete solution.

On the face of it, this model makes sense: IT already provision and manage work laptops (plus all the programmes on them), why shouldn’t they also provision specific cloud services for data scientists? There are a few good reasons I personally don’t believe they should:

the time of people in IT departments is usually extremely precious; we can save their time as much as possible by allowing data scientists to self-provision the services they need.
workers in IT departments are technical experts but are unlikely to be huge users of data science tools themselves—leading to a gap between data scientists’ needs and what is provisioned. The example of external vendors thinking data scientists just use Jupyter Notebooks for everything is classic. I have had (extremely helpful) colleagues in IT who were surprised that data scientists needed to use the command line.
having data scientists own the budget and directly provision their own services makes for a tighter feedback loop between costs and services. If that link is broken, people can unwittingly run up huge bills.
having data scientists be able to self-provision means they feel empowered and are faster at getting what they need. I heard of one public sector organisation where it takes two weeks and numerous forms and emails to set up a (basic) SQL database: the result is that no-one sets up a SQL database, even if that would be the best solution. In general, I think it’s a good principle to give experts a brief, a budget, an accountability framework, and then let them get on with the job—and this applies to data scientists here.
work laptops are typically used by all staff, and so they need to be fairly fool proof, which is why IT specialists are needed to manage the fleet of work laptops and to triage any issues. Data scientists are themselves technical experts, so do not actually need this level of service.
by introducing a threshold or barrier to the process (eg you have to use a service desk request to try something), you discourage innovation of the kind that may not work, but just might, if someone could just try something quickly.

I’m not talking about data scientists choosing whether it’s GCP or AWS or whoever providing the cloud services here; the IT department or similar doing that makes a lot of sense. But within that outer wrapper, I think it makes much more sense for data scientists to choose the specific services they need without going through a middle layer.

A heart-shaped cloud floating by. — Avoid the wrong sort of cloud provision

If you stop to think about it, the model we usually use is the one where enabling functions determine a service provider then let people choose the specific products or services according to their local budget. The Chief Operating Office might choose which firm serves up food in the canteen, but the COO isn’t going to actually come to the canteen and force you to eat the salad; you get to choose within your budget. Similarly, back when organisations actually needed stationary, there was usually a high-level agreement with a supplier but local business areas would then decide what their area needed within their budget. Why should it be different for specific cloud services for experts like data scientists?

Some might say there are risks with this approach. For example, IT specialists are trained in security practices, or can build in security practices, that prevent data leaks or other things that keep Chief Information Officers up at night. I think data scientists could cover this just as well, though I think that we might need more training in it. I would also say that this apparently risky counter-factual is better than where we are right now: we have data leaks and errors because people are using the wrong tools and tech (cf the problems with Excel spreadsheets and people being forced to email data rather than programmatically access it because they cannot create databases or APIs). So I don’t really buy that there’s even a trade-off here. But even if there was we undervalue innovation because risks are tangible and apparent but the improvements we could achieve if we were to make a slightly different trade-off are not. Innovation may still be worth doing.

As noted by Tim Harford, it’s quite telling that such a lot of innovation happened in the public sector during the pandemic when the usual rules (and, I must say, barriers) were temporarily suspended. I believe there’s a win-win-win here where data scientists are empowered to innovate and improve public services, budget holders can get accountability from those who are actually spending the money on cloud services, and ever-busy IT departments don’t have to manage cloud services on top of everything else.