AI product development is being held back by data engineering. It’s time to do something about it

With GenAI at the “Peak of Inflated Expectations”, a seeming lack of compelling use cases for LLMs beyond 1,000 chatbot and natural-language search demos, and more companies than ever cynically slapping “AI” onto their products to ride the hype, it’s easy to dismiss LLMs as a solution in search of a problem and roll your eyes at customers and investors who manage to bring every conversation back to “but what’s your AI strategy?”. It’s clear that LLMs are very powerful and that the likes of Google and Microsoft will be able to build very cool features on top of all the data they already have access to, but it may be harder to see how that applies to the product you’re building: do my users really need yet another app-specific chatbot?

Part of that hesitation may stem from the fact that your company has never shipped any AI features before, and the world of data and ML engineering seems daunting. It may be easy enough to get a LangChain prototype running locally, but taking a demo like that into production is orders of magnitude more complicated, and running data and ML infrastructure is very different from running a SaaS web app. Data and ML engineers are difficult to hire, and your software engineers only have so much time to learn completely new disciplines. And even if you could, it’s not obvious what you should build anyway!

Unlike past hypes like blockchains and web3, though, AI is coming for you whether you like it or not. Not just in the sense that every team will need to adopt AI in their everyday workflows to stay competitive, but in that an opportunity to use AI in your product to create value for your users will inevitably arise as ML models get better, pioneers explore more applications, and your proprietary data – and the data your customers would be willing to give your product access to – grows in size. Whether AI has true “intelligence” is hotly debated, but it’s clearly “smart” for some definition of the word, and any product would benefit from being smarter. LLMs and AI in general may currently overpromise and underdeliver in terms of direct valuable applications in any given product, but they’ll inherently get better over time and gradually grow into new use cases rather than being fundamentally limited by our imagination of what the technology could do today.

That’s because AI is fundamentally about data, both the data used to train the model, and the data that model is run on. The result of that model run is some new piece of data that didn’t exist until the model and your data were brought together, whether that’s a predicted number, a text summary, an answer to a natural language query, or new generated content. The new data may have been contained in the original data all along, or it may be a new insight that required a ton of external context, but either way the ML model was needed to unlock it, and different models unlock different types of data. This means that AI lets you and your customers leverage existing data to create new data, some of which would be very valuable to their business and competitive advantage. In other words, any given dataset has potential value locked inside it, and as the tools to unlock value from data get better, that potential value grows, as do the opportunity cost and competitive risk of not using as much of it as possible. The amount of valuable insights that can be unlocked starts growing exponentially when datasets from different sources are combined, and patterns can be identified across silos.

To unlock the full potential of data, we need both analytics and AI

Of course, this concept of unlocking the potential value stored in raw data is anything but new or unique to AI, it’s exactly what data analysts and scientists have been doing for decades. But there’s an inherent limit to the amount of value that can be unlocked using analytical SQL queries and Python notebooks that act only on the data at hand and bring in no outside context, which does not apply to AI which can always bring in more context and/or become better at pattern recognition. As more advanced ML techniques and pre-trained models become available, and as the datasets you have access to grow, the potential value locked up in that data keeps going up, as does the opportunity cost of not using every last bit of it. 

Data teams can already unlock a lot of value from data using traditional data analytics, but while not using ML may have been enough to unlock 90% of a dataset’s potential value a few years ago, the recent advancements in AI have already brought that down to, say, 70%, and that number will continue to go down as the limits of AI continue to be pushed. Some organizations will have the resources to build and maintain their own state-of-the-art internal data and ML platforms to unlock as much of that value as possible, but as the state-of-the-art grows increasingly complex it will get out of reach of more and more organizations and their (typically modestly sized) internal data teams.

To unlock the potential value of their data, we predict that organizations will increasingly reach for advanced off-the-shelf AI and analytics products that they couldn’t have built themselves (at all, as quickly, or as well), rather than continuing to rely solely on their internal data teams. This means that demand for new products that integrate data from different sources will grow, as will demand for existing products to do more with the proprietary data they already hold, and to unlock additional value by combining this data with data from other sources. Some of these products will let their customers extract value from basically any data, no matter the source, and sit on top of a customer’s existing data warehouse (like a BI tool), but more will focus on specific industries or departments and let their customers directly connect the relevant data sources without requiring them to explicitly explain how different tables and columns in their bespoke data warehouse relate to each other.

Building data-powered features is not as straightforward as it seems

If you’re thinking of building a product like that, or if you have an existing product and would like to start getting more value out of your proprietary data mixed with data from related tools and databases your customers use, your software engineers are going to have to learn how to ship features built on external data. Specifically, you’ll need to build infrastructure to retrieve it from wherever it’s currently stored, manipulate it into the appropriate shape for your use case, and store it somewhere your code and ML models can reach it, often merging and aggregating data from different sources along the way.

When you’re a software engineer like me, this sounds relatively straightforward – I know how to read from an API and write to a DB! – but spend some time around data engineers and you’ll realize how large the gap is between something that will run on your machine once, and something that you’d feel comfortable putting into production and charging your customers for.

Data engineering is not just a skill – it’s a discipline, with a dedicated job title, courses and conferences, and a long list of tools tailor-made for specific parts of the process. Organizations typically start with a single data scientist or analyst who knows enough data engineering to be dangerous, but as data teams grow they end up hiring dedicated data engineers to keep the ever-increasing amounts of data flowing, so that the data analysts and scientists can focus on extracting value from the data, not the surprisingly complex process of reliably acquiring and processing it.

If you’re building a new feature on top of external data, you’ll probably start by hand-writing some API request code and SQL INSERT statements, and running them using a background job scheduler of some kind. This may work well enough for a proof of concept, but will quickly stop scaling when you add more and complexer APIs and authentication flows, more customers, and more steps in the data flow, and you end up needing to build things like monitoring, resource scaling, and automatic testing. You’ll find yourself reinventing what data engineers know as workflow orchestrators, ETL (Extract, Transform, Load), and connectors.

Data engineering is a poor fit for software teams

Building a bespoke data engineering platform is clearly not a good use of your software engineers’ time, which should be going to building and shipping new valuable features – the “do something with the data”, not “get the data”. You could hire a data engineer, or learn how to do proper data engineering yourself. But data engineers are expensive and difficult to hire (demand exceeds supply), and asking a software engineer to learn to use the entire data engineering tool stack is not a small ask: it’s less like asking a frontend developer to learn a new framework (like Svelte instead of React), and more like asking them to build a mobile app or a video game – a whole different discipline with its own tools, ecosystems, and best practices learned through experience, that no software engineer could master after a just few days or weeks heads-down in the docs.

Data engineering has become a full-time job, and the tools show it: they’re relatively low-level, very flexible with hundreds of knobs to turn, and typically focused on doing just one thing really well (per the Unix philosophy) so you can pick your favorite tool for each job and bring them all together into your ideal 4-to-10-component data platform, the so-called Modern Data Stack. That’s great, provided that data engineering is your whole job and you have the time to tweak every last detail to be just right and actively maintain complex infrastructure (which typically comprises mostly UI-based tools duct-taped together, rather than anything resembling infrastructure-as-code).

For a software team just trying to ship features, though, these existing data engineering tools are a poor fit: they’re made for building internal data platforms rather than multi-tenant products, and the endless flexibility is not worth the learning curve and ongoing maintenance burden, no matter whether you try to hire full-time data engineers or train your software engineers to be able to do what they can. And if you’re going to be training your software engineers on any new skillset, let it be ML and LLM prompt engineering, where your business’s value is created, rather than data engineering, which is ultimately undifferentiated.

There are a lot of products that aim to help software teams integrate external data in their products, in categories like Unified APIs, Embedded EL, and Embedded iPaaS. But in response to the high-flexibility/high-maintenance nature of data engineering tools, they’ve gone too far in the other direction and become a black box, with limited data source support (i.e. nothing private or niche that the product team didn’t decide to prioritize), little to no data processing abilities, and near-zero visibility into what’s going on or how to fix things when something inevitably goes wrong. They’re also inflexible in how they deliver the data to you, either by writing data into a database you’re expected to manage and scale yourself, or through a REST API with a highly opinionated “unified” data model that may or may not fit your needs. And if you want to see any of the data through a different lens, like a vector embedding to use in similarity search or LLM RAG, you’re on your own in data engineering land again.

In other words, every software team is going to have to start working with data at a scale that requires some amount of data engineering, and their options are to get all of the flexibility and all of the maintenance burden (and hire/train data engineers), or none of the flexibility and none of the maintenance burden, with no middle ground.

Software teams – and society – deserve better

Software engineers deserve a solution that sits somewhere on the spectrum between these two extremes, that enables powerful data and ML engineering capabilities without the massive complexity of current tools and infrastructure: something tailor-made for building products instead of internal analytics platforms, that embraces software development workflows and best practices, and lets developers move between higher-level (easier to use) and lower-level (more flexible) concepts and components as their use case requires. It should be fully managed and end-to-end, from letting customers connect their data sources to exposing that data to application code over SQL, APIs, or outgoing webhooks, with powerful data processing capabilities and use-case-appropriate storage in the middle. 

It should be built for tomorrow’s products, not yesterday’s, meaning native support for structured as well as unstructured data, automatic generation of embeddings for similarity search and LLM RAG, and the ability to use ML models and LLM prompt chains in any data flow or API endpoint. It should be data-infrastructure-as-code, letting you declaratively define what you need in terms of data sources, flow, and endpoints, without worrying about the infrastructure required to make it happen. It should be built around an open source library of data flow steps that will be sufficient most of the time and can be used with just a line of code, while letting you build, tweak, and debug them when you need to do something more advanced. It should be powerful and flexible enough to eventually become your business’s only “data cloud”, letting different teams build analytics dashboards, product features, or internal workflows based on one Single Source of Truth understanding of all the data relevant to your business and its users.

The lack of such a solution, or anything close to it, means that software teams are not able to ship new AI features and other capabilities enabled by their customers’ data as quickly as they’d like: data engineering forms a bottleneck on making their product more valuable for their users. For those users, it means that tools to unlock the potential value of their data are becoming available more slowly than they could be: with every AI advance, that potential value of their data goes up, but some of that value cannot be unlocked yet until the AI advance is productized, so data engineering indirectly slows down their ability to learn and improve. For society, this means that the boost in innovation and progress that these AI advances create the potential for is not being unlocked as quickly as it could be: the fact that only a relatively small group of people (software engineers) are capable of unlocking this value for the rest of society, means that them being held back by data engineering holds back everyone.

We’re doing something about it, and we need your help

At the most fundamental level, our team is motivated by unlocking the potential value of data, as we believe this will increase productivity, innovation, and progress – both within an organization and in society as a whole. With Meltano, we used our experience at GitLab (where the project was founded) to make data integration accessible to aspiring data engineers around the world with a self-hostable open source platform, connector library, and connector SDK, and introduced data teams to software engineering best practices like infrastructure-as-code, version control, and CI/CD, enabling them to collaborate more effectively and with higher confidence in their output. 

Now with Arch, recognizing that the bottleneck in unlocking the potential value of data has shifted from data integration on data teams to data engineering on software teams, we’re using our unique experience at the intersection of data and software engineering to take a step towards the pie-in-the-sky “one data cloud to rule them all” vision described above. Our mission is to enable every software developer to build (AI) product features powered by external data, without needing to set up and maintain complex data engineering infrastructure. Arch is the bridge between your customers’ data and your code, the data backend for your AI product. 

For the past few months, we’ve been building Arch in close collaboration with some great design partners, and today we’re ready to start working with more software teams that are facing the challenge of building products on their customers’ data, to make sure that we build exactly what you need, with flexibility high enough so that you can get the data exactly how you need it, and a maintenance burden low enough so that you can quickly get back to actually using the data. The vision above is a sketch of where Arch could go based on what we think software teams like yours will want and need in the medium-to-long term, but what we’ll actually end up building will depend on your short-term needs and how they evolve as this AI revolution plays out.


If anything I’ve written here resonates with you (or if you think I’m completely off the mark), I’d love to talk to you – feel free to book a slot on my calendar or reach out to douwe@arch.dev. You can also sign up on https://arch.dev to get access to Arch or just stay in the loop, and I encourage you to read the product launch blog post if you’d like to learn more about what Arch can actually do today and what we’ve got in store for the coming months.

Let’s work together to realize your wildest AI product ideas!