TL;DR: With Arch, we’re enabling every software developer to build (AI) product features on top of their customers’ data, without needing to set up and maintain complex data engineering infrastructure. Arch is the bridge between your customers’ data and your code. Intrigued? We’d love to work with you – book a time to chat with my co-founder and me.
For a few years now, our team has been working at the intersection of data engineering and software engineering in building Meltano, an open source project used by data engineers at thousands of organizations to efficiently build and manage data pipelines that bring data from hundreds of different sources together in a single place so that it can be analyzed effectively. Our approach has been unique in bringing software development best practices like version control, CI/CD, and isolated deployment environments to data integration, a field in which most tools are still point-and-click UIs with limited connector libraries and a “Save” button that deploys untested changes straight to production.
This year, the rise in Generative AI has made it clear that the bottleneck in unlocking the potential value of data has shifted from data integration on data teams to data engineering on software teams, as laid out by my co-founder Douwe in his blog post “AI product development is being held back by data engineering”. So we’ve decided to do something about it, and have made it our mission to enable every software developer to build (AI) product features powered by external data, without needing to set up and maintain complex data engineering infrastructure. Let’s dive in!
The Problem
If you’re a software engineer who wants to start using your customers’ external data in your features (whether ML-based or not), it’s easy to think that pulling the data out of the relevant sources, storing it somewhere, and then accessing that data from your code will be relatively straightforward. Reading from an API and writing to a database is not hard and you’ve probably done it hundreds of times already.
But once you start trying to move, shape, store, and use data at scale – for hundreds of customers and from dozens of sources – things get increasingly complicated and you enter the realm of data engineering. Data engineering is an entire discipline that has spent years working on the hard problems of data movement, transformation, storage, and access, and there are many tools available that help them do this – including our own Meltano. So instead of attempting to scale up the hand-written API code you’ve used so far, you’ll likely want to learn from data engineering instead of reinventing the wheel.
The data platforms that data engineers build to manage this work are typically made up of at least 3-5 separate tools, that focus on specific areas such as data movement (also called extract and load), data transformation, data orchestration, data storage (aka data warehouses or lakehouses), data access and visualization, along with observability, monitoring, and governance of the entire system. These powerful low-level tools help data engineers build data pipelines that make sure the right data is always at the right place, in the right format, at the right time. Having that many moving parts requiring ongoing maintenance is fine if you’re a full-time data engineer, but it’s a lot to take on for a software team that would rather work on the features that use the data than the process of acquiring and processing it. It also doesn’t help that these tools were not actually built with your use case and preferences in mind: they’re typically UI-based rather than anything resembling infrastructure-as-code, and they expect that all the data is coming from the organization’s own sources rather than having multiple tenants (your customers) that each have their own sources and pipelines.
The point here is that data engineering is complex, and that data engineering tools are made for full-time data engineers building single-tenant internal data platforms, rather than multi-tenant products. As a software engineer, you should not have to become a data engineering expert and manage complex data infrastructure just to be able to build useful features for your customers on top of their data.
For a deeper dive on the problems that software teams face as they want to ship AI and other data-powered features and a vision of what an ideal solution could look like, I encourage you to read the “AI product development is being held back by data engineering” blog post that accompanies this one.
Unveiling Arch
This is why we’re building Arch – to let you build AI-powered features on top of your customer’s data without having to worry about building and maintaining bespoke data infrastructure just to reliably access that data. The goal for the initial version we’re presenting today was to build a platform that can power multi-tenant products with LLM-based features on top of data from any source, built explicitly for software engineers.
Arch is a completely new product built for this use case (and more!) that lets you do four key things:
- connect to any data source
- define the desired shape
- access data however you like
- manage your data infra as code
I want to go into detail on each of these areas and talk specifically about what we’re building and what our initial version of Arch will be able to do.
Connect to any data source
We’ve seen many demo projects and blog posts show how easy it is to build an LLM-powered application with just a single source. Tools like LangChain and LlamaIndex leave it as an exercise to the user to figure out how to go from their demo project to many data sources or customers. Making your customers’ data easily available for product features at scale (and not just for a demo project) requires a tool that is multi-tenant-aware and is able to continually move data from any source at scale.
Multi-tenancy made easy
Many data tools make the basic assumption that you’re only using data from a single account. If you’re trying to provide a service on top of data your customer has, and if you have more than one customer, it’s often left as an exercise to you on how to manage credentials, storage, and access. And the more customers you have, the more complex it gets.
With Arch, we make multi-tenancy very easy. It’s simple to have your customers authenticate with Arch directly via OAuth, while also being flexible enough to integrate with your credential management system if you want. Their credentials are securely stored in our system and accessible only via automated Arch processes. It’s easy to segment each tenant’s data within Arch via unique per-tenant schemas or with per-tenant foreign keys within a single table.
Because Arch is built with multi-tenancy in mind, you don’t have to worry about it when building your application. Within your application code you can write the same logic for multiple customers and have confidence that each customer only has access to the data they’re supposed to.
Fresh Data from Anywhere
We’re building Arch on top of our popular open source data integration project, Meltano. Meltano enables us to leverage the library of over 600 pre-existing connectors within Arch. For any existing connector, you can tell Arch which data you want to pull from and how fresh you want that data to be, and we’ll handle the rest.
More importantly though, it gives Arch the ability to be incredibly flexible and support a diversity of data sources that many other platforms won’t support. We’ve talked to dozens of engineers who are frustrated by other solutions in the market (typically called “Embedded EL” tools or “Unified APIs”). They’re frustrated because they’re limited to the connectors that those platforms’ product teams have decided to implement, and they’re all typically closed source.
With our foundation built on Meltano, we’re able to use the existing connector library while also supporting custom connectors for niche and private sources. Our SDK has been used to build hundreds of custom connectors and we will continue to invest in making it as easy as possible to pull data from any source.
Define the Desired Shape
An emerging pattern for features powered by large language models (LLMs) is to add in relevant data when prompting the model. This is called “Retrieval Augmented Generation”, also known as RAG. By sending your prompt along with snippets of (potentially) relevant data, you’re able to improve the quality of the LLM responses as well as perform useful knowledge extraction and summary from the provided data. A common implementation for RAG is to encode data into vector embeddings to improve the performance of search and retrieval when fetching relevant data. Many tools will help you get an initial batch of data vectorized (indeed, OpenAI will now do this for individual documents), but doing this at scale on a continual basis is a huge challenge – that is firmly in the realm of data engineering. And what if you want to use the same data you’re vectorizing for analytic use cases? The answer right now is to use multiple database vendors for these different use cases and to build the data platform yourself.
Auto-embeddings, analytic workloads, and more
With Arch we manage the storage layer for you and we make it easy to solve all of these use cases within the same platform. Behind the scenes we’re using PostgreSQL since it’s one of the most popular and reliable open source databases in the world. Postgres also has a large library of extensions which make it easy to support a wide variety of use cases. Extensions such as pgvector enable vector storage and similarity searches, hydra enables fast analytics queries which typically require retrieval of columns along many rows (opposite of typical transactional use cases which are based on quickly fetching rows), and postgresml which is a machine learning extension for easy training and inference. We’re making a bet that Postgres is a fantastic solution for 90%+ of use cases. (Fun fact, the first data warehouse at GitLab, where I led the data team, was PostgreSQL!)
Within Arch we’re making it easy to specify any data for vector embedding generation. Arch will automatically chunk the data, generate the embeddings, store the vectors, and enable easy similarity search. We’re starting with the OpenAI text-embedding-ada-002 model and will add more very soon. We want to make RAG-based workflows incredibly easy and we believe starting with automatically managing embedding generation for data in Arch is a great place to start.
SQL, Python, and Tenant aware transformations
One of the many challenges of building features on customer data is getting the data into a shape that’s usable for different use cases. We’ve already touched on the LLM / embedding use case, but even in that example you’re unlikely to generate embeddings on data that comes directly out of a source system. Data often needs to be cleaned, filtered, and aggregated to prepare it for specific use cases.
A common scenario is working with data from a CRM (customer relationship management) tool such as Salesforce or Hubspot. Often the raw data from these systems has data that needs to be removed (such as test accounts) or data you may not want in its raw form such as personally identifiable information (PII). With Arch it’s easy to specify any data be hashed, filtered, or completely anonymized.
Another challenge is that these CRMs can be implemented in unique ways for each of your customers. If they each have custom entities (aka an “object” in Salesforce), that entity can have custom properties (aka fields/columns). Mapping that custom data is often a per-tenant process. Arch makes it easy to implement and manage custom entity and property mappings for each of your tenants – either directly in our platform or via an interactive UI you can present to your customers, with some help from AI to suggest likely mappings.
Yet another challenge in working with multiple tools from within the same industry is how they each talk differently about the same concepts. While a CRM may both talk about “Deals”, the data models in those platforms may call them something completely different (“opportunities” in the case of Salesforce). We want to make it easy to work with data from different sources but within the same industry. Arch has unified data transformations which translate the per-tool concepts into a single definition with which you can integrate once. Unlike other platforms that have similar mappings, ours are open source and easy to use and contribute to.
At the end of the day, we know we can’t cover every single use case for data up front. But we can make a platform that easily runs your custom SQL and Python in such a way to make it easy to manage in a multi-tenant environment. You’re able to use our library of pre-built models as-is, or you can fork and customize them as needed. Arch won’t limit you based on arbitrary restrictions.
Access Data How You Like
Over the past several months in talking with engineers and understanding their frustrations with existing products, one point kept coming up a lot – they were often frustrated by the interface these tools provided for access. For so-called “Unified API” platforms, they only let you interface with their REST API. Unified APIs are essentially a middle-ware that make it “easier” to integrate with multiple different products (Plaid is an example of a Unified API for banking). This is useful because you don’t have to learn the ins and outs of each individual API anymore, but you’re limited to the lowest common denominator endpoints they’ve implemented and teams were often reaching for the built-in escape hatches, at which point they felt like they should’ve built the integration themself.
On the other side are the “Embedded EL” providers which require you to manage the storage layer yourself – but the benefit is you could access the data via SQL. But as we mentioned before, these tools are not built for multi-tenancy at scale nor do they let you write the data back to the source system. And if you prefer a GraphQL or REST API interface to the data, that’s yet another tool you’d have to set up yourself.
What if data access was easy?
With Arch, we wanted to build a best of both worlds approach to data access. Because Arch leverages PostgreSQL, you get easy access via SQL out of the box. Just connect, authenticate, and query.
But we go a step further and make it easy to create instant APIs – GraphQL or REST – for any table and for any tenant. This means you control which tables can be queried for both reads and writes. Supporting the vector similarity search use case is easy too – define the data for which you want auto-embeddings generated and then you can have a custom endpoint for querying.
Our near-term roadmap also includes SDKs for Python and Typescript so it’s easy to define these endpoints in your application where you’re already working. We’ll also make it easy to monitor usage of the APIs as well as provide more access control settings.
Manage your project as code
One of our goals with Meltano was to get more data engineers working like software engineers. We knew that they could benefit from version control, code review, automated testing, and isolated development environments. For a developer-first platform like Arch though, these features are table stakes. No engineer wants a platform built on ClickOps. Engineers want tools that integrate with their workflow and take advantage of years of improvements in tooling and best practices.
Built for Engineers
We’ve seen the benefits of the declarative infrastructure-as-code approach for software platforms, and Arch implements many of the same benefits for your data infrastructure. Arch enables you to declaratively define data sources, transformations, and API endpoints, so that you can check those definitions into version control and immediately benefit from code review best practices.
With those definitions in version control, you can define isolated development environments for different branches so that you can confidently iterate on your features without touching production. And since Arch manages the data storage for you as well, we can make it easy to read production data while preventing development branch writes. This gives you the confidence to iterate on features while not wasting any resources.
The declarative infra-as-code approach also makes it possible to integrate with code hosting platforms such as GitHub. We’ll make it easy to have per pull request previews as well as running tests in CI prior to merging any code.
At the end of the day, we want you to build with confidence and we’re building Arch so it makes the confidence come easy.
Clear Pricing
A critical part of building engineer confidence in our platform is our pricing model. With our early partners and customers we’ve been using a flat usage-based model based on compute hours, so that they can think of Arch like any other cloud resource in their tech stack. The current price is $0.75 / compute hour for any workloads run on the platform during data syncing, transformation, storage, and querying.
In any pricing model, there is a tradeoff between predictability and value-alignment. We’re a fan of the usage-based model because it better aligns the value our customers get from the work we take out of their hands and is easy to compare to an in-house solution running on your own compute. We’ve seen other tools charge on a per-connection or per-tenant basis, and while certainly predictable, these don’t account for differences between tenants in terms of data volumes and different subscription tiers that unlock different data-powered functionality. The usage based model has been well received so far by our early customers and design partners, but we’re always looking for feedback, so let us know if you have a different preference.
The Road Ahead
We’ve got an ambitious roadmap and are moving quickly to make Arch the best possible solution for all your product’s external-data and AI-enablement needs. What we build, and in which order, will depend heavily on your feedback if you end up working with us, but these are some of the things we’re considering based on early user feedback:
- 2-way sync so you can push data back to upstream systems as easily as you can pull from them
- Support for streaming and event-based data sources, including webhook payloads
- Putting Arch in front of your application database so you can join your proprietary data with external data and get the same auto-embeddings and instant API benefits
- Use ML models including LLMs in the data flow to extract and enrich data
For the past few months, we’ve been building Arch in close collaboration with some great design partners, and today we’re ready to start working with more software teams that are facing the challenge of building products on their customers’ data, to make sure that we build exactly what you need, with flexibility high enough so that you can get the data exactly how you need it, and a maintenance burden low enough so that you can quickly get back to actually using the data. The potential features listed above and the vision laid out in the “AI product development is being held back by data engineering” blog post are a sketch of where Arch could go based on what we think software teams like yours will want and need in the medium-to-long term, but what we’ll actually end up building will depend on your short-term needs and how they evolve as this AI revolution plays out.
If anything I’ve written here resonates with you, we’d love to talk to you – feel free to book a slot on my calendar or sign up on https://arch.dev to get access to Arch or just stay in the loop. Let’s work together to realize your wildest AI product ideas!