CTO Sessions: Matei Zaharia, Databricks

What is the biggest issue that you’re helping customers with at the moment? "The biggest issue is how to make data and AI powered applications simpler and more reliable to build."

IDGConnect_ctosessions_suppliedart_mateizahariadatabricks_1200x800
Databricks

Name: Matei Zaharia

Company: Databricks

Job title: Co-Founder & Chief Technologist

Date started current role: July 2013

Location: Palo Alto, California

Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly in datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Zaharia tech-leads the MLflow development effort at Databricks in addition to other aspects of the platform. Zaharia’s research work was recognised through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).

What was your first job? I worked as a web developer for some science education organisations in Canada while I was in high school there. After that, I started working as a research assistant in university.

Did you always want to work in IT? Not always, but I got interested in computer science fairly early on and decided to study it in university. I was interested in STEM as a whole, but I thought that computer science was the field in which it was easiest to experiment with the latest ideas and advances and come up with something better, since everyone has access to a computer. This is different from fields that require a lot of lab equipment to work in.

What was your education? Do you hold any certifications? What are they? As an undergraduate, I studied Computer Science at the University of Waterloo. After that, I completed my PhD in Computer Science at the University of California, Berkeley. During my PhD, I started the Apache Spark project, which is now one of the most widely used frameworks for distributed data processing, and co-started other widely used datacenter software such as Apache Mesos, Alluxio, and Spark Streaming. My research work was recognised through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).

Explain your career path. Did you take any detours? If so, discuss. At university, I started doing research with a professor at Waterloo (Srinivasan Keshav) and got very interested in networking and distributed systems. My early work with Prof. Keshav was on peer-to-peer networks and mobile computing. I decided to apply to PhD programs afterward and joined UC Berkeley, where I quickly became interested in datacenter-scale computing and was involved in a lot of the earlier academic research that came out in that area. As part of that process, I also got to know the open source distributed systems community well and started contributing to projects there and coming up with ideas for new projects, such as Apache Spark. When I finished my PhD, I spent two years full-time at Databricks and then started also working as a tenure track professor to continue doing research, first at MIT and then at Stanford. It’s been very interesting seeing both the research career path, which is focused on long-term trends and theoretical ideas, and the industry career path at an enterprise company like Databricks, which lets you see the tough real-world problems that organisations face.

What type of CTO are you? Essentially, I am an Architect/Product Manager because I’m mostly focused on product design as well as the technical architecture behind the product. At Databricks specifically, it is critical to design the technology and the product together because the product itself is used by data analysts and developers, so it needs to scale well, be reliable, and support the latest algorithms and techniques.

Which emerging technology are you most excited about the prospect of? Even though it's been around for a while, I think cloud computing is still an emerging technology and we're figuring out better ways to build cloud applications and to deliver them to enterprises and to further enable enterprises to share results or access data and applications inside that enterprise.

I still believe there is a lot of transformation that will happen as an increasing number of applications become cloud native and elastic: most of the current computing stack is not fully taking advantage of the cloud. When we started Databricks, we took a bet on the cloud and were awarded #5 on the Forbes Cloud 100 list. For example, we built one of the first cloud-based data engineering and data science environments, one of the most widely used data management layers for cloud storage (the open source Delta Lake project), and the first auto-scaling engine for data science and analytics workloads in the cloud (by modifying Apache Spark to support this).

Are there any technologies which you think are overhyped? Why? It is difficult to say because in each area there are people that are overhyping certain technologies, but within the AI space, there are a lot of technologies/products that don’t focus on the end to end problem where companies think that a certain product will solve everything, but then realise, it can’t. In particular, anything that doesn't help with the data problem of keeping AI models well fed with high quality data is going to have issues.

What is one unique initiative that you’ve employed over the last 12 months that you’re really proud of? I’m very proud of how we’ve been able to repeatedly work with early customers to design new products, which when they hit the market, they are a good fit for a lot of companies. We’ve done this multiple times starting three years ago with Delta Lake, which is a reliable data management player over cloud data lakes that makes them look similar to a data warehouse and helped start the current trend towards Lakehouse system architectures. This is something we co-developed with one very large technology company with a highly demanding use case early on, and then were able to roll out to smaller companies without any issues.

We later did this with MLflow, a major part of our managed machine learning platform, and then with smaller products. We’ve designed our whole development process and product management process so we can gather meaningful feedback from customers as soon as possible to deliver great products.

Are you leading a digital transformation? If so, does it emphasise customer experience and revenue growth or operational efficiency? If both, how do you balance the two? As a seven year old company, Databricks doesn’t have too much legacy infrastructure to transform, but we are always looking at how to become even more data driven and give people really great tools to make decisions. So within Databricks, we are using our own technology to give many different users access to the latest data about how the business is doing, how our customers are doing, and how products are doing for an engineering team.

For example, we have internal products that our customer success team can check each time they want to talk to someone to see about the usage and we have engineers and product managers who can see the usage of the product in real-time and can then identify what is working and what isn’t.

Our technology is about maximising what the customer can build with their data assets, so ensuring they get the most out of our product is very important to our business.

What is the biggest issue that you’re helping customers with at the moment? The biggest issue is how to make data and AI powered applications simpler and more reliable to build. There are multiple pieces to that including collecting, integrating, and cleaning data from different sources in close to real-time There’s also a component of building the downstream applications, whether they're reporting or machine learning or data science. We've found that most enterprises already have a data team and are using data and AI in their business in some way, but they want to scale their use of these applications and do it quickly which means they need new technology that makes this simpler and easier to scale and is broadly accessible within the company.

With the pandemic, things are changing and everyone wants to understand how their business is changing in even more detail than before and how to make decisions informed by data.

How do you align your technology use to meet business goals? This can be challenging, but within our business, we deliver a cloud service that needs to be highly reliable and robust. We are very careful about how we design the technology so we design our products in a way that allows them to recover from errors. We’re also very careful about user interfaces (UI) and ensuring the UIs are highly compatible and well designed. Lastly, we pick technology internally that is simple and easy to build with when possible. Given the number of engineers at Databricks and the product surface area, we don’t have a very broad technology stack. Rather, we tried to pick a few tools that work well and use them for everything and make sure that everyone is an expert in those.

Do you have any trouble matching product/service strategy with tech strategy? Overall, they are fairly independent from each other so our experience has been okay. The most important part of the tech strategy is to ensure that we can move the product in the direction that makes sense over time.

We do a lot of planning and consider how we want the product to look in the future, how we want to deliver it, and what additional guarantees we want to provide. It is important to look a few years ahead of where you're going, but overall, we haven't seen too much of a clash between these strategies.

What makes an effective tech strategy? Overall, I believe it is important to keep things simple to understand and push back on complexity within reason, but there is always a tradeoff between delivering something quickly and making it perfect. If you choose simplicity or fewer moving parts, that will help with both objectives of developing it quickly and making it easier to maintain in the future.

What predictions do you have for the role of the CTO in the future? I think the role of the CTO will be an increasingly important role because I think all companies are increasingly becoming technology companies. Within each business, there are a lot of unique problems and systems specific to that business which means they need in-house talent to understand them and successfully solve the problems. With the advent of cloud computing, it makes it easier to adopt new technology and test new technology, and change it out if it isn't working. I think that's actually an exciting thing for most CTOs because it means they can move quicker instead of having a long feedback cycle on any decision they make. For tech company CTOs, the move to the cloud means you have to build software in a different way from the enterprise companies of the past so there is a lot to learn, but it is a necessary thing to learn.

What has been your greatest career achievement? Overall, with everything we’ve built at Databricks, we’ve delivered a lot of high quality, industry changing products in a short time with a team that isn’t that large.

In 2014, we launched our Databricks Workspace, which was the first end to end cloud workspace within data science, and this type of interface has now become the standard for data scientists in many organisations.

We’ve continued building out the Apache Spark open source project which has become the most widely used large scale data processing framework. As we moved along, we launched two more open source projects and associated projects - Delta Lake and MLflow. Three years after launch, half of our workload is on Delta Lake, which is very rare for something that is so mission critical infrastructure for managing large data volumes and the thousands of applications that rely on them. But we figured out how to design this technology to give customers very powerful data management features while remaining highly reliable and scalable. We have even more exciting products coming, such as the cloud-hosted version of Redash SQL analytics software that we announced at our Spark + AI Summit conference this year after we acquired Redash.

Bringing together a team and setting up the processes that can deliver these types of products quickly and reliably is very challenging, and I think that's the thing that we did best.

What are you reading now? Most recently I've been reading a bunch of novels by N. K. Jemisin.

Most people don't know that I… Used to do a lot of work with computer graphics in high school and college. I might have kept doing that if I hadn't met my undergrad research advisor, who worked on networks and peer-to-peer systems and got me interested in distributed systems.

In my spare time, I like to…Read a lot of books, and spend a lot of time outside walking, hiking and running.

Ask me to do anything but… Sleep less than a whole night - I find that very tough.