Home
>
Data Science
>
Update Lyft’s Amundsen: Data-Discovery with Built-In Trust

March 29, 2022 by Phu Nguyen

Update Lyft’s Amundsen: Data-Discovery with Built-In Trust

Main Contents:

Lyft’s Amundsen: Data-Discovery with Built-In Trust is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Lyft’s Amundsen: Data-Discovery with Built-In Trust in today’s post !

Read more about Lyft’s Amundsen: Data-Discovery with Built-In Trust at Wikipedia

You can find content about Lyft’s Amundsen: Data-Discovery with Built-In Trust from the Wikipedia website

Knowing what data a company has and where and how it’s stored has gained urgency with the enactment of the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), coming Jan. 1.

But other concerns, such as worker productivity, also come into play for organizations dealing with massive amounts of data. Ride-share company Lyft, for example, collects data on the more than 50 million rides it provides a month.

Yet data scientists spend up to a third of their time trying to find the data they need and then trying to figure out whether the data they find can be trusted, according to the company.

In response, Lyft built Amundsen, a data-discovery application on top of a metadata repository to make it easier for data scientists and others to find and interact with the data more easily. It’s named after Norwegian explorer Roald Amundsen, whose expedition was the first to reach the South Pole, and patterned after Google search. Lyft open-sourced the project in April.

Emphasis on Trust

Lyft has been growing rapidly in the volume of services provided, but also in the number of employees joining the company, leaving a knowledge gap about what data the company has, what work has been done previously on it and how up to date it is, product manager Mark Grover explained in a recent webinar.

Lyft’s data sources include both structured and unstructured data stores like Hive, Presto, Postgres and Amazon Redshift.

It faced challenges that no single data model fit for all data resources and that each stored and fetched differently.

Its requirements for the project included:

Trust embodied in the solution — things more trustworthy show up first in the search results.
Little manual curation — it needed to be automated.

A preference for open source. While the team considered open source projects like LinkedIn’s WhereHows and Apache Atlas, it ultimately decided the experience it wanted wasn’t out there, Grover said, and set out to build its own.

There are four parts to Amundsen:

Crawler called Databuilder similar to Google’s web crawlers. It crawls the databases, dashboards and HR systems to determine which tables were newly created since the last run, which columns were added, who got into the system, who left the company and more. It uses Apache Airflow to orchestrate jobs.
Search engine similar to Google’s based on Elasticsearch. It supports multiple types of search: normal, which matches records based on relevancy; category, which matches records first based on data type, then relevancy; and wildcard.
Front-end service — If, for example, you’re looking for data on estimated arrival times (ETAs) for drivers, you type “ETA” in the search box and get a results page similar to page rank in Google search. The information there includes how commonly a table is queried, when was the table last populated and who else is using that table. If you click on the first result, it provides more data, including the schema of the table, a quick preview of the data and stats about the shape of the data such as standard deviations, means, etc.
Graph database — It’s a metadata repository containing information about tables, people and the relationships between them. It’s built on Neo4j, but support for Apache Atlas is in the works. It also supports REST APIs for other services pushing or pulling metadata directly.

The first iteration focused heavily on tables, the work of data scientists and analysts, people using raw data sets to do analysis. It has since added a second node, which is people.

“I can go to the page of a person on the team — what tables does she own, what does she bookmark, what does she use frequently? Those conversations we used to have on Slack don’t need to happen anymore because I have that information,” Grover said.

It plans to add more nodes, including dashboards, streams, and ETL and data quality.

Metadata at the Core

Metadata is key to the next wave of big data applications, according to the company.

“We realized we were gathering all these interesting metadata that we wanted to use for data discovery and trust, but we could use it for other applications as well. What we ended up building was this data discovery application on top, but at the bottom was this metadata engine, the core of all the information that people use to power the data,” Grover said.

“If I know where all our data is stored, if I can tag all these columns as personal or private and know who’s accessing this data, then I can have a governance system based on this,’ he said of the compliance use case for the system.

Rather than manual approaches or isolating sensitive data in a separate database or location, metadata can be used to restrict access appropriately and maintain compliance, he explained in a blog post.

For ETL and data quality, using profiles of data in all the tables, users can apply heuristics to determine how the data going in today compares with that of yesterday, then set some allowable percentage of difference.

With streams, it could determine which streams are trustworthy, which map to which data sets.

Feature Image: “Iceberg Framed” by Anita Ritenour. Licensed under CC BY-SA 2.0.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

May 29, 2025 by Anh Hoang

Update Lyft’s Amundsen: Data-Discovery with Built-In Trust

Read more about Lyft’s Amundsen: Data-Discovery with Built-In Trust at Wikipedia

Emphasis on Trust

Metadata at the Core

AI Automation for Business in 2025: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2023

AI Automation for Business in 2025: A Step-by-Step Guide

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

Locations

Read more about Lyft’s Amundsen: Data-Discovery with Built-In Trust at Wikipedia

Emphasis on Trust

Metadata at the Core

Get a custom Proposal

You need to enter your email to download

Blog post

Locations