Mode Analytics Learn SQL Learn Python Gallery Discussions Data Jobs Data News

Data Engineering Articles

Data engineers empower company initiatives by building tools, infrastructure, frameworks, and services to get data “in shape” for analysts to query. This section includes articles about building and maintaining scalable data infrastructure, data modeling, piping data from one database to another (ETL), integrating data generated from SaaS tools into a single data warehouse, and optimizing data processing and storage.

Down with Pipeline debt / Introducing Great Expectations

This new Python library aims to help you beat down pipeline debt—type of technical debt that infests backend data systems—by conducting automated tests of data (instead of code) that happen at batch time (instead of compile or deploy time). - Great Expectations

What is a Senior Data Visualization Engineer

“It differs from an analyst role in that the focus is not on a question but rather on an audience that typically needs something more than a single report and who expects views into the data that generate more than just the expected insights.” - Elijah Meeks

A Beginner’s Guide to Data Engineering — Part I

The perfect primer for aspiring data scientists who need to learn the basics to evaluate job opportunities or early-stage founders who are about to build the company’s first data team. - Robert Chang

Scaling Event Tables with Redshift Spectrum

As Mode’s customer base grew, we reached a point where our infrastructure wasn’t capable of handling the exponentially increasing volume of event data. Here’s how we saved Redshift performance by offloading 75% of our event data to S3 in less than a week. - Mode

Selecting a Cloud Provider

Since its inception, Etsy has hosted its site and services in self-managed data centers. Now the company is switching over to Google Cloud Platform. Their CTO shares what went into their five-month-long evaluation process. - Code as Craft

The Missing Layers of the Analytics Stack

Collect, transform, analyze. These are the three pillars that support the modern analytics stack. Looking ahead, new layers may be added to streamline current sticking points, like data cleansing and anomaly detection. - Fishtown Analytics

Apache Airflow for the confused

Do you need a clear explanation about this task orchestration tool, sans the technical language? This post unpacks the jargon with a very apropos metaphor—air traffic controllers. - NYC Capital Planning

Big Data Processing at Spotify: The Road to Scio (Part 1)

Using Scio, a built in-house Scala API, Spotify is able to run the majority of their workloads with a single system, with little operational overhead. - Spotify Labs

What, exactly, is dbt?

Go deep on dbt, a command line tool that handles the T (transform) in ETL. - Fishtown Analytics

Segment vs Fivetran vs Stitch: Which Data Ingest Should You Use?

Choosing a pipeline tool comes down to which of these criteria is your top priority: harnessing an open source framework, handling high volumes of data with minimal downtime, or getting your data into third-party tools. - Stephen Levin

ZATA: How we used Kubernetes and Google Cloud to expose our Big Data platform as a set of RESTful web services

An inside look at zulily's data platform, which makes data accessible to analysts, systems, and applications without sacrificing speed or storage options. - Tech @ zulily

How Stitch Consolidates A Billion Records Per Day

Ever wanted to know how the people who make ETL tools set up their data infrastructure? Wonder no more. - StackShare

Choosing an ETL tool for your analytics stack

In the market for an ETL solution? Here's the criteria we employed when we evaluated ETL vendors for our own use here at Mode. - Mode

Airflow and the Future of Data Engineering: A Q&A

“[F]uture startups will be catapulted up the data maturity curve with access to better, cheaper, more accessible analytics software and services.” - Astronomer

The Rise of the Data Engineer

An in-depth manifesto for data science’s younger sibling. - Maxime Beauchemin

The State of Data Engineering

What makes a data engineer, well, a data engineer? And why does it feel like everyone is looking to hire one? This new study of LinkedIn data reveals that the number of data engineers doubled from 2013-2015, but demand still far outpaces supply. - Stitch Data

Goods: Organizing Google’s datasets

Most companies store their data in a central repository where everyone can go to publish or retrieve a dataset. Google manages their data in different way: they’ve built (surprise!) a crawling engine to index datasets and gather metadata about them. This gives folks the freedom to make and use datasets however they like.

When to use unstructured datatypes in Postgres–Hstore vs. JSON vs. JSONB

PostgresSQL has supported NoSQL for a while now, but when should you use the relational mode and when should you use non-relational mode? And if you use NoSQL, which data type should you pick? - Citus Data

Non-Mathematical Feature Engineering techniques for Data Science

This article is worth Pocketing for the straightforward, plain-English explanation of feature engineering alone. (And the best practices for pre-processing data ain’t bad either.) - Sachin Joglekar

Bridging the Gap Between Data Science and Data Engineering

Josh Wills, Director of Data Engineering at Slack, shares his thoughts on how data engineers and data scientists work best together. - Hakka Labs

The Purpose of Platforms in Data Science

How do you scale your data science org without hiring more people? Optimize for technical efficiency. In Uber’s case, that means data engineers building self-serve platforms to address specific problems in data scientists’ workflows. - Kevin Novak

Building Thumbtack’s Data Infrastructure

In this post, Thumbtack data engineer Nate Kupp sheds light on the company’s process for evaluating tools to add to their tech stack. It’s a goldmine for startups contemplating how to build a sustainable data infrastructure. - Thumbtack Engineering

Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department

Here’s one suggestion for fixing the sometimes hairy relationships between data scientists and engineers optimize for autonomy, not technical efficiency. - Stitchfix

Choosing a Database for Analytics

A comprehensive rundown of criteria to consider when you’re ready to dedicate a database to analytics. Use this guide to evaluate your options depending on the type and size of your data, the state of your engineering resources, and your need to analyze data in real-time. - Segment