ThoughtSpot acquires Mode to define the next generation of collaborative BI >>Learn More

Mode Studio

The Collaborative Data Science Platform

Creating Pandas DataFrames & Selecting Data

Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.

Goals of this lesson

In this lesson, you'll learn how to create and use a DataFrame, a Python data structure that is similar to a database or spreadsheet table. You'll learn how to:

The Python data analysis tools that you'll learn throughout this tutorial are very useful, but they become immensely valuable when they are applied to real data (and real problems). In this lesson, you'll be using tools from pandas, one of the go-to libraries for data manipulation, to conduct analysis of web traffic, which can help drive valuable decisions for a business.

Pandas DataFrames

Pandas has a few powerful data structures:

  • A table with multiple columns is a DataFrame.
  • A column of a DataFrame, or a list-like object, is a Series.

A DataFrame is a table much like in SQL or Excel. It's similar in structure, too, making it possible to use similar operations such as aggregation, filtering, and pivoting. However, because DataFrames are built in Python, it's possible to use Python to program more advanced operations and manipulations than SQL and Excel can offer. As a bonus, the creators of pandas have focused on making the DataFrame operate very quickly, even over large datasets.

DataFrames are particularly useful because powerful methods are built into them. In Python, methods are associated with objects, so you need your data to be in the DataFrame to use these methods. DataFrames can load data through a number of different data structures and files, including lists and dictionaries, csv files, excel files, and database records (more on that here).

Loading data into a Mode Python Notebook

Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.

For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:

  1. Log into Mode or create an account.
  2. Navigate to this report and click Duplicate. This will take you to the SQL Query Editor, with a query and results pre-populated.
  3. Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
  4. The first input cell is automatically populated with datasets[0].head(n=5). Run this code so you can see the first five rows of the dataset.

datasets is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.

Now you’re all ready to go.

Creating a Pandas DataFrame

Prepping a DataFrame

In Mode Python Notebooks, the first cell is automatically populated with the following code to access the data produced by the SQL query:

datasets[0].head(n=5)

The datasets object is a list, where each item is a DataFrame corresponding to one of the SQL queries in the Mode report. So datasets[0] is a dataframe object within the datasets list. You can see that the above command produces a table showing the first 5 rows of the results of your SQL query.

Mode is able to do this because it comes pre-loaded with pandas. Still, you should get in the habit of giving librarias aliases so that you can refer to them easily throughout your code. Pandas is typically aliased as pd:

Input

import pandas as pd
    

You should also assign the DataFrame as a variable. Since you'll only be working with one DataFrame in this lesson, you can keep it simple and just call it data:

Input

data = datasets[0] # assign SQL query results to the data variable
    

One final step before we're ready to start analysis: text cleanup. There are a few missing values in this dataset (in SQL you'd refer to them as null). For the sake of making this easier to look at, use the fillna() method to replace missing values with empty strings:

Input

data = data.fillna('') # replace missing values with strings for easier text processing
    

About this dataset

As mentioned above, in this lesson you'll be working with web traffic data from a nonprofit called Watsi. Every row in this dataset corresponds to a person visiting a watsi.org page (this is known as a pageview). The general flow of pageviews is referred to as web traffic.

Every pageview (row in the dataset) is composed of:

  • 'referrer' The url that referred the user to the site (if available). For example, if someone arrived at the page through a Facebook link, referrer would be https://www.facebook.com
  • 'timestamp' The time the event occurred
  • 'title' The title of the page the user visited on the Watsi website
  • 'url' The url the user visited. For example, https://watsi.org/team/the-meteor-chef
  • 'user_agent' The software the user used to accessed the site, including platform, browser, and extensions
  • 'user_id' A unique id for each user (normally they’d be numbers—we've turned them into anonymous names instead)
  • 'referrer_domain' The domain of the url that referred the user to the site. For example, "facebook.com"
  • 'website_section' The section of the website visited. For example, the section of https://watsi.org/team/the-meteor-chef is "team"
  • 'platform' The device platform the user visited from. Possible values are "Desktop" and "Mobile"

Get context for the data

Through their website, Watsi enables direct funding of medical care. Take the time to understand what that looks like in practice. Visit some of the URLs you see in this dataset to familiarize yourself with the structure of the site and content, such as Mary's patient profile. Google Watsi and consider why people might engage with the service. Context is important - it'll help you make educated inferences in your analysis of the data.

Data sampling

This dataset contains 5,000 rows, which were sampled from a 500,000 row dataset spanning the same time period. Throughout these analyses, the number of events you count will be about 100 times smaller than they actually were, but the proportions of events will still generally be reflective of that larger dataset. In this case, a sample is fine because our purpose is to learn methods of data analysis with Python, not to create 100% accurate recommendations to Watsi.

Selecting columns in a DataFrame

As you learned in the previous lesson, you can select a value in a list or dictionary using brackets:

  • cities[0] (gets item at place 0 in the list "cities")
  • city_population['Tokyo'] (gets values associated with the key 'Tokyo' in the dictionary city_population)

Similarly, you can use brackets to select a column in the DataFrame:

Input

data['url']
    
Output

0                        https://watsi.org/
1    https://watsi.org/team/the-meteor-chef
2              https://watsi.org/gift-cards
3                        https://watsi.org/
4                        https://watsi.org/
Name: url, dtype: object
    

Selecting the column gives you access to the whole column, but will only show a preview. Below the column, the column name and data type (dtype) are printed for easy reference.

The url column you got back has a list of numbers on the left. This is called the index, which uniquely identifies rows in the DataFrame. You will use the index to select individual rows, similar to how you selected rows from a list in an earlier lesson. A unique identifier is often necessary to refer to specific records in the dataset. For example, the DMV uses license plates to identify specific vehicles, instead of "Blue 1999 Honda Civic in California," which may or may not uniquely identify a car.

Selecting columns will be important to much of the analysis you do throughout the tutorials, especially in grouping and counting events.

Selecting rows in a DataFrame

Selecting rows is useful for exploring the data and getting familiar with what values you might see. You can select rows by using brackets and row indexes. For example, you can select the first three rows of the DataFrame with the following code:

Input

data[:3]
    
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform
0 https://www.google.com/ 2016-02-05 00:48:23 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... CHAROLETTE S google Desktop
1 https://themeteorchef.com/snippets/making-use-... 2016-02-24 23:12:10 Watsi | The Meteor Chef https://watsi.org/team/the-meteor-chef Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... WARREN Q themeteorchef.com team Desktop
2 https://watsi.org/ 2015-12-25 17:59:35 Watsi | Give the gift of health with a Watsi G... https://watsi.org/gift-cards Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... MITCHEL O watsi.org gift-cards Desktop

The ":3" between the brackets effectively means "up to index 3". Similarly, you could select everything "from index 4 up to (but not including) index 7":

Input

data[4:7]
    
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform
4 https://watsi.org/fund-treatments 2016-02-14 19:30:08 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2... ANDREE N watsi.org Desktop
5 http://blog.watsi.org/ 2015-10-15 06:04:40 Watsi | Fund a medical treatment on Watsi. 100... https://watsi.org/fund-treatments Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... SHAREN Y watsi.org fund-treatments Desktop
6 2015-12-25 10:23:43 Watsi | Redeem your Watsi Gift Card https://watsi.org/redeem Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) G... BRICE Z redeem Desktop

Finally, you can select "everything from index 4997 onward":

Input

data[4997:]
    
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform
4997 http://aplus.com/a/kid-has-flexible-neck?c=654... 2016-01-03 02:48:38 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_4 like ... NOELLE P aplus.com iPhone
4998 https://watsi.org/fund-treatments?page=2 2016-02-07 23:47:53 Watsi | Success! Sarah from Kenya raised $1,12... https://watsi.org/profile/6705ce017f7e-sarah Mozilla/5.0 (iPad; CPU OS 9_2 like Mac OS X) A... JERICA F watsi.org profile iPad
4999 https://watsi.org/ 2015-11-17 16:38:25 Watsi | Fund a medical treatment on Watsi. 100... https://watsi.org/fund-treatments?page=4 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5... MARIANNA I watsi.org fund-treatments Desktop

Selecting a specific row

To select a specific row, you must use the .iloc or .loc method, with the row's index in brackets. Which to use depends on what you are trying to return.

.iloc is integer index-based. If you .iloc[1], it will return to you the row at the 1st index regardless of the index’s name. .loc however is label-based, meaning that .iloc[1] will only return the row at the first index if "1" is the index’s label. In the case of this dataframe .iloc[1] and .loc[1] will return the same row.

Input

data.iloc[1]
    
Output

referrer           https://themeteorchef.com/snippets/making-use-...
timestamp                                        2016-02-24 23:12:10
title                                        Watsi | The Meteor Chef
url                           https://watsi.org/team/the-meteor-chef
user_agent         Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
user_id                                                     WARREN Q
referrer_domain                                    themeteorchef.com
website_section                                                 team
platform                                                     Desktop
Name: 1, dtype: object
    
Input

data.loc[1]
    
Output

referrer           https://themeteorchef.com/snippets/making-use-...
timestamp                                        2016-02-24 23:12:10
title                                        Watsi | The Meteor Chef
url                           https://watsi.org/team/the-meteor-chef
user_agent         Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
user_id                                                     WARREN Q
referrer_domain                                    themeteorchef.com
website_section                                                 team
platform                                                     Desktop
Name: 1, dtype: object
    

This is different from selecting columns. When selecting a column, you'll use data[], and when selecting a row, you'll use data.iloc[] or data.loc[]. To learn more about the differences between .iloc and .loc, check out pandas documentation.

Selecting rows and columns in a DataFrame

Just as you can select from rows or columns, you can also select from both rows and columns at the same time. For example, you can select the first three rows of the title column by naming both the column and rows in square brackets:

Input

data['title'][:3]
    
Output

0    Watsi | Fund medical treatments for people aro...
1                              Watsi | The Meteor Chef
2    Watsi | Give the gift of health with a Watsi G...
Name: title, dtype: object
    

Think about this as listing the row and column selections one after another. Putting together a column selection and a row selection:

data['title']

data[:3]

You get the combined selection:

data['title'][:3]

The brackets selecting the column and selecting the rows are separate, and the selections are applied from left to right (in this last example, the column is selected, then it is filtered down to the first 3 rows).

In fact, selecting the rows and then the column yields the same result:

Input

data[:3]['title']
    
Output

0    Watsi | Fund medical treatments for people aro...
1                              Watsi | The Meteor Chef
2    Watsi | Give the gift of health with a Watsi G...
Name: title, dtype: object
    

Practice Problem

Select records from rows 10 to 15 in the 'referrer' column.

View Solution

Lesson summary:

In this lesson, you learned to:

  • Create a pandas DataFrame with data
  • Select columns in a DataFrame
  • Select rows in a DataFrame
  • Select both columns AND rows in a DataFrame

In the next lesson, you'll learn how to count values and plot a bar chart.

Next Lesson

Pandas .values_count() & .plot()

Get more from your data

Your team can be up and running in 30 minutes or less.