ThoughtSpot acquires Mode to define the next generation of collaborative BI >>Learn More

Mode Studio

The Collaborative Data Science Platform

Deriving New Columns & Defining Python Functions

Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.

Goals of this lesson

This lesson builds on the pandas DataFrame data type you learned about in a previous lesson. You’ll learn how to:

Loading data into Mode Python notebooks

Mode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.

For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:

  1. Log into Mode or create an account.
  2. Navigate to this report and click Duplicate. This will take you to the SQL Query Editor, with a query and results pre-populated.
  3. Click Python Notebook under Notebook in the left navigation panel. This will open a new notebook, with the results of the query loaded in as a dataframe.
  4. The first input cell is automatically populated with datasets[0].head(n=5). Run this code so you can see the first five rows of the dataset.

datasets[0] is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.

Now you’re all ready to go.

Getting oriented with the data

This lesson uses data from Watsi. For a data dictionary with more information, click here.

Since you’ll be using pandas methods and objects, import the pandas library. Then, give the DataFrame a variable name and use the .head() method to preview the first five rows.

Input

import pandas as pd
    
Input

data = datasets[0] # assign SQL query results to the data variable
data = data.fillna('') # replace missing values with '' as in the previous lesson
    
Input

data.head()
    
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform
0 https://www.google.com/ 2016-02-05 00:48:23 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... CHAROLETTE S google Desktop
1 https://themeteorchef.com/snippets/making-use-... 2016-02-24 23:12:10 Watsi | The Meteor Chef https://watsi.org/team/the-meteor-chef Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... WARREN Q themeteorchef.com team Desktop
2 https://watsi.org/ 2015-12-25 17:59:35 Watsi | Give the gift of health with a Watsi G... https://watsi.org/gift-cards Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... MITCHEL O watsi.org gift-cards Desktop
3 2016-02-05 21:19:30 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2... MICHEL O Desktop
4 https://watsi.org/fund-treatments 2016-02-14 19:30:08 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2... ANDREE N watsi.org Desktop

Deriving new columns from existing data

As you remember from the previous lesson, people used different platforms (iPhone, Windows, OSX, etc) to view pages on Watsi's site. Count the values in the platform column to get an idea of the distribution (for a quick refresher on distributions, check out this lesson:

Input

data['platform'].value_counts()
    
Output

Desktop       3934
iPhone         515
Android        388
iPad           154
Opera Mini       7
IEMobile         1
BlackBerry       1
Name: platform, dtype: int64
    

But say that instead, you want to compare Mobile and Desktop, treating all mobile devices as one way of interacting with Watsi’s site. Maybe you have a thesis about how people are more likely to search for Watsi on their desktop computer, but not on their phone. To do this, you need to create a new value for every row with one of two possible values: “Mobile” or “Desktop.” You can do this by creating a derived column based on the values in the platform column. This new column is what’s known as a derived column because it’s been created using data from one or more existing columns.

To get the feel for this, start by creating a new column that is not derived from another column.

Creating a new column

Creating a column is much like creating a new key-value pair in a dictionary. By assigning values to the new column name, you add a column to the DataFrame:

Input

data['new'] = 2 # the value for all rows
    
Input

data[:3] # let's see that new column
    
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform new
0 https://www.google.com/ 2016-02-05 00:48:23 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... CHAROLETTE S google Desktop 2
1 https://themeteorchef.com/snippets/making-use-... 2016-02-24 23:12:10 Watsi | The Meteor Chef https://watsi.org/team/the-meteor-chef Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... WARREN Q themeteorchef.com team Desktop 2
2 https://watsi.org/ 2015-12-25 17:59:35 Watsi | Give the gift of health with a Watsi G... https://watsi.org/gift-cards Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... MITCHEL O watsi.org gift-cards Desktop 2

Make sure you scroll all the way to the right to check out the new column you just made.

Just as you saw with dictionaries in the first lesson, assigning values to an existing column will overwrite that column:

Input

data['new'] = 'overwritten'
data[:3]
    
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform new
0 https://www.google.com/ 2016-02-05 00:48:23 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... CHAROLETTE S google Desktop overwritten
1 https://themeteorchef.com/snippets/making-use-... 2016-02-24 23:12:10 Watsi | The Meteor Chef https://watsi.org/team/the-meteor-chef Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... WARREN Q themeteorchef.com team Desktop overwritten
2 https://watsi.org/ 2015-12-25 17:59:35 Watsi | Give the gift of health with a Watsi G... https://watsi.org/gift-cards Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... MITCHEL O watsi.org gift-cards Desktop overwritten

This is a simple example—you’ve just set the value for every row to be the same. In reality, you’ll almost never have use for a column where the values are all the same number. Instead, you’ll use functions to determine the value in each row of your new column.

Writing and testing Python functions

Say you wanted to compare just two categories—mobile and desktop. You can define mobile platforms in this list of strings:

Input

mobile = ['iPhone', 'Android', 'iPad', 'Opera Mini', 'IEMobile', 'BlackBerry']
    

You'll use this list to filter values in the platform column. But first, you’ll need to learn a few tools for comparing values.

The in operator

The handy Python operator in allows you to evaluate whether something exists in a list. The evaluation returns a boolean. Here's how you check if "iPad", "Desktop", and "Monty Python" are mobile platforms:

Input

print 'iPad' in mobile
print 'Desktop' in mobile
print 'Monty Python' in mobile
    
Output

True
False
False
    

This is very similar to the IN operator in SQL, where you might use:

SELECT * FROM tutorial.watsi_events
    WHERE platform IN ('iPhone', 'Android', 'iPad', 'Opera Mini', 'IEMobile', 'BlackBerry');

if control statements

Python has control statements, or pieces of logic, that will help you create your own functions. For example, you can check if the "Opera Mini" platform is in the mobile list and then print something if it returns a boolean of True.

Input

if 'Opera Mini' in mobile:
    print 'great success.'
    
Output

great success.
    

If statements must result in a True or False. If the if statement results in True, as in the above case, it will execute the code after the colon. Otherwise, it does not execute the code after the colon, like this:

Input

if 'The Marriage of Figaro' in mobile:
  print 'great success.'
    

'The Marriage of Figaro' is not in the mobile list, so the above statement evaluates to False, skips the code indented after the colon, and nothing is printed. This is very similar to how the CASE statement works in SQL.

This little bit of logic opens up a world of possibilities. Using an if statement, you can write a function that decides what to do based on the values you find.

else and elif control statements

If a value is not found in the mobile list, you might want to do something else with it. If the if statement evaluates to false, as the last one did, you might want the function to take a different action.

Input

if 'The Marriage of Figaro' in mobile:
 print 'great success.'
else:
 print 'that is immobile.'
    
Output

that is immobile.
    

The code after else: will execute when the if statement returns False. The keyword elif, similarly, would evaluate if nothing before it had returned True. The statement runs from top to bottom, and if a statement evaluates to True, it executes the code after the colon, and then does not look at any other elif or else statements in the series. For example:

Input

operas = ['The Marriage of Figaro', 'The Magic Flute', 'La traviata']

if 'The Marriage of Figaro' in mobile: print 'grave success.' elif 'The Marriage of Figaro' in operas: print 'that is a gravely beautiful piece.' else: print 'that is immobile.'

Output

that is a gravely beautiful piece.
    

As you can see, the else statement was not executed because the elif statement evaluated to True and ran the print statement 'that is a gravely beautiful piece.'

Defining functions

Functions are reusable code blocks that you can use to perform a single action. Functions can take in values (called "parameters" or "arguments") and perform logic. For more on the basics of functions, click here.

In the last statement you wrote, you performed logic using the if statement. Here's how you might rewrite it to take an argument:

Input

def is_in_mobile(platform): #define the "is_in_mobile" function, which accepts an argument called "platform"
    if platform in mobile:
        print 'great success.'
    

Let's break this down:

  • def — the keyword that signals to Python that you're writing a function
  • is_in_mobile — the name of your function, which can be anything you like, even my_hovercraft_is_full_of_eels()
  • (platform): — the parameter(s)

Now you can give the function a value, and it will execute the code you defined. Check to see if the BlackBerry phone is in the list mobile:

Input

is_in_mobile('BlackBerry')
    
Output

great success.
    

Parameters and arguments

The parameter is a very important part of the function. Think of it as a temporary variable name you use when you define the function, but that gets replaced when you run the function. In the above example, platform is the parameter. Functions can have many parameters—just look at the .plot() function you used in an earlier lesson.

When you run the function, the thing that replaces the parameter is called the argument. In the above example, 'BlackBerry' is the argument.

Here’s another example of a function in action, this time adding on an else statement:

Input

def is_in_mobile(platform):
    if platform in mobile:
        print 'great success.'
    else:
        print ':('
    
Input

is_in_mobile('SnackBerry')
    
Output

:(
    

return statements

Let's add another layer by writing a function that will allow you to label records as either 'mobile' or 'desktop'. To do this, you’ll use return statements.

A return statement is simple—it tells the computer "this is the result. Make it available for further use and end the if statement here." A return statement is different from a print statement, because when it executes, return makes the value available to store as a variable or to use in another function. print simply makes the value appear on the screen.

The function below takes in a platform argument and checks if the platform is in the mobile list. Note that after each of these if/else statements, there’s a return statement.

If platform is in the mobile list, it returns "Mobile" and terminates there. If the platform is't in the mobile list, the function continues to the next evaluation—whether platform is equal to "Desktop"—and so forth.

Try it out by first writing a function that accepts the platform argument:

Input

def filter_desktop_mobile(platform):
    if platform in mobile:
        return 'Mobile'
    elif platform == 'Desktop':
        return 'Desktop'
    else:
        return 'Not Known'
    

Now try running that function with 'Android' as the argument. This will effectively replace the word platform in the above function with 'Android' and then return the result. In this case, the returned result will be printed because it is the only output from the cell above:

Input

filter_desktop_mobile('Android')
    
Output

'Mobile'
    

The real use of return as opposed to print is the fact that you can assign the value to a variable name. For example:

Input

what_is_android = filter_desktop_mobile('Android')
print what_is_android
    
Output

Mobile
    

Functions best practices

Keep functions simple, do one thing.

Generally, functions should only do one logical thing. This is up to your interpretation, of course, but ask any seasoned programmer or data scientist for their advice (and war stories), and you'll find out that keeping it simple is the key to sanity.

Indent with spaces

As you saw above, the code inside for and if statements is indented. In other languages such a SQL and JavaScript, whitespace only matters for readability. But in Python, tabs and spaces can change what the code means. The notebook will also help automatically indent your code, to the customary 4-space indentation.

Test all the cases (if you can)

You can test your function to make sure it does what you expect. Testing is a big part of analysis, and helps you ensure that your code is working as expected. Go ahead and test some of the possible cases:

Input

print filter_desktop_mobile('iPhone')
print filter_desktop_mobile('Desktop')
print filter_desktop_mobile('Paper Plane')
    
Output

Mobile
Desktop
Not Known
    

Success! The function did what was expected, given some likely values.

Applying functions to DataFrames

The .apply() method allows you to apply a function to a column of a DataFrame. You can put the values of the existing platform column through the filter_desktop_mobile function you wrote and get a resulting Series:

Input

data['platform'].apply(filter_desktop_mobile)
    
Output

0       Desktop
1       Desktop
2       Desktop
3       Desktop
4       Desktop
5       Desktop
6       Desktop
7       Desktop
8       Desktop
9        Mobile
10       Mobile
11      Desktop
12       Mobile
13      Desktop
14      Desktop
15       Mobile
16       Mobile
17      Desktop
18      Desktop
19      Desktop
20       Mobile
21      Desktop
22      Desktop
23      Desktop
24      Desktop
25      Desktop
26       Mobile
27      Desktop
28      Desktop
29       Mobile
         ...   
4970    Desktop
4971    Desktop
4972    Desktop
4973    Desktop
4974    Desktop
4975    Desktop
4976    Desktop
4977    Desktop
4978    Desktop
4979     Mobile
4980    Desktop
4981     Mobile
4982    Desktop
4983    Desktop
4984    Desktop
4985    Desktop
4986    Desktop
4987    Desktop
4988    Desktop
4989    Desktop
4990    Desktop
4991     Mobile
4992     Mobile
4993    Desktop
4994    Desktop
4995    Desktop
4996    Desktop
4997     Mobile
4998     Mobile
4999    Desktop
Name: platform, dtype: object
    

This series looks as expected—just "Desktop" and "Mobile" values. You can store these values in a new column using the following code:

Input

data['platform_type'] = data['platform'].apply(filter_desktop_mobile)
    

Selecting multiple columns

To select multiple columns, you can pass a list of column names you want to select into the square brackets:

Input

data[['platform','platform_type']][14:18] # rows with different values to make sure it worked
    
Output
platform platform_type
14 Desktop Desktop
15 Android Mobile
16 iPhone Mobile
17 Desktop Desktop

Looks like the function worked.

Now count the values and use a bar chart to see how these the platforms stack up:

Input

data['platform_type'].value_counts().plot(kind='bar')
    
Output

      
    

Practice Problem

Store the length of each row's referrer value in a new column. Hint: We used a method to measure length in a previous lesson.

View Solution

Practice Problem

Create a derived column from referrer_domain that filters domain types of 'organization' (for '.org') and 'company' (for '.com'), labeling any others as 'other'. Then plot a bar chart of their relative frequency. Hint: Use the in keyword creatively.

For extra bonus points, select the records that were not referred from Watsi.org, and plot their relative frequency. Hint: Think about what values are not equal to.

View Solution

Lesson summary:

In this lesson, you learned about:

  • Derive new columns from existing data
  • Write and test functions using in, if, else, and elif
  • Define functions using parameters and arguments
  • Apply functions to DataFrames using .apply()
  • Select multiple columns

In the next lesson, you'll learn about grouping data for comparison.

Next Lesson

Pandas .groupby(), Lambda Functions, & Pivot Tables

Get more from your data

Your team can be up and running in 30 minutes or less.