Starting here? This lesson is part of a full-length tutorial in using SQL for Data Analysis. Check out the beginning.
Comparison operators on numerical data
The most basic way to filter data is using comparison operators. The easiest way to understand them is to start by looking at a list of them:
|Not equal to|
|Greater than or equal to|
|Less than or equal to|
These comparison operators make the most sense when applied to numerical columns. For example, let’s use
> to return only the rows where the West Region produced more than 30,000 housing units (remember, the units in this data table are already in thousands):
WHERE west > 30
Try running that query with each of the operators in place of
>. Try some values other than
30 to get a sense of how SQL operators work. When you’re ready, try out the practice problems.
Did the West Region ever produce more than 50,000 housing units in one month?
Did the South Region ever produce 20,000 or fewer housing units in one month?
Comparison operators on non-numerical data
All of the above operators work on non-numerical data as well.
!= make perfect sense—they allow you to select rows that match or don’t match any value, respectively. For example, run the following query and you’ll notice that none of the January rows show up:
WHERE month_name != 'January'
There are some important rules when using these operators, though. If you’re using an operator with values that are non-numeric, you need to put the value in single quotes:
Note: SQL uses single quotes (as opposed to double quotes) to reference column names.
You can use
<, and the rest of the comparison operators on non-numeric columns as well—they filter based on alphabetical order. Try it out a couple times with different operators:
WHERE month_name > 'January'
If you’re using
<=, you don’t necessarily need to be too specific about how you filter. Try this:
WHERE month_name > 'J'
The way SQL treats alphabetical ordering is a little bit tricky. You may have noticed in the above query that selecting
month_name > 'J' will yield only rows in which
month_name starts with “j” or later in the alphabet. “Wait a minute,” you might say. “January is included in the results—shouldn’t I have to use
month_name >= 'J to make that happen?” SQL considers ‘Ja’ to be greater than ‘J’ because it has an extra letter. It’s worth noting that most dictionaries would list ‘Ja’ after ‘J’ as well.
Write a query that only shows rows for which the month name is February.
Write a query that only shows rows for which the
month_name starts with the letter "N" or an earlier letter in the alphabet.
Arithmetic in SQL
You can perform arithmetic in SQL using the same operators you would in Excel:
/. However, in SQL you can only perform arthimetic across columns on values in a given row. To clarify, you can only add values in multiple columns from the same row together using
+—if you want to add values across multiple rows, you’ll need to use aggregate functions, which are covered in the Intermediate SQL section of this tutorial.
The example below illustrates the use of
west + south AS south_plus_west
The above example produces a column showing the sum of whatever is in the
west columns for each row. You can chain arithmetic functions, including both column names and actual numbers:
west + south - 4 * year AS nonsense_column
The columns that contain the arithmetic functions are called “derived columns” because they are generated by modifying the information that exists in the underlying data.
Write a query that calculates the sum of all four regions in a separate column.
As in Excel, you can use parentheses to manage the order of operations. For example, if you wanted to average the
south columns, you could write something like this:
(west + south)/2 AS south_west_avg
It occasionally makes sense to use parentheses even when it’s not absolutely necessary just to make your query easier to read.
Sharpen your SQL skills
Write a query that returns all rows for which more units were produced in the West region than in the Midwest and Northeast combined.
Write a query that calculates the percentage of all houses completed in the United States represented by each region. Only return results from the year 2000 and later.
Hint: There should be four columns of percentages.