BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage and don't need a database administrator. BigQuery uses familiar SQL and it can take advantage of pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.

In this lab, we will explore the Wikipedia dataset using BigQuery.

What you'll learn

What you'll need

How will you use this tutorial?

Read it through only Read it and complete the exercises

How would rate your experience with Google Cloud Platform?

Novice Intermediate Proficient

Self-paced environment setup

If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project:

Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!).

Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost you more than a few cents, but it could be more if you decide to use more storage or if you do not delete your objects (see "Delete a bucket" section at the end of this document). Google Cloud Storage pricing is documented here.

New users of Google Cloud Platform are eligible for a $300 free trial.

In the Google Developer Console, click the Menu icon on the top left of the screen:

Scroll down the menu to the bottom, and click BigQuery:

This will open up the BigQuery console in a new browser tab, that looks like this:

But, there is nothing in here! Luckily, there are tons of Open Datasets available in BigQuery for you to query. We'll query the Wikipedia dataset in the next section.

We need a dataset to query with. We'll talk about loading data into BigQuery in another lab. In this section, we'll query the Wikimedia pageviews dataset that's part of many Public Datasets available in BigQuery today, including Wikimedia, Hacker News, GitHub, GDELT (News events), and many more.

For this lab, we'll use the Wikimedia public data set. To add the data set, visit this URL:

https://bigquery.cloud.google.com/table/bigquery-samples:wikimedia_pageviews.201112

Click on bigquery-samples:wikimedia_pageviews, this is the dataset we will use.

Scroll down and find and click the table 201112:

You can see the table schema in the Schema view on the right:

You can find out how much data is in the table, by navigating to the Details view on the right:

Alright! Let's write a query to see what's the most popular Wikipedia page in December, 2011 using BigQuery.

Click Compose Query on the top left:

This will bring up the New Query view:

Let's find out the total number of Wikimedia views in December, 2011, by writing this query:

SELECT SUM(views)
FROM [bigquery-samples:wikimedia_pageviews.201112]

Before we run the query, for the purpose of this lab, let's disable data caching so that we are not using any cached results. Click Show Options:

Then, uncheck Use Cached Results:

Click Run Query:

In a few seconds, the result will be listed in the bottom, and it'll also tell you how much data was proccessed:

In seconds, we queried over a 105 GB table, but we only needed to process 12.2GB of data to get to the result! This is because BigQuery is a columnar database. Because we only queried a single column, the total amount of data processed is significantly less than the total table size.

Find Wikipedia page views

The Wikimedia dataset contains page views for all of the Wikimedia projects (including Wikipedia, Wikitionary, Wikibooks, Wikiquotes, etc). Let's narrow down the query to just Wikipedia pages by adding a where statement:

SELECT SUM(views), wikimedia_project
FROM [bigquery-samples:wikimedia_pageviews.201112]
WHERE wikimedia_project = "wp"
GROUP BY wikimedia_project

Notice that, by querying an additional column, wikimedia_project, the amount of data processed increased from 12.2 GB to 18 GB.

BigQuery supports many of the familiar SQL clauses, such as contains, group by, order by, and a number of aggregation functions. In addition, you can also use regular expressions to query text fields! Let's try one:

SELECT title, SUM(views) views
FROM [bigquery-samples:wikimedia_pageviews.201112]
WHERE wikimedia_project = "wp"
AND REGEXP_MATCH(title, 'Red.*t')
GROUP BY title
ORDER BY views DESC

Query across multiple tables

You can query across multiple tables too, by adding multiple tables in the FROM clause to form a union:

SELECT title, SUM(views) views
FROM
  [bigquery-samples:wikimedia_pageviews.201112],
  [bigquery-samples:wikimedia_pageviews.201111]
WHERE wikimedia_project = "wp"
AND REGEXP_MATCH(title, 'Red.*t')
GROUP BY title
ORDER BY views DESC

Or, you can select a range of tables to form the union using a table wildcard function. Let's query over the entire year of 2011 by querying tables with "2011" as a prefix:

This query will query over a total size of 1 TB, but process 672 GB of data!

SELECT title, SUM(views) views
FROM
  TABLE_QUERY([bigquery-samples:wikimedia_pageviews],
    'REGEXP_MATCH(table_id, r"^2011[\d]{2}")')
WHERE wikimedia_project = "wp"
AND REGEXP_MATCH(title, 'Red.*t')
GROUP BY title
ORDER BY views DESC

That wasn't too hard to query that much data!

BigQuery is that simple! With basic SQL knowledge, you are now able to query terabytes and terabytes of data!

What we've covered

Next Steps