BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage and don't need a database administrator. BigQuery uses SQL and it can take advantage of pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
In this lab, we will load a dataset into BigQuery and query it.
If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project:
Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!).
Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.
Running through this codelab shouldn't cost you more than a few cents, but it could be more if you decide to use more storage or if you do not delete your objects (see "Delete a bucket" section at the end of this document). Google Cloud Storage pricing is documented here.
New users of Google Cloud Platform are eligible for a $300 free trial.
Navigate to the the Google Cloud Console from another browser tab/window, to https://console.cloud.google.com. Use the login credential given to you by the lab proctor.
In the Google Developer Console, click the Menu icon on the top left of the screen:
Scroll down the menu to the bottom, and click BigQuery:
This will open up the BigQuery console in a new browser tab, that looks like this:
But, there is nothing in here! Luckily, there are tons of Open Datasets available in BigQuery for you to query and of course you can upload your own data which we'll do in the next section.
First we need to create a new dataset in the project. A dataset is composed of multiple tables. To create a dataset, click the dropdown icon next to the project name, and click Create new dataset:
lab as the Dataset ID:
Click OK to create the dataset. The dataset is empty at the moment. We'll need to load the latest Wikimedia data.
First, from the Google Developer Console (not the BigQuery Console), open Google Cloud Shell, It is a command line environment running in the Cloud. This Debian-based virtual machine is loaded with all the development tools you'll need (
git and others) and offers a persistent 5GB home directory. Open the Google Cloud Shell by clicking on the icon on the top right of the screen:
Once Cloud Shell is opened, you'll need to download the latest Wikimedia dataset using
wget, followed by importing the data with
From the Wikimedia raw data dump page, navigate to data for January, 2016. Copy one of the links of the data file, such as
pagecounts-20160108-110000.gz, and then download that file using
wget in Cloud Shell:
$ wget https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-01/pagecounts-20160108-110000.gz
The downloaded file is a GZip'ed CSV file. You can load this file directly using the
bq command line utility. As part of the load command, you'll also describe the schema of the file. Make sure to replace
YourProjectID with your project id before running the command:
$ bq load -F" " \ --quote "" \ YourProjectID:lab.pagecounts_20160108_11 \ pagecounts-20160108-110000.gz \ language,title,requests:integer,content_size:integer
You can learn more about the bq command line in the documentation.
Go back to the BigQuery Console, and see that the table is created in the dataset (you may need to refresh the console if it was previously opened) and select it.
The select the Details option in the right hand panel. This will display information about the table data, including the size of the table and the number of rows. Note that the table contains 560 MB of data.
Click Compose Query on the top left:
This will bring up the New Query view:
Let's find out the total number of Wikimedia requests by writing this query:
SELECT SUM(requests) FROM [lab.pagecounts_20160108_11]
Click Run Query:
The query will take second or so to run. Once complete you'll see a message saying that the query completed along with how long it took and how much data was processed. The actual result of the query will be listed at the bottom which in this cases is the total number of requests for pages
This effectively ran a query across all rows in the table but only processed 56.2 MB of data (recall that the size of the dataset is 560 MB). BigQuery stores data by columns and not rows, this means that it only has to read data for columns specified in the query. In this case just that was only the ‘requests' column.
BigQuery also only reads data from disk once and will automatically scale queries across large numbers of machines. This makes it extremely efficient and extremely fast, even when querying huge datasets (TeraBytes and beyond).
You've learned how to import CSV files into BigQuery. You can also import JSON files and/or stream data into BigQuery using the API. Finally, for very large datasets, you can upload the data file into Google Cloud Storage first and then import that into BigQuery. Learn more about loading data into BigQuery.
bqcommand line tool