Cloud Dataprep is Google's self-service data preparation tool. In this lab, you will learn how to clean and enrich multiple datasets using Cloud Dataprep. The lab exercises are based on a mock use case scenario.

Use Case Scenario:

You work for a technical services company that sells three monthly subscription products:

The company occasionally offers promotional discounts – therefore, some product prices may be slightly lower than those listed above. Your overall goal is to provide an analysis of sales activity by zip code over the course of three years.

In order to do this, you'll need to join your customer contacts datasource (where zip code information resides) with sales data from your purchases datasource. Once you've joined the data, you'll need to aggregate the results.

What you'll learn

What you'll need

Self-paced environment setup

If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project:

Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!).

Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost you more than a few cents, but it could be more if you decide to use more storage or if you do not delete your objects (see "Delete a bucket" section at the end of this document). Google Cloud Storage pricing is documented here.

New users of Google Cloud Platform are eligible for a $300 free trial.

Codelab-at-a-conference setup

The instructor will be sharing with you temporary accounts with existing projects that are already setup so you do not need to worry about enabling billing or any cost associated with running this codelab. Note that all these accounts will be disabled soon after the codelab is over.

Once you have received a temporary username / password to login from the instructor, log into the Google Cloud Console: https://console.cloud.google.com/.

Here's what you should see once logged in :

Open a new tab and go to https://clouddataprep.com.

If prompted, click the "Sign-in with Google" button and enter your Qwiklabs username and password.

Click the Allow button to grant Cloud Dataprep access to your GCP project.

Check the box to accept the Google Cloud Dataprep license agreement and then click "Accept".

You will need to select the Cloud Storage bucket that you created earlier. In the setup window, click the name of your Cloud Storage bucket:

Inside your GCP bucket, click the folder named "next-2017-cdp". Click Use this folder to finish setting up Dataprep.

To begin wrangling your data, you first need to create a flow. A flow is comprised of a set of related datasets. A flow also visualizes the connections between datasets.

Click the Create Flow button:

In the Create Flow pop-up box, specify the following:

You can leave the flow description blank.

At this point, your flow serves as an empty container. In order to wrangle data, you need to import data and then add the imported data to your flow.

Click on the Import & Add Datasets button to browse for data files on Google Cloud Platform:

Our sample data lives in Google Cloud Storage. Click on the GCS button to open the GCS file browser:

In the GCS file browser, navigate to the storage bucket that you created in the "Setup and Requirements" section at the beginning. Click on the name of the folder called "next-2017-cdp".

Click on the + icon next to the following datasets to import those datasets into Cloud Dataprep:

You will see four cards appear on the right side of the screen:

Once you see all four cards appear, click on the blue Import Datasets button at the bottom of the screen to add the datasets to your flow.

Cloud Dataprep will bring you back to the Flow View page. There may be a brief pause before the page fully loads. On the left side of the screen, you will see a visual representation of the four datasets that you have added to your flow:

In the Flow View, the black icons represent each of your imported datasets. An imported dataset is the raw source data that resides on Google Cloud Storage.

The scroll inside the circle icon represents the data preparation recipe that you will apply to your imported datasets.

The square icon with a table image represents your wrangled datasets. In Cloud Dataprep, a wrangled dataset is a logical representation of your imported datasets after you have applied a data preparation recipe.

Notice that Cloud Dataprep has automatically created an initial data preparation recipe for each of your four imported datasets. This initial recipe contains structuring logic. You can see the steps contained in a recipe by clicking on the recipe icon:

You will see the automatically-applied parsing logic listed on the right-hand Details panel.

Now that you have created a flow, let's design a data preparation recipe to clean the customers dataset. We can add to the existing data preparation recipe that Cloud Dataprep has already created. Click on the recipe icon associated with the lab_customers.csv imported dataset:

In the Details panel for the lab_customers recipe, click on the blue button that reads Edit Recipe.

Cloud Dataprep opens the Transformer Grid. This is the visual, Excel-like interface where you can design the steps in your data preparation recipe. When you open the Transformer Grid, Cloud Dataprep automatically profiles the contents of your dataset and generates column-level histograms and data quality indicators. This profile information can be used to guide your data preparation process.

Scroll over to the "start_date" column.

Examine the horizontal bar at the top of the "start_date" column:

This is the data quality bar. The green part of the data quality bar represents valid values. The black part of the data quality bar represents missing or null values. Clicking on sections of the data quality bar will generate suggestions that contain data quality conditionals. These conditionals test whether each record is valid, empty, or invalid, depending on the section of the bar that you clicked.

We need to apply a filter to the rows to remove contacts where the "start_date" column is empty. To do this, click on the black part of the data quality bar.

Cloud Dataprep generates a list of suggested transformations based on your selection. Click on the card that reads "Delete rows where ismissing([start_date])":

Notice that after clicking on the card, Cloud Dataprep updates the data to show you a preview of this transformation. The rows highlighted in red will be removed from your dataset. Click on Add to Recipe to add this step to your recipe:

Next, look at the end_date column. Based on the data quality bar, you'll notice that there are a large number of rows with missing values. So you can easily work with this column, you need to insert an empty value – January 01, 2050 – in those empty rows.

Click the black section of the data quality bar for the end_date column. This will generate another set of suggested transformations.

Scroll through the list of suggestions and click on the card that reads, "Set end_date to IFMISSING(end_date, NULL())":

This card will insert an explicit NULL() for each of the records with a missing end date. This suggestion is not exactly what we want to do, so we need to modify the underlying script line. Click Modify:

This will open the New Step builder. Cloud Dataprep's suggested transformation has already been populated, but you can make adjustments to the code.

Modify the text in the Formula box to read: ifmissing(end_date, '2050/01/01')

Click Add to Recipe.

You have now addressed the data quality problems in the customers dataset.

Let's now switch gears and operate on the transactions datasets. Click on the Dataprep Codelab flow name at the top of the screen:

This bring you back into the flow view. We want to create a single dataset that unions the transactions datasets from 2013, 2014, and 2015. To do this, we will create a new recipe that branches from the lab_2013_transactions wrangled dataset. Click on the lab_2013_transactions wrangled dataset:

On the Details panel for the lab_2013_transactions wrangled dataset, click the three dots icon:

In the drop down menu, click Add New Recipe:

Cloud Dataprep will create a new recipe and wrangled dataset named lab_2013_transactions - 2:

Click on this new wrangled dataset. In the Details panel, click on the three dots icon. From the drop-down menu, choose Edit name and description.

In the Edit Dataset pop up window, change the dataset name to read "Combined Transactions":

Click OK.

From the Details panel, click Edit Recipe. This will open the new "Combined transactions" recipe in the transformer grid. Notice that the data in the grid is the structured data from the lab_2013_transactions.csv dataset.

You can combine multiple datasets with the same schema using a Union transform. In the New Step box, type "Union" and hit the "enter" key on your keyboard:

This will open the Union Tool.

The Union Output field displays the output schema for your dataset. Each box represents a column. Cloud Dataprep bases the output schema on the schema of the dataset from which you initiated the union transform. In this case, the columns in the Combined transactions dataset determine the columns that will appear in the combined output.

Click on the Add Dataset button:

Place a checkmark next to the lab_2014_transactions dataset and the lab_2015_transactions dataset:

Click the button that raeds Add Dataset and Align by Name.

Examine the column-to-column mappings. Click Add to Recipe to combine all three datasets:

After adding the union to your script, examine the column histogram for the transaction_date column:

Notice that this dataset now includes records from January 2013 through December 2015.

Click on the Dataprep Codelab flow name to return to the flow view.

The flow visualization has updated to show how the three transactions datasets combine to form the combined transactions dataset:

We now want to enrich the transactions data with information about where each purchase was made. To do this, we will join the customer data to the transactions data. When performing a join, you should generally treat the larger dataset as the master dataset, or the "left side" of the join. The smaller dataset should be the detail dataset, or the "right side" of the join. In Cloud Dataprep, the dataset from which you initiate a join automatically becomes the master dataset.

Click on the "Combined Transactions" wrangled dataset:

In the Details panel, click Edit Recipe:

In the New Step box, type Join and hit the "enter" key on your keyboard:

Click the "lab_customers" dataset:

Click the Preview Selected Dataset button:

Cloud Dataprep will display a preview of the detail dataset.

Click on the Join Keys tab to begin designing your join:

Note that the join will be performed on the customer_id column. Cloud Dataprep tries to automatically infer the correct join keys.

Examine the fields in the data preview.

Click on the customer_id (lab_customers) field to remove it from the output dataset.

Click on these fields to add them to the join:

Click the Add to Recipe button:

Notice that the preview in the transformer grid now shows the result of your join.

As a final step, we want to aggregate transactions by product, year, and state.

First, we need to create a new column that contains the year of each transaction. Click the downwards-facing arrow at the top side of the "transaction_date" column:

Click Find:

Click Year (YYYY):

Notice that this transformation will create a new column that contains the year of each transaction.

Click Add to Recipe:

Click in the Choose a transformation box in the New Step panel, and choose "aggregate" from the list:

In the functions box, type min(ticket_price) and then hit enter. Repeat for the following additional functions:

Click the text box in the Group by field and select the following columns:

Click the Add to Recipe button:

Now that you have finished preparing your data, you're ready to produce a results file on Google Cloud Storage. Cloud Dataprep executes your data transformation recipe to produce your output file using the Dataflow engine.

Click Run Job at the top right of the transformer grid:

In the Run Job on Dataflow page, you can configure your output settings. By default, Cloud Dataprep will create a CSV file on GCS.

Click Run Job to kick off your Dataflow job:

Cloud Dataprep is that simple! It's easy to cleanse and enrich multiple data sources using an intuitive, visual interface.

What we've covered