You are on page 1of 3

5 Principles You Need to Know Before Using Google Cloud Dataprep for Data

Preparation
Go to the profile of Alina Zhang
Alina Zhang
Aug 1, 2018
Google Cloud Dataprep is an intelligent data service on Google Cloud Platform for
exploring, cleaning, and preparing structured and unstructured data.

There are 5 principles important to know before your data preparation with
Dataprep.

1. Create baseline dataset before profiling source data


Before you get started cleaning your dataset, it is helpful to create a virtual
profile of the source data. First, create a minimal recipe on a dataset after you
have ingested into the Transformer page. Then, click Run Job to generate a profile
of the data, which can be used as a baseline dataset for validating and debugging
the origin of data problems you discover.

2. Normalize data before applying Deduplicate Transform


Remove identical rows from your dataset after uniqueness check is a common step in
data preparation. Google Cloud Dataprep provides a single transform deduplicate,
which can remove identical rows from your dataset.

There are 2 limitations:

This transform is case-sensitive. So, if a column has values Darren and DARREN, the
rows containing those values are not considered duplicates and cannot be removed
with this transform.
Whitespace and the beginning and ending of values is not ignored.
It is necessary to normalize your data before applying deduplicate transform. For
example, you can use the LOWER function to make the case of each entry in a column
to be consistent, then call the trim function to remove leading and trailing
whitespace.

Source and Preview of LOWER Function


3. Join early and Union later
You can enrich your data by Join or Union dataset from multiple sources together.
Join operations should be performed early in your recipe so that you reduce the
chance of having changes to your join keys impacting the results of your join
operations.

Union operations should be performed later in the recipe. By doing them later in
the process, you minimize the chance of changes to the union operation, including
dataset refreshes, affecting the recipe and the output.

4. Use statistical information to evaluate generated data


After you have completed your recipe and run the job, you can open the source data
and the profile you created for the source data in separate browser tabs to
evaluate how consistent and complete your data remains from beginning to end of the
wrangling process.

Instead of comparing data row by row, use the statistical information in the
generated profile to compare with the statistics generated from the source, so that
you can identify if your changes have introduced unwanted changes to these values.
5. Keep recipe records after profiling source data
For record keeping, click View Recipe to copy and paste the recipe used to create
the profile. You can Download Recipe into a text file.

These are the 5 principles important to know before you start working on your
datasets with Google Cloud Dataprep. If you have any question about building data
pipeline or training Machine Learning models on Cloud, feel free to leave me a
message. Thanks for reading.

Google Cloud PlatformDataprepData ScienceMachine LearningAI


Go to the profile of Alina Zhang
Alina Zhang
Data Scientist, Google Cloud Certified Data Engineer, AI Solution Consultant with
life purpose: SERVING HUMANITY WITH AI

Google Cloud Platform - Community


Google Cloud Platform - Community
A collection of technical articles published or curated by Google Cloud Platform
Developer Advocates. The views expressed are those of the authors and don't
necessarily reflect those of Google.

More from Google Cloud Platform - Community


Automated canary deployments with Flagger and Istio
Go to the profile of Stefan Prodan
Stefan Prodan
Feb 21
More from Google Cloud Platform - Community
Hands on Knative?�?Part 1
Go to the profile of Mete Atamel
Mete Atamel
Feb 4
More from Google Cloud Platform - Community
Serverless on Google Cloud Platform: an Introduction with Serverless Store Demo
Go to the profile of Ratros Y.
Ratros Y.
Jan 24
Responses
Conversation between Anjana Kondisetti and Alina Zhang.
Go to the profile of Anjana Kondisetti
Anjana Kondisetti
Nov 16, 2018
Hi,@Alina

i have seen your post it�s very useful to me and i have a question i just tried to
import the 30 crs data set from big query to google cloud prep i got only 36,590
rows has saved into the google cloud prep how do i get the complete 30 cr complete
data set into the google cloud prep if you find any links to clarify�

Go to the profile of Alina Zhang


Alina Zhang
Nov 17, 2018
Hi Anjana,

Thank you for coming with this practical question. I have the answer for you.

�When your data is first loaded into the Transformer page, a sample of the data in
your dataset is displayed in the data grid.�
Conversation with Alina Zhang.
Go to the profile of Jimmy Chu
Jimmy Chu
Sep 13, 2018
What is the largest dataset by the number of records Dataprep can efficiently
perform deduplication?

Go to the profile of Alina Zhang


Alina Zhang
Sep 14, 2018
Good question. �Prepare datasets of any size, megabytes to terabytes, with equal
ease.� from Dataprep documentation.

It depends on how you define �efficiently�. The data cleaning and transforming on
sample dataset is real-time. The job running on the entire dataset will be executed
after you submit it. And the running time�

You might also like