Professional Documents
Culture Documents
Preparation
Go to the profile of Alina Zhang
Alina Zhang
Aug 1, 2018
Google Cloud Dataprep is an intelligent data service on Google Cloud Platform for
exploring, cleaning, and preparing structured and unstructured data.
There are 5 principles important to know before your data preparation with
Dataprep.
This transform is case-sensitive. So, if a column has values Darren and DARREN, the
rows containing those values are not considered duplicates and cannot be removed
with this transform.
Whitespace and the beginning and ending of values is not ignored.
It is necessary to normalize your data before applying deduplicate transform. For
example, you can use the LOWER function to make the case of each entry in a column
to be consistent, then call the trim function to remove leading and trailing
whitespace.
Union operations should be performed later in the recipe. By doing them later in
the process, you minimize the chance of changes to the union operation, including
dataset refreshes, affecting the recipe and the output.
Instead of comparing data row by row, use the statistical information in the
generated profile to compare with the statistics generated from the source, so that
you can identify if your changes have introduced unwanted changes to these values.
5. Keep recipe records after profiling source data
For record keeping, click View Recipe to copy and paste the recipe used to create
the profile. You can Download Recipe into a text file.
These are the 5 principles important to know before you start working on your
datasets with Google Cloud Dataprep. If you have any question about building data
pipeline or training Machine Learning models on Cloud, feel free to leave me a
message. Thanks for reading.
i have seen your post it�s very useful to me and i have a question i just tried to
import the 30 crs data set from big query to google cloud prep i got only 36,590
rows has saved into the google cloud prep how do i get the complete 30 cr complete
data set into the google cloud prep if you find any links to clarify�
Thank you for coming with this practical question. I have the answer for you.
�When your data is first loaded into the Transformer page, a sample of the data in
your dataset is displayed in the data grid.�
Conversation with Alina Zhang.
Go to the profile of Jimmy Chu
Jimmy Chu
Sep 13, 2018
What is the largest dataset by the number of records Dataprep can efficiently
perform deduplication?
It depends on how you define �efficiently�. The data cleaning and transforming on
sample dataset is real-time. The job running on the entire dataset will be executed
after you submit it. And the running time�