Google Cloud Professional Data Engineer Certification Exam - Study guide and Practice tests
I cleared the Professional Data Engineer Certification recently and the question topics are fresh in my mind, I wanted to share with you helpful tips for preparing and ace the certification exam.
Why would you want to do a Google Cloud Professional Data Engineer Certification?
You may already have the skills to use Google Cloud already but how do you demonstrate this to a future employer or client? Two ways. Through a portfolio of projects or a certification.
A certificate says to future clients and employers, ‘Hey, I’ve got the skills and I’ve put in the effort to get accredited
Who is it for ?
If you’re already a data scientist, a data engineer, data analyst, machine learning engineer or looking for a career change into the world of data, the Google Cloud Professional Data Engineer Certification is for you
How much does it cost?
To sit the certification exam costs $200 USD. If you fail, you will have to pay the fee again to resit.
What is the exam format ?
- Length: 2 hours
- Registration fee: $200 (plus tax where applicable)
- Languages: English, Japanese.
- Exam format: 50 Multiple choice and multiple select taken remotely or in person at a test center. Locate a test center near you.
- Exam Delivery Method:
- Take the online-proctored exam from your home
- Take the onsite-proctored exam at a testing center
- Prerequisites: None
- Recommended experience: 3+ years of industry experience including 1+ years designing and managing solutions using GCP.
Course Topics to Prepare:
You will need to familiar with,
- 5 database products (Bigquery, Bigtable, Cloud SQL, Cloud Spanner, Datastore)
- 3 ETL tools (Dataflow, Dataprep, Dataproc)
- 5 Misc products (Stackdriver, CloudComposer, Cloud Storage, Pub/Sub, IOT Core)
- Machine learning basics and Cloud Basics (eg. zones, regions)
Overwhelmed ? Dont worry, lets take one small step at a time:
Bigquery :
- 30% of questions come from Bigquery, so prepare on the below topics
- Understand what is google Bigquery - Serverless, Petabyte Scale cloud data warehouse
- Bigquery Partitioned tables
- Bigquery Table Clustering
- How to use Partitioning and Clustering to reduce query costs
- Bigquery can automatically recognize schema of JSON files
- Understanding Bigquery Streaming inserts
- Bigquery ML
- Bigquery use cases : Analytics, Easily works on huge volume of historical data, ANSI SQL Compliant
- Understanding access control of Bigquery tables - How are they managed via IAM and dataset permissions
- Learn about Bigquery Authorized views. How they can be used to grant access to users without granting permissions to underlying tables
- Bigquery logs can be monitored using stackdriver. Understand how to setup stack driver reporting for bytes used or for slot usage
- Understand Bigquery slots quota per project. How does it differ between pay-as-you-go model and fixed price model
- Learn about Bigquery federated querying
Bigtable :
- Understand Google Bigtable - No SQL, low latency database
- Bigtable schema design - Learn the best practices for choosing the row keys
- In general Bigtable performs good for lean and tall tables. So limit the number of columns
- Common use cases - Real time series data, IOT data, real time stock market data
- Understanding the nodes and how to scale write and read operations by choosing the number of nodes and also tweaking the disks between SSD and HDD
Cloud SQL:
- Cloud SQL is good for Transactional databases, where the data volume is less than 4 TB and max concurrent 4000 connections
- Choose Cloud SQL when the customer wants to port the application to cloud as it is and do not want to spend on rewriting the application
- Understand how to install Stackdriver agent to monitor db performance
Cloud Spanner:
- Cloud Spanner is good for transactional databases, which needs unlimited scalabilty and also scale concurrent connections, and when global consistency is needed
- Cloud spanner supports Primary Key and Secondary Key. Learn how to choose these keys and how they impact the data distribution across nodes and retrieval
- Learn how to improve performance by tweaking various options including the number of nodes, disk types etc
Datastore:
- Understand what is Datastore and its use cases
Dataflow:
- Understand what is Dataflow (uses Apache Beam internally)
- Get familiar with various dataflow transformations like IO, sideinputs, sideoutputs, runner
- Get familiar with various types of windowing - fixed, vs sliding vs session windows
- How to increase the performance of dataflow jobs
- How to update a real time streaming job and change the code
- Understand the difference between stopping vs draining a dataflow job
- Get familiar with Pcollections and how are they used
Dataprep:
- Understand what is Dataprep - Mostly used by analysts, who prefer gui, and Dataprep is commonly used for cleansing data whose schema changes frequently
- Dataprep internally uses dataflow to run the pipeline
Dataproc:
- Understand what is Dataproc
- On-prem hadoop instances can be migrated to Dataproc, use cloud storage in lieu of HDFS and use Bigtable to replace Hive
- Best practice to use Dataproc to run only one type of job and kill the cluster on completion
- If cost is an issue you can instantiate worker nodes with preemptive types
- Understand how to run initialization actions on your cluster
- Understand spark job processing and how to improve the performance of the Dataproc cluster
Cloud Storage:
- Thoroughly go through the Cloud Storage. Most questions touch upon Cloud Storage
- Understand the difference between regional and multi regional buckets and the cost associated and what is high availability
- Understand the various storage classes - Standard, Nearline, Coldline, Archive
Pub / Sub:
- Decouples sender and receiver, Stores messages for upto 7 days. Guarantees message delivery. Messages may come out of order
- Used for most of the real time application use cases
- Unserstand Pub/Sub Topics and subscriptions
Misc Products:
Briefly touch upon Stackdriver, CloudComposer and IOT Core
Machine Learning Basics:
- Refresh your machine learning concepts using this quick guide
- Understand when to build your own model vs using Auto ML vs Google ML libraries like (Vision API, translate API etc)
- Learn about Google AI Platform and how to use it to train and serve models
- Learn about the various hardware that google offers for training a model - CPUs, GPUs and TPUs
Good Luck with your exam !
nVector
posted on 24 Jul 20Enjoy great content like this and a lot more !
Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds
Post Comment