Google Cloud Professional Data Engineer Certification Exam - Study guide and Practice tests

I cleared the Professional Data Engineer Certification recently and the question topics are fresh in my mind, I wanted to share with you helpful tips for preparing and ace the certification exam.

Why would you want to do a Google Cloud Professional Data Engineer Certification?

You may already have the skills to use Google Cloud already but how do you demonstrate this to a future employer or client? Two ways. Through a portfolio of projects or a certification.

A certificate says to future clients and employers, ‘Hey, I’ve got the skills and I’ve put in the effort to get accredited

Who is it for ?

If you’re already a data scientist, a data engineer, data analyst, machine learning engineer or looking for a career change into the world of data, the Google Cloud Professional Data Engineer Certification is for you

How much does it cost?

To sit the certification exam costs $200 USD. If you fail, you will have to pay the fee again to take again.

What is the exam format ?

  • Length: 2 hours
  • Registration fee: $200 (plus tax where applicable)
  • Languages: English, Japanese.
  • Exam format: 50 Multiple choice and multiple select taken remotely or in person at a test center. Locate a test center near you.
  • Exam Delivery Method: 
  • Take the online-proctored exam from your home
  • Take the onsite-proctored exam at a testing center
  • Prerequisites: None
  • Recommended experience: 3+ years of industry experience including 1+ years designing and managing solutions using GCP.

Course Topics to Prepare:

You will need to familiar with,

  • 5 database products (Bigquery, Bigtable, Cloud SQL, Cloud Spanner, Datastore)
  • 3 ETL tools (Dataflow, Dataprep, Dataproc)
  • 5 Misc products (Cloud Monitoring, Cloud Composer, Cloud Storage, Pub/Sub, Analytics Hub)
  • Machine learning basics and Cloud Basics (eg. zones, regions)

Overwhelmed ? Don't worry, lets take one small step at a time:

Bigquery :

  • 30% of questions come from Bigquery, so prepare on the below topics
  • Understand what is google Bigquery - Serverless, Petabyte Scale cloud data warehouse
  • Bigquery Partitioned tables
  • Bigquery Table Clustering
  • How to use Partitioning and Clustering to reduce query costs
  • Bigquery can automatically recognize schema of JSON files
  • Understanding Bigquery Streaming inserts
  • Bigquery ML
  • Bigquery use cases : Analytics, Easily  works on huge volume of historical data, ANSI SQL Compliant
  • Understanding access control of Bigquery tables - How are they managed via IAM and dataset permissions. understand dataviewer and dataeditor roles.
  • Understand bigquery snapshot tables. How to use snapshot backups as a cost effective method for daily backups
  • Learn about Bigquery Authorized views. How they can be used to grant access to users without granting permissions to underlying tables
  • Bigquery logs can be monitored using Cloud Monitoring. Understand how to setup Cloud Monitoring reporting for bytes used or for slot usage / investigate slowness
  • Understand Bigquery slots quota per project. How does it differ between pay-as-you-go model and fixed price model. 
  • Understand usage quotas. How to setup daily usage quotas for teams who needs predictable cost.
  • Learn about Bigquery federated querying. How to query aws tables using omni. Differences between external table and biglake tables.
  • Learn about datastreams, For frequently accessed tables, you can use streams to replicate tables from aws and cloud sql to bigquery. This helps avoid reading the data for every query from aws and also reduces the load on the aws services.
  • Bigquery caching does not work when queries use current_timestamp or when querying multiple tables with a wildcard
  • Use partition expiry to delete older unwanted data automatically and in a cost effective manner

Bigtable :

  • Understand Google Bigtable - No SQL, low latency database. can handle billions of rows and thousands of simultaneous read and write operations.
  • Bigtable schema design - Learn the best practices for choosing the row keys
  • In general Bigtable performs good for lean and tall tables. So limit the number of columns
  • Common use cases - Real time series data, IOT data, real time stock market data
  • Understanding the nodes and how to scale write and read operations by choosing the number of nodes and also tweaking the disks between SSD and HDD
  • If a node fails, data will not be lost as data is never stored in Bigtable nodes
  • Changing cluster IDs can be only be done by deleting and recreating the cluster. If you want to change from HDD to SDD, export data into avro format using dataflow and import it back into the new bigtable instance.
  • Factors for slow bigtable performance

Cloud SQL:

  • Cloud SQL is good for Transactional databases, where the data volume is less than 4 TB and max concurrent 4000 connections
  • Choose Cloud SQL when the customer wants to port the application to cloud as it is and do not want to spend on rewriting the application
  • Understand how to install Cloud Monitoring agent to monitor DB performance
  • Read Replica - A read-only copy of primary instance that reflects changes in almost real-time. used to offload read requests, analytics traffic, and queries from primary instance
    Fail-over Replica - A DR replica that can be used for failover in the event of a primary instance failover. A Read replica cannot be used as as DR replica as it will also be down during any failure of primary instance.

Cloud Spanner:

  • Cloud Spanner is good for transactional databases, which needs unlimited scalabilty and also scale concurrent connections, and when global consistency is needed
  • Cloud spanner supports Primary Key and Secondary Key. Learn how to choose these keys and how they impact the data distribution across nodes and retrieval
  • Learn how to improve performance by tweaking various options including the number of nodes, disk types etc

Datastore:

  • Understand what is Datastore and its use cases

Dataflow:

  • Understand what is Dataflow (uses Apache Beam internally)
  • Get familiar with various dataflow transformations like IO, sideinputs, sideoutputs, runner
  • Get familiar with various types of windowing - fixed, vs sliding vs session windows

    Fixed windows (tumbling) - 
    These windows are non-overlapping and have a fixed interval duration. They are useful for batch analysis and aggregation use cases. such as hourly, daily, monthly.

    Sliding windows (hopping) - These windows overlap and have a fixed interval duration. They are useful for moving averages and continuous insights. such as 30 minutes of web traffic every 5 minutes

    Session windows - These windows have dynamically set intervals and are non-uniform across keys. They are useful for tracking user activity, such as a user's actions during a web session. eg. End a web session if a user is idle for 30 minutes

  • How to increase the performance of dataflow jobs
  • How to update a real time streaming job and change the code
  • Understand the difference between stopping vs draining a dataflow job
  • Get familiar with Pcollections and how are they used.

    PCollection -
    An immutable, unordered collection of values used as input and output for pTransform

    PTransform - A data processing operation, or step, in a pipeline. A PTransform can be applied to zero or more PCollection objects and produce zero or more PCollection objects. Examples of PTransforms include Map and Filter operations.

    Pipeline - A graph of transformations that defines the data processing operations for a pipeline. A pipeline is a combination of a PCollection and a PTransform. 

    Pardo - operations like filtering, type conversion, extract certain fields from a file

  • Understand how you can add a Reshuffle transform to your pipeline code to resolve auto scaling up issues
  • Tips to avoid zonal failures in dataflow: Use the --region flag instead of the --zone flag to specify a worker region. This allows Dataflow to automatically choose the best zone for the job, which can provide more fault tolerance.

Dataprep:

  • Understand what is Dataprep - Mostly used by analysts, who prefer gui, and Dataprep is commonly used for cleansing data whose schema changes frequently
  • Dataprep internally uses dataflow to run the pipeline
  • You have data located in BigQuery that is used to generate reports for your company. You have noticed some weekly executive report fields do not correspond to format according to company standards. For example, report errors include different telephone formats and different country code identifiers. This is a frequent issue, so you need to create a recurring job to normalize the data. You want a quick solution that requires no coding. What should you do? Answer : Use Cloud Data Fusion and Wrangler to normalize the data, and set up a recurring job.

Dataproc:

  • Understand what is Dataproc
  • On-prem hadoop instances can be migrated to Dataproc, use cloud storage in lieu of HDFS and use Bigtable to replace Hive
  • Using HDFS storage for IO intensive jobs instead of cloud storage
  • Best practice to use Dataproc to run only one type of job and kill the cluster on completion
  • If cost is an issue you can instantiate worker nodes with preemptive types
  • Understand how to run initialization actions on your cluster. If your company doesnt permit internet access, understand we can download a copy of the files into cloud storage and use them in the initialization actions
  • Understand spark job processing and how to improve the performance of the Dataproc cluster
  • For low priority workloads, use standard mode with high memory instead of High availability to save costs

Cloud Storage:

  • Thoroughly go through the Cloud Storage. Most questions touch upon Cloud Storage
  • Understand the difference between regional and multi regional buckets and the cost associated and what is high availability. What is Recovery point objective, if the requirement is to have RPO < 15 mins, use multiregion buckets
  • Understand the various storage classes - Standard, Nearline, Coldline, Archive and Autoclass. How to use Autoclass to promote transparent policies to the users, and how autoclass is the best solution for random access patterns. Also understand how to set object level lifecycle policies
  • How to use CMEK. if the cmek key is compromised, create a new bucket with a default new CMEK key and then copy the objects from old bucket to new bucket. and then dispose the old CMEK key
  • Storage transfer service - can be used to copy files from a regional bucket to a central multi region bucket
  • Cloud data transfer service - used for on-premise to GCS migration

Pub / Sub:

  • Decouples sender and receiver, Stores messages for upto 7 days. Guarantees message delivery. Messages may come out of order
  • Pub / Sub allows both push as well as pull subscriptions. You can also use "deliver exactly once" to avoid duplicates in the receiver side
  • Used for most of the real time application use cases
  • Understand Pub/Sub Topics and subscriptions
  • Pub/Sub cannot retrieve the messages after you have acknowledged them. However, sometimes you might find it necessary to replay the acknowledged messages, for example, if you performed an erroneous acknowledgment. Then you can use the Seek feature to mark previously acknowledged messages as unacknowledged, and force Pub/Sub to redeliver those messages. You can also use seek to delete the unacknowledged messages by changing their state to acknowledged. Seek to a snapshot or seek to a timestamp to replay the messages in a subscription. This guide shows examples of how to replay previously acknowledged Pub/Sub messages using seek.

Misc Products:

Briefly touch upon Cloud Monitoring, Cloud Composer and Analytics Hub

Analytics Hub:

  • Analytics Hub helps with self-service, users can publish, discover and subscribe to data assets within the organization. you can also use it to share data assets outside organization as well to address challenges of data reliability and cost

Machine Learning Basics:

  • Refresh your machine learning concepts using this quick guide
  • Understand when to build your own model vs using Auto ML vs Google ML libraries like (Vision API, translate API etc)
  • Learn about Google AI Platform and how to use it to train and serve models
  • Learn about the various hardware that google offers for training a model - CPUs, GPUs and TPUs

Good Luck with your exam !

nVector

posted on 24 Jul 20

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds