Data strategy for the Google Cloud Platform.

Related image

Today, we will look at storage options from the amazing Google Cloud Platform (GCP), retrieval of data as well as the methods of storing and retrieving data that are recommended by Google. The ‘GCP Data Layer Design’ includes the database services and storage services (persistence) and the data access layer.

This blog post is a deeper look at the 8-step process an-8-step-process-to-architect-solutions-on-the-google-cloud-platform-gcp.

Users do not distinguish between data loss, data corruption and downtime. So its all about Data Integrity. If the data is unaffected but unavailable to the user – then for the user the trust is gone even though it might be a temporary access issue.

According to the CAP theorem, (you can read more about it here – you can only pick two of the three – consistency, availability and partition tolerance.

Image result for consistency availability partition tolerance
copyright – gcp

Let’s say you chose Availability and Partition Tolerance – you chose an option known as BASE (Basically Available Storage). Here you might make an update, but it might not be there immediately. Eventually there will be consistency, but the key advantage is the ability to write and expand the data quickly.

If Consistency is important we look at ACID transactions which stands for Atomicity, Consistency, Isolation and Durability. Here, if you write any data, it is guaranteed to be there. It is fault-tolerant and has replication.

So, what are the factors that goes into determining a data strategy? We need to think deeply about uptime which is how often the data needs to be available to the user – for example a bank might be closed in the evening and you can’t access your funds but it does not mean that your data is not there. For Latency, think about you searching for a specialist doctor on your health insurance portal and it takes a long time for the search results to appear. Regarding Scale and Velocity, think about how large and fast you expect your user base to grow. And finally privacy, data needs to be properly destroyed after a reasonable period of time.

The next step is data migration, here you transport the data from your on-prem environment or wherever it is currently to the Google Cloud Platform (GCP). There is another method called Data Ingestion where data will continue to reside in your on-prem and will be periodically loaded to the cloud.

With the Google Cloud Platform (GCP):

  • You can migrate data directly from the console where you can drag and drop files.
  • You can use the gsutil command from the cloud shell.
  • You can also use Cloud Storage Transfer Service which has a built-in Amazon S3 SDK to transfer data from AWS buckets to GCP, it also allows transfer between GCP buckets and to backup data.
  • With the GCP Cloud Storage JSON API, you can compress data, do partial uploads and resumable uploads to be more efficient about data migrations.
  • Finally for large data sizes you have the Google Transfer Appliance.

GCP Console
GCP Cloud Shell
Google Transfer Appliances
Use Google Transfer Appliance if the data is larger than 100 GB.

copyright – google cloud platform
Key components for data ingestion into GCP

Here are the many storage options provided by the Google Cloud Platform (GCP)

copyright – google cloud platform

So how do you chose between so many storage options? Luckily Google provides a nice decision flow to help us with it.

copyright – google cloud storage

Let’s assume you chose Cloud Storage, there are multiple options within it. Cost and location based access to data are the determining factors. The performance is the same across all the tiers. The regional storage cost 2 cents per GB and multi-regional comes in at 2.6 cents per GB. Nearline costs 1 cent per GB but if you need to retrieve data it costs an additional cent. For the coldline option, the cost is 0.7 cents per GB but the retrieval fee is higher at 5 cents per GB.

copyright – google cloud platform
copyright – google cloud platform

For data analytics at scale, we chose BigQuery. The cost is 2 cents per GB and the cost drops to 1 cent per GB after 90 days. All the analytics resources and compute are built into the service and you only pay for the data you access. So you pay only when you run analytic queries on the data and it comes to $5 per TB of data that’s queried. You can have OLAP workloads up to petabytes scale and it runs SQL with very fast processing time. Here’s a sample Data Warehouse solution architecture from Google.

copyright – google cloud platform

Here are a few decision trees from Google to help you decide when to choose Cloud SQL, Cloud Spanner, Cloud Datastore and Cloud Bigtable.

copyright – google cloud platform
copyright – google cloud platform
copyright – google cloud platform
copyright – google cloud platform

So, in summary here are the storage options for your data layer design.

copyright – google cloud platform

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *