23 November, 2022

Understanding BigQuery on data science

Today, all companies, regardless of whether they are public or private, use BigQuery with one thing in common: they understand the value and importance of data. And the key to success is in the data. By having them properly structured and studied in a systematic way, we can draw conclusions that will benefit us in the future.

Now, how does BigQuery tie into several of these data-related disciplines?It really works as a data warehousing tool, as it allows you to save and extract large amounts of data.

In the Data Mining part, it marries very well with the R language. Which helps us explore data that does not have a good structure, and this language precisely has an implementation of a library of this type.

Regarding Business Intelligence, there is a tool that Google recently acquired called Big Query BI Engine. Where you can do analysis and reports in real time with fairly low latency, with reports that are ready in seconds or even less than this time.

Finally, this tool has a very direct relationship with Machine Learning. Since it precisely has a product called Big Query ML (those who have worked with SQL will notice that it is extremely easy to learn).

The magic of BigQuery: Why is it different from other platforms?

With this tool you can do partitioning, which are manual partitions in different areas where you are saving your data.

Likewise, you can make a much lighter strategy such as clustering, where you have the possibility of manually organizing your keys by which your data will be saved. Let’s see other details that make this product a marvel.

It is fully managed

You don’t have to worry about anything; you have the security that it goes to the account of the same Google cloud. You do not have to do replicas, you do not have to do ring configurations or network configurations, everything is completely managed.

It has a vertical storage (columnar)

Unlike a traditional RDBMS, which is a horizontal, record-based store, with BigQuery, queries are much faster for aggregation and reporting.

Use access patterns to encrypt data

Through Capacitor (File Format), when the data is repeated, it compresses said repetitions and when extracting the data, it reconstructs them, even if the data is not ordered, behind the scenes it orders them.

Distributed File System (Colossus)

It is a document system that Google manages, it is not only for this tool; however, it also works.

Colossus offers the possibility of being able to save data redundantly on different hard drives, several machines or different zones. You can even change the regions or even save them in another country or continent.

Information saving optimization

It turns out that BigQuery behind the scenes is going to be doing sorts where it will intelligently index your data.

For example: if you normally extract the data by date, it will put the date as one of the primary keys to make the extraction much easier. This data will be 100% encrypted and you can even manage the encryption keys.

Unfortunately, within all the magic there is a great sin: you cannot select everything in a table. Why? Because being a columnar database, you would have to unpack all the columns and then you would have to associate one column with the other to reassemble the entire table, so it doesn’t make much sense.

How to use BigQuery?

Its interface is quite friendly and familiar. In Google Cloud we have the projects, which are the way in which the entire pool of resources that we are going to use is functionally separated.

As an example, if you have compute instances, or database instances, as well as warehouse tools, among others, these will be kept separate. On the other hand, we will be able to keep costs down.

When you have a SQL database that is in that same project, it’s going to be very easy to connect it with this storage tool, because that means that these projects are living in the same house so to speak.

So with BigQuery we don’t need to create a broker through a proxy, everything is taken care of within these working groups. By default, Google Cloud gives you some resources so you can work, experimenting and testing all its tools for free. Basically, all of these attachments have some free tier.

¿What else can you do?

New data sets can be added (it can be your personal data from another database). When producing the data sets, you have the power to define where you want it to be located. Taking into account the regions or geographical areas that we discussed earlier.

To make it easier, we suggest placing it in the United States. You can also modify other areas such as the expiration of the tables. Whether you want them to expire, and the encryption you want BigQuery to have. In Google Cloud we have the internal explorer, which allows us to search for data that we have, both inside and outside of data groups.

We can find a series of very interesting tables that are constantly updated. For example: baseball data, Blockchain data, crypto data, Covid-19 data, geographic data, google analytics data (it’s an excellent way to see how users use the internet), and more.

The way to interact with this product is really simple when exploring the data, we can even connect an external tool such as an API. If we want to write a new query, we can open a new editor and write a query in much the same way as we did in SQL.

Using BigQuery and practical example

The parts that a Query normally has are a select* and a from. The from would be from which table or from which data source we are going to extract the information. Here we are going to define where you want to take it from. In this example it is taken from BigQuery public data, which is precisely the data source that we found for the exemplification.

The period and period is going to give you, then, the particular data source and in this case we are going to focus on. The COVID-19 data, since for better or for worse it is still a topic that is on everyone’s lips, all time.

COVID-19 + ecdc, within that we have a table that is precisely that of COVID-19 and is called “geographic distribution worldwide”, this is going to be our data source.

Within these tables there is the possibility that you will find syntax errors, which are also seen when we leave the select empty. But, for the example that we want to develop we are first interested in seeing where we are going to get the data from in BigQuery.

First step, choose the table

Now, for example, you select the table that is going to give you some data immediately. This is how you can avoid doing that query of star and three. At that time the select from of the select start from in the table.

In the case of this table, this type of query is avoided because it will cause you a very large cost and it will really be an aggregation of all the data at the end. In case you want to explore the data without having to pay these costs, you can do it in the following way:

Detailed selection

What you can do, on one hand, when you click on the selection tab, is that you will see several selection tabs appear in BigQuery.

In the first you will see the scheme where you can observe in detail, what columns and divisions it would have. For now we would have “date”, “month”, “year”, confirmed deaths, daily deaths, among other data.

In the other column, the type of data it will appear. In the next column it will indicate if it is nullable or not nullable, which we use to know if said field can be empty or not.

Choice of storage size

If we want to see more details, there is a possibility that you will look at the information in the table. The full size is a 5.44 megabytes of long-term storage, which is all the storage is going to take in a very long period of time. This can change if you are constantly increasing data in BigQuery.

Similarly, you have the option to calculate what the size is going to be in the future with the number of rows. When the table was created, the last modification, the expiration of the table, and other particularities.

Groupings

It is ideal to make a group that helps us to make the discriminated blocks. By being able to make the sections of information and the criteria by which we are going to be guided. Which refers to this example of COVID-19 data, “countries and territories”, after this we will proceed to do it by month.

What this selection that you make in BigQuery. Will give you is the sum of confirmed cases, deaths in a given month, in a given year and in a given country. This type of organization is very useful to make the data more specific.

Choose the data you really need

This query processes 2.5 megabytes when it is executed. So, in this way, you will be able to see exactly how the query is going to change or how the amount of information that you are going to extract from the table changes for each query.

If you do a “select” everything will change completely. For example if you choose everything, you will get 5.4 megabytes which is the full size of the table. If, on the contrary, you are only adding specific data, then it will change to 2.5 to 2 megabytes. Depending on the amount of information that you bring from the table.

When you do this work, what changes is the columns that you bring and when the amount of extracted data decreases. The query becomes faster and the whole process also becomes less expensive.

Dremel in BigQuery

Among the things that this enterprise data warehouse tool does is that it has an execution system or a data processing system called Dremel.

What this system does is plan how it is going to execute the tables. The order of the processes to be executed, what it is going to do first, what it will do next, and then how it works with the first column, so that the is relating to others.

In the same way, it gives you the power to group, and in the process you will be able to observe how it is doing it. What is the strategy that is going to be used to execute a frame and the phases. Which as an addition to BigQuery is an advantage to all lights.

Regarding the waiting phase, it is the time we wait to have a machine that executes this task. How long it took to read that extracts all the data, how long it took to process the data.

What precisely becomes the obtaining of information regarding the sum of the aggregations, the accurate and quantifiable annotations in the writing. This task is being executed surreptitiously, and what really happens is that all this information is being saved in another file. It is a temporary one so that everyone can consult it.

In the same topic, you’ll also notice that this plugin in BigQuery, you will have the ability to unrestricted access to queries that are recurring. You can have them there already saved for future crises or quick views.

Google DataStudio

Another thing that may also be interesting is the possibility of working with Google Data Studio, which is a tool dedicated to helping visualize metrics through the processes of creating dashboards, presentations and exploring different data formats such as the ones we talked about earlier.

One benefit it has is that it gives you the ability to use it for free. Within its features, you can create different types of graphs to present a complete dashboard, Google Cloud is one of those products that is not so easy to find elsewhere, it does not have a comparable one in the market.

Actually, with BigQuery, this tool is very useful for data scientists and to explore some business questions in general. The experimentation tasks with Google Data Studio, gives you a visual that makes it a super easy tool.

It also has the option to connect realtime, which is useful for working with remote groups when producing reports, data collection, live analysis and real-time error correction.

This is simply one of the many options that this tool gives you, which you will have the power to take or just work online without the need to make a live connection.

Get data in Google Data Studio

To generate reports with this powerful and at the same time simple tool, you have to establish the source from which you will obtain your data, the wonderful thing about this is that you can add several sources at the same time and work with more complete information.

If you have a visual SQL database and other data sources such as a spreadsheet, you can also include them, which makes the report more secure in terms of the information obtained, as well as facilitating data collection.

When selecting it, Google Data Studio will ask you to give it access to the connection with the application, which, in this case, will be BigQuery. Right away you will see all the options that you have enabled to work, such as adding graphs of different types and analytics and tables.

In the same way, you can add filters that will allow you to discriminate the information by specific data that you want to highlight, or equally, you can filter the companies from which you are obtaining the data, integrating several companies in the same document.

Embedded data sources

When you create your data sources and edit the reports that will work from those sources, all that information will be reflected in said document, that is, they will be inserted in them.

Embedded data sources give the team working on this information the ability to collaborate much more easily on reports and their data sources. Anyone who has the power to edit the reports can also make changes to the data sources, as well as their connection to BigQuery.

You should also take into account that when you share or copy the report that is being worked on, all the data sources that are inserted are also shared or copied.

Reusable data sources

From the moment you generate data sources from the Google Data Studio home page, it gives you the ability to reuse these sources. The benefit is that you can leverage those data sources in different reports that are produced.

A substantial benefit is that with reusable data sources, you will have the permission to produce and share data models with the same information that the entire organization manages. In the case of sharing a reusable data source, this person who receives it will also have the power to modify it.

What does restrict this plug is that only the owner of the data source credentials will be the one that has the permission to edit the connection with the application, which for us is BigQuery.

Benefit of Embedded Data Sources

As you can share the data source as many times as you share the report in which they are inserted, then you will not have the double work of sharing these sources separately, which reduces duplication of management, involuntary errors or loss of information just by including everything in a single report.

These sources are independent elements that have the advantage of being used in multiple reports, and can also be edited from the same report or by accessing the main page of Google Data Studio.

Another advantage is that it generates a sample in the event that the data source is not working correctly, in turn, it shows if there is a problem with the connection with BigQuery or where you are obtaining the data from.

Data Flow

This is another of Google’s tools that we can connect to with the data warehouse and analysis tool that we’re looking at. It is used to connect any data input with another. This plugin allows:

Perform streaming data analytics with fluidity, less data latency and with the ability to carry out your processes quickly and very easily.
Simplify the management work, since it works without a server, which helps to minimize the operational load that is generated when working with data engineering. This frees the team from server- or cluster-based management and what goes into it, so they can focus on programming.
Reduces Total Cost of Ownership Although this plug is expensive to connect to applications like BigQuery, it also lowers other costs as it combines resource autoscaling with cost-saving, empowering batch data processing. Its way is through the handling of temporary volumes of work and with combined maximum peaks, without using more resources that generate additional expenses.

Although we know that it is a very interesting plug to use, we also know that it is not cheap, the trick is also that you get hold of other tools such as Cloud Function. Google Cloud Functions is the serverless computing alternative introduced by Google for building event-driven applications.

This gives you, as a developer, full access to Firebase and Google Cloud events, with the ease of applying code that responds to the events you’re accessing. Together with BigQuery you will be able to get notifications of interesting events that are happening at the moment.

Discover how to implement you´re blockchain in business here.

UTOPICODE

Autor

Comparte este artículo