Home
/
Blog
/
Data Processing with Google Colab.

Data Processing with Google Colab.

Google Collaboration is a product of Google Research. It allows users to write and execute an uninformed Python code using the browser. It is well suited for machine learning, data analysis, and education. Colab is a Cloud-based service, a server at Google runs the notebook rather than the user’s local machine.

Colab. as a Notebook
Import Data in Google Colab.
Input Data Manually
The Clean Way

Colab. as a Notebook

Google Colab is running as a Cloud thus, Cloud’s local is not the user’s machine local. Therefore, a read_csv statement does not search for the file on Google’s side rather than on the user’s side.

Import Data in Google Colab.

To import data into Colab. Notebook, there are two most known methods, one is known as Manual Way and the other is Clean Way. Both have their advantages and disadvantages. There are two alternative solutions, which may be more appropriate particularly when the user’s code has to be easy to industrialize.

Input Data Manually

There are two procedures to input Data manually.

File Upload Method
Mount Google Drive

File Upload Method

Data could be uploaded using files.upload() right in the Colab notebook, which gives the user a traditional upload button that allows transferring files from the local machine into the Colab interface.

[] from google.colab import files
Upload = files.upload()

Then use built-in function io.StringIO()with pd.read_csv to read the imported file into a data frame.

[] import pandas as pd
Import io
Df= pd.read_csv(io.stringIO(upload[‘myfile.csv’])
df

Advantages of using File Upload Method

It is the easiest approach of all, although it requires a few lines of code.

Disadvantages of using File Upload Method

The upload might take a while for large files. Whenever the notebook is restarted (if it fails or for some other reason), the upload has to be repeated manually. It is not the best solution, because firstly, the code would not re-execute spontaneously when relaunched and it requires monotonous manual operations in case of notebook disasters.

Mount Google Drive

The users may upload the data to Google Drive before getting started within the notebook. Mount the Google Drive onto the Colab environment, which means that the Colab notebook may now access files in your Google Drive.

Mount the drive using the drive.mount() function

[] from google.colab import drive
Drive.mount(‘/content/drive’)

Access anything which is in Google Drive directly

[] with open (‘/content/drive/My Drive/Dev.txt’, ‘w’) as f:
f.write(‘hello Google!’)
!cat /content/drive/My\ Drive/foo.txt

Advantages of Mounting the Drive

Google Drive is user-friendly and uploading your data to Google Drive is an easy task for most people. Besides, once the user has uploaded the data, the user does not have to input the data manually again when restarting the notebook. Hence, it is a better approach than uploading the files manually one by one.

Disadvantages of Mounting the Drive

The main disadvantage of this approach is mainly for company use. It is a successful approach as long as the user is working on small projects. However, if access management and security are at stake. The user will find that this approach is difficult to industrialize. It could lead to a major security breach to use the data on an open Drive.

The users may not want to be in a 100% Google Background, as multi-cloud solutions give more independence from other Cloud vendors.

The Clean Way

If the project is small, and the user knows that it may always remain only a notebook. Previous approaches may be acceptable. However, for any project that may grow higher in the future; separating data storage from the notebook is an upright step towards a good architecture.

If the user wants to move towards, a better architecture for data storage in the Google Colab notebook. The user should try going for a proper Data Storage solution.

There are many possibilities in Python to connect with data stores. Two solutions out of them are AWS S3 for file storage and SQL for relational database storage.

AWS s3 Bucket

AWS is a cloud-based service and S3 is the file storage of AWS. It shares similar properties to the previously described ways of inputting data to Google Colab. Accessing S3 file storage using Python is very performant. Users may add authentication as well if need be.

[] import pandas as pd
Import s3fs
Df = pd.read_csv(‘s3://bucket-name/file.csv’)

S3 is a data storage solution by the software community. Many developers only for the integration with other Google Services prefer Google Drive. However, it is more valued for individual users. This Approach improves both the code and the architecture.

Disadvantages of using Colab with s3.

AWS itself is a cloud-computing tool. Which may take up system resources and impact the performance of the user. It is easy to implement. However, it may still be a disadvantage in the same cases as some company policies do not allow AWS. Besides, it can be longer than loading from Google Drive since the data source is separate.

SQL Database

If the data is sorted in a relational database like MySQL or another. It would be a good approach to collaborate your Colab notebook directly to the database. SQLAlchemy is a package, which permits the user to send SQL queries to the relational database. This would permit to have well-organized data in a distinct SQL environment. While keeping only the Python operations in the Colab notebook.

Disadvantages of using SQL Database with Colab

It would be impossible to use Relational Data Storage with disorganized data. But a non-relational database may help in this scenario. Query execution time is high for a large volume Table. It may also cause problems to manage the database. If the user does not have any or if cannot easily share access.

Google Colab notebooks are great. Apparently, It could be a real struggle when the data needs to be imported or the user wants to export it. However, Importing data by Manual Upload or Mounting Google Drive are both easy to use but hard to develop. Substitutes like AWS S3 or a Relational database help make the system less manual and therefore better.

The two-manual methods are better for short-term projects. The two methods with external storage may prove hand when a plan needs a clean data store.