Google Collaboration is a product of Google Research. It allows users to write and execute an uninformed Python code using the browser. It is well suited for machine learning, data analysis, and education. Colab is a Cloud-based service, a server at Google runs the notebook rather than the user’s local machine.
Google Colab is running as a Cloud thus, Cloud’s local is not the user’s machine local. Therefore, a read_csv statement does not search for the file on Google’s side rather than on the user’s side.
To import data into Colab. Notebook, there are two most known methods, one is known as Manual Way and the other is Clean Way. Both have their advantages and disadvantages. There are two alternative solutions, which may be more appropriate particularly when the user’s code has to be easy to industrialize.
There are two procedures to input Data manually.
Data could be uploaded using files.upload() right in the Colab notebook, which gives the user a traditional upload button that allows transferring files from the local machine into the Colab interface.
[] from google.colab import files Upload = files.upload()
Then use built-in function io.StringIO()with pd.read_csv to read the imported file into a data frame.
[] import pandas as pd Import io Df= pd.read_csv(io.stringIO(upload[‘myfile.csv’]) df
It is the easiest approach of all, although it requires a few lines of code.
The upload might take a while for large files. Whenever the notebook is restarted (if it fails or for some other reason), the upload has to be repeated manually. It is not the best solution, because firstly, the code would not re-execute spontaneously when relaunched and it requires monotonous manual operations in case of notebook disasters.
The users may upload the data to Google Drive before getting started within the notebook. Mount the Google Drive onto the Colab environment, which means that the Colab notebook may now access files in your Google Drive.
[] from google.colab import drive Drive.mount(‘/content/drive’)
[] with open (‘/content/drive/My Drive/Dev.txt’, ‘w’) as f: f.write(‘hello Google!’) !cat /content/drive/My\ Drive/foo.txt
Google Drive is user-friendly and uploading your data to Google Drive is an easy task for most people. Besides, once the user has uploaded the data, the user does not have to input the data manually again when restarting the notebook. Hence, it is a better approach than uploading the files manually one by one.
The main disadvantage of this approach is mainly for company use. It is a successful approach as long as the user is working on small projects. However, if access management and security are at stake. The user will find that this approach is difficult to industrialize. It could lead to a major security breach to use the data on an open Drive.
The users may not want to be in a 100% Google Background, as multi-cloud solutions give more independence from other Cloud vendors.
If the project is small, and the user knows that it may always remain only a notebook. Previous approaches may be acceptable. However, for any project that may grow higher in the future; separating data storage from the notebook is an upright step towards a good architecture.
If the user wants to move towards, a better architecture for data storage in the Google Colab notebook. The user should try going for a proper Data Storage solution.
There are many possibilities in Python to connect with data stores. Two solutions out of them are AWS S3 for file storage and SQL for relational database storage.
AWS is a cloud-based service and S3 is the file storage of AWS. It shares similar properties to the previously described ways of inputting data to Google Colab. Accessing S3 file storage using Python is very performant. Users may add authentication as well if need be.
[] import pandas as pd Import s3fs Df = pd.read_csv(‘s3://bucket-name/file.csv’)
S3 is a data storage solution by the software community. Many developers only for the integration with other Google Services prefer Google Drive. However, it is more valued for individual users. This Approach improves both the code and the architecture.
AWS itself is a cloud-computing tool. Which may take up system resources and impact the performance of the user. It is easy to implement. However, it may still be a disadvantage in the same cases as some company policies do not allow AWS. Besides, it can be longer than loading from Google Drive since the data source is separate.
If the data is sorted in a relational database like MySQL or another. It would be a good approach to collaborate your Colab notebook directly to the database. SQLAlchemy is a package, which permits the user to send SQL queries to the relational database. This would permit to have well-organized data in a distinct SQL environment. While keeping only the Python operations in the Colab notebook.
It would be impossible to use Relational Data Storage with disorganized data. But a non-relational database may help in this scenario. Query execution time is high for a large volume Table. It may also cause problems to manage the database. If the user does not have any or if cannot easily share access.
Google Colab notebooks are great. Apparently, It could be a real struggle when the data needs to be imported or the user wants to export it. However, Importing data by Manual Upload or Mounting Google Drive are both easy to use but hard to develop. Substitutes like AWS S3 or a Relational database help make the system less manual and therefore better.
The two-manual methods are better for short-term projects. The two methods with external storage may prove hand when a plan needs a clean data store.