How To Contribute To Open Source Machine Learning

Contributing to open source machine learning is an excellent way to set yourself apart from the crowd. And if you’re just starting out in your data science journey and want to build up your portfolio of projects, getting pull requests merged on a popular project are an excellent addition. While there are numerous ways to contribute all throughout the development pipeline, this blog post focuses on two: data collection and model evaluation.

Machine learning is all around us.  The modern web makes sophisticated use of machine learning algorithms to display the blog you are reading, your favorite products on Amazon, or the latest cat pictures on Reddit.  This tutorial will introduce you to the field of machine learning, and make you comfortable contributing to open source machine learning projects.

Get to know GitHub

Almost all open source projects used a version control system, a tool that helps with merging new code into the project (the main “repository”). Usually, the collaboration is centred around a website that hosts the central repository. These websites include Github, Bitbucket and Gitlab, thought Github is by far the most popular one and the one we use here at MindsDB.

Generally speaking, version control systems allow keeping track of all changes in a project without saving several files. Make an account on Github, login and explore all the different tabs on the top of the website. Here you will find information about Github, development guides, open source projects, events, conferences etc.

Caliban

caliban/README.md at master · google/caliban · GitHub Open Source Projects

Source: Google Images

This is a machine learning project from tech giant Google.  It is used for developing machine learning research workflows and notebooks in an isolated and reproducible computing environment. It solves a big problem. When developers are building data science projects, it is many times difficult to build a test environment that can show your project in a real-life situation. It is not possible to predict all edge cases. So, Caliban is a potential solution for this problem. Caliban makes it easy to develop any ML models locally, run code on your machine then try out that exact same code in a Cloud environment for execution on big machines. So, Dockerized research workflows are made easy, locally as well as in the cloud.

DeepLearning4J

Capitalizing on the latest buzzword within the buzzword laden field of ML, Deep Learning for Java brings to open source a strong set of algorithms designed to do single machine and distributed deep learning on Hadoop and Spark. It has a range of utilities for working with data and also has GPU (graphical processing unit) support.

What is deep learning? Increasingly used at places like Google, Facebook and Amazon, deep learning is a new, large scale approach to neural networks designed to significantly reduce the amount of human intervention needed to train and maintain models while also providing significantly better results. DL4J, as it is called, also has a book (preorder) in the works via Adam Gibson and Josh Patterson.

Play with the software: Find Bugs

First, try the software, install it and have a play, try to break it, see if it does what it says on the box. This is a great place where you can start to add value very easily, by reporting any issues you find. Each project is likely to have a slightly different template for reporting issues, but essentially they all contain similar information.

Understanding how a project works

Not all open source projects operate in the same way. Some allow contributions from anyone. Some require you to work your way up to get contribution privileges. Some have multiple people involved in managing a project. Others have a single person in charge, a so-called benevolent dictator for life. 

Contribution guidelines help you understand how to approach your participation in a project. It will explain how to reach out about a contribution, provide templates for communicating bugs and suggesting features, list work that is needed by maintainers, project goals, etc. An amazing example is the Angular contribution guide which lists all kinds of useful information for new contributors like their commit message guidelines, coding rules, submission guidelines, etc. in great detail. 

In addition to contribution guidelines, some projects will have a code of conduct. It usually outlines community rules and behavior expectations. It’s meant to help you know how to be a amiable and professional contributor and community member. Angular, for instance, has an awesome code of conduct that lists what they consider unprofessional conduct, their responsibilities to the community, and how to get in contact in case someone violates it.   

Big projects may have governance policy and team documents that outline specific roles in the community, teams, sub-committees, contribution workflows, how discussions are conducted, and who gets to commit. These kinds of documents are essential to understanding how the community operates. The about page on angular.io, for example, lists who all the core team members are, their roles, and other contributors. On Github, they also have a docs folder containing policies regarding contribution.  

Even after you’ve gone through the documentation, you may still need to ask questions to active members of the community. Despite doing your research, you may still be stumped on a particular aspect of the project. To interact with other contributors, join community communication tools like Slack, IRC etc., sign up for newsletters, and subscribe to their mailing list. Angular uses Gitter as its community communication tool and directs contributors with questions/problems to Stack Overflow, where they can get help using the angular tag. Connect with community members and develop relationships with them as it will expose you to facets of the project that you may be unaware of. 

Having a good grasp of the technical aspects of the project and how it’s organized is essential to making contributions that meet the project’s standards. To understand technical parts of the project, consult the project README, wikis, tutorials, and documentation. Angular, for example, has docs explaining their Github process, building and testing, their coding standards, debugging, PR reviews, etc. Going a step further, look at past feature integrations and bug fixes in merged pull requests which are full of discussions by other contributors and can be a rich source of context. As the project evolves, pay attention to it, frequently follow issues, features, discussions, pull requests, and bug fixes to continually learn how it works. For instance, a contributor can follow this example of an Angular feature request discussion about a form API to better understand how Angular forms work, bundle size management, etc.

An open source project is sort of like a project at any company you might work at; there will be a house coding style, team culture, and workflows for getting things done. The difference is that open source projects can and will have a much different group of people working on them. 

Analytics Zoo

Analytics Zoo

                                                                 Source: Google Images

Analytics Zoo is a unified data analytics and AI platform that unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline. This can efficiently scale from a laptop to a large cluster to process the production of big data. This project is maintained by Intel-analytics.

Analytics Zoo helps an AI solution in the following ways:

  • Helps you easily prototype AI models.
  • Scaling is efficiently managed.
  • Helps to add automation processes to your ML pipeline like feature engineering, model selection, etc.

Explore existing Issues

Once you have figured out the project you want to contribute to, explore its Github repository, read the documents and go to the issues tab. Here you will find all the open issues that you can work on. These issues can be from beginner level to advance level. If the repository is properly maintained, the issues would be tagged with “beginner”, “first-timers”, “help-wanted” etc. so that you can work on the easiest tasks to gain confidence and experience with the project.

Help improve documentation

However, you don’t have to code right away. You can contribute to documentation, creating the developer’s guide, helping other people solve their issues, etc. Read the documentation and if you feel something is missing this can be a great first place to contribute. You can either raise an issue or, if you feel confident enough, you can add the missing parts yourself. 

Submitting your work

After you’ve completed work on your contribution, submit it as per the contribution guidelines. At times, your submission may not get a reply even after a reasonable amount of time has gone by. In such cases, respectfully request a review or get in touch with other contributors for assistance. Post-review changes may be requested. Try to make them as soon as possible so that your contribution is integrated promptly and does not become out-of-date or forgotten. If your contribution is rejected, ask for feedback to understand why. When reviewers ask questions, make comments, or give feedback, be responsive and check on your work for any updates regularly. Treat this like any other work and be professional, courteous, and respectful.

Suggest new features

If you like the project and find it useful, you can also start to request new features to help improve it. (You can find our template here) Or even better you can try to add them yourself.

Conclusion

Contribute to open source machine learning projects and be part of changing the world by becoming more informed about all things machine learning related.

Similar Posts

0 Comments

No Comment.